Dawid Jurkiewicz
|
0bba61bbcd
|
Fix checking if response is a binary string.
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
|
2018-04-13 21:45:20 +02:00 |
|
Dawid Jurkiewicz
|
21ba56a8fa
|
Add domain-blacklist.txt, domain filter, modify crawler.
Add binary or not checker.
|
2018-04-09 23:53:36 +02:00 |
|
Dawid Jurkiewicz
|
f9c5690657
|
Modifiy error logging in get_parishes_url. Enhance crawl_deon.py
Fix Makefile - append instead of rewrite.
|
2018-04-06 23:33:18 +02:00 |
|
Dawid Jurkiewicz
|
3027e1e7cc
|
Switch to pure html download. Enhanced urls filtering.
Update Makefile.
|
2018-03-11 18:02:31 +01:00 |
|
Dawid Jurkiewicz
|
b433a5e297
|
Code refactorings.
|
2018-03-01 18:16:11 +01:00 |
|
Dawid Jurkiewicz
|
8b72d0b351
|
Prototype rule based masses extractor.
Added spider.
Started working on testsets.
|
2018-03-01 14:40:13 +01:00 |
|