9b76f4e8aa
Add robust recrawling of not completed data.
...
Add annotator.py (highlighing hout within context done)
Enhance parish2text.py (enable more flags, convert button)
2018-04-16 23:54:03 +02:00
e9c4dcd743
Tune download settings. Enable dummy cache with 7 days of expiration.
...
Fix generating spiider commands.
Add redirected domain appenid to allowed domains.
Configure loggers.
Add more meta info to *processed.txt
Enhance view raw data python jsnoline viewer
2018-04-15 12:17:35 +02:00
Dawid Jurkiewicz
0bba61bbcd
Fix checking if response is a binary string.
...
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
2018-04-13 21:45:20 +02:00
Dawid Jurkiewicz
21ba56a8fa
Add domain-blacklist.txt, domain filter, modify crawler.
...
Add binary or not checker.
2018-04-09 23:53:36 +02:00
Dawid Jurkiewicz
f9c5690657
Modifiy error logging in get_parishes_url. Enhance crawl_deon.py
...
Fix Makefile - append instead of rewrite.
2018-04-06 23:33:18 +02:00
Dawid Jurkiewicz
3027e1e7cc
Switch to pure html download. Enhanced urls filtering.
...
Update Makefile.
2018-03-11 18:02:31 +01:00
Dawid Jurkiewicz
b433a5e297
Code refactorings.
2018-03-01 18:16:11 +01:00
Dawid Jurkiewicz
8b72d0b351
Prototype rule based masses extractor.
...
Added spider.
Started working on testsets.
2018-03-01 14:40:13 +01:00