9b76f4e8aa
Add robust recrawling of not completed data.
...
Add annotator.py (highlighing hout within context done)
Enhance parish2text.py (enable more flags, convert button)
2018-04-16 23:54:03 +02:00
e9c4dcd743
Tune download settings. Enable dummy cache with 7 days of expiration.
...
Fix generating spiider commands.
Add redirected domain appenid to allowed domains.
Configure loggers.
Add more meta info to *processed.txt
Enhance view raw data python jsnoline viewer
2018-04-15 12:17:35 +02:00
Dawid Jurkiewicz
c83c29e58e
Delete old files
2018-04-13 22:33:11 +02:00
Dawid Jurkiewicz
0bba61bbcd
Fix checking if response is a binary string.
...
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
2018-04-13 21:45:20 +02:00
Dawid Jurkiewicz
21ba56a8fa
Add domain-blacklist.txt, domain filter, modify crawler.
...
Add binary or not checker.
2018-04-09 23:53:36 +02:00
Dawid Jurkiewicz
56f704630e
Add raw data viewer.
2018-03-30 22:10:41 +02:00
Dawid Jurkiewicz
63c4a71812
Add converter of content field in jsonline from html to text.
2018-03-15 16:09:59 +01:00
Dawid Jurkiewicz
3027e1e7cc
Switch to pure html download. Enhanced urls filtering.
...
Update Makefile.
2018-03-11 18:02:31 +01:00
Dawid Jurkiewicz
b433a5e297
Code refactorings.
2018-03-01 18:16:11 +01:00
Dawid Jurkiewicz
8b72d0b351
Prototype rule based masses extractor.
...
Added spider.
Started working on testsets.
2018-03-01 14:40:13 +01:00