Commit Graph

9 Commits

Author SHA1 Message Date
e9c4dcd743 Tune download settings. Enable dummy cache with 7 days of expiration.
Fix generating spiider commands.
Add redirected domain appenid to allowed domains.
Configure loggers.
Add more meta info to *processed.txt
Enhance view raw data python jsnoline viewer
2018-04-15 12:17:35 +02:00
Dawid Jurkiewicz
c83c29e58e Delete old files 2018-04-13 22:33:11 +02:00
Dawid Jurkiewicz
0bba61bbcd Fix checking if response is a binary string.
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
2018-04-13 21:45:20 +02:00
Dawid Jurkiewicz
21ba56a8fa Add domain-blacklist.txt, domain filter, modify crawler.
Add binary or not checker.
2018-04-09 23:53:36 +02:00
Dawid Jurkiewicz
56f704630e Add raw data viewer. 2018-03-30 22:10:41 +02:00
Dawid Jurkiewicz
63c4a71812 Add converter of content field in jsonline from html to text. 2018-03-15 16:09:59 +01:00
Dawid Jurkiewicz
3027e1e7cc Switch to pure html download. Enhanced urls filtering.
Update Makefile.
2018-03-11 18:02:31 +01:00
Dawid Jurkiewicz
b433a5e297 Code refactorings. 2018-03-01 18:16:11 +01:00
Dawid Jurkiewicz
8b72d0b351 Prototype rule based masses extractor.
Added spider.
Started working on testsets.
2018-03-01 14:40:13 +01:00