Commit Graph

35 Commits

Author SHA1 Message Date
9b76f4e8aa Add robust recrawling of not completed data.
Add annotator.py (highlighing hout within context done)
Enhance parish2text.py (enable more flags, convert button)
2018-04-16 23:54:03 +02:00
e9c4dcd743 Tune download settings. Enable dummy cache with 7 days of expiration.
Fix generating spiider commands.
Add redirected domain appenid to allowed domains.
Configure loggers.
Add more meta info to *processed.txt
Enhance view raw data python jsnoline viewer
2018-04-15 12:17:35 +02:00
siulkilulki
a5cb3a090f remove plan.org 2018-04-13 22:45:37 +02:00
ee636a65f1
Update README.md 2018-04-13 22:36:04 +02:00
Dawid Jurkiewicz
c83c29e58e Delete old files 2018-04-13 22:33:11 +02:00
Dawid Jurkiewicz
0bba61bbcd Fix checking if response is a binary string.
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
2018-04-13 21:45:20 +02:00
Dawid Jurkiewicz
21ba56a8fa Add domain-blacklist.txt, domain filter, modify crawler.
Add binary or not checker.
2018-04-09 23:53:36 +02:00
siulkilulki
6107a89c78
Update README.md 2018-04-06 23:43:14 +02:00
Dawid Jurkiewicz
f9c5690657 Modifiy error logging in get_parishes_url. Enhance crawl_deon.py
Fix Makefile - append instead of rewrite.
2018-04-06 23:33:18 +02:00
Dawid Jurkiewicz
ccc4af3d51 Fix get parishes urls script. 2018-04-04 20:29:48 +02:00
Dawid Jurkiewicz
56f704630e Add raw data viewer. 2018-03-30 22:10:41 +02:00
siulkilulki
88c55891f4 Update README.md 2018-03-15 19:01:13 +01:00
Dawid Jurkiewicz
63c4a71812 Add converter of content field in jsonline from html to text. 2018-03-15 16:09:59 +01:00
Dawid Jurkiewicz
3027e1e7cc Switch to pure html download. Enhanced urls filtering.
Update Makefile.
2018-03-11 18:02:31 +01:00
Dawid Jurkiewicz
b433a5e297 Code refactorings. 2018-03-01 18:16:11 +01:00
Dawid Jurkiewicz
0070ffe07d Merge branch 'master' of github.com:siulkilulki/mass-scraper 2018-03-01 14:50:49 +01:00
Dawid Jurkiewicz
8b72d0b351 Prototype rule based masses extractor.
Added spider.
Started working on testsets.
2018-03-01 14:40:13 +01:00
Dawid Jurkiewicz
c3b86fe5a9 Prototype rule based masses extractor.
Added spider.
Started working on testsets.
2018-01-20 21:55:26 +01:00
siulkilulki
7161193169 Add prototype basic crawl 2017-11-21 22:51:09 +01:00
siulkilulki
9f1423b362 fixed url checking 2017-06-21 22:51:53 +02:00
Dawid Jurkiewicz
5ad2a36499 urlschecker alpha & sync 2017-06-21 21:52:20 +02:00
siulkilulki
b17fe9b5c2 fix varaible name 2017-06-19 08:13:08 +02:00
siulkilulki
4ae6cd24c0 fix proxy conditional statement 2017-06-18 21:44:12 +02:00
siulkilulki
f54e01581c code refactorings and improvements 2017-06-18 21:33:44 +02:00
siulkilulki
b16f29ef6d changed prdriver location 2017-06-12 22:17:23 +02:00
siulkilulki
57315f9b31 proof of concept alpha 2017-06-12 22:08:29 +02:00
siulkilulki
de56ecb253 done proxy.py 2017-06-11 00:00:22 +02:00
siulkilulki
c205e1b627 added proxy downloader 2017-06-10 02:09:22 +02:00
siulkilulki
35d3b11ec6 add downloaded parishes 2017-04-21 00:29:17 +02:00
siulkilulki
35db6760f7 Merge branch 'master' of https://github.com/siulkilulki/mass-scraper 2017-04-20 10:56:23 +02:00
siulkilulki
7aed0dda4f add parish scrapping script 2017-04-20 10:51:02 +02:00
siulkilulki
d25f3f2757 Update temat.md 2017-03-14 17:11:33 +01:00
siulkilulki
b463dee0d2 Update temat.md 2017-03-14 17:10:24 +01:00
siulkilulki
5dc436781b add description of thesis 2017-03-14 17:08:44 +01:00
siulkilulki
af01adb7ab Initial commit 2017-03-10 16:05:59 +01:00