Commit Graph

31 Commits

Author SHA1 Message Date
Dawid Jurkiewicz
c83c29e58e Delete old files 2018-04-13 22:33:11 +02:00
Dawid Jurkiewicz
0bba61bbcd Fix checking if response is a binary string.
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
2018-04-13 21:45:20 +02:00
Dawid Jurkiewicz
21ba56a8fa Add domain-blacklist.txt, domain filter, modify crawler.
Add binary or not checker.
2018-04-09 23:53:36 +02:00
siulkilulki
6107a89c78
Update README.md 2018-04-06 23:43:14 +02:00
Dawid Jurkiewicz
f9c5690657 Modifiy error logging in get_parishes_url. Enhance crawl_deon.py
Fix Makefile - append instead of rewrite.
2018-04-06 23:33:18 +02:00
Dawid Jurkiewicz
ccc4af3d51 Fix get parishes urls script. 2018-04-04 20:29:48 +02:00
Dawid Jurkiewicz
56f704630e Add raw data viewer. 2018-03-30 22:10:41 +02:00
siulkilulki
88c55891f4 Update README.md 2018-03-15 19:01:13 +01:00
Dawid Jurkiewicz
63c4a71812 Add converter of content field in jsonline from html to text. 2018-03-15 16:09:59 +01:00
Dawid Jurkiewicz
3027e1e7cc Switch to pure html download. Enhanced urls filtering.
Update Makefile.
2018-03-11 18:02:31 +01:00
Dawid Jurkiewicz
b433a5e297 Code refactorings. 2018-03-01 18:16:11 +01:00
Dawid Jurkiewicz
0070ffe07d Merge branch 'master' of github.com:siulkilulki/mass-scraper 2018-03-01 14:50:49 +01:00
Dawid Jurkiewicz
8b72d0b351 Prototype rule based masses extractor.
Added spider.
Started working on testsets.
2018-03-01 14:40:13 +01:00
Dawid Jurkiewicz
c3b86fe5a9 Prototype rule based masses extractor.
Added spider.
Started working on testsets.
2018-01-20 21:55:26 +01:00
siulkilulki
7161193169 Add prototype basic crawl 2017-11-21 22:51:09 +01:00
siulkilulki
9f1423b362 fixed url checking 2017-06-21 22:51:53 +02:00
Dawid Jurkiewicz
5ad2a36499 urlschecker alpha & sync 2017-06-21 21:52:20 +02:00
siulkilulki
b17fe9b5c2 fix varaible name 2017-06-19 08:13:08 +02:00
siulkilulki
4ae6cd24c0 fix proxy conditional statement 2017-06-18 21:44:12 +02:00
siulkilulki
f54e01581c code refactorings and improvements 2017-06-18 21:33:44 +02:00
siulkilulki
b16f29ef6d changed prdriver location 2017-06-12 22:17:23 +02:00
siulkilulki
57315f9b31 proof of concept alpha 2017-06-12 22:08:29 +02:00
siulkilulki
de56ecb253 done proxy.py 2017-06-11 00:00:22 +02:00
siulkilulki
c205e1b627 added proxy downloader 2017-06-10 02:09:22 +02:00
siulkilulki
35d3b11ec6 add downloaded parishes 2017-04-21 00:29:17 +02:00
siulkilulki
35db6760f7 Merge branch 'master' of https://github.com/siulkilulki/mass-scraper 2017-04-20 10:56:23 +02:00
siulkilulki
7aed0dda4f add parish scrapping script 2017-04-20 10:51:02 +02:00
siulkilulki
d25f3f2757 Update temat.md 2017-03-14 17:11:33 +01:00
siulkilulki
b463dee0d2 Update temat.md 2017-03-14 17:10:24 +01:00
siulkilulki
5dc436781b add description of thesis 2017-03-14 17:08:44 +01:00
siulkilulki
af01adb7ab Initial commit 2017-03-10 16:05:59 +01:00