Commit Graph

38 Commits

Author SHA1 Message Date
382666c563 Add test.py for data gathering (data for annotation)
Small changes to annotator.py (to be deleted in near future)
Add utils/iterator
Add redis to enviroment.yml
Rename, adapt and move rule based extractor.
Adapt find_hours.
Yapify webapp app (probalby nothing more)
Rename buttons in index.html
2018-05-11 23:12:21 +02:00
c617018611 Restructure code. Add frontend template. (logic to be done) 2018-05-04 23:25:07 +02:00
6982ac2e59 Add basic wsgi app. Rename extractors, change directories.
Add gunicorn and flask to environment.yml
Update .gitignore
2018-04-27 22:44:15 +02:00
9b76f4e8aa Add robust recrawling of not completed data.
Add annotator.py (highlighing hout within context done)
Enhance parish2text.py (enable more flags, convert button)
2018-04-16 23:54:03 +02:00
e9c4dcd743 Tune download settings. Enable dummy cache with 7 days of expiration.
Fix generating spiider commands.
Add redirected domain appenid to allowed domains.
Configure loggers.
Add more meta info to *processed.txt
Enhance view raw data python jsnoline viewer
2018-04-15 12:17:35 +02:00
siulkilulki
a5cb3a090f remove plan.org 2018-04-13 22:45:37 +02:00
ee636a65f1
Update README.md 2018-04-13 22:36:04 +02:00
Dawid Jurkiewicz
c83c29e58e Delete old files 2018-04-13 22:33:11 +02:00
Dawid Jurkiewicz
0bba61bbcd Fix checking if response is a binary string.
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
2018-04-13 21:45:20 +02:00
Dawid Jurkiewicz
21ba56a8fa Add domain-blacklist.txt, domain filter, modify crawler.
Add binary or not checker.
2018-04-09 23:53:36 +02:00
siulkilulki
6107a89c78
Update README.md 2018-04-06 23:43:14 +02:00
Dawid Jurkiewicz
f9c5690657 Modifiy error logging in get_parishes_url. Enhance crawl_deon.py
Fix Makefile - append instead of rewrite.
2018-04-06 23:33:18 +02:00
Dawid Jurkiewicz
ccc4af3d51 Fix get parishes urls script. 2018-04-04 20:29:48 +02:00
Dawid Jurkiewicz
56f704630e Add raw data viewer. 2018-03-30 22:10:41 +02:00
siulkilulki
88c55891f4 Update README.md 2018-03-15 19:01:13 +01:00
Dawid Jurkiewicz
63c4a71812 Add converter of content field in jsonline from html to text. 2018-03-15 16:09:59 +01:00
Dawid Jurkiewicz
3027e1e7cc Switch to pure html download. Enhanced urls filtering.
Update Makefile.
2018-03-11 18:02:31 +01:00
Dawid Jurkiewicz
b433a5e297 Code refactorings. 2018-03-01 18:16:11 +01:00
Dawid Jurkiewicz
0070ffe07d Merge branch 'master' of github.com:siulkilulki/mass-scraper 2018-03-01 14:50:49 +01:00
Dawid Jurkiewicz
8b72d0b351 Prototype rule based masses extractor.
Added spider.
Started working on testsets.
2018-03-01 14:40:13 +01:00
Dawid Jurkiewicz
c3b86fe5a9 Prototype rule based masses extractor.
Added spider.
Started working on testsets.
2018-01-20 21:55:26 +01:00
siulkilulki
7161193169 Add prototype basic crawl 2017-11-21 22:51:09 +01:00
siulkilulki
9f1423b362 fixed url checking 2017-06-21 22:51:53 +02:00
Dawid Jurkiewicz
5ad2a36499 urlschecker alpha & sync 2017-06-21 21:52:20 +02:00
siulkilulki
b17fe9b5c2 fix varaible name 2017-06-19 08:13:08 +02:00
siulkilulki
4ae6cd24c0 fix proxy conditional statement 2017-06-18 21:44:12 +02:00
siulkilulki
f54e01581c code refactorings and improvements 2017-06-18 21:33:44 +02:00
siulkilulki
b16f29ef6d changed prdriver location 2017-06-12 22:17:23 +02:00
siulkilulki
57315f9b31 proof of concept alpha 2017-06-12 22:08:29 +02:00
siulkilulki
de56ecb253 done proxy.py 2017-06-11 00:00:22 +02:00
siulkilulki
c205e1b627 added proxy downloader 2017-06-10 02:09:22 +02:00
siulkilulki
35d3b11ec6 add downloaded parishes 2017-04-21 00:29:17 +02:00
siulkilulki
35db6760f7 Merge branch 'master' of https://github.com/siulkilulki/mass-scraper 2017-04-20 10:56:23 +02:00
siulkilulki
7aed0dda4f add parish scrapping script 2017-04-20 10:51:02 +02:00
siulkilulki
d25f3f2757 Update temat.md 2017-03-14 17:11:33 +01:00
siulkilulki
b463dee0d2 Update temat.md 2017-03-14 17:10:24 +01:00
siulkilulki
5dc436781b add description of thesis 2017-03-14 17:08:44 +01:00
siulkilulki
af01adb7ab Initial commit 2017-03-10 16:05:59 +01:00