382666c563
Add test.py for data gathering (data for annotation)
...
Small changes to annotator.py (to be deleted in near future)
Add utils/iterator
Add redis to enviroment.yml
Rename, adapt and move rule based extractor.
Adapt find_hours.
Yapify webapp app (probalby nothing more)
Rename buttons in index.html
2018-05-11 23:12:21 +02:00
c617018611
Restructure code. Add frontend template. (logic to be done)
2018-05-04 23:25:07 +02:00
6982ac2e59
Add basic wsgi app. Rename extractors, change directories.
...
Add gunicorn and flask to environment.yml
Update .gitignore
2018-04-27 22:44:15 +02:00
9b76f4e8aa
Add robust recrawling of not completed data.
...
Add annotator.py (highlighing hout within context done)
Enhance parish2text.py (enable more flags, convert button)
2018-04-16 23:54:03 +02:00
e9c4dcd743
Tune download settings. Enable dummy cache with 7 days of expiration.
...
Fix generating spiider commands.
Add redirected domain appenid to allowed domains.
Configure loggers.
Add more meta info to *processed.txt
Enhance view raw data python jsnoline viewer
2018-04-15 12:17:35 +02:00
siulkilulki
a5cb3a090f
remove plan.org
2018-04-13 22:45:37 +02:00
ee636a65f1
Update README.md
2018-04-13 22:36:04 +02:00
Dawid Jurkiewicz
c83c29e58e
Delete old files
2018-04-13 22:33:11 +02:00
Dawid Jurkiewicz
0bba61bbcd
Fix checking if response is a binary string.
...
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
2018-04-13 21:45:20 +02:00
Dawid Jurkiewicz
21ba56a8fa
Add domain-blacklist.txt, domain filter, modify crawler.
...
Add binary or not checker.
2018-04-09 23:53:36 +02:00
siulkilulki
6107a89c78
Update README.md
2018-04-06 23:43:14 +02:00
Dawid Jurkiewicz
f9c5690657
Modifiy error logging in get_parishes_url. Enhance crawl_deon.py
...
Fix Makefile - append instead of rewrite.
2018-04-06 23:33:18 +02:00
Dawid Jurkiewicz
ccc4af3d51
Fix get parishes urls script.
2018-04-04 20:29:48 +02:00
Dawid Jurkiewicz
56f704630e
Add raw data viewer.
2018-03-30 22:10:41 +02:00
siulkilulki
88c55891f4
Update README.md
2018-03-15 19:01:13 +01:00
Dawid Jurkiewicz
63c4a71812
Add converter of content field in jsonline from html to text.
2018-03-15 16:09:59 +01:00
Dawid Jurkiewicz
3027e1e7cc
Switch to pure html download. Enhanced urls filtering.
...
Update Makefile.
2018-03-11 18:02:31 +01:00
Dawid Jurkiewicz
b433a5e297
Code refactorings.
2018-03-01 18:16:11 +01:00
Dawid Jurkiewicz
0070ffe07d
Merge branch 'master' of github.com:siulkilulki/mass-scraper
2018-03-01 14:50:49 +01:00
Dawid Jurkiewicz
8b72d0b351
Prototype rule based masses extractor.
...
Added spider.
Started working on testsets.
2018-03-01 14:40:13 +01:00
Dawid Jurkiewicz
c3b86fe5a9
Prototype rule based masses extractor.
...
Added spider.
Started working on testsets.
2018-01-20 21:55:26 +01:00
siulkilulki
7161193169
Add prototype basic crawl
2017-11-21 22:51:09 +01:00
siulkilulki
9f1423b362
fixed url checking
2017-06-21 22:51:53 +02:00
Dawid Jurkiewicz
5ad2a36499
urlschecker alpha & sync
2017-06-21 21:52:20 +02:00
siulkilulki
b17fe9b5c2
fix varaible name
2017-06-19 08:13:08 +02:00
siulkilulki
4ae6cd24c0
fix proxy conditional statement
2017-06-18 21:44:12 +02:00
siulkilulki
f54e01581c
code refactorings and improvements
2017-06-18 21:33:44 +02:00
siulkilulki
b16f29ef6d
changed prdriver location
2017-06-12 22:17:23 +02:00
siulkilulki
57315f9b31
proof of concept alpha
2017-06-12 22:08:29 +02:00
siulkilulki
de56ecb253
done proxy.py
2017-06-11 00:00:22 +02:00
siulkilulki
c205e1b627
added proxy downloader
2017-06-10 02:09:22 +02:00
siulkilulki
35d3b11ec6
add downloaded parishes
2017-04-21 00:29:17 +02:00
siulkilulki
35db6760f7
Merge branch 'master' of https://github.com/siulkilulki/mass-scraper
2017-04-20 10:56:23 +02:00
siulkilulki
7aed0dda4f
add parish scrapping script
2017-04-20 10:51:02 +02:00
siulkilulki
d25f3f2757
Update temat.md
2017-03-14 17:11:33 +01:00
siulkilulki
b463dee0d2
Update temat.md
2017-03-14 17:10:24 +01:00
siulkilulki
5dc436781b
add description of thesis
2017-03-14 17:08:44 +01:00
siulkilulki
af01adb7ab
Initial commit
2017-03-10 16:05:59 +01:00