Commit Graph

49 Commits

Author SHA1 Message Date
7dd903b3b5 First version of ml hour classificator.
Add last_access field to annotator_console user stats.

Add split-data script.
Add tsv2fasttext.py
Add todos.org.
2018-05-28 15:10:31 +02:00
6ff4f230db Add redis database to tsv method. 2018-05-27 14:44:55 +02:00
606ebb5260 Add redis data structures description. Handle banned users.
Rename annotation_stats ---> annotator_console.py
Add ban and users stats function in annotator_console.py
2018-05-26 19:07:08 +02:00
626307f135 Fix get next utterance (by score).
Switch to argparse in annotation_stats.py
2018-05-25 15:09:23 +02:00
6a3819eb0a Add passing users not annotated utterances (by them)
Switch to secrets module for cookie tokens.
Add console, exec mode to annotation_stats.py (todo rename script)
Add some more info in index.html helper modal.
2018-05-25 12:39:06 +02:00
fbcf3bad4e Add redis stats, helper script.
Add default colors to color_hour func in extractor.find_hours
Add annotation task description modal.
2018-05-24 12:56:02 +02:00
916703ed5e Refactor app.py and add robust undo functionality.
Add cookie js disclaimer script in index.html
2018-05-21 01:09:05 +02:00
0e5dc170f6 Configure loggers. Add redis stats script. Add logging by ip in db. 2018-05-19 01:05:30 +02:00
7da40e76ac Fix find_hours regex. Fix app.py (adapt to addtion of utterances)
change header of website in index.html
get_utterances.py run with paramter intead of in script filename
2018-05-16 20:33:32 +02:00
95491b20a7 Working annotator. Without abuse handling, but logging actions.
Modify find_hours
Modify get_utterances
Add missing parish2text-commands.sh
workin app.py
add hash.min.js (fingerpirntjs)
modify index.html, make it prettier, add functions and more
2018-05-15 07:13:09 +02:00
1f6b1e6ffe Working utterances getting/pickling
Working converting parishes from html2text.
Add makefile parish2text goal.
Change to non-html(text) parishes in extract_rule_based and get_utterances
Enhance find_hours.py
Wrap render_template in make_response in webapp/app.py
2018-05-14 01:51:40 +02:00
382666c563 Add test.py for data gathering (data for annotation)
Small changes to annotator.py (to be deleted in near future)
Add utils/iterator
Add redis to enviroment.yml
Rename, adapt and move rule based extractor.
Adapt find_hours.
Yapify webapp app (probalby nothing more)
Rename buttons in index.html
2018-05-11 23:12:21 +02:00
c617018611 Restructure code. Add frontend template. (logic to be done) 2018-05-04 23:25:07 +02:00
6982ac2e59 Add basic wsgi app. Rename extractors, change directories.
Add gunicorn and flask to environment.yml
Update .gitignore
2018-04-27 22:44:15 +02:00
9b76f4e8aa Add robust recrawling of not completed data.
Add annotator.py (highlighing hout within context done)
Enhance parish2text.py (enable more flags, convert button)
2018-04-16 23:54:03 +02:00
e9c4dcd743 Tune download settings. Enable dummy cache with 7 days of expiration.
Fix generating spiider commands.
Add redirected domain appenid to allowed domains.
Configure loggers.
Add more meta info to *processed.txt
Enhance view raw data python jsnoline viewer
2018-04-15 12:17:35 +02:00
siulkilulki
a5cb3a090f remove plan.org 2018-04-13 22:45:37 +02:00
ee636a65f1
Update README.md 2018-04-13 22:36:04 +02:00
Dawid Jurkiewicz
c83c29e58e Delete old files 2018-04-13 22:33:11 +02:00
Dawid Jurkiewicz
0bba61bbcd Fix checking if response is a binary string.
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
2018-04-13 21:45:20 +02:00
Dawid Jurkiewicz
21ba56a8fa Add domain-blacklist.txt, domain filter, modify crawler.
Add binary or not checker.
2018-04-09 23:53:36 +02:00
siulkilulki
6107a89c78
Update README.md 2018-04-06 23:43:14 +02:00
Dawid Jurkiewicz
f9c5690657 Modifiy error logging in get_parishes_url. Enhance crawl_deon.py
Fix Makefile - append instead of rewrite.
2018-04-06 23:33:18 +02:00
Dawid Jurkiewicz
ccc4af3d51 Fix get parishes urls script. 2018-04-04 20:29:48 +02:00
Dawid Jurkiewicz
56f704630e Add raw data viewer. 2018-03-30 22:10:41 +02:00
siulkilulki
88c55891f4 Update README.md 2018-03-15 19:01:13 +01:00
Dawid Jurkiewicz
63c4a71812 Add converter of content field in jsonline from html to text. 2018-03-15 16:09:59 +01:00
Dawid Jurkiewicz
3027e1e7cc Switch to pure html download. Enhanced urls filtering.
Update Makefile.
2018-03-11 18:02:31 +01:00
Dawid Jurkiewicz
b433a5e297 Code refactorings. 2018-03-01 18:16:11 +01:00
Dawid Jurkiewicz
0070ffe07d Merge branch 'master' of github.com:siulkilulki/mass-scraper 2018-03-01 14:50:49 +01:00
Dawid Jurkiewicz
8b72d0b351 Prototype rule based masses extractor.
Added spider.
Started working on testsets.
2018-03-01 14:40:13 +01:00
Dawid Jurkiewicz
c3b86fe5a9 Prototype rule based masses extractor.
Added spider.
Started working on testsets.
2018-01-20 21:55:26 +01:00
siulkilulki
7161193169 Add prototype basic crawl 2017-11-21 22:51:09 +01:00
siulkilulki
9f1423b362 fixed url checking 2017-06-21 22:51:53 +02:00
Dawid Jurkiewicz
5ad2a36499 urlschecker alpha & sync 2017-06-21 21:52:20 +02:00
siulkilulki
b17fe9b5c2 fix varaible name 2017-06-19 08:13:08 +02:00
siulkilulki
4ae6cd24c0 fix proxy conditional statement 2017-06-18 21:44:12 +02:00
siulkilulki
f54e01581c code refactorings and improvements 2017-06-18 21:33:44 +02:00
siulkilulki
b16f29ef6d changed prdriver location 2017-06-12 22:17:23 +02:00
siulkilulki
57315f9b31 proof of concept alpha 2017-06-12 22:08:29 +02:00
siulkilulki
de56ecb253 done proxy.py 2017-06-11 00:00:22 +02:00
siulkilulki
c205e1b627 added proxy downloader 2017-06-10 02:09:22 +02:00
siulkilulki
35d3b11ec6 add downloaded parishes 2017-04-21 00:29:17 +02:00
siulkilulki
35db6760f7 Merge branch 'master' of https://github.com/siulkilulki/mass-scraper 2017-04-20 10:56:23 +02:00
siulkilulki
7aed0dda4f add parish scrapping script 2017-04-20 10:51:02 +02:00
siulkilulki
d25f3f2757 Update temat.md 2017-03-14 17:11:33 +01:00
siulkilulki
b463dee0d2 Update temat.md 2017-03-14 17:10:24 +01:00
siulkilulki
5dc436781b add description of thesis 2017-03-14 17:08:44 +01:00
siulkilulki
af01adb7ab Initial commit 2017-03-10 16:05:59 +01:00