c99c218436
Fix mcc score.
...
Add comments to makefile.
Fix get_utterances condition.
Adjust craweler settings.
Change split-data script
2018-06-18 14:56:41 +02:00
7dd903b3b5
First version of ml hour classificator.
...
Add last_access field to annotator_console user stats.
Add split-data script.
Add tsv2fasttext.py
Add todos.org.
2018-05-28 15:10:31 +02:00
1f6b1e6ffe
Working utterances getting/pickling
...
Working converting parishes from html2text.
Add makefile parish2text goal.
Change to non-html(text) parishes in extract_rule_based and get_utterances
Enhance find_hours.py
Wrap render_template in make_response in webapp/app.py
2018-05-14 01:51:40 +02:00
9b76f4e8aa
Add robust recrawling of not completed data.
...
Add annotator.py (highlighing hout within context done)
Enhance parish2text.py (enable more flags, convert button)
2018-04-16 23:54:03 +02:00
e9c4dcd743
Tune download settings. Enable dummy cache with 7 days of expiration.
...
Fix generating spiider commands.
Add redirected domain appenid to allowed domains.
Configure loggers.
Add more meta info to *processed.txt
Enhance view raw data python jsnoline viewer
2018-04-15 12:17:35 +02:00
Dawid Jurkiewicz
0bba61bbcd
Fix checking if response is a binary string.
...
Modyfiy Makefile - enlarge to 40 parallel crawles.
Add 4XX http code to retry list.
Remove processed.final.txt
Probably fix remove_blacklisted.py
2018-04-13 21:45:20 +02:00
Dawid Jurkiewicz
21ba56a8fa
Add domain-blacklist.txt, domain filter, modify crawler.
...
Add binary or not checker.
2018-04-09 23:53:36 +02:00
Dawid Jurkiewicz
f9c5690657
Modifiy error logging in get_parishes_url. Enhance crawl_deon.py
...
Fix Makefile - append instead of rewrite.
2018-04-06 23:33:18 +02:00
Dawid Jurkiewicz
3027e1e7cc
Switch to pure html download. Enhanced urls filtering.
...
Update Makefile.
2018-03-11 18:02:31 +01:00
Dawid Jurkiewicz
b433a5e297
Code refactorings.
2018-03-01 18:16:11 +01:00
Dawid Jurkiewicz
8b72d0b351
Prototype rule based masses extractor.
...
Added spider.
Started working on testsets.
2018-03-01 14:40:13 +01:00