mass-scraper/extractor/find_hours.py

import re
from colorama import Fore, Back, Style

hour_regex = re.compile(
    '(0?[6-9]|1\d|2[0-2])[:.](oo|[0-5]\d)|6|7|8|9|1\d|2[0-2]')


def borders_ok(text, start, end):
    text = ' ' + text + ' '
    before_start_char = text[start]
    after_end_char = text[end + 1]
    if ((before_start_char.isspace() or before_start_char in ',(/')
            and (after_end_char.isspace() or after_end_char in ',;)/')
            and (before_start_char != '(' or after_end_char != ')')):
        return True
    else:
        return False


def delete_duplicates(text):
    text = re.sub(' +', ' ', text)
    text = re.sub(' ?\n ?', '\n', text)
    text = re.sub('\n{5,}', '\n\n\n', text)
    text = re.sub('\n\n', '\n', text)
    return text


def get_context(text, start, end, minsize):
    hour = text[start:end]
    prefix = delete_duplicates(text[:start]).rsplit(
        ' ', maxsplit=minsize + 12)[1:]
    suffix = delete_duplicates(text[end:]).split(
        ' ', maxsplit=minsize + 2)[:-1]
    return ' '.join(prefix), hour, ' '.join(suffix)


def hours_iterator(text, minsize=20, color=False):
    for hour_match in hour_regex.finditer(text):
        start = hour_match.start(0)
        end = hour_match.end(0)
        if not borders_ok(text, start, end):
            continue
        prefix, hour, suffix = get_context(text, start, end, minsize)
        if color:
            utterance = f'{prefix}&&&{hour}###{suffix}'
            yield utterance, color_hour(prefix, hour, suffix, Fore.GREEN,
                                        Style.BRIGHT)
        else:
            yield prefix, hour, suffix


# w klasyfikatorze dzielić tak aby jeszcze \n było oddzielnie


def color_hour(prefix, hour, suffix, color=Fore.GREEN, style=Style.BRIGHT):
    return prefix + color + style + hour + Style.RESET_ALL + suffix
Add basic wsgi app. Rename extractors, change directories. Add gunicorn and flask to environment.yml Update .gitignore 2018-04-27 22:44:15 +02:00			`import re`
			`from colorama import Fore, Back, Style`

Working utterances getting/pickling Working converting parishes from html2text. Add makefile parish2text goal. Change to non-html(text) parishes in extract_rule_based and get_utterances Enhance find_hours.py Wrap render_template in make_response in webapp/app.py 2018-05-14 01:51:40 +02:00			`hour_regex = re.compile(`
Fix find_hours regex. Fix app.py (adapt to addtion of utterances) change header of website in index.html get_utterances.py run with paramter intead of in script filename 2018-05-16 20:33:32 +02:00			`'(0?[6-9]\|1\d\|2[0-2])[:.](oo\|[0-5]\d)\|6\|7\|8\|9\|1\d\|2[0-2]')`
Add basic wsgi app. Rename extractors, change directories. Add gunicorn and flask to environment.yml Update .gitignore 2018-04-27 22:44:15 +02:00
Add test.py for data gathering (data for annotation) Small changes to annotator.py (to be deleted in near future) Add utils/iterator Add redis to enviroment.yml Rename, adapt and move rule based extractor. Adapt find_hours. Yapify webapp app (probalby nothing more) Rename buttons in index.html 2018-05-11 23:12:21 +02:00
Add basic wsgi app. Rename extractors, change directories. Add gunicorn and flask to environment.yml Update .gitignore 2018-04-27 22:44:15 +02:00			`def borders_ok(text, start, end):`
			`text = ' ' + text + ' '`
			`before_start_char = text[start]`
			`after_end_char = text[end + 1]`
Working utterances getting/pickling Working converting parishes from html2text. Add makefile parish2text goal. Change to non-html(text) parishes in extract_rule_based and get_utterances Enhance find_hours.py Wrap render_template in make_response in webapp/app.py 2018-05-14 01:51:40 +02:00			`if ((before_start_char.isspace() or before_start_char in ',(/')`
			`and (after_end_char.isspace() or after_end_char in ',;)/')`
			`and (before_start_char != '(' or after_end_char != ')')):`
Add basic wsgi app. Rename extractors, change directories. Add gunicorn and flask to environment.yml Update .gitignore 2018-04-27 22:44:15 +02:00			`return True`
			`else:`
			`return False`

Add test.py for data gathering (data for annotation) Small changes to annotator.py (to be deleted in near future) Add utils/iterator Add redis to enviroment.yml Rename, adapt and move rule based extractor. Adapt find_hours. Yapify webapp app (probalby nothing more) Rename buttons in index.html 2018-05-11 23:12:21 +02:00
Working annotator. Without abuse handling, but logging actions. Modify find_hours Modify get_utterances Add missing parish2text-commands.sh workin app.py add hash.min.js (fingerpirntjs) modify index.html, make it prettier, add functions and more 2018-05-15 07:13:09 +02:00			`def delete_duplicates(text):`
			`text = re.sub(' +', ' ', text)`
			`text = re.sub(' ?\n ?', '\n', text)`
			`text = re.sub('\n{5,}', '\n\n\n', text)`
			`text = re.sub('\n\n', '\n', text)`
			`return text`


Add basic wsgi app. Rename extractors, change directories. Add gunicorn and flask to environment.yml Update .gitignore 2018-04-27 22:44:15 +02:00			`def get_context(text, start, end, minsize):`
			`hour = text[start:end]`
Working annotator. Without abuse handling, but logging actions. Modify find_hours Modify get_utterances Add missing parish2text-commands.sh workin app.py add hash.min.js (fingerpirntjs) modify index.html, make it prettier, add functions and more 2018-05-15 07:13:09 +02:00			`prefix = delete_duplicates(text[:start]).rsplit(`
			`' ', maxsplit=minsize + 12)[1:]`
			`suffix = delete_duplicates(text[end:]).split(`
Add test.py for data gathering (data for annotation) Small changes to annotator.py (to be deleted in near future) Add utils/iterator Add redis to enviroment.yml Rename, adapt and move rule based extractor. Adapt find_hours. Yapify webapp app (probalby nothing more) Rename buttons in index.html 2018-05-11 23:12:21 +02:00			`' ', maxsplit=minsize + 2)[:-1]`
Add basic wsgi app. Rename extractors, change directories. Add gunicorn and flask to environment.yml Update .gitignore 2018-04-27 22:44:15 +02:00			`return ' '.join(prefix), hour, ' '.join(suffix)`

Add test.py for data gathering (data for annotation) Small changes to annotator.py (to be deleted in near future) Add utils/iterator Add redis to enviroment.yml Rename, adapt and move rule based extractor. Adapt find_hours. Yapify webapp app (probalby nothing more) Rename buttons in index.html 2018-05-11 23:12:21 +02:00
			`def hours_iterator(text, minsize=20, color=False):`
Add basic wsgi app. Rename extractors, change directories. Add gunicorn and flask to environment.yml Update .gitignore 2018-04-27 22:44:15 +02:00			`for hour_match in hour_regex.finditer(text):`
			`start = hour_match.start(0)`
			`end = hour_match.end(0)`
			`if not borders_ok(text, start, end):`
			`continue`
			`prefix, hour, suffix = get_context(text, start, end, minsize)`
Add test.py for data gathering (data for annotation) Small changes to annotator.py (to be deleted in near future) Add utils/iterator Add redis to enviroment.yml Rename, adapt and move rule based extractor. Adapt find_hours. Yapify webapp app (probalby nothing more) Rename buttons in index.html 2018-05-11 23:12:21 +02:00			`if color:`
Working annotator. Without abuse handling, but logging actions. Modify find_hours Modify get_utterances Add missing parish2text-commands.sh workin app.py add hash.min.js (fingerpirntjs) modify index.html, make it prettier, add functions and more 2018-05-15 07:13:09 +02:00			`utterance = f'{prefix}&&&{hour}###{suffix}'`
Add test.py for data gathering (data for annotation) Small changes to annotator.py (to be deleted in near future) Add utils/iterator Add redis to enviroment.yml Rename, adapt and move rule based extractor. Adapt find_hours. Yapify webapp app (probalby nothing more) Rename buttons in index.html 2018-05-11 23:12:21 +02:00			`yield utterance, color_hour(prefix, hour, suffix, Fore.GREEN,`
			`Style.BRIGHT)`
			`else:`
Working annotator. Without abuse handling, but logging actions. Modify find_hours Modify get_utterances Add missing parish2text-commands.sh workin app.py add hash.min.js (fingerpirntjs) modify index.html, make it prettier, add functions and more 2018-05-15 07:13:09 +02:00			`yield prefix, hour, suffix`
Add test.py for data gathering (data for annotation) Small changes to annotator.py (to be deleted in near future) Add utils/iterator Add redis to enviroment.yml Rename, adapt and move rule based extractor. Adapt find_hours. Yapify webapp app (probalby nothing more) Rename buttons in index.html 2018-05-11 23:12:21 +02:00
Add basic wsgi app. Rename extractors, change directories. Add gunicorn and flask to environment.yml Update .gitignore 2018-04-27 22:44:15 +02:00
			`# w klasyfikatorze dzielić tak aby jeszcze \n było oddzielnie`


Add redis stats, helper script. Add default colors to color_hour func in extractor.find_hours Add annotation task description modal. 2018-05-24 12:56:02 +02:00			`def color_hour(prefix, hour, suffix, color=Fore.GREEN, style=Style.BRIGHT):`
Add basic wsgi app. Rename extractors, change directories. Add gunicorn and flask to environment.yml Update .gitignore 2018-04-27 22:44:15 +02:00			`return prefix + color + style + hour + Style.RESET_ALL + suffix`