mass-scraper/utils/iterator.py
siulkilulki 382666c563 Add test.py for data gathering (data for annotation)
Small changes to annotator.py (to be deleted in near future)
Add utils/iterator
Add redis to enviroment.yml
Rename, adapt and move rule based extractor.
Adapt find_hours.
Yapify webapp app (probalby nothing more)
Rename buttons in index.html
2018-05-11 23:12:21 +02:00

25 lines
749 B
Python

import os
import jsonlines
import random
from parishwebsites.parish2text import Parish2Text
def parish_path_iterator(directory):
for root, dirs, files in os.walk(directory):
for fname in sorted(files):
filepath = os.path.join(root, fname)
if os.path.getsize(filepath) > 0:
yield filepath
def parish_page_iterator(filepath):
with jsonlines.open(filepath) as parish_reader:
page_nr = 0
for parish_page in parish_reader:
page_nr += 1
if 'Maximum execution time of 30 seconds exceeded in' in parish_page[
'content']:
continue
parish2text = Parish2Text()
yield parish2text.convert(parish_page)