notebooks | ||
.gitignore | ||
crawler_class.py | ||
crawler.py | ||
image_class.py | ||
image_download.py | ||
mail_test.py | ||
README.md | ||
requirements.txt |
Wikisource crawler and image downloader
Requirements:
Python 3.8>
Install/setup:
pip install -r requirements.txt
Usage crawler
python crawler.py --type {green or yellow or red} --output_file_name {output tsv file name} --start_file_name {name of file to start crawling from} --start_page_number {page of file to start crawling}
Usage image downloader
python image_download.py --file_path {tsv file with data to download} --output_folder {folder to output images -> default images} --max_folder_size_mb {size in MB to stop, if not given will download all} --from_checkpoint {True to start from checkpoint if pickle available}