Go to file

zzombely da0e4f3263 crawler classes and image downloader		2023-03-12 15:57:35 +00:00
notebooks	fixing	2023-01-10 22:57:22 +01:00
.gitignore	crawler classes and image downloader	2023-03-12 15:57:35 +00:00
crawler_class.py	crawler classes and image downloader	2023-03-12 15:57:35 +00:00
crawler.py	crawler classes and image downloader	2023-03-12 15:57:35 +00:00
image_class.py	crawler classes and image downloader	2023-03-12 15:57:35 +00:00
image_download.py	fixing	2023-01-10 22:57:22 +01:00
mail_test.py	crawler classes and image downloader	2023-03-12 15:57:35 +00:00
README.md	readme and update for mb lock	2023-01-10 19:05:56 +01:00
requirements.txt	requirements update	2023-01-07 14:41:47 +00:00

README.md

Wikisource crawler and image downloader

Requirements:

Python 3.8>

Install/setup:

pip install -r requirements.txt

Usage crawler

python crawler.py --type {green or yellow or red} --output_file_name {output tsv file name} --start_file_name {name of file to start crawling from} --start_page_number {page of file to start crawling}

Usage image downloader

python image_download.py --file_path {tsv file with data to download} --output_folder {folder to output images -> default images} --max_folder_size_mb {size in MB to stop, if not given will download all} --from_checkpoint {True to start from checkpoint if pickle available}