Carl_Heinz_Kliemann__Kunst | ||
Franzkline | ||
Frame.svg | ||
Logo.svg | ||
README.md |
Art Scraping Projects
A collection of web scraping projects for extracting art data from various online sources.
Wideo demonstruje działanie skryptów. »

Watch the video demonstration »
A video showcasing the work of 2 web scrapers / Wideo prezentujące działanie 2 web scraperów.
Timestamps / Oznaczenia czasowe:
00:00 Intro / Wprowadzenie
00:11 Operation of the first code on Linux with Docker Compose / Działanie pierwszego kodu na Linuksie z Docker Compose
01:20 Start of the script's operation / Początek działania skryptu
03:25 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
04:28 The moment of checking the code's functionality / Moment sprawdzania działania kodu
05:00 Continuation of the script's operation / Kontynuacja działania skryptu
12:13 End of the code's operation and checking its functionality / Zakończenie działania kodu i sprawdzenie jego funkcjonowania
14:15 Demonstration of the "help" command / Pokaz działania komendy "help"
15:00 Operation of the second code on Linux with Docker Compose / Działanie drugiego kodu na Linuksie z Docker Compose
16:05 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
16:35 The moment of checking the code's functionality / Moment sprawdzania działania kodu
17:25 Continuation of the script's operation / Kontynuacja działania skryptu
17:50 End of the code's operation and checking its functionality / Zakończenie działania kodu i sprawdzenie jego funkcjonowania
18:38 Demonstration of the "help" command / Pokaz działania komendy "help"
19:15 Operation of the first code on Windows / Działanie pierwszego kodu na Windows
19:29 Running the script using a bat file / Uruchamianie skryptu za pomocą pliku bat
21:05 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
21:45 The moment of checking the code's functionality / Moment sprawdzania działania kodu
22:45 Continuation of the script's operation / Kontynuacja działania skryptu
25:30 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
26:10 The moment of checking the code's functionality / Moment sprawdzania działania kodu
26:30 Continuation of the script's operation / Kontynuacja działania skryptu
32:15 End of the code's operation and checking its functionality / Zakończenie działania kodu i sprawdzenie jego funkcjonowania
33:10 Demonstration of the "help" command / Pokaz działania komendy "help"
34:05 Outro / Zakończenie

Watch the video demonstration of the second code on Windows »
Operation of the second code on Windows / Działanie drugiego kodu na Windows
Timestamps / Oznaczenia czasowe:
00:00 Operation of the second code on Windows / Działanie drugiego kodu na Windows
00:38 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
01:11 The moment of checking the code's functionality / Moment sprawdzania działania kodu
01:40 Continuation of the script's operation / Kontynuacja działania skryptu
03:05 End of the code's operation and checking its functionality / Zakończenie działania kodu i sprawdzenie jego funkcjonowania
03:55 Demonstration of the "help" command / Pokaz działania komendy "help"
04:49 Outro / Zakończenie
Table of Contents
About The Project
This repository contains web scraping projects designed to extract art-related data:
Carl-Heinz Kliemann Kunst Scraper
This project scrapes artwork data from Kunst-Archive.net, specifically focusing on the works of Carl-Heinz Kliemann.
Features:
- Fetches Artwork Details: Retrieves information such as year, materials, dimensions, signature, and exhibition history.
- Downloads Images: Downloads and saves artwork images locally.
- Progress Saving: Saves scraped data in JSON format to allow resuming.
- Interruption Handling: Gracefully handles interruptions (e.g., Ctrl+C) and saves progress.
- Git Integration: Creates a
.gitignore
to prevent accidental image commits. - Comprehensive Logging: Logs the scraping process to a file and console.
- Mimics Browser Behavior: Uses random User-Agents and simulates Google Analytics requests.
- Asynchronous Operation: Uses asynchronous programming (
httpx
) for efficient network requests.
Limitations Under High Load:
- Data Fetch Restrictions: When a high number of requests are made, the website may block data fetching.
- Error Handling: A 500 error is displayed in the terminal when the limit is reached.
- Docker Behavior: In Docker Compose, the service may terminate unexpectedly under such conditions.
Franz Kline Artwork Scraper
This script retrieves artwork information from a website dedicated to Franz Kline. It uses Selenium for web interaction and handles interruptions gracefully, allowing for resuming of downloads and data extraction.
Features:
- Configuration: The
config.yaml
file contains crucial settings: - Logging: Uses Python's
logging
module to record events (info, errors, warnings) todata.log
. - Output: Scraped data is saved to a JSON file (specified in
config.yaml
), and images are saved to a designated folder (also specified inconfig.yaml
). - Interruption Handling: The script handles
SIGINT
(Ctrl+C) to save progress before exiting. - Resumption: Loads previously saved data from the JSON file to resume scraping from where it left off. Calculates the correct starting page based on previously scraped items.
- Git Ignore: Creates a
.gitignore
file within the image folder to prevent images from being tracked by Git. - Headless Chrome: The script runs Chrome in headless mode, meaning the browser will not be visually displayed.
Built With
Getting Started
Prerequisites
- Python 3.7+
- pip
Installation
-
Clone the repo:
git clone https://git.wmi.amu.edu.pl/s498817/Projekt_Zaliczeniowy.git
-
Navigate to the project directory:
cd Projekt_Zaliczeniowy
-
Choose the appropriate scraper directory:
cd Carl_Heinz_Kliemann__Kunst_Scraper
or
cd Franz_Kline_Artwork_Scraper
-
Install dependencies:
pip install -r requirements.txt
Usage
Carl-Heinz Kliemann Kunst Scraper Usage
Using Docker Compose (Highly Recommended)
-
Navigate to the project directory:
cd Carl_Heinz_Kliemann__Kunst_Scraper
-
Build the Docker image:
docker-compose build
-
Run the container in detached mode:
docker-compose up -d
-
Execute script (run):
docker exec -it Carl_Heinz_Kliemann__Kunst_Scraper python Main_v3.py run
-
Execute script (help):
docker exec -it Carl_Heinz_Kliemann__Kunst_Scraper python Main_v3.py help
Running Directly (Without Docker)
-
Navigate to the project directory:
cd Carl_Heinz_Kliemann__Kunst_Scraper
-
Install dependencies (if not already installed):
pip install -r requirements.txt
-
Run the script:
python Main_v3.py run
-
For help:
python Main_v3.py help
Franz Kline Artwork Scraper Usage
Running Directly
-
Navigate to the project directory:
cd Franz_Kline_Artwork_Scraper
-
Ensure you have Python and the required libraries installed. A
config.yaml
file should be present in the same directory as the script.pip install -r requirements.txt
- Run scraper:
python <script_name.py>
(Replace<script_name.py>
with the actual name of your Python file. For exapmlepython scraper.py
)
Commands
Command | Description |
---|---|
run |
Starts the scraping process. |
help |
Shows this help message with command and function details, plus Docker use. |
Key Functions (Combined for Both Scrapers)
This section outlines the key functions used across both the Carl-Heinz Kliemann Kunst Scraper and Franz Kline Artwork Scraper projects. These functions handle various aspects of the scraping process, from fetching and parsing web pages to managing data and handling interruptions.
Function | Description | Scraper |
---|---|---|
setup_logging() |
Configures and returns a logger for logging application messages, including errors, warnings, and informational messages. | Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper |
handle_interrupt(...) |
Handles interruptions (e.g., Ctrl+C or SIGINT) gracefully, ensuring that any data being processed is saved before the script exits. | Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper |
main() |
Entry point of the application. Manages asyncio (Carl-Heinz) to speed up data fetching. |
Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper |
create_gitignore(...) |
Creates a .gitignore file in the specified folder to ignore all files except .gitignore itself (for excluding downloaded images from Git). |
Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper |
log_event(...) |
Logs events (success, failure, etc.) with details. | Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper |
fetch_html(url, ...) |
Asynchronously fetches the HTML content of a given URL using httpx . Handles retries and exceptions for robust network requests. |
Carl-Heinz Kliemann Kunst Scraper |
parse_html(html) |
Parses the HTML content using BeautifulSoup to extract relevant information about artworks. |
Carl-Heinz Kliemann Kunst Scraper |
process_art_page(...) |
Fetches and processes a single artwork page, extracting detailed information and downloading the associated image. | Carl-Heinz Kliemann Kunst Scraper |
get_image_link(...) |
Extracts the image link from the artwork page. | Carl-Heinz Kliemann Kunst Scraper |
download_image(...) |
Downloads and saves an image to the local filesystem. | Carl-Heinz Kliemann Kunst Scraper |
save_to_json(...) |
Saves the scraped data to a JSON file. Handles data serialization and file writing. | Carl-Heinz Kliemann Kunst Scraper |
create_images_directory() |
Creates the directory for storing images if it doesn't exist. | Carl-Heinz Kliemann Kunst Scraper |
send_google_analytics(...) |
Sends a simulated request to Google Analytics, mimicking a real user's interaction with the website. | Carl-Heinz Kliemann Kunst Scraper |
interrupt_handler(...) |
Handles keyboard interruptions (Ctrl+C) gracefully, ensuring data is saved before exiting. (Same as handle_interrupt but specific to Carl-Heinz Scraper) |
Carl-Heinz Kliemann Kunst Scraper |
print_help() |
Prints the help message for the script, explaining available command-line arguments. | Carl-Heinz Kliemann Kunst Scraper |
parse_arguments() |
Parses command-line arguments using argparse . |
Carl-Heinz Kliemann Kunst Scraper |
fetch_pages_data(...) |
Fetches data from multiple pages concurrently, using asyncio to manage tasks. |
Carl-Heinz Kliemann Kunst Scraper |
process_page_and_artworks(...) |
Processes a single page of artworks, extracting the links and then processing each individual artwork page using process_art_page . |
Carl-Heinz Kliemann Kunst Scraper |
create_progress_bar(...) |
Creates a tqdm progress bar to visualize the scraping progress. |
Carl-Heinz Kliemann Kunst Scraper |
generate_random_user_agent() |
Generates a random User-Agent string to mimic different browsers. | Carl-Heinz Kliemann Kunst Scraper |
extract_section_data(...) |
Extracts data from specific sections of a webpage (e.g., provenance, exhibitions, bibliography) using Selenium. | Franz Kline Artwork Scraper |
click_next_page_button(...) |
Handles clicking the "Next Page" button on the website to navigate through multiple pages of artwork listings. Uses Selenium's WebDriverWait for robust element interaction. |
Franz Kline Artwork Scraper |
save_on_interrupt(...) |
Saves the current progress (scraped data) to a JSON file when the script is interrupted (e.g., by Ctrl+C). | Franz Kline Artwork Scraper |
extract_section_data(...) |
Extracts data like provenance, exhibitions and bibliography sections from the artwork page. | Franz Kline Artwork Scraper |
load_config |
Loads configuration settings from a YAML file (config.yaml). | Franz Kline Artwork Scraper |
Roadmap
-
Develop Web Scrapers for Two Websites
- Website 1 (Kunst-Archive.net):
- Create scripts that extract information about artworks, including details and images, and store the data in JSON format.
- Asynchronous Requests with httpx: Implement asynchronous requests using the httpx library to optimize data scraping.
- Simulate Browser Behavior: Use random User-Agents and mimic requests to Google Analytics to avoid detection and blocking.
- Website 1 (Kunst-Archive.net):
-
Website 2 (Franz Kline Artwork):
- Create scripts that extract information about paintings and save the data in JSON format along with the images of the paintings.
- Selenium Integration: Use Selenium in the script for web browser.
-
Implement Automatic .gitignore Generation
- Add functionality to automatically generate a .gitignore file to manage unnecessary files in the repository, specifically ignoring downloaded images.
-
Implement Data Persistence and Resume Functionality
- Save Progress on Interruption: Add functionality to save the current state of scraped data upon script interruption (e.g., Ctrl+C).
- Resume from Saved State: Implement the ability to load previously saved data and resume scraping from the last processed page/item.
- Limit Data Retrieval: If data already exists in the output file, avoid redownloading it.
-
Dockerization and Deployment
- Utilize Docker Compose: Set up and manage the project environment using Docker Compose for better scalability and deployment.(For Carl-Heinz Kliemann Kunst Scraper)
- Containerize Applications: Create Docker containers for each scraper to ensure consistent execution environments.(For Carl-Heinz Kliemann Kunst Scraper)
-
Add Help and Run Commands
- Implement help and run commands using argparse to provide guidance on usage and facilitate the execution of the scraping scripts.
-
Enhanced Error Handling and Logging
- Detailed Logging: Implement comprehensive logging using the logging module to track script progress, errors, and warnings.
- Robust Error Handling: Add error handling to gracefully manage exceptions during HTTP requests, parsing, and file operations.
-
Configuration Management
- Centralized Configuration: Use a configuration file (e.g., config.yaml for the Franz Kline Artwork Scraper) to manage settings such as URLs, headers, proxies, output paths, and logging preferences.
-
Code Refactoring and Optimization
- Modular Design: Refactor code into smaller, reusable functions/modules for better maintainability.
- Performance Optimization: Continuously optimize code for better performance, considering factors like memory usage, network requests, and parsing efficiency.
-
Documentation
- Comprehensive README: Create a detailed README file that explains how to set up, run, and use the scrapers, including Docker instructions.
- Code Comments: Add clear and concise comments to the code to explain the purpose and logic of different sections.
- Help Command Output: Improve the help command's output to provide more context and detailed usage instructions.
-
Add support for more art websites.
-
Implement more robust error handling.
-
Create a web interface for easier interaction.
Contact
Mykyta Kyslytsia (s498817) - mykkys@st.amu.edu.pl - mykyta.kyslytsia@gmail.com
Project Link: https://git.wmi.amu.edu.pl/s498817/Projekt_Zaliczeniowy
Acknowledgments
- Kunst-Archive.net
- Website dedicated to Franz Kline.