Go to file
Mykyta Kyslytsia 9f4eed8b34 hotfix7
2025-02-07 15:30:19 +01:00
Carl_Heinz_Kliemann__Kunst hotfix7 2025-02-07 15:30:19 +01:00
Franzkline hotfix7 2025-02-07 15:30:19 +01:00
Frame.svg finall commit 2025-01-21 01:42:21 +01:00
Logo.svg finall commit 2025-01-21 01:42:21 +01:00
README.md hotfix3 2025-02-07 02:20:28 +01:00

Logo

Art Scraping Projects

A collection of web scraping projects for extracting art data from various online sources.
Wideo demonstruje działanie skryptów. »

Video Demonstration

Watch the video demonstration »

A video showcasing the work of 2 web scrapers / Wideo prezentujące działanie 2 web scraperów.

Timestamps / Oznaczenia czasowe:
00:00 Intro / Wprowadzenie
00:11 Operation of the first code on Linux with Docker Compose / Działanie pierwszego kodu na Linuksie z Docker Compose
01:20 Start of the script's operation / Początek działania skryptu
03:25 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
04:28 The moment of checking the code's functionality / Moment sprawdzania działania kodu
05:00 Continuation of the script's operation / Kontynuacja działania skryptu
12:13 End of the code's operation and checking its functionality / Zakończenie działania kodu i sprawdzenie jego funkcjonowania
14:15 Demonstration of the "help" command / Pokaz działania komendy "help"
15:00 Operation of the second code on Linux with Docker Compose / Działanie drugiego kodu na Linuksie z Docker Compose
16:05 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
16:35 The moment of checking the code's functionality / Moment sprawdzania działania kodu
17:25 Continuation of the script's operation / Kontynuacja działania skryptu
17:50 End of the code's operation and checking its functionality / Zakończenie działania kodu i sprawdzenie jego funkcjonowania
18:38 Demonstration of the "help" command / Pokaz działania komendy "help"
19:15 Operation of the first code on Windows / Działanie pierwszego kodu na Windows
19:29 Running the script using a bat file / Uruchamianie skryptu za pomocą pliku bat
21:05 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
21:45 The moment of checking the code's functionality / Moment sprawdzania działania kodu
22:45 Continuation of the script's operation / Kontynuacja działania skryptu
25:30 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
26:10 The moment of checking the code's functionality / Moment sprawdzania działania kodu
26:30 Continuation of the script's operation / Kontynuacja działania skryptu
32:15 End of the code's operation and checking its functionality / Zakończenie działania kodu i sprawdzenie jego funkcjonowania
33:10 Demonstration of the "help" command / Pokaz działania komendy "help"
34:05 Outro / Zakończenie

Video Demonstration

Watch the video demonstration of the second code on Windows »

Operation of the second code on Windows / Działanie drugiego kodu na Windows

Timestamps / Oznaczenia czasowe:
00:00 Operation of the second code on Windows / Działanie drugiego kodu na Windows
00:38 Demonstration of stopping the code with Ctrl + C / Pokaz zatrzymania kodu za pomocą Ctrl + C
01:11 The moment of checking the code's functionality / Moment sprawdzania działania kodu
01:40 Continuation of the script's operation / Kontynuacja działania skryptu
03:05 End of the code's operation and checking its functionality / Zakończenie działania kodu i sprawdzenie jego funkcjonowania
03:55 Demonstration of the "help" command / Pokaz działania komendy "help"
04:49 Outro / Zakończenie

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

This repository contains web scraping projects designed to extract art-related data:

Carl-Heinz Kliemann Kunst Scraper

This project scrapes artwork data from Kunst-Archive.net, specifically focusing on the works of Carl-Heinz Kliemann.

Features:

  • Fetches Artwork Details: Retrieves information such as year, materials, dimensions, signature, and exhibition history.
  • Downloads Images: Downloads and saves artwork images locally.
  • Progress Saving: Saves scraped data in JSON format to allow resuming.
  • Interruption Handling: Gracefully handles interruptions (e.g., Ctrl+C) and saves progress.
  • Git Integration: Creates a .gitignore to prevent accidental image commits.
  • Comprehensive Logging: Logs the scraping process to a file and console.
  • Mimics Browser Behavior: Uses random User-Agents and simulates Google Analytics requests.
  • Asynchronous Operation: Uses asynchronous programming (httpx) for efficient network requests.

Limitations Under High Load:

  • Data Fetch Restrictions: When a high number of requests are made, the website may block data fetching.
  • Error Handling: A 500 error is displayed in the terminal when the limit is reached.
  • Docker Behavior: In Docker Compose, the service may terminate unexpectedly under such conditions.

Franz Kline Artwork Scraper

This script retrieves artwork information from a website dedicated to Franz Kline. It uses Selenium for web interaction and handles interruptions gracefully, allowing for resuming of downloads and data extraction.

Features:

  • Configuration: The config.yaml file contains crucial settings:
  • Logging: Uses Python's logging module to record events (info, errors, warnings) to data.log.
  • Output: Scraped data is saved to a JSON file (specified in config.yaml), and images are saved to a designated folder (also specified in config.yaml).
  • Interruption Handling: The script handles SIGINT (Ctrl+C) to save progress before exiting.
  • Resumption: Loads previously saved data from the JSON file to resume scraping from where it left off. Calculates the correct starting page based on previously scraped items.
  • Git Ignore: Creates a .gitignore file within the image folder to prevent images from being tracked by Git.
  • Headless Chrome: The script runs Chrome in headless mode, meaning the browser will not be visually displayed.

Built With

  • Python
  • httpx
  • Beautiful Soup
  • tqdm
  • user_agent
  • urllib3
  • PyYAML
  • pythonjsonlogger
  • Docker
  • Requests
  • lxml
  • PySocks
  • Selenium

Getting Started

Prerequisites

  • Python 3.7+
  • pip

Installation

  1. Clone the repo:

    git clone https://git.wmi.amu.edu.pl/s498817/Projekt_Zaliczeniowy.git
    
    
  2. Navigate to the project directory:

    cd Projekt_Zaliczeniowy
    
  3. Choose the appropriate scraper directory:

     cd Carl_Heinz_Kliemann__Kunst_Scraper
    

    or

    cd Franz_Kline_Artwork_Scraper
    
  4. Install dependencies:

    pip install -r requirements.txt
    

Usage

Carl-Heinz Kliemann Kunst Scraper Usage

  1. Navigate to the project directory:

    cd Carl_Heinz_Kliemann__Kunst_Scraper
    
  2. Build the Docker image:

    docker-compose build
    
  3. Run the container in detached mode:

    docker-compose up -d
    
  4. Execute script (run):

    docker exec -it Carl_Heinz_Kliemann__Kunst_Scraper python Main_v3.py run
    
  5. Execute script (help):

    docker exec -it Carl_Heinz_Kliemann__Kunst_Scraper python Main_v3.py help
    

Running Directly (Without Docker)

  1. Navigate to the project directory:

    cd Carl_Heinz_Kliemann__Kunst_Scraper
    
  2. Install dependencies (if not already installed):

    pip install -r requirements.txt
    
  3. Run the script:

    python Main_v3.py run
    
  4. For help:

    python Main_v3.py help
    

Franz Kline Artwork Scraper Usage

Running Directly

  1. Navigate to the project directory:

     cd Franz_Kline_Artwork_Scraper
    
  2. Ensure you have Python and the required libraries installed. A config.yaml file should be present in the same directory as the script.

    pip install -r requirements.txt
    
  • Run scraper: python <script_name.py> (Replace <script_name.py> with the actual name of your Python file. For exapmle python scraper.py)

Commands

Command Description
run Starts the scraping process.
help Shows this help message with command and function details, plus Docker use.

Key Functions (Combined for Both Scrapers)

This section outlines the key functions used across both the Carl-Heinz Kliemann Kunst Scraper and Franz Kline Artwork Scraper projects. These functions handle various aspects of the scraping process, from fetching and parsing web pages to managing data and handling interruptions.

Function Description Scraper
setup_logging() Configures and returns a logger for logging application messages, including errors, warnings, and informational messages. Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper
handle_interrupt(...) Handles interruptions (e.g., Ctrl+C or SIGINT) gracefully, ensuring that any data being processed is saved before the script exits. Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper
main() Entry point of the application. Manages asyncio (Carl-Heinz) to speed up data fetching. Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper
create_gitignore(...) Creates a .gitignore file in the specified folder to ignore all files except .gitignore itself (for excluding downloaded images from Git). Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper
log_event(...) Logs events (success, failure, etc.) with details. Carl-Heinz Kliemann Kunst Scraper, Franz Kline Artwork Scraper
fetch_html(url, ...) Asynchronously fetches the HTML content of a given URL using httpx. Handles retries and exceptions for robust network requests. Carl-Heinz Kliemann Kunst Scraper
parse_html(html) Parses the HTML content using BeautifulSoup to extract relevant information about artworks. Carl-Heinz Kliemann Kunst Scraper
process_art_page(...) Fetches and processes a single artwork page, extracting detailed information and downloading the associated image. Carl-Heinz Kliemann Kunst Scraper
get_image_link(...) Extracts the image link from the artwork page. Carl-Heinz Kliemann Kunst Scraper
download_image(...) Downloads and saves an image to the local filesystem. Carl-Heinz Kliemann Kunst Scraper
save_to_json(...) Saves the scraped data to a JSON file. Handles data serialization and file writing. Carl-Heinz Kliemann Kunst Scraper
create_images_directory() Creates the directory for storing images if it doesn't exist. Carl-Heinz Kliemann Kunst Scraper
send_google_analytics(...) Sends a simulated request to Google Analytics, mimicking a real user's interaction with the website. Carl-Heinz Kliemann Kunst Scraper
interrupt_handler(...) Handles keyboard interruptions (Ctrl+C) gracefully, ensuring data is saved before exiting. (Same as handle_interrupt but specific to Carl-Heinz Scraper) Carl-Heinz Kliemann Kunst Scraper
print_help() Prints the help message for the script, explaining available command-line arguments. Carl-Heinz Kliemann Kunst Scraper
parse_arguments() Parses command-line arguments using argparse. Carl-Heinz Kliemann Kunst Scraper
fetch_pages_data(...) Fetches data from multiple pages concurrently, using asyncio to manage tasks. Carl-Heinz Kliemann Kunst Scraper
process_page_and_artworks(...) Processes a single page of artworks, extracting the links and then processing each individual artwork page using process_art_page. Carl-Heinz Kliemann Kunst Scraper
create_progress_bar(...) Creates a tqdm progress bar to visualize the scraping progress. Carl-Heinz Kliemann Kunst Scraper
generate_random_user_agent() Generates a random User-Agent string to mimic different browsers. Carl-Heinz Kliemann Kunst Scraper
extract_section_data(...) Extracts data from specific sections of a webpage (e.g., provenance, exhibitions, bibliography) using Selenium. Franz Kline Artwork Scraper
click_next_page_button(...) Handles clicking the "Next Page" button on the website to navigate through multiple pages of artwork listings. Uses Selenium's WebDriverWait for robust element interaction. Franz Kline Artwork Scraper
save_on_interrupt(...) Saves the current progress (scraped data) to a JSON file when the script is interrupted (e.g., by Ctrl+C). Franz Kline Artwork Scraper
extract_section_data(...) Extracts data like provenance, exhibitions and bibliography sections from the artwork page. Franz Kline Artwork Scraper
load_config Loads configuration settings from a YAML file (config.yaml). Franz Kline Artwork Scraper

Roadmap

  • Develop Web Scrapers for Two Websites

    • Website 1 (Kunst-Archive.net):
      • Create scripts that extract information about artworks, including details and images, and store the data in JSON format.
      • Asynchronous Requests with httpx: Implement asynchronous requests using the httpx library to optimize data scraping.
      • Simulate Browser Behavior: Use random User-Agents and mimic requests to Google Analytics to avoid detection and blocking.
  • Website 2 (Franz Kline Artwork):

    • Create scripts that extract information about paintings and save the data in JSON format along with the images of the paintings.
    • Selenium Integration: Use Selenium in the script for web browser.
  • Implement Automatic .gitignore Generation

    • Add functionality to automatically generate a .gitignore file to manage unnecessary files in the repository, specifically ignoring downloaded images.
  • Implement Data Persistence and Resume Functionality

    • Save Progress on Interruption: Add functionality to save the current state of scraped data upon script interruption (e.g., Ctrl+C).
    • Resume from Saved State: Implement the ability to load previously saved data and resume scraping from the last processed page/item.
    • Limit Data Retrieval: If data already exists in the output file, avoid redownloading it.
  • Dockerization and Deployment

    • Utilize Docker Compose: Set up and manage the project environment using Docker Compose for better scalability and deployment.(For Carl-Heinz Kliemann Kunst Scraper)
    • Containerize Applications: Create Docker containers for each scraper to ensure consistent execution environments.(For Carl-Heinz Kliemann Kunst Scraper)
  • Add Help and Run Commands

    • Implement help and run commands using argparse to provide guidance on usage and facilitate the execution of the scraping scripts.
  • Enhanced Error Handling and Logging

    • Detailed Logging: Implement comprehensive logging using the logging module to track script progress, errors, and warnings.
    • Robust Error Handling: Add error handling to gracefully manage exceptions during HTTP requests, parsing, and file operations.
  • Configuration Management

    • Centralized Configuration: Use a configuration file (e.g., config.yaml for the Franz Kline Artwork Scraper) to manage settings such as URLs, headers, proxies, output paths, and logging preferences.
  • Code Refactoring and Optimization

    • Modular Design: Refactor code into smaller, reusable functions/modules for better maintainability.
    • Performance Optimization: Continuously optimize code for better performance, considering factors like memory usage, network requests, and parsing efficiency.
  • Documentation

    • Comprehensive README: Create a detailed README file that explains how to set up, run, and use the scrapers, including Docker instructions.
    • Code Comments: Add clear and concise comments to the code to explain the purpose and logic of different sections.
    • Help Command Output: Improve the help command's output to provide more context and detailed usage instructions.
  • Add support for more art websites.

  • Implement more robust error handling.

  • Create a web interface for easier interaction.

Contact

Mykyta Kyslytsia (s498817) - mykkys@st.amu.edu.pl - mykyta.kyslytsia@gmail.com

Project Link: https://git.wmi.amu.edu.pl/s498817/Projekt_Zaliczeniowy

Acknowledgments