docs: remove content from README.md
Some checks failed
Docker Image CI / build (push) Has been cancelled
Some checks failed
Docker Image CI / build (push) Has been cancelled
Need to write Documentation Signed-off-by: paprykdev <58005447+paprykdev@users.noreply.github.com>
This commit is contained in:
parent
634d8ae7fa
commit
74bcd14d41
67
README.md
67
README.md
@ -4,71 +4,4 @@
|
|||||||
|
|
||||||
This project is a web scraper designed to extract data from websites.
|
This project is a web scraper designed to extract data from websites.
|
||||||
|
|
||||||
## Features
|
|
||||||
|
|
||||||
☑️ Extracts data from web pages
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
### With Docker
|
|
||||||
|
|
||||||
1. Clone the repository:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone https://git.wmi.amu.edu.pl/s500042/webscraper
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Navigate to the project directory:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd webscraper
|
|
||||||
```
|
|
||||||
|
|
||||||
3. Build the Docker image and run it using `start.py` script:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python scripts/start.py
|
|
||||||
```
|
|
||||||
|
|
||||||
On Mac, you'll have to use
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python3 scripts/start.py
|
|
||||||
```
|
|
||||||
|
|
||||||
4. Check `/app/dist/data.json` file to see the extracted data.
|
|
||||||
|
|
||||||
### Without Docker
|
|
||||||
|
|
||||||
1. Clone the repository:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone https://git.wmi.amu.edu.pl/s500042/webscraper
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Install the required dependencies:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip install -r app/requirements.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
If you're on Arch Linux, you'll need to create a virtual environment.
|
|
||||||
Here's is a [Step by step guide](#) that will help you create it.
|
|
||||||
|
|
||||||
3. Run `run_with_no_docker.py` script:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python scripts/run_with_no_docker.py
|
|
||||||
```
|
|
||||||
|
|
||||||
On Mac you'll, need to use:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python3 scripts/run_with_no_docker.py
|
|
||||||
```
|
|
||||||
|
|
||||||
4. Check `/app/dist/data.json` file to see the extracted data.
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
|
|
||||||
|
@ -1,7 +1,4 @@
|
|||||||
import time
|
import time
|
||||||
import requests
|
|
||||||
import os
|
|
||||||
import json
|
|
||||||
from playwright.async_api import async_playwright
|
from playwright.async_api import async_playwright
|
||||||
import asyncio
|
import asyncio
|
||||||
|
|
||||||
@ -11,6 +8,7 @@ NOTE:
|
|||||||
Some pages doesn'y have info about paintings, so we need to skip them
|
Some pages doesn'y have info about paintings, so we need to skip them
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
class NoguchiScraper:
|
class NoguchiScraper:
|
||||||
def __init__(self, url="https://archive.noguchi.org/Browse/CR", base_url="https://archive.noguchi.org"):
|
def __init__(self, url="https://archive.noguchi.org/Browse/CR", base_url="https://archive.noguchi.org"):
|
||||||
self.hrefs = []
|
self.hrefs = []
|
||||||
@ -36,9 +34,6 @@ class NoguchiScraper:
|
|||||||
element = await self.find_el('a.acceptCookie')
|
element = await self.find_el('a.acceptCookie')
|
||||||
await element.click()
|
await element.click()
|
||||||
|
|
||||||
async def insert_value(self, selector, value):
|
|
||||||
await self.page.fill(selector, value)
|
|
||||||
|
|
||||||
async def find_el(self, selector: str):
|
async def find_el(self, selector: str):
|
||||||
await self.wait_for_el(selector)
|
await self.wait_for_el(selector)
|
||||||
return await self.page.query_selector(selector)
|
return await self.page.query_selector(selector)
|
||||||
|
Loading…
Reference in New Issue
Block a user