forked from filipg/aitech-eks-pub
24 KiB
24 KiB
Ekstrakcja informacji
2. Wyszukiwarki — roboty [wykład]
Filip Graliński (2021)
Jak stworzyć swojego robota?
Narzędzia uruchamiane z wiersza poleceń
- wget
- curl
- aria2c
# Pobierz rekurencyjnie, z ograniczeniem do jednego poziomu rekurencji
! wget -r -l 1 https://laboratoria.wmi.amu.edu.pl/
--2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/ Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt' Resolving laboratoria.wmi.amu.edu.pl (laboratoria.wmi.amu.edu.pl)... 150.254.78.3 Connecting to laboratoria.wmi.amu.edu.pl (laboratoria.wmi.amu.edu.pl)|150.254.78.3|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 6269 (6.1K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/index.html' laboratoria.wmi.amu 100%[===================>] 6.12K --.-KB/s in 0.001s 2021-03-17 09:25:32 (4.19 MB/s) - 'laboratoria.wmi.amu.edu.pl/index.html' saved [6269/6269] Loading robots.txt; please ignore errors. --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/robots.txt Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 403 Forbidden 2021-03-17 09:25:32 ERROR 403: Forbidden. --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/page-resources/wmi.png Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 596 [image/png] Saving to: 'laboratoria.wmi.amu.edu.pl/page-resources/wmi.png' laboratoria.wmi.amu 100%[===================>] 596 --.-KB/s in 0s 2021-03-17 09:25:32 (53.7 MB/s) - 'laboratoria.wmi.amu.edu.pl/page-resources/wmi.png' saved [596/596] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/css/labs.css Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 6919 (6.8K) [text/css] Saving to: 'laboratoria.wmi.amu.edu.pl/css/labs.css' laboratoria.wmi.amu 100%[===================>] 6.76K --.-KB/s in 0s 2021-03-17 09:25:32 (18.5 MB/s) - 'laboratoria.wmi.amu.edu.pl/css/labs.css' saved [6919/6919] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/en/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 5946 (5.8K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/en/index.html' laboratoria.wmi.amu 100%[===================>] 5.81K --.-KB/s in 0.002s 2021-03-17 09:25:32 (3.04 MB/s) - 'laboratoria.wmi.amu.edu.pl/en/index.html' saved [5946/5946] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/page-resources/wmi_transparent.png Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 15034 (15K) [image/png] Saving to: 'laboratoria.wmi.amu.edu.pl/page-resources/wmi_transparent.png' laboratoria.wmi.amu 100%[===================>] 14.68K --.-KB/s in 0.005s 2021-03-17 09:25:32 (2.62 MB/s) - 'laboratoria.wmi.amu.edu.pl/page-resources/wmi_transparent.png' saved [15034/15034] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/godziny-otwarcia/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 5317 (5.2K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/godziny-otwarcia/index.html' laboratoria.wmi.amu 100%[===================>] 5.19K --.-KB/s in 0s 2021-03-17 09:25:32 (87.9 MB/s) - 'laboratoria.wmi.amu.edu.pl/godziny-otwarcia/index.html' saved [5317/5317] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/kontakt/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 4644 (4.5K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/kontakt/index.html' laboratoria.wmi.amu 100%[===================>] 4.54K --.-KB/s in 0s 2021-03-17 09:25:32 (142 MB/s) - 'laboratoria.wmi.amu.edu.pl/kontakt/index.html' saved [4644/4644] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/pierwsze-kroki/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 6639 (6.5K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/pierwsze-kroki/index.html' laboratoria.wmi.amu 100%[===================>] 6.48K --.-KB/s in 0.002s 2021-03-17 09:25:32 (3.61 MB/s) - 'laboratoria.wmi.amu.edu.pl/pierwsze-kroki/index.html' saved [6639/6639] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/przewodnik/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 5454 (5.3K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/przewodnik/index.html' laboratoria.wmi.amu 100%[===================>] 5.33K --.-KB/s in 0.002s 2021-03-17 09:25:32 (2.97 MB/s) - 'laboratoria.wmi.amu.edu.pl/przewodnik/index.html' saved [5454/5454] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/regulamin-laboratoriow-komputerowych/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 14393 (14K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/regulamin-laboratoriow-komputerowych/index.html' laboratoria.wmi.amu 100%[===================>] 14.06K --.-KB/s in 0.005s 2021-03-17 09:25:32 (2.65 MB/s) - 'laboratoria.wmi.amu.edu.pl/regulamin-laboratoriow-komputerowych/index.html' saved [14393/14393] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/nie-odpowiadamy/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 4481 (4.4K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/nie-odpowiadamy/index.html' laboratoria.wmi.amu 100%[===================>] 4.38K --.-KB/s in 0s 2021-03-17 09:25:32 (101 MB/s) - 'laboratoria.wmi.amu.edu.pl/nie-odpowiadamy/index.html' saved [4481/4481] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/laboratoria/oprogramowanie/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 12821 (13K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/laboratoria/oprogramowanie/index.html' laboratoria.wmi.amu 100%[===================>] 12.52K --.-KB/s in 0.004s 2021-03-17 09:25:32 (2.93 MB/s) - 'laboratoria.wmi.amu.edu.pl/laboratoria/oprogramowanie/index.html' saved [12821/12821] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/uslugi/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 10688 (10K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/uslugi/index.html' laboratoria.wmi.amu 100%[===================>] 10.44K --.-KB/s in 0.004s 2021-03-17 09:25:32 (2.74 MB/s) - 'laboratoria.wmi.amu.edu.pl/uslugi/index.html' saved [10688/10688] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/uslugi-uniwersyteckie/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 4240 (4.1K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/uslugi-uniwersyteckie/index.html' laboratoria.wmi.amu 100%[===================>] 4.14K --.-KB/s in 0.001s 2021-03-17 09:25:32 (3.27 MB/s) - 'laboratoria.wmi.amu.edu.pl/uslugi-uniwersyteckie/index.html' saved [4240/4240] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/problemy/docker/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 6326 (6.2K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/problemy/docker/index.html' laboratoria.wmi.amu 100%[===================>] 6.18K --.-KB/s in 0s 2021-03-17 09:25:32 (182 MB/s) - 'laboratoria.wmi.amu.edu.pl/problemy/docker/index.html' saved [6326/6326] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/serwery-terminalowe/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 382 [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/serwery-terminalowe/index.html' laboratoria.wmi.amu 100%[===================>] 382 --.-KB/s in 0s 2021-03-17 09:25:32 (15.9 MB/s) - 'laboratoria.wmi.amu.edu.pl/serwery-terminalowe/index.html' saved [382/382] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/vpn/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 334 [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/vpn/index.html' laboratoria.wmi.amu 100%[===================>] 334 --.-KB/s in 0s 2021-03-17 09:25:32 (16.3 MB/s) - 'laboratoria.wmi.amu.edu.pl/vpn/index.html' saved [334/334] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/a126 Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://laboratoria.wmi.amu.edu.pl/a126/ [following] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/a126/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 3671 (3.6K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/a126' laboratoria.wmi.amu 100%[===================>] 3.58K --.-KB/s in 0s 2021-03-17 09:25:32 (194 MB/s) - 'laboratoria.wmi.amu.edu.pl/a126' saved [3671/3671] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/irc Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://laboratoria.wmi.amu.edu.pl/irc/ [following] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/irc/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 3946 (3.9K) [text/html] Saving to: 'laboratoria.wmi.amu.edu.pl/irc' laboratoria.wmi.amu 100%[===================>] 3.85K --.-KB/s in 0s 2021-03-17 09:25:32 (243 MB/s) - 'laboratoria.wmi.amu.edu.pl/irc' saved [3946/3946] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/godziny-otwarcia Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://laboratoria.wmi.amu.edu.pl/godziny-otwarcia/ [following] --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/godziny-otwarcia/ Reusing existing connection to laboratoria.wmi.amu.edu.pl:443. HTTP request sent, awaiting response... 200 OK Length: 5317 (5.2K) [text/html] laboratoria.wmi.amu.edu.pl/godziny-otwarcia: Is a directory Cannot write to 'laboratoria.wmi.amu.edu.pl/godziny-otwarcia' (Is a directory). --2021-03-17 09:25:32-- https://laboratoria.wmi.amu.edu.pl/js/fix.js Connecting to laboratoria.wmi.amu.edu.pl (laboratoria.wmi.amu.edu.pl)|150.254.78.3|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 62 [application/javascript] Saving to: 'laboratoria.wmi.amu.edu.pl/js/fix.js' laboratoria.wmi.amu 100%[===================>] 62 --.-KB/s in 0s 2021-03-17 09:25:32 (6.51 MB/s) - 'laboratoria.wmi.amu.edu.pl/js/fix.js' saved [62/62] FINISHED --2021-03-17 09:25:32-- Total wall clock time: 0.3s Downloaded: 20 files, 115K in 0.03s (4.14 MB/s)
# aria2c pozwala łatwo pobrać listę adresów URL, dla każdego adresu można ustawić specyficzne opcje
! (cd aria2c-example && cat aria.in)
! (cd aria2c-example && aria2c -i aria.in)
http://www.almanachmuszyny.pl/spisy/1991/AM1991_02_muszynski_zamek_prawda_i_legenda.pdf out=1991-1.pdf http://www.almanachmuszyny.pl/spisy/1991/AM1991_03_muszyna_miasteczko_historyczne.pdf out=1991-2.pdf 03/17 09:31:54 [[1;32mNOTICE[0m] Downloading 2 item(s) 03/17 09:31:55 [[1;32mNOTICE[0m] Download complete: /home/filipg/ext/amu/aitech-eks/wyk/aria2c-example/1991-1.pdf 03/17 09:31:55 [[1;32mNOTICE[0m] Download complete: /home/filipg/ext/amu/aitech-eks/wyk/aria2c-example/1991-2.pdf Download Results: gid |stat|avg speed |path/URI ======+====+===========+======================================================= 3bf8a7|[1;32mOK[0m | 458KiB/s|/home/filipg/ext/amu/aitech-eks/wyk/aria2c-example/1991-1.pdf e0c4c1|[1;32mOK[0m | 677KiB/s|/home/filipg/ext/amu/aitech-eks/wyk/aria2c-example/1991-2.pdf Status Legend: (OK):download completed.
Biblioteki/frameworki do tworzenia robotów
Python
Użyteczne biblioteki:
- urllib
- request
- Beautiful Soup (do parsowania HTML-a)
import urllib
import requests
from bs4 import BeautifulSoup
url = 'https://laboratoria.wmi.amu.edu.pl/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# wydobądź wszystkie linki (elementy A)
links = soup.find_all('a')
print([(link['href'], link.get_text()) for link in links])
[('/en/', 'English'), ('/', '\n\n Laboratoria Komputerowe\n '), ('/', 'Strona główna'), ('/godziny-otwarcia/', 'Godziny otwarcia'), ('/kontakt/', 'Kontakt'), ('/pierwsze-kroki/', 'Pierwsze kroki'), ('/przewodnik/', 'Przewodnik po stronie'), ('/regulamin-laboratoriow-komputerowych/', 'Regulamin Wydziałowych Laboratoriów Komputerowych'), ('/nie-odpowiadamy/', 'Za co nie odpowiadamy'), ('/laboratoria/oprogramowanie/', 'Laboratoria'), ('/uslugi/', 'Usługi'), ('/uslugi-uniwersyteckie/', 'Usługi Uniwersyteckie'), ('/problemy/docker/', 'Problemy'), ('/serwery-terminalowe/', 'serwera terminalowego'), ('/vpn/', 'VPN'), ('https://help.wmi.amu.edu.pl/', 'https://help.wmi.amu.edu.pl/'), ('/a126', 'A1-26'), ('https://help.wmi.amu.edu.pl/', 'System helpdeskowy'), ('mailto:helpdesk@wmi.amu.edu.pl', 'helpdesk@wmi.amu.edu.pl'), ('/irc', 'users'), ('https://www.facebook.com/wmilabs/', 'Facebook'), ('/godziny-otwarcia', 'Godziny otwarcia')]
XPath
XPath – język służący do adresowania części dokumentu XML.
/html/body/div/p
– pełna ścieżka do wszystkich akapitów wewnątrz głównych elementów<DIV>
//div/p
– wszystkie akapity w jakichkolwiek elementach<DIV>
//a/@href
- wartości atrybutuhref
dla wszystkich linków//p[@id=’foo’]/img[5]
- piąty (indeksowanie od 1!) obrazek wewnątrz akapitu o identyfikatorze foo//p[img]/a
- linki w akapitach zawierających obrazek
Czym się różni:
//img[3]
od(//img)[3]
?//p[img]/a
od//p[//img]/a
?
from urllib.request import urlopen
from lxml import etree
url = 'https://laboratoria.wmi.amu.edu.pl/'
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
# linki z panelu
links = tree.xpath("//div[@class='sidebar-menu']//a/@href")
print(links)
['/', '/godziny-otwarcia/', '/kontakt/', '/pierwsze-kroki/', '/przewodnik/', '/regulamin-laboratoriow-komputerowych/', '/nie-odpowiadamy/', '/laboratoria/oprogramowanie/', '/uslugi/', '/uslugi-uniwersyteckie/', '/problemy/docker/']
Jak poradzić sobie z dynamicznymi stronami?
HtmlUnit
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage("http://ceti.pl/?ceti=administracja");
HtmlForm form = page.getForms().get(2);
HtmlTextInput loginField = form.getInputByName("login");
loginField.setValueAttribute("atrapa");
HtmlPasswordInput passField = form.getInputByName("pass");
passField.setValueAttribute("haslo1");
HtmlImageInput button = form.getInputByValue("OK");
HtmlPage page2 = (HtmlPage)button.click();
HtmlPage page3 = webClient.getPage("https://tau4.ceti.pl/cgi-bin/logs-user-show.cgi");
System.out.println(page3.asXml());
UnexpectedPage page4 = webClient.getPage("https://adm.tau4.ceti.pl/logs.zip");
InputStream istr = page4.getInputStream();
Selenium
# należy wcześniej uruchomić serwer selenium
# wget https://selenium-release.storage.googleapis.com/3.141/selenium-server-standalone-3.141.59.jar
# java -jar selenium-server-standalone-3.141.59.jar
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
driver = webdriver.Remote(
command_executor='http://127.0.0.1:4444/wd/hub',
desired_capabilities=DesiredCapabilities.CHROME)
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("list")
elem.send_keys(Keys.RETURN)
links = driver.find_elements(By.XPATH, '//h3/a')
print([l.get_attribute('href') for l in links])
driver.close()
['https://www.python.org/community/sigs/guidelines', 'https://www.python.org/dev/peps/pep-0585/', 'https://www.python.org/community/lists', 'https://www.python.org/doc/essays/list2str', 'https://www.python.org/dev/core-mentorship', 'https://www.python.org/dev/peps/pep-3128/', 'https://www.python.org/dev/peps/pep-0204/', 'https://www.python.org/community/sigs/coordination', 'https://www.python.org/psf/committees', 'https://www.python.org/dev/peps/pep-0225/', 'https://www.python.org/dev/peps/pep-3132/', 'https://www.python.org/community/sigs/current/doc-sig/stext', 'https://www.python.org/dev/peps/pep-0202/', 'https://www.python.org/dev/peps/pep-0274/', 'https://www.python.org/dev/peps/pep-0469/', 'https://www.python.org/dev/peps/pep-0289/', 'https://www.python.org/dev/peps/pep-0270/', 'https://www.python.org/community/sigs/retired/string-sig', 'https://www.python.org/community/sigs/retired/progenv-sig', 'https://www.python.org/psf/records/board/minutes/2005-02-08']
Haskell i strzałki
W języku Haskell można tworzyć roboty używając biblioteki HXT opartym na formalizmie strzałek (ang. _arrows).