265 lines
15 KiB
Plaintext
265 lines
15 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "competitive-desire",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
|||
|
"<div class=\"alert alert-block alert-info\">\n",
|
|||
|
"<h1> Komputerowe wspomaganie tłumaczenia </h1>\n",
|
|||
|
"<h2> 9,10. <i>Web scraping</i> [laboratoria]</h2> \n",
|
|||
|
"<h3>Rafał Jaworski (2021)</h3>\n",
|
|||
|
"</div>\n",
|
|||
|
"\n",
|
|||
|
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "hungarian-davis",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Jak dobrze wiemy, w procesie wspomagania tłumaczenia oraz w zagadnieniach przetwarzania języka naturalnego ogromną rolę pełnią zasoby lingwistyczne. Należą do nich korpusy równoległe (pamięci tłumaczeń), korpusy jednojęzyczne oraz słowniki. Bywa, że zasoby te nie są dostępne dla języka, nad którym chcemy pracować."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "featured-moisture",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"W tej sytuacji jest jeszcze dla nas ratunek - możemy skorzystać z zasobów dostępnych publicznie w Internecie. Na dzisiejszych zajęciach przećwiczymy techniki pobierania tekstu ze stron internetowych."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "underlying-isaac",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Poniższy kod służy do ściągnięcia zawartości strony (w formacie HTML do zmiennej) oraz do wyszukania na tej stronie konkretnych elementów. Przed jego uruchomieniem należy zainstalować moduł BeautifulSoup:\n",
|
|||
|
"`pip3 install beautifulsoup4`"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"id": "revolutionary-trust",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Nastolatek ukradł znajomemu 4500 złotych. Wcześniej pił z nim alkohol\n",
|
|||
|
"Czekają nas kolejne podwyżki rachunków. Tym razem za ogrzewanie i ciepłą wodę\n",
|
|||
|
"Nie żyje Piotr Ś. Czyściciel kamienic miał 47 lat\n",
|
|||
|
"Maciej Skorża nie zmienił zdania o systemie na mecz z Rakowem. Kolejorz ma szybką okazję do rehabilitacji\n",
|
|||
|
"Kto zabił Kazimierę Kurkowiak? Poznańskie Archiwum X wraca do sprawy sprzed 30 lat\n",
|
|||
|
"Mieszkańcy osiedla Kwiatowego zyskają nowy chodnik\n",
|
|||
|
"Poznańskie ZOO ponownie się otwiera i apeluje o kupowanie biletów online\n",
|
|||
|
"1700 zł mandatu dla motocyklisty: nie ma prawa jazdy, jechał za szybko\n",
|
|||
|
"Plac Wolności ma tętnić życiem. Jest koncepcja zagospodarowania\n",
|
|||
|
"Dzikie wysypisko w Wielkopolskim Parku Narodowym, a w nim paczka z telefonem odbiorcy\n",
|
|||
|
"Dobre wieści z Łazarza! \"Zielona Perła\" sprzedana!\n",
|
|||
|
"Sokoły wędrowne w gnieździe na kominie poznańskiej elektrociepłowni! Są 4 młode\n",
|
|||
|
"720 nowych zakażeń w Wielkopolsce\n",
|
|||
|
"Uderzył kobietę w sklepie: \"sprawca będzie rozliczony\"\n",
|
|||
|
"Zespół Szkół Geodezyjno- Drogowych. Przyszłość rysuje się w kolorowych barwach!\n",
|
|||
|
"Tajemniczy wypadek i pożar pod Kwilczem. Auto spłonęło, w środku nikogo nie było\n",
|
|||
|
"Nad Jeziorem Maltańskim powstanie duży hotel? \"Ma uzupełniać infrastrukturę sportową\"\n",
|
|||
|
"Śmiertelny wypadek na trasie S8: samochód potrącił rowerzystę\n",
|
|||
|
"Specjaliści o poszukiwaniu Natalii Lick: \"niestety trop psa prowadził na Wartostradę\"\n",
|
|||
|
"Korki przy skrzyżowaniu Grochowska / Grunwaldzka: ruszyły prace!\n",
|
|||
|
"Restauracja w Kaliszu przyjmuje klientów: sanepid i policja \"odwiedzili\" lokal\n",
|
|||
|
"Ile kosztuje wywóz odpadów?\n",
|
|||
|
"Dachowanie auta na trasie Konin - Turek\n",
|
|||
|
"Kierowca BMW pod wpływem narkotyków, pasażer w ich posiadaniu. Obaj zostali zatrzymani\n",
|
|||
|
"Leszno: mężczyzna uderzył klientkę sklepu. Poszło o maseczkę?\n",
|
|||
|
"Od poniedziałku zapłacimy za parkowanie na kolejnych ulicach\n",
|
|||
|
"Włamał się do obiektu handlowego. Grozi mu nawet 15 lat więzienia\n",
|
|||
|
"Rondo Śródka: kolizja z udziałem dwóch pojazdów\n",
|
|||
|
"Europoseł PSL: oświadczenie Episkopatu ma wpływ na proces szczepień. \"Bardzo dużo ludzi zrezygnowało\"\n",
|
|||
|
"Bezcenna wygrana Enea Energetyka. Poznanianki zagrają w fazie play-off\n",
|
|||
|
"No to w drogę! Po odmienionych trasach w Wielkopolsce\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import requests\n",
|
|||
|
"from bs4 import BeautifulSoup\n",
|
|||
|
"\n",
|
|||
|
"url='https://epoznan.pl'\n",
|
|||
|
"\n",
|
|||
|
"page = requests.get(url)\n",
|
|||
|
"soup = BeautifulSoup(page.content, 'html.parser')\n",
|
|||
|
"\n",
|
|||
|
"headers = soup.find_all('h3', {'class':'postItem__title'})\n",
|
|||
|
"\n",
|
|||
|
"print('\\n'.join([header.get_text() for header in headers]))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "dental-combination",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Ćwiczenie 1: Napisz funkcję do pobierania nazw towarów z serwisu Ceneo.pl. Typ towaru, np. telewizor, pralka, laptop jest parametrem funkcji. Wystarczy pobierać dane z pierwszej strony wyników wyszukiwania."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"id": "moving-clothing",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def get_names(article_type):\n",
|
|||
|
" return []"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "mechanical-produce",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"W ten sposób pobieramy dane z jednej strony. Nic jednak nie stoi nam na przeszkodzie, aby zasymulować przełączanie stron."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "legitimate-corrections",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Ćwiczenie 2: Zaobserwuj, jak zmienia się url strony podczas przechodzenia do kolejnych stron wyników wyszukiwania na Ceneo.pl. Wykorzystaj tę informację i uruchom funkcję get_names() na więcej niż jednej stronie wyników."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "german-dispute",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def scrape_names():\n",
|
|||
|
" return []"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "discrete-durham",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Technika pobierania treści z Internetu jest szczególnie efektywnym sposobem na pozyskiwanie dużych ilości tekstu. Poniższy fragment kodu służy do ściągnięcia całości tekstu ze strony."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"id": "premium-button",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
" Yahoo Make Yahoo Your HomepageDiscover something new every day from News, Sports, Finance, Entertainment and more! HOME MAIL NEWS FINANCE SPORTS ENTERTAINMENT LIFE SHOPPING YAHOO PLUS MORE... Download the Yahoo Home app Yahoo Home Search query Sign in Mail Sign in to view your mail Mail Mail COVID-19 COVID-19 News News Finance Finance Sports Sports Entertainment Entertainment Life Life Shopping Shopping Yahoo Plus Yahoo Plus More... More... Follow live:Closing arguments begin for Derek Chauvin's murder trial in the death of George Floyd 5 people in hospital after shooting in Louisiana One victim was shot in the head, and another suffered multiple gunshot wounds, according to local news outlet.Multiple police units dispatched to scene »2 dead in crash of Tesla with 'no one' drivingMall shooter, 16, faces 1st-degree murder charge'80s pop star rips 'Simpsons' for 'hateful' parodyConspiracy theorist Alex Jones faces a reckoningPig's head left at former home of Chauvin trial witness U.S.HuffPostFirst-Ever Wild Wolf Collar Camera Shows What They Really Do All Day LongThis canine's favorite meal might surprise you. Thanks for your feedback! CelebrityThe TelegraphRobert De Niro unable to turn down acting roles because of his ‘estranged wife's expensive lifestyle’Hollywood legend Robert De Niro is unable to turn down acting roles because he must pay for his estranged wife's expensive tastes, the actor's lawyer has claimed. Caroline Krauss told a Manhattan court that he is struggling financially because of the pandemic, a massive tax bill and the demands of Grace Hightower, who filed for divorce in 2018 after 21 years of marriage. The court has been asked to settle how much De Niro should pay Ms Hightower, 66, until the terms of the prenuptial agreement the couple negotiated in 2004 takes effect. “Mr De Niro is 77 years old, and while he loves his craft, he should not be forced to work at this prodigious pace because he has to,” Ms Krauss told the court. “When does that stop? When does he get the opportunity to not take every project that comes along and not work six-day weeks, 12-hour days so he can keep pace with Ms Hightower’s thirst for Stella McCartney?” Thanks for your feedback! U.S.Associated PressCouple: Man has tossed used cups in their yard for 3 yearsAn upstate New York couple may have finally solved the mystery of who's been tossing used coffee cups in their front yard for nearly three years. Edward and Cheryl Patton told The Buffalo News they tried mounting a camera in a tree in front of their home in Lake View to catch the phantom litterer. After Edward Patton called police, they waited and pulled over a vehicle driven by 76-year-old Larry Pope, who Cheryl Patton said had once worked with her and had had disagreements with her over union issues. Thanks for your feedback! U.S.INSIDERA leading conspiracy theorist who thought COVID-19 was a hoax died from the virus after hosting illegal house partiesA high-profile conspiracy theorist from Norway, who shared false information about the pandemic online, has died from COVID-19, officials say. Thanks for your feedback! PoliticsThe WeekOne America News Network producer says 'majority' of employees didn't believe reports on voter fraud claimsMarty Golingan, a producer at One America News Network, a right-wing cable news channel often noted for its affinity for former President Donald Trump, told The New York Times he was worried his work may have helped inspire the Jan. 6 Capitol riot. At one point during the incident, Golingan said he caught sight of someone in the mob holding a flag with OAN's logo. \"I was like, OK, that's not good. That's what happens when people listen to us,\" he told the Times, referring to OAN's coverage of the 2020 presidential election, which often gave credence to Trump's unfounded claims of widespread voter fraud and Democratic conspiracies. Golingan said that many of his colleagues, including himself, disagreed with the coverage. \"The majority of people did not believe the voter fraud claims being run on the air,\" he tol
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import re\n",
|
|||
|
"\n",
|
|||
|
"url = \"https://www.yahoo.com\"\n",
|
|||
|
"\n",
|
|||
|
"page = requests.get(url)\n",
|
|||
|
"soup = BeautifulSoup(page.content, 'html.parser')\n",
|
|||
|
"\n",
|
|||
|
"# usunięcie elementów script i style\n",
|
|||
|
"for script in soup([\"script\", \"style\"]):\n",
|
|||
|
" script.extract() # usuń element\n",
|
|||
|
"\n",
|
|||
|
"# pobierz tekst\n",
|
|||
|
"text = soup.get_text()\n",
|
|||
|
"\n",
|
|||
|
"# usuń wielokrotne białe znaki\n",
|
|||
|
"text = re.sub(r\"\\s+\", \" \", text)\n",
|
|||
|
"\n",
|
|||
|
"print(text)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "assigned-necessity",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Ćwiczenie 3: Napisz program do pobrania tekstu ze strony Wydziału Matematyki i Informatyki. Pobierz cały tekst ze strony głównej a następnie wyszukaj na tej stronie wszystkich linków wewnętrznych i pobierz tekst ze stron wskazywanych przez te linki. Nie zagłębiaj się już dalej.\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "regulation-sheriff",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def scrape_wmi():\n",
|
|||
|
" return []"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "paperback-hello",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Omówione wyżej techniki działają również bardzo dobrze dla zasobów słownikowych."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"id": "after-activity",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"### Ćwiczenie 4: Pobierz jak najwięcej słów w języku albańskim z serwisu glosbe.com."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"id": "surgical-ozone",
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def scrape_shqip():\n",
|
|||
|
" return []"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"author": "Rafał Jaworski",
|
|||
|
"email": "rjawor@amu.edu.pl",
|
|||
|
"lang": "pl",
|
|||
|
"subtitle": "9,10. Web scraping",
|
|||
|
"title": "Komputerowe wspomaganie tłumaczenia",
|
|||
|
"year": "2021",
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.8.10"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 5
|
|||
|
}
|