KWT-2024/lab/lab_09-10.ipynb

43 KiB
Raw Permalink Blame History

Logo 1

Komputerowe wspomaganie tłumaczenia

9,10. Web scraping [laboratoria]

Rafał Jaworski (2021)

Logo 2

Jak dobrze wiemy, w procesie wspomagania tłumaczenia oraz w zagadnieniach przetwarzania języka naturalnego ogromną rolę pełnią zasoby lingwistyczne. Należą do nich korpusy równoległe (pamięci tłumaczeń), korpusy jednojęzyczne oraz słowniki. Bywa, że zasoby te nie są dostępne dla języka, nad którym chcemy pracować.

W tej sytuacji jest jeszcze dla nas ratunek - możemy skorzystać z zasobów dostępnych publicznie w Internecie. Na dzisiejszych zajęciach przećwiczymy techniki pobierania tekstu ze stron internetowych.

Poniższy kod służy do ściągnięcia zawartości strony (w formacie HTML do zmiennej) oraz do wyszukania na tej stronie konkretnych elementów. Przed jego uruchomieniem należy zainstalować moduł BeautifulSoup: pip3 install beautifulsoup4

import requests
from bs4 import BeautifulSoup

url='https://epoznan.pl'

page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

headers = soup.find_all('h3', {'class':'postItem__title'})

print('\n'.join([header.get_text() for header in headers]))
Autorska biżuteria od Goldenmark
Głośno na pl Wolności. "Polskojęzyczny rząd i prokuratura plują na polski mundur"
Ostatnia szansa na tak wysoką dotację w Poznaniu. "Od 2025 r. będą one sukcesywnie spadać"
Bardzo niebezpieczna roślina zaczęła kwitnąć. Jest ostrzeżenie!
Miasto z Wielkopolski wśród szesnastu Cudów Polski 2024. "To wielka historia w małym mieście"
Rapid Motocykle - najstarsza firma motocyklowa w Poznaniu zaprasza!
Kierowca Toyoty jechał pod prąd na S11. Zderzył się z motocyklem
Kilka masowych imprez w najbliższych dniach w Poznaniu. ZTM wprowadzi zmiany, na ulicach pojawi się nowa linia tramwajowa
Ekipa popularnego programu pojawiła się w Poznaniu. "Będzie się działo"
Poważny wypadek na krajowej "jedenastce". Kierowcy uwięzieni w busach
Wyjątkowa okazja, by zwiedzić to miejsce z przewodnikiem. "Goście muszą mieć ukończone 7 lat i buty na płaskiej podeszwie"
Najnowsze trendy edukacyjne - żłobek, przedszkole i szkoła w OGRODZIE
Kolejna inwestycja szykuje się na północy Poznania. "Wybraliśmy wykonawcę"
Zatrzaśnięta 2-latka w aucie na parkingu przy galerii handlowej. Wybito szybę
Pijany 19-latek rozbił kamieniem witraże w kościele i zerwał baner wyborczy. Jednej nocy
Korek na autostradowej obwodnicy Poznania. Zderzenie 3 aut
Rower dla nastolatka - wyzwanie dla rodzica
Co dalej ze śmigłowcem uszkodzonym na A2?
Coraz bardziej zaawansowane prace przy budowie Mostów Berdychowskich, kolejne zmiany w organizacji ruchu
W regionie powstaje największy park logistyczny w Polsce. Właśnie kończą przebudowywać drogi w okolicy na koszt inwestora
Dachował busem w rowie, prawdopodobnie zasnął. Poważny wypadek
Pijany wszedł na rusztowanie przy remontowanej wieży kościoła, bo chciał zrobić zdjęcie miasta z góry
Mieszkańcy skarżyli się na hałas generowany przez motocykle, 5 na 6 skontrolowanych z zatrzymanym dowodem rejestracyjnym
KODANO świętuje 20. urodziny! Promocje jakich jeszcze nie było!
Podpisano umowę na budowę kolejnej wielkopolskiej obwodnicy!
Akcja CBŚP pod Poznaniem. Zatrzymano jedną osobę
Sarna utknęła na szkolnym terenie
Na krajowej "jedenastce" bez zmian, wciąż gigantyczne korki przez remont mostu
Są oficjalne wyniki wyborów: 6 mandatów w Wielkopolsce! Jest jedno zaskoczenie
Brzegi Warty połączono ostatnim stalowym elementem Mostów Berdychowskich. Jest w nim kapsuła czasu z aktem erekcyjnym

Ćwiczenie 1: Napisz funkcję do pobierania nazw towarów z serwisu Ceneo.pl. Typ towaru, np. telewizor, pralka, laptop jest parametrem funkcji. Wystarczy pobierać dane z pierwszej strony wyników wyszukiwania.

Ceneo.pl jest renderowane przez Javascript. Strona po requescie nie jest załadowana. W związku z tym, scrapowanie strony nie może odbyć się poprzez zwykła funkcję request, ponieważ w tym wypadku należy zasymulować przeglądarkę.

<!DOCTYPE html>

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    
    <script src="https://challenges.cloudflare.com/turnstile/v0/api.js" async defer></script>

    <style>
    html, body {
        font-size: 13px;
        font-family: Helvetica,Arial,Roboto,sans-serif;
        line-height: 1.5;
        color: #444;
        background-color: #f2f5f7;
        position: relative;
        margin: 0;
        padding: 0;
        height: 100%;
    }

    .footer {
        padding-top: 20px;
        position: absolute;
        z-index: 5;
        background-color: #f2f5f7;
        bottom: 0;
        width: 100%;
    }

    .page-header .header-container {
        position: relative;
        background-color: #f2f5f7;
    }

    .page-header .header-container > .wrapper {
        display: -webkit-box;
        display: -ms-flexbox;
        display: flex;
        height: 70px;
        position: relative;
        align-items: center;
    }

    .main-content {
        padding-bottom: 60px;
    }

    .page-header .header-container .header-logo {
        width: 220px;
    }

    .ceneo-logo {
        height: 53px;
        display: inline-block;
        vertical-align: top;
        position: relative;
        width: 116px;
        margin: 0px 10px;
    }

    .copyright {
        margin-top: 1.5em;
        text-align: center;
        background: rgba(0,0,0,.1);
    }

    .copyright .wrapper {
        padding-top: .75em;
        padding-bottom: .75em;
        position: relative;
    }

    a:active, a:hover {
        outline: 0;
    }

    a:hover {
        color: #0071c5;
    }

    a {
        -webkit-transition: color .3s;
        -moz-transition: color .3s;
        -o-transition: color .3s;
        transition: color .3s;
        text-decoration: none;
        color: #444;
    }

    .layout_container {
        min-height: 100%;
        position: relative;
    }

    .btn {
        margin: 0;
        display: inline-block;
        padding: 5px 14px;
        margin-bottom: 0;
        font-size: 13px;
        line-height: 19px;
        text-align: center;
        border-radius: 3px;
        vertical-align: middle;
        cursor: pointer;
        -webkit-appearance: button;
        font-family: sans-serif;
        color: #fff;
        border: 0;
        border-color: rgba(0,0,0,.1) rgba(0,0,0,.1) rgba(0,0,0,.25);
        background-color: #0071c5;
    }

    .captcha-bot {
        position: relative;
        width: 352px;
        margin: 10px auto;
        text-align: center;
    }

    @media (max-width: 960px) {
        #bot {
            display: none;
        }
    }
    
    .card-form {
        box-shadow: 0 1px 3px rgba(0,0,0,.12), 0 1px 2px rgba(0,0,0,.24);
        padding: 10px 25px!important;
        box-sizing: border-box;
        background-color: #fdfdfd;
    }

    form {
        margin: 0;
    }

    #bot {
        position: absolute;
        right: -72px;
        top: 0;
        content: '';
        width: 90px;
        height: 155px;
    }

    .card-form .col .col-wide {
        width: 100%;
    }
    .captcha-bot .col-wide {
        text-align: right;
        margin: 15px 0;
    }
    .field-validation-error {
        margin-bottom: .375em;
        display: block;
        border: 0;
        color: #e71226
    }

    </style>
</head>
<body>

    <div id="layout_container">
        <header class="page-header" role="banner">
            <div class="header-container">
                <div class="wrapper">
                    <div class="header-logo">
                        <span class="ceneo-logo">
                            <a href="/"><img src="https://www.ceneo.pl/Content/img/icons/logo-ceneo-simple-orange.svg" alt="Ceneo - znajdź, porównaj, kup"></a>
                        </span>
                    </div>
                </div>
            </div>
        </header>

        <div class="main-content">
            <div class="wrapper">
                <div class="card-form captcha-bot">
                    


<script type="text/javascript">
    function successChallengeCallback(token) {
        var form = document.getElementById("turnstileChallenge");
        form.elements["client-side-errors"].value = "";
        form.submit();
    }
    function errorChallengeCallback(errorCode) {
        var form = document.getElementById("turnstileChallenge");
        form.elements["client-side-errors"].value = errorCode;
    }
</script>

<form id="turnstileChallenge" method="post" action="/Captcha/Add">
    <input hidden type="text" id="ReturnUrl" name="ReturnUrl" value="%2fTelewizory" />
    <input hidden type="number" id="CookieSeconds" name="CookieSeconds" value="172800" />
    <input hidden type="text" id="CaptchaServiceProvider" name="CaptchaServiceProvider" value="Cloudflare" />
    <input type="hidden" name="client-side-errors"/>
    <div class="cf-turnstile" data-sitekey="0x4AAAAAAAUXnkfCYJAjoWMy" data-callback="successChallengeCallback" data-error-callback="errorChallengeCallback" data-language="pl-pl"></div>
    <div class="col col-wide">
        <button type="submit" class="btn btn-info">Przejdź dalej</button>
    </div>
</form>


                    <svg version="1.1" id="bot" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px" viewBox="0 0 77 151.6" style="enable-background: new 0 0 77 151.6;" xml:space="preserve"><style type="text/css">.st0 {fill: #B55B28;}.st1 {fill: #F6921E;}.st2 {fill: #F1F1F2;}.st3 {fill: #1D427B;}.st4 {fill: #FFFFFF;}.st5 {fill: #BE222F;}.st6 {fill: #679B41;}.st7 {fill: #C43261;}.st8 {fill: none;stroke: #1D427B;stroke-width: 3;stroke-miterlimit: 10;}</style><g><rect x="14.3" y="60.1" transform="matrix(0.863 0.5053 -0.5053 0.863 35.5865 -4.3319)" class="st0" width="23" height="6.7"/><path class="st0" d="M66.9,94.6l-33.5,53.6c-0.8,1.3-2.4,1.7-3.7,0.9l-17.2-10.1V59.5L66,90.8C67.3,91.6,67.7,93.3,66.9,94.6z"/><path class="st1" d="M50.5,85.1L17,138.6c-0.8,1.3-2.4,1.7-3.7,0.9l-0.8-0.5V59.5l37.1,21.7C50.9,82,51.3,83.7,50.5,85.1z"/><path class="st2" d="M46.6,85.8L15.8,135c-0.5,0.9-1.5,1.4-2.4,1.4c-0.3,0-0.6,0-0.9-0.2v-0.8c1,0.5,2.2,0.1,2.7-0.8L46,85.4c0.3-0.5,0.4-1.1,0.3-1.7c-0.1-0.6-0.5-1-1-1.3L12.5,63.2v-0.9l33.2,19.4c0.7,0.4,1.2,1,1.3,1.8C47.2,84.3,47,85.1,46.6,85.8z"/><path class="st3" d="M40.8,86.1l-17,25.9c-0.3,0.5-1,0.8-1.7,0.8c-0.6,0-1.3-0.2-1.9-0.6l-7.8-4.6V67l26.9,15.7c0.9,0.5,1.5,1.3,1.7,2C41.1,85.3,41,85.7,40.8,86.1z"/><path class="st1" d="M37.9,84.4l-16.9,26c-0.4,0.6-1.5,0.6-2.5,0l-6-3.5v-39l24.3,14.2C37.8,82.7,38.3,83.7,37.9,84.4z"/><path class="st4" d="M18.7,78c0,0-0.3,0.2-0.9,2.3c-0.5,1.7,0.1,4.5-1.5,7.8c-0.7,1.5-1.5,2.3-2.4,3c-0.1-0.1-0.1-0.2-0.3-0.2c-0.4-0.2-0.8-0.1-1.1,0.2v-2c0.3-0.3,0.6-0.8,0.9-1.2c0.7-1.3,1-2.4,0.9-3.3c-0.1-0.9-0.7-1.7-1.6-2.2c-0.1,0-0.1-0.1-0.2-0.1v-7.3L18.7,78z"/><path class="st4" d="M12.7,92.4c0.4,0.2,1,0.1,1.2-0.3c0.2-0.3,0.1-0.7-0.1-1c-0.5,0.4-0.9,0.7-1.4,1.1C12.5,92.3,12.6,92.3,12.7,92.4z"/><path class="st4" d="M19.8,90.7c-0.7-0.4-1.4-0.1-1.8,0c0.2-0.4,0.4-0.8,0.3-0.8c0,0-0.1,0-0.1,0l-0.8-0.1c-0.1,0-0.1,0-0.2,0c-0.1,0.1-0.1,0.3-0.2,0.5c-0.1,0.1-0.1,0.3-0.3,0.4l-3.6,3.9c0,0-0.1,0.1-0.1,0.2c0,0,0,0.1,0,0.1c0,0,0.1,0,0.1,0l0.8,0.3c0.1,0,0.2,0,0.2-0.1l1.2-1.3c0.1,0.2,0.2,0.4,0.6,0.6c1.4,0.8,3.2-0.6,3.9-1.9C20.5,91.6,20.4,91,19.8,90.7z M16.5,93.5c-0.3-0.2-0.4-0.4-0.4-0.5l1.1-1.2c0.5-0.5,1.2-1,1.7-0.7c0.3,0.2,0.3,0.4,0.1,0.8C18.4,92.9,17.3,94,16.5,93.5z"/><path class="st4" d="M24.2,91.2c0,0,0.1-0.1,0.1-0.2c0-0.1,0-0.1,0-0.1c0,0-0.1,0-0.1,0l-0.8-0.3c-0.1,0-0.2,0-0.2,0.1l-3.4,3.7c-0.2,0.3-0.4,0.5-0.6,0.8c-0.3,0.5-0.1,0.9,0.4,1.2c0.3,0.2,0.9,0.4,1,0.3c0,0,0-0.1,0-0.1l0.3-0.6c0-0.1,0-0.1,0-0.1c0,0-0.2,0-0.3-0.1c-0.1-0.1-0.2-0.2-0.1-0.3c0.1-0.1,0.2-0.3,0.4-0.5L24.2,91.2z"/><path class="st4" d="M19,118.4l-5.9,10c-0.2,0.3-0.4,0.5-0.7,0.7v-16.6l5.9,3.5C19.2,116.4,19.5,117.5,19,118.4z"/><path class="st0" d="M20.8,145.6l-3,5.2c-0.5,0.8-1.6,1.1-2.4,0.6l-2.9-1.7v-11.1l7.8,4.5C21.1,143.6,21.3,144.7,20.8,145.6z"/><path class="st5" d="M50.3,23.2l-7.2-3.4c-1.5-0.7-2.2-2.5-1.5-4.1l5.2-11.2c0.7-1.5,2.5-2.2,4.1-1.5l7.2,3.4c1.5,0.7,2.2,2.5,1.5,4.1l-5.2,11.2C53.7,23.2,51.9,23.9,50.3,23.2z"/><path class="st0" d="M76.5,30.9L57.4,72c-1.2,2.6-4.3,3.7-6.9,2.5l-38-17.7V36.7L26.1,7.4c1.2-2.6,4.3-3.7,6.9-2.5l41,19.1C76.6,25.3,77.7,28.3,76.5,30.9z"/><path class="st1" d="M67,26.5l-19.1,41c-1.2,2.6-4.3,3.7-6.9,2.5L12.5,56.7V11.9L16.6,3c1.2-2.6,4.3-3.7,6.9-2.5l41,19.1C67.1,20.8,68.2,23.9,67,26.5z"/><ellipse transform="matrix(0.4228 -0.9062 0.9062 0.4228 -7.4003 89.152)" class="st3" cx="66.3" cy="50.4" rx="9.1" ry="6.5"/><path class="st1" d="M69.7,52c0.7-1.5,3.3-2.1,3.3-2.1s-1-5.7-2.9-6.6c-2.6-1.2-6.2,1.1-8.1,5.2c-1.9,4.1-1.3,8.3,1.3,9.5c1.9,0.9,6.9-2,6.9-2S69,53.4,69.7,52z"/><path class="st3" d="M73,49.8c-1.3-0.2-2.7,0.7-3.3,2.1c-0.7,1.5-0.4,3.1,0.5,3.9c0.8-0.8,1.5-1.7,2-2.8C72.7,52,72.9,50.9,73,49.8z"/><rect x="42" y="12" transform="matrix(0.9062 0.4228 -0.4228 0.9062 10.1025 -20.3141)" class="st3" width="17.8" height="1.1"/><path class="st0" d="M27,36.5l-8.4-3.9c-0.3-0.1-0.4-0.5-0.3-0.8l7.6-16.2c0.1-0.3,0.5-0.4,0.8-0.3l8.4,3.9c0.3,0.1,0.4,0.5,0.3,0.8l-7.6,16.2C27.7,36.5,27.3,36.6,27,36.5z"/><path class="st0" d="M43.7,44.3l-8.4-3.9c-0.3-0.1-0.4-0.5-0.3-0.8l7.6-16.2c0.1-0.3,0.5-0.4,0.8-0.3l8.4,3.9c0.3,0.1,0.4,0.5,0.3,0.8L44.5,44C44.4,44.3,44,44.4,43.7,44.3z"/><path class="st3" d="M25.6,35.8l-8.4-3.9c-0.3-0.1-0.4-0.5-0.3-0.8l7.6-16.2c0.1-0.3,0.5-0.4,0.8-0.3l8.4,3.9c0.3,0.1,0.4,0.5,0.3,0.8l-7.6,16.2C26.2,35.8,25.9,35.9,25.6,35.8z"/><path class="st3" d="M42.3,43.6l-8.4-3.9c-0.3-0.1-0.4-0.5-0.3-0.8l7.6-16.2c0.1-0.3,0.5-0.4,0.8-0.3l8.4,3.9c0.3,0.1,0.4,0.5,0.3,0.8l-7.6,16.2C42.9,43.6,42.6,43.7,42.3,43.6z"/><rect x="32.6" y="19.1" transform="matrix(0.9062 0.4228 -0.4228 0.9062 15.2188 -11.2479)" class="st6" width="0.8" height="19.1"/><rect x="49.3" y="26.9" transform="matrix(0.9062 0.4228 -0.4228 0.9062 20.1 -17.6055)" class="st5" width="0.8" height="19.1"/><rect x="26" y="16.4" transform="matrix(0.9062 0.4228 -0.4228 0.9062 10.0607 -10.6522)" class="st7" width="6.2" height="1.9"/><rect x="42.7" y="24.2" transform="matrix(0.9062 0.4228 -0.4228 0.9062 14.9233 -16.9857)" class="st7" width="6.2" height="1.9"/><path class="st8" d="M36.2,56.9c-3.9-0.6-7.9-2-11.7-4c-3.8-2-7.1-4.6-9.9-7.4"/><path class="st4" d="M61.9,112.1c2.6-0.8,4.5-4.7,4.7-9.6c0-0.5,0-0.9,0-1.3c-0.1-2.1-0.5-4-1.3-5.5c-0.7-1.5-1.7-2.5-2.9-3c-0.9-0.7-1.8-1.1-2.8-1.3c-1-1.2-2.2-1.8-3.6-1.9c-3.9-0.1-7.1,5.4-7.3,12.5c-0.1,3.3,0.5,6.5,1.7,8.8c0.6,1.1,1.2,2,2,2.7l-0.8,14.7l-16.9,1.4c-0.1-0.3-0.2-0.6-0.3-0.8c-0.6-1.2-1.3-2-2.2-2.4c-1-0.8-2.2-1.3-3.5-1.4c0,0-0.1,0-0.1,0c-1.6,0-3.2,0.8-4.4,2.1l-7.4-5.5c0,0,0,0,0,0c-0.1,0-0.1-0.1-0.2-0.1c0,0-0.1,0-0.1,0c0,0-0.1,0-0.1,0l-7.5-0.2c-0.2,0-0.4,0.1-0.5,0.2l-7.7,7.1c0,0,0,0,0,0c0,0,0,0,0,0c0,0,0,0,0,0c-0.1,0.1-0.1,0.1-0.1,0.2c0,0,0,0,0,0.1c0,0.1,0,0.1,0,0.2c0,0,0,0,0,0l0,1.4c0,0.4,0.3,0.8,0.7,0.8l9.3,0.2c0,0,0,0,0,0c0.1,0,0.2,0,0.2,0c0,0,0,0,0,0c0.1,0,0.1-0.1,0.2-0.1c0,0,0,0,0,0l3.6-2.7l1.9,1.3l0,1c0,0,0,0,0,0c-0.1,0.1-0.2,0.3-0.2,0.5l-0.2,6.8c0,0.1,0,0.3,0.1,0.4l-1.8,1.1l-3.4-2.9c0,0,0,0,0,0c0,0-0.1-0.1-0.1-0.1c0,0,0,0-0.1,0c0,0-0.1,0-0.1,0c0,0-0.1,0-0.1,0c0,0,0,0,0,0l-9.3-0.2c-0.2,0-0.4,0.1-0.5,0.2C0.1,136.8,0,137,0,137.2l0,1.4c0,0,0,0,0,0c0,0.1,0,0.1,0,0.2c0,0,0,0,0,0.1c0,0.1,0.1,0.1,0.1,0.2c0,0,0,0,0,0c0,0,0,0,0,0c0,0,0,0,0,0l7.3,7.5c0.1,0.1,0.3,0.2,0.5,0.2l7.5,0.2c0,0,0,0,0,0c0,0,0,0,0,0c0,0,0,0,0,0c0.1,0,0.1,0,0.2,0c0.1,0,0.1-0.1,0.2-0.1c0,0,0,0,0,0l7.7-5.1c1.2,1.4,2.7,2.2,4.4,2.3c0,0,0.1,0,0.1,0c1.2,0,2.4-0.4,3.4-1.2c0.9-0.4,1.7-1.2,2.3-2.3c0.1-0.2,0.2-0.5,0.3-0.7l18-1.6c1.1,2.1,2.7,3.5,4.6,3.5c0,0,0.1,0,0.1,0c2.2,0,4.2-1.8,5.3-4.5c0-0.1,0.1-0.1,0.1-0.2c0-0.1,0.1-0.2,0.1-0.4c0.1-0.2,0.1-0.3,0.2-0.5c0-0.1,0-0.1,0.1-0.2c0.1-0.2,0.1-0.4,0.2-0.6c0-0.1,0-0.1,0-0.2c0-0.2,0.1-0.4,0.1-0.7c0-0.1,0-0.2,0-0.2c0-0.2,0.1-0.4,0.1-0.7c0-0.1,0-0.2,0-0.2c0-0.3,0-0.6,0.1-0.9c0.1-2-0.3-3.7-0.8-5.2c0,0,0,0,0-0.1c-0.1-0.4-0.3-0.7-0.5-1c-0.1-0.1-0.1-0.3-0.2-0.4c-0.2-0.3-0.4-0.6-0.6-0.9l0.3-12.5C61.7,112.2,61.8,112.1,61.9,112.1z"/><g><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -47.982 155.167)" class="st3" cx="55.7" cy="102.2" rx="11.9" ry="6.3"/><g><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -45.7188 157.6167)" class="st1" cx="58.1" cy="102.3" rx="10.3" ry="7.8"/></g><g><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -76.4925 185.9519)" class="st1" cx="57.2" cy="132.2" rx="8.8" ry="5.4"/></g><rect x="38.9" y="123.8" transform="matrix(8.719753e-02 0.9962 -0.9962 8.719753e-02 172.9182 79.5027)" class="st1" width="8.4" height="20.6"/><polygon class="st3" points="33.8,131.5 31.4,130.7 52.3,129 54.8,129.5 "/><g><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -106.9022 159.3979)" class="st1" cx="28.4" cy="134.6" rx="8.8" ry="6.3"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -104.7693 161.7066)" class="st3" cx="30.6" cy="134.6" rx="7.8" ry="4.1"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -104.7693 161.7066)" class="st1" cx="30.6" cy="134.6" rx="6.6" ry="3.4"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -104.7693 161.7066)" class="st3" cx="30.6" cy="134.6" rx="5.3" ry="2.8"/></g><polygon class="st3" points="53.2,110.4 52.3,129 54.8,129.5 57.6,109.4 "/><polygon class="st1" points="57.6,109.4 60.8,111.5 60.5,125.1 55.1,127.1 "/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -75.2578 187.2884)" class="st3" cx="58.5" cy="132.3" rx="7.8" ry="4.1"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -75.2578 187.2884)" class="st1" cx="58.5" cy="132.3" rx="6.6" ry="3.4"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -75.2578 187.2884)" class="st3" cx="58.5" cy="132.3" rx="5.3" ry="2.8"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -42.8812 160.6881)" class="st3" cx="61.1" cy="102.4" rx="9.1" ry="4.8"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -42.8812 160.6881)" class="st1" cx="61.1" cy="102.4" rx="7.7" ry="4"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -42.8812 160.6881)" class="st3" cx="61.1" cy="102.4" rx="6.2" ry="3.3"/><g><polygon class="st1" points="25.2,132.5 25.3,128.9 16.2,122.2 8.5,129.4 10.2,130.7 17,125.6 22.8,130.2 22.5,133.5 25.2,133.6 "/><polygon class="st0" points="10.2,130.7 1,130.5 1,129.1 8.5,129.4 "/><polygon class="st3" points="1,129.1 8.7,122 16.2,122.2 8.5,129.4 "/><polygon class="st3" points="22.5,133.5 16.8,133.3 16.9,129.6 14.2,127.7 17,125.6 22.8,130.2 "/></g><g><polygon class="st1" points="25.1,136.5 25,140.1 15.6,146.3 8.3,138.7 10.1,137.5 16.5,142.9 22.6,138.7 22.5,135.4 25.1,135.4 "/><polygon class="st0" points="10.1,137.5 0.8,137.2 0.7,138.7 8.3,138.7 "/><polygon class="st3" points="0.7,138.7 8.1,146.1 15.6,146.3 8.3,138.7 "/><polygon class="st3" points="22.5,135.4 16.7,135.2 16.6,139 13.9,140.7 16.5,142.9 22.6,138.7 "/></g><rect x="16.5" y="131.6" transform="matrix(0.9997 2.636193e-02 -2.636193e-02 0.9997 3.5652 -0.4773)" class="st1" width="6.7" height="6.8"/><g><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -112.6895 154.3782)" class="st1" cx="22.9" cy="135" rx="3.4" ry="2.5"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -111.8626 155.2733)" class="st3" cx="23.8" cy="135.1" rx="3" ry="1.6"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -111.8626 155.2733)" class="st1" cx="23.8" cy="135.1" rx="2.5" ry="1.3"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -111.8626 155.2733)" class="st3" cx="23.8" cy="135.1" rx="2.1" ry="1.1"/></g><g><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -118.7584 147.8609)" class="st1" cx="16.5" cy="134.9" rx="3.4" ry="2.5"/><ellipse transform="matrix(2.636193e-02 -0.9997 0.9997 2.636193e-02 -117.9314 148.756)" class="st1" cx="17.4" cy="134.9" rx="2.5" ry="1.3"/></g></g></g>
                    </svg>
                </div>
            </div>
        </div>

        <footer class="footer">

            <div class="copyright">
                <div class="wrapper">
                    © 2005-2024 Ceneo.pl sp. z o.o. Korzystanie z serwisu oznacza akceptację <a href="http://info.ceneo.pl/regulamin" target="_blank" title="Regulamin Ceneo.pl">regulaminu</a>
                </div>
            </div>

        </footer>
    </div>

</body>
def get_names(article_type):
    product_type_url = article_type.replace(' ', '+')
    url = f'https://www.ceneo.pl/{product_type_url}'
    print(url)

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print("Failed to retrieve the page. Status code:", response.status_code)
        return []

    soup = BeautifulSoup(response.text, 'html.parser')
    products = soup.find_all('div', class_='cat-prod-row__body')
    print(products)
    product_names = []
    for product in products:
        name = product.find('a', class_='go-to-product').get_text(strip=True)
        product_names.append(name)

    return product_names

product_type = "Telewizory"
names = get_names(product_type)
print(names)
https://www.ceneo.pl/Telewizory
[]
[]

W ten sposób pobieramy dane z jednej strony. Nic jednak nie stoi nam na przeszkodzie, aby zasymulować przełączanie stron.

Ćwiczenie 2: Zaobserwuj, jak zmienia się url strony podczas przechodzenia do kolejnych stron wyników wyszukiwania na Ceneo.pl. Wykorzystaj tę informację i uruchom funkcję get_names() na więcej niż jednej stronie wyników.

def scrape_names(name, pages):
    return [get_names(name if i == 0 else f'{name};0020-30-0-0-{i}.htm') for i in range(pages)]

names_from_pages = scrape_names(product_type, 10)
print(names_from_pages)
https://www.ceneo.pl/Telewizory
[]
https://www.ceneo.pl/Telewizory;0020-30-0-0-1.htm
[]
https://www.ceneo.pl/Telewizory;0020-30-0-0-2.htm
[]
https://www.ceneo.pl/Telewizory;0020-30-0-0-3.htm
[]
https://www.ceneo.pl/Telewizory;0020-30-0-0-4.htm
[]
https://www.ceneo.pl/Telewizory;0020-30-0-0-5.htm
[]
https://www.ceneo.pl/Telewizory;0020-30-0-0-6.htm
[]
https://www.ceneo.pl/Telewizory;0020-30-0-0-7.htm
[]
https://www.ceneo.pl/Telewizory;0020-30-0-0-8.htm
[]
https://www.ceneo.pl/Telewizory;0020-30-0-0-9.htm
[]
[[], [], [], [], [], [], [], [], [], []]

Technika pobierania treści z Internetu jest szczególnie efektywnym sposobem na pozyskiwanie dużych ilości tekstu. Poniższy fragment kodu służy do ściągnięcia całości tekstu ze strony.

import re

url = "https://www.yahoo.com"

page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# usunięcie elementów script i style
for script in soup(["script", "style"]):
    script.extract()    # usuń element

# pobierz tekst
text = soup.get_text()

# usuń wielokrotne białe znaki
text = re.sub(r"\s+", " ", text)

print(text)
 Yahoo Make Yahoo Your HomepageDiscover something new every day from News, Sports, Finance, Entertainment and more! HOME MAIL NEWS FINANCE SPORTS ENTERTAINMENT LIFE SHOPPING YAHOO PLUS MORE... Download the Yahoo Home app Yahoo Home Search query Sign in Mail Sign in to view your mail Mail Mail COVID-19 COVID-19 News News Finance Finance Sports Sports Entertainment Entertainment Life Life Shopping Shopping Yahoo Plus Yahoo Plus More... More... Follow live:Closing arguments begin for Derek Chauvin's murder trial in the death of George Floyd 5 people in hospital after shooting in Louisiana One victim was shot in the head, and another suffered multiple gunshot wounds, according to local news outlet.Multiple police units dispatched to scene »2 dead in crash of Tesla with 'no one' drivingMall shooter, 16, faces 1st-degree murder charge'80s pop star rips 'Simpsons' for 'hateful' parodyConspiracy theorist Alex Jones faces a reckoningPig's head left at former home of Chauvin trial witness U.S.HuffPostFirst-Ever Wild Wolf Collar Camera Shows What They Really Do All Day LongThis canine's favorite meal might surprise you. Thanks for your feedback! CelebrityThe TelegraphRobert De Niro unable to turn down acting roles because of his estranged wife's expensive lifestyleHollywood legend Robert De Niro is unable to turn down acting roles because he must pay for his estranged wife's expensive tastes, the actor's lawyer has claimed. Caroline Krauss told a Manhattan court that he is struggling financially because of the pandemic, a massive tax bill and the demands of Grace Hightower, who filed for divorce in 2018 after 21 years of marriage. The court has been asked to settle how much De Niro should pay Ms Hightower, 66, until the terms of the prenuptial agreement the couple negotiated in 2004 takes effect. “Mr De Niro is 77 years old, and while he loves his craft, he should not be forced to work at this prodigious pace because he has to,” Ms Krauss told the court. “When does that stop? When does he get the opportunity to not take every project that comes along and not work six-day weeks, 12-hour days so he can keep pace with Ms Hightowers thirst for Stella McCartney?” Thanks for your feedback! U.S.Associated PressCouple: Man has tossed used cups in their yard for 3 yearsAn upstate New York couple may have finally solved the mystery of who's been tossing used coffee cups in their front yard for nearly three years. Edward and Cheryl Patton told The Buffalo News they tried mounting a camera in a tree in front of their home in Lake View to catch the phantom litterer. After Edward Patton called police, they waited and pulled over a vehicle driven by 76-year-old Larry Pope, who Cheryl Patton said had once worked with her and had had disagreements with her over union issues. Thanks for your feedback! U.S.INSIDERA leading conspiracy theorist who thought COVID-19 was a hoax died from the virus after hosting illegal house partiesA high-profile conspiracy theorist from Norway, who shared false information about the pandemic online, has died from COVID-19, officials say. Thanks for your feedback! PoliticsThe WeekOne America News Network producer says 'majority' of employees didn't believe reports on voter fraud claimsMarty Golingan, a producer at One America News Network, a right-wing cable news channel often noted for its affinity for former President Donald Trump, told The New York Times he was worried his work may have helped inspire the Jan. 6 Capitol riot. At one point during the incident, Golingan said he caught sight of someone in the mob holding a flag with OAN's logo. "I was like, OK, that's not good. That's what happens when people listen to us," he told the Times, referring to OAN's coverage of the 2020 presidential election, which often gave credence to Trump's unfounded claims of widespread voter fraud and Democratic conspiracies. Golingan said that many of his colleagues, including himself, disagreed with the coverage. "The majority of people did not believe the voter fraud claims being run on the air," he told the Times. Indeed, the Times interviewed 18 current and former OAN employees, 16 of whom said the channel has "broadcast reports that they considered misleading, inaccurate, or untrue." But Allysia Britton, a former producer and one of more than a dozen employees to leave OAN in the wake of the riot, explained that while "many people have raised concerns ... when people speak up about anything, you will get in trouble." Read more at The New York Times. More stories from theweek.comThe new HBO show you won't be able to stop watchingDonald Trump's most dangerous political legacyTrump's NSA general counsel Michael Ellis resigns, never having taken office Thanks for your feedback! BusinessMoneyWiseFourth stimulus check update: Biden faces mounting pressure for new paymentAdvocates and lawmakers say the crisis isn't over, and neither is the need for relief. Thanks for your feedback! CelebrityThe TelegraphLand Rover driver at Prince Philip's funeral spent week ensuring he could drive at correct speedHuffPostPrince Philip's Funeral, In PhotosUSA TODAY EntertainmentWhy did Prince Philip's Land Rover carry his casket? The story behind the strange hearse Thanks for your feedback! Trending Now1. Gianna Hammer2. Derek Chauvin3. Black Rob4. 2021 Acm Awards5. Baby Shower Invitations6. Amanda Broderick7. Mortgage Refinance Calculator8. Interest Rates Today9. Tesla Crash10. Mars Helicopter Yahoo! Mail WeatherWeatherGreater PolandView your LocationsRemove from favorite locationsDetect my locationEnter City or ZipcodeManage LocationsToday66°45°TueRain today with a high of 59 °F (15.0 °C) and a low of 41 °F (5.0 °C). There is a 50% chance of precipitation.59°41°WedPartly cloudy today with a high of 57 °F (13.9 °C) and a low of 41 °F (5.0 °C).57°41°ThuScattered showers today with a high of 48 °F (8.9 °C) and a low of 37 °F (2.8 °C). There is a 35% chance of precipitation.48°37°See More » ScoreboardChange Sports to display different scoresNBA NFL MLB NHL NCAAB NCAAF Trending YesterdayTodayTomorrowPortland Charlotte 101109FinalSacramento Dallas 121107FinalMinnesota LA Clippers 105124FinalMore scores » HoroscopeChange your horoscope signAriesTaurusGeminiCancerLeoVirgoLibraScorpioSagittariusCapricornAquariusPiscesApril 19 -Aries - You're feeling the heat, and you may find that your friends like it as much as you do! Your great energy is perfect for almost any activity, so light up the night and have a great time! See more » Yahoo! Mail Yahoo! Sports Terms (Updated)Privacy (Updated)AdvertiseAbout Our AdsCareersHelpFeedback Close this content, you can also use the Escape key at anytime 

Ćwiczenie 3: Napisz program do pobrania tekstu ze strony Wydziału Matematyki i Informatyki. Pobierz cały tekst ze strony głównej a następnie wyszukaj na tej stronie wszystkich linków wewnętrznych i pobierz tekst ze stron wskazywanych przez te linki. Nie zagłębiaj się już dalej.

import requests
from bs4 import BeautifulSoup
import re

def get_page_text(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    for script in soup(["script", "style"]):
        script.extract()

    text = soup.get_text()
    text = re.sub(r"\s+", " ", text) # white chars removal

    return text

def get_internal_links(url, domain):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    links = soup.find_all('a', href=True)
    return [link['href'] for link in links if domain in link['href']]

def scrape_wmi():
    base_url = "https://wmi.amu.edu.pl/"

    main_text = get_page_text(base_url)
    print("Tekst ze strony głównej:\n", main_text[:1000], "...")
    internal_links = list(set(get_internal_links(base_url, "wmi.amu.edu.pl")))

    return [get_page_text(base_url + link if link.startswith('/') else link) for link in internal_links]


scrape_wmi()

Omówione wyżej techniki działają również bardzo dobrze dla zasobów słownikowych.

Ćwiczenie 4: Pobierz jak najwięcej słów w języku albańskim z serwisu glosbe.com.

def get_page_words(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')

    # Pobierz wszystkie słowa ze strony
    words = []
    prefix = '/sq/pl'
    for word in soup.find_all('a', href=True):
      if word['href'].startswith(prefix):
        text = word.get_text().replace(prefix, '').replace('\n', '')
        if len(text) > 0:
          words.append(text)

    return words

def scrape_shqip():
    base_url = "https://glosbe.com/pl/sq/"
    domain = "glosbe.com"

    # Pobierz słowa ze strony głównej
    words = get_page_words(base_url)
    print("Słowa ze strony głównej:\n", words)
    # Usuń duplikaty
    words = list(set(words))
    print("Wszystkie słowa:", words)


scrape_shqip()
Słowa ze strony głównej:
 ['tungjatjeta', 'lakmitar', 'lakmitar', 'zbret', 'naten e mirë', 'bir kurvë']
Wszystkie słowa: ['lakmitar', 'naten e mirë', 'zbret', 'tungjatjeta', 'bir kurvë']