2021-03-08 21:32:33 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
2021-09-27 07:36:37 +02:00
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
2021-09-27 07:42:48 +02:00
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 1. <i>Wyszukiwarki — wprowadzenie</i> [wykład]</h2> \n",
"<h3> Filip Graliński (2021)</h3>\n",
2021-09-27 07:36:37 +02:00
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
2021-03-08 21:32:33 +01:00
"source": [
"# Wyszukiwarki - wprowadzenie\n",
"\n",
2021-03-17 11:03:03 +01:00
"## Systemy wyszukiwania informacji (information retrieval systems)\n",
2021-03-08 21:32:33 +01:00
"\n",
"![System wyszukiwania informacji](system-wyszukiwania-informacji.png)"
]
},
{
"cell_type": "markdown",
2021-09-27 07:36:37 +02:00
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
2021-03-08 21:32:33 +01:00
"source": [
"## Wyszukiwarki\n",
"\n",
"![Wyszukiwarki](wyszukiwarka-internetowa.png)"
]
},
2021-03-09 22:24:20 +01:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chcę stworzyć swoją własną wyszukiwarkę internetową...\n",
"\n",
"1. Skąd brać adresy URL?\n",
"2. Jak pobrać pliki z tych adresów?\n",
"3. Jak wydobyć z nich tekst?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ... a może w ogóle nie pobierać?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Korpus CommonCrawl\n",
"\n",
"https://commoncrawl.org/the-data/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<!-- スマホ用 --\n",
"<!-- \n",
"<!--table width='750' border='0' align='center' cellpadding='0' cellspacing='0'\n",
"<!--a href='index.phtml?CHANNEL=R51&FID=389924'\n",
"<!-- mail: \n",
"<!-- beige_lavender-3c --\n",
"<!--\n",
"<!-- Template Design By BeigeHeart_Chako_ http://beigeheart.blog9.fc2.com/ --\n",
"<!-- 関連記事_ http://beigeheart.blog9.fc2.com/blog-entry-99.html --\n",
"<!-- 利用規約_ http://beigeheart.blog9.fc2.com/blog-entry-103.html --\n",
"<!-- テンプレの再配布、営利目的の利用禁止 --\n",
"<!-- 画像の無断転載・再配布禁止 --\n",
"<!-- アダルト・法律違反サイト、使用不可 --\n",
"<!-- アクセス解析タグはここから --\n",
"<!-- アクセス解析タグはここまで --\n",
"<!--▼▼▼メインカラムカラム+右サイドカラム部分--\n",
"<!--▼ヘッダー--\n",
"<!--▼管理ページリンク--\n",
"<!--▲管理ページリンク--\n",
"<!--▼タイトル--\n"
]
}
],
"source": [
"# Bezpośrednio z serwisu\n",
"\n",
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a '<!--[^\\[\\]<>]+' | uniq | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dostępne są też \"ekstrakty\" czystego tekstu - zob. http://data.statmt.org/ngrams/raw/, np. 59 GB czystego tekstu po polsku z 2012 roku."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df6fa1abb58549287111ba8d776733e9 0.000000 http://www.gornicki.pl/focal_points_4/2006\n",
"Przegląd okulistyczny \n",
"Focal points \n",
"Przegląd reumatologiczny \n",
"Biblioteka on-line \n",
"STRONA GŁÓWNA \n",
"WYDAWNICTWO \n",
"O wydawnictwie \n",
"Kontakt \n",
"Regulamin zamówień \n",
"Spotkania autorskie \n",
"Nasi autorzy \n",
"CZYTELNIA ONLINE \n",
"w dziale: anatomia \n",
"w dziale: okulistyka \n",
"w dziale: ratownictwo \n",
"CENNIK \n",
"LINKI \n",
"USŁUGI \n",
"df6fa1abb58549287111ba8d776733e9 2.000000 http://www.gornicki.pl/focal_points_4/2006\n",
"Licencjaty \n",
"Multimedia \n",
"Pulmonologia \n",
"Okulistyka \n",
"Ratownictwo \n",
"Reumatologia \n",
"Zestawy specjalne \n",
"Onkologia \n",
"Focal Points 4/2006\n",
"\n"
]
}
],
"source": [
"! (wget -O - -q http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/pl/raw/pl.2012.raw.xz \\\n",
" | xzcat | head -n 30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Zrzuty Wikipedii\n",
"\n",
"Nie pobieraj Wikipedii strona po stronie!\n",
"\n",
"* tracisz swój czas\n",
"* i tracisz czas serwerów Wikipedii\n",
"\n",
"Lepiej pobrać zrzut (_dump_) ze strony https://dumps.wikimedia.org/backup-index.html"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1977]]\n",
"[[język skryptowy|skryptowy]]\n",
"[[programowanie proceduralne|proceduralny]]\n",
"[[Programowanie sterowane zdarzeniami|sterowany zdarzeniami]]\n",
"[[Alfred V. Aho|Alfred Aho]]\n",
"[[Peter J. Weinberger|Peter Weinberger]]\n",
"[[Brian Kernighan]]\n",
"[[wieloplatformowość|wieloplatformowy]]\n",
"[[język programowania]]\n",
"[[plik]]\n",
"[[system operacyjny|systemów operacyjnych]]\n",
"[[Unix|UNIX]]\n",
"[[tablica asocjacyjna|tablice asocjacyjne]]\n",
"[[Tekstowy typ danych|stringi]]\n",
"[[wyrażenie regularne|wyrażenia regularne]]\n",
"[[Alfred V. Aho|Alfreda V. Aho]]\n",
"[[Peter Weinberger|Petera Weinbergera]]\n",
"[[Brian Kernighan|Briana Kernighana]]\n",
"[[POSIX]]\n",
"[[System V|SVR4]]\n"
]
}
],
"source": [
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
" | bzcat | grep -P -o '\\[\\[[^\\]]+\\]\\]' | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2021-03-10 10:32:28 +01:00
"## Skąd brać adresy URL?\n",
2021-03-09 22:24:20 +01:00
"\n",
"### Zob. dumpy powyżej"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://ssl'\n",
"https://static.fc2.com/css_cn/common/headbar/120710style.css\n",
"https://blog.fc2.com/\n",
"https://spdeliver.i-mobile.co.jp/script/adsnativepc.js?20101001\n",
"https://media.fc2.com/counter_img.php?id=3493\n",
"https://plus.google.com/+apothekenumschau\n",
"https://script.ioam.de/iam.js\n",
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/AGP-Kontaktformular--73317.html\n",
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/Apotheker-HP--AGP-73319.html\n",
"https://login.apotheken-umschau.de/login?service=https://www.apotheken-umschau.de/j_spring_cas_security_check\n",
"https://forum.apotheken-umschau.de/portal/registration/register\n",
"https://www.facebook.com/Apotheken.Umschau\n",
"https://api.wortundbildverlag.com/drug-suggest/terms\n",
"https://07743rats-apotheke.apotheken-umschau.de/unternehmenskommunikation/Kontakt-zu-den-Redaktionen-53834.html\n",
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
"https://static.skyrock.net/js/common.min.js?eBtyhdw\n",
"https://static.skyrock.net/img/favicon_v5b.ico\n",
"https://wir.skyrock.net/wir/v1/resize/?c=isi&im=%2F9775%2F59549775%2Fpics%2Fphoto_59549775_89.jpg&w=16\n",
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
"https://static.skyrock.net/css/common.css?eahf2jw\n"
]
}
],
"source": [
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a 'https://[^ \"><]+' | uniq | head -n 20)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna\n",
"https://web.archive.org/web/20100116001012/http://homepages.cwi.nl/~dik/english/codes/stand.html#ascii\n",
"https://web.archive.org/web/20160613145224/http://www.aivosto.com/vbtips/charsets-7bit.html#body}}</ref>\n",
"https://web.archive.org/web/20160522024759/http://worldpowersystems.com/J/codes/#ASCII-1967\n",
"https://books.google.com/?id=NQSpNAEACAAJ&pg=PA28\n",
"https://web.archive.org/web/20160616084132/https://www.w3.org/blog/2008/05/utf8-web-growth/\n",
"https://web.archive.org/web/20160616084637/https://googleblog.blogspot.de/2008/05/moving-to-unicode-51.html\n",
"https://web.archive.org/web/20160616085323/https://googleblog.blogspot.de/2010/01/unicode-nearing-50-of-web.html\n",
"https://web.archive.org/web/20160827000956/http://dlx.bookzz.org/genesis/772000/c80a62495acf1e1a5b966de23c1f989a/_as/%5BInterface_Age_Staff%5D_Best_of_Interface_Age%2C_Volum%28BookZZ.org%29.pdf\n",
"https://books.google.com/books?id=bXLDwmIJNkUC&pg=PA13\n",
"https://web.archive.org/web/20161031223347/http://ethw.org/First-Hand%3AChad_is_Our_Most_Important_Product%3A_An_Engineer%27s_Memory_of_Teletype_Corporation\n",
"https://textfiles.meulie.net/bitsaved/Books/Mackenzie_CodedCharSets.pdf\n",
"https://web.archive.org/web/20160526181319/http://longstreet.typepad.com/thesciencebookstore/2012/03/heres-the-link.html\n",
"https://web.archive.org/web/20120213005708/http://www.transbay.net/~enf/ascii/ascii.pdf\n",
"https://archive.org/details/dictionaryworldp00iann\n",
"https://archive.org/details/dictionaryworldp00iann/page/n80\n",
"https://www.theguardian.com/commentisfree/belief/2013/jan/28/lucretius-all-things-atoms\n",
"https://archive.org/details/distillingknowle00mora_557\n",
"https://archive.org/details/distillingknowle00mora_557/page/n156\n",
"https://archive.org/details/fromelementstoat00sieg\n"
]
}
],
"source": [
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
" | bzcat | grep -P -o 'https://[^ \"><]+' | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Serwis DMOZ/ODP (niestety już nieaktywny)\n",
"Ostatni link: https://web.archive.org/web/20160306230718/http://rdf.dmoz.org/rdf/content.rdf.u8.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Odpytywać \"pasożytniczo\" inną wyszukiwarkę"
]
},
{
"cell_type": "code",
2021-03-20 14:29:51 +01:00
"execution_count": 1,
2021-03-09 22:24:20 +01:00
"metadata": {},
"outputs": [],
"source": [
"# see https://hackernoon.com/how-to-scrape-google-with-python-bo7d2tal\n",
"\n",
"import urllib\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"def query_google(query):\n",
" url = f\"https://google.com/search?q={query}\"\n",
" response = requests.get(url)\n",
" soup = BeautifulSoup(response.content, \"html.parser\")\n",
" \n",
" results = []\n",
" for g in soup.find_all('a'):\n",
" link = g['href']\n",
" if '/url?q=' in link:\n",
2021-03-10 10:32:28 +01:00
" results.append((link[7:], g.parent.get_text()))\n",
2021-03-09 22:24:20 +01:00
" return results"
]
},
{
"cell_type": "code",
2021-03-20 14:29:51 +01:00
"execution_count": 2,
2021-03-09 22:24:20 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2021-03-20 14:29:51 +01:00
"[('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQmhMwC3oECA0QDg&usg=AOvVaw0GUY96bFEsdrfOb9_ME9qP',\n",
2021-03-10 10:32:28 +01:00
" 'Wikipedia'),\n",
2021-03-20 14:29:51 +01:00
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjASegQIDBAB&usg=AOvVaw3LMsdCuK3PBSunL8shYp-S',\n",
2021-03-10 10:32:28 +01:00
" 'Wielka Stopa (zwierzę) – Wikipedia, wolna encyklopediapl.wikipedia.org › wiki › Wielka_Stopa_(zwierzę)'),\n",
2021-03-20 14:29:51 +01:00
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Opis&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQ0gIwEnoECAwQAg&usg=AOvVaw02WHiDgMZ18jJGW-y7agVg',\n",
2021-03-10 10:32:28 +01:00
" 'Opis'),\n",
2021-03-20 14:29:51 +01:00
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Historia&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQ0gIwEnoECAwQAw&usg=AOvVaw10BrulHDJ4WgEOFkd-3-H6',\n",
2021-03-10 10:32:28 +01:00
" 'Historia'),\n",
2021-03-20 14:29:51 +01:00
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Najwa%25C5%25BCniejsze_argumenty_%25E2%2580%259Eza%25E2%2580%259D_i_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQ0gIwEnoECAwQBA&usg=AOvVaw1nSHJDVeWEJTqpRJOMBcus',\n",
2021-03-10 10:32:28 +01:00
" 'Najważniejsze argumenty ...'),\n",
2021-03-20 14:29:51 +01:00
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Argumenty_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQ0gIwEnoECAwQBQ&usg=AOvVaw3UqFIOr7y6yxvK-i1su1au',\n",
2021-03-10 10:32:28 +01:00
" 'Argumenty „przeciw”'),\n",
2021-03-20 14:29:51 +01:00
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(w%25C3%25B3dz_Siuks%25C3%25B3w)&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjATegQICxAB&usg=AOvVaw1lZSYrEp4ez0Kh4o4SXrY1',\n",
" 'Wielka Stopa (wódz Siuksów) – Wikipedia, wolna encyklopediapl.wikipedia.org › wiki › Wielka_Stopa_(wódz_Siuksów)'),\n",
" ('https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQtwIwFHoECAQQAQ&usg=AOvVaw2EugGtxH-FfMbNmqhS5py3',\n",
2021-03-10 10:32:28 +01:00
" 'Wielka Stopa w Suszu - YouTubewww.youtube.com › watch'),\n",
2021-03-20 14:29:51 +01:00
" ('https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQuAIwFHoECAQQAg&usg=AOvVaw17g24VY46PboJW54XyZGa1',\n",
2021-03-10 10:32:28 +01:00
" '23 cze 2017 · Od niedawna oczy naukowców poszukujących Wielkiej Stopy skierowane są na niewielkie ...Czas trwania: 6:24\\nOpublikowano: 23 cze 2017'),\n",
2021-03-20 14:29:51 +01:00
" ('https://www.ceneo.pl/%3Bszukaj-wielka%2Bstopa&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAZegQIBhAB&usg=AOvVaw0HUE-TpszLKJjAMsV6lvPU',\n",
2021-03-10 10:32:28 +01:00
" 'Wielka Stopa - znaleziono na Ceneo.plwww.ceneo.pl › ...'),\n",
2021-03-20 14:29:51 +01:00
" ('https://www.antyradio.pl/News/Kobieta-twierdzi-ze-spotkala-Wielka-Stope-Hustala-sie-na-drzewie-ZDJECIE-43102&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAaegQICBAB&usg=AOvVaw1iIlPUpJwldL0MacDY4ebw',\n",
2021-03-10 10:32:28 +01:00
" 'Wielka Stopa - kolejny przypadek spotkania z potworem - Antyradiowww.antyradio.pl › News › Kobieta-twierdzi-ze-spotkala-Wielka-Stope-Hu...'),\n",
2021-03-20 14:29:51 +01:00
" ('https://allegro.pl/kategoria/gry%3Fstring%3DWielka%2520stopa%2520%253A)%2520-&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAbegQIABAB&usg=AOvVaw0mgn1YuyE65LFfA54P-gQo',\n",
2021-03-10 10:32:28 +01:00
" 'Wielka stopa :) - Gry - Allegro.plallegro.pl › Kultura i rozrywka › Gry'),\n",
2021-03-20 14:29:51 +01:00
" ('https://allegro.pl/listing%3Fstring%3DWielka%2520stopa%2520%253A%2529%2520-&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAcegQIAxAB&usg=AOvVaw3dzMG9f8K5w31r30AyxNEz',\n",
2021-03-10 10:32:28 +01:00
" 'Wielka stopa :) - Niska cena na Allegro.plallegro.pl › listing'),\n",
2021-03-20 14:29:51 +01:00
" ('https://www.empik.com/gra-strategiczna-yeti-wielka-stopa-jawa,p1103341700,zabawki-p&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAdegQIBxAB&usg=AOvVaw3xZ_RVxgMxK7vOUPAYO-pe',\n",
" 'Gra strategiczna Yeti Wielka stopa - | Sklep EMPIK.COMwww.empik.com › Zabawki › Gry › Strategiczne i ekonomiczne'),\n",
" ('https://tvn24.pl/tvnmeteo/informacje-pogoda/ciekawostki,49/wielka-stopa-nie-istnieje-naukowcy-to-nie-koniec-nadziei,127328,1,0.html&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAeegQICRAB&usg=AOvVaw3XECuxJKyNK_x4MTREa9Ui',\n",
" 'Wielka Stopa nie istnieje? Naukowcy: to nie koniec nadziei - TVN24tvn24.pl › Informacje pogodowe › Ciekawostki'),\n",
" ('https://www.monolith.pl/filmy/2020/mala-wielka-stopa-2-w-rodzinie-sila/&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAfegQIChAB&usg=AOvVaw3uFesbmGBr0dDWxK1ej5n_',\n",
" 'Mała Wielka Stopa 2 - Filmy - Monolith Filmswww.monolith.pl › filmy › mala-wielka-stopa-2-w-rodzinie-sila'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQty4IigE&usg=AOvVaw0fYQ97CWfJ8aCmNBcv3a_d',\n",
" 'Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D%252522wielka%252Bstopa%252522%26hl%3Dpl&sa=U&ved=0ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQxs8CCIsB&usg=AOvVaw1V17_OrU9CNrErDjbwNZRj',\n",
2021-03-10 10:32:28 +01:00
" 'Zaloguj się')]"
2021-03-09 22:24:20 +01:00
]
},
2021-03-20 14:29:51 +01:00
"execution_count": 2,
2021-03-09 22:24:20 +01:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('\"wielka stopa\"')"
]
},
2021-03-10 10:32:28 +01:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Google hacking\n",
"\n",
"... czyli kreatywne wykorzystanie wyszukiwarki Google (niekoniecznie w złowrogich celach)\n",
"\n",
"#### Jak szukać materiałów dwujęzycznych?"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://context.reverso.net/t%25C5%2582umaczenie/angielski-polski/english%2Bversion&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAAegQIABAB&usg=AOvVaw3RrHCxcaLe8qoaZfLEPV6Y',\n",
" 'english version - Tłumaczenie na polski - angielskich przykładów ...context.reverso.net › tłumaczenie › angielski-polski › english+version'),\n",
" ('https://context.reverso.net/t%25C5%2582umaczenie/angielski-polski/An%2BEnglish%2Bversion&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjABegQIBhAB&usg=AOvVaw017LUPkNtKNdnPE8dToBSB',\n",
" 'An English version - Tłumaczenie na polski - angielskich przykładów ...context.reverso.net › tłumaczenie › angielski-polski › An+English+version'),\n",
" ('https://pl.bab.la/slownik/angielski-polski/english-version&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjACegQICRAB&usg=AOvVaw0BG6Y5Y4PWUDFAMQbF5OiB',\n",
" 'ENGLISH VERSION - Tłumaczenie na polski - bab.lapl.bab.la › slownik › angielski-polski › english-version'),\n",
" ('https://www.linguee.com/english-polish/translation/in%2Benglish%2Bversion.html&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjADegQIBxAB&usg=AOvVaw03YqBv17ZeVx2FwKA2Y2gu',\n",
" 'in English version - Polish translation – Lingueewww.linguee.com › english-polish › translation › in+english+version'),\n",
" ('https://www.linguee.com/english-polish/translation/an%2Benglish%2Bversion.html&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAEegQICBAB&usg=AOvVaw261dClyWD55TlTUkm5JNiI',\n",
" 'an English version - Polish translation – Lingueewww.linguee.com › english-polish › translation › an+english+version'),\n",
" ('https://www.youtube.com/watch%3Fv%3DdC8Jy0-VImU&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QtwIwBXoECAoQAQ&usg=AOvVaw1fvEyAWPyHIeWCqTmx5efS',\n",
" 'MELODIA - Sanah | PO ANGIELSKU | ENGLISH VERSION - YouTubewww.youtube.com › watch'),\n",
" ('https://www.youtube.com/watch%3Fv%3DdC8Jy0-VImU&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QuAIwBXoECAoQAg&usg=AOvVaw2n8-O6Aooitc2POfMr2eSI',\n",
" '2 lip 2020 · Z uwagi na to, że wersja angielska \"Szampana\" bardzo Wam się spodobała, postanowiłam ...Czas trwania: 3:16\\nOpublikowano: 2 lip 2020'),\n",
" ('https://www.linguee.pl/angielski-polski/t%25C5%2582umaczenie/english%2Bversion%2Bprevail.html&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAGegQIAhAB&usg=AOvVaw2gR32hWrps8JeETEZFcnC3',\n",
" 'English version prevail - Tłumaczenie na polski – słownik Lingueewww.linguee.pl › angielski-polski › tłumaczenie › english+version+prevail'),\n",
" ('https://www.linguee.pl/angielski-polski/t%25C5%2582umaczenie/english%2Bversion%2Bcoming%2Bsoon.html&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAHegQIARAB&usg=AOvVaw1Gy_8y1P8j2LkQmOcFNUho',\n",
" 'English version coming soon - Tłumaczenie na polski – słownik ...www.linguee.pl › angielski-polski › english+version+coming+soon'),\n",
" ('https://www.umcs.pl/pl/instrukcja-w-jezyku-angielskim-english-version-,15428.htm&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAIegQIBRAB&usg=AOvVaw2qxqPHA01a_XGp2OI2LwHh',\n",
" 'Instrukcja w języku angielskim (english version) - Nowi pracownicy ...www.umcs.pl › ... › Dla pracownika › Nowi pracownicy (instrukcja)'),\n",
" ('https://www.wsb.net.pl/en/&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAJegQIAxAB&usg=AOvVaw33uMYMxHmM5oTynwt9481F',\n",
" 'English version : - Wyższa Szkoła Bezpieczeństwawww.wsb.net.pl › ...'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0Qty4INA&usg=AOvVaw3FvXRX8gjDnoExpLAPHyWl',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dsi%2525C4%252599%252B%252522English%252Bversion%252522%26hl%3Dpl&sa=U&ved=0ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0Qxs8CCDU&usg=AOvVaw3nXIS27h-FWwpKhQDIdB9y',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('się \"English version\"')"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://www.ksk.gda.pl/%3Fs%3D%257Bsearch_term_string%257D%253Flang%253Den%253Flang%253Dfr%253Flang%253Dfr%253Flang%253Dde%253Flang%253Den%253Flang%253Dfr%253Flang%253Dfr%253Flang%253Den%253Flang%253Dfr%253Flang%253Dfr%253Flang%253Dde%253Flang%253Den%253Flang%253Dde%253Flang%253Dde%253Flang%253Dde%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAAegQIAxAB&usg=AOvVaw1rz99qpelK6AKXNq32Y3DB',\n",
" '{search_term_string}?lang=en?lang=fr?lang=fr?lang=de?lang=en ...www.ksk.gda.pl › s={search_term_string}?lang=en?lang=fr?lang=fr?lang=...'),\n",
" ('https://emonitoring.poczta-polska.pl/%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjABegQIBBAB&usg=AOvVaw3BgMdqycY5NWdhCmVHe6Eo',\n",
" 'Śledzenie przesyłek - Poczta Polskaemonitoring.poczta-polska.pl › lang=en'),\n",
" ('http://44mpa.pl/urban-adaptation-plans/%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjACegQICxAB&usg=AOvVaw0yHXmZ8Tv3dujCVJIRKjR7',\n",
" 'Urban Adaptation Plans | Wczujmy się w klimat!44mpa.pl › urban-adaptation-plans › lang=en'),\n",
" ('http://www.apiscosmetics.pl/start-en/products/professional-products/home-terapis-en.html%3Fproduct%3D288%26lang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjADegQICBAB&usg=AOvVaw1QwK_aHzWym29dEM4w0MSw',\n",
" '<!doctype html> <html lang=\"en\"> <head> <meta http-equiv ... - Apiswww.apiscosmetics.pl › products › professional-products › home-terapis-en'),\n",
" ('https://ekursy.akademiakierowcy.pl/message/output/airnotifier/lang/en/&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAEegQIBxAB&usg=AOvVaw2fR_Xur4oOOIxEb1KiJBRL',\n",
" 'Index of /message/output/airnotifier/lang/en - Akademia Kierowcyekursy.akademiakierowcy.pl › message › output › airnotifier › lang'),\n",
" ('https://ekursy.akademiakierowcy.pl/message/output/popup/lang/en/&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAFegQICRAB&usg=AOvVaw38ifWqViF-gaqRnBYCs7ph',\n",
" 'Index of /message/output/popup/lang/en - Akademia Kierowcyekursy.akademiakierowcy.pl › message › output › popup › lang'),\n",
" ('https://www.zabierzow.org.pl/community/welcome/%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAKegQIABAB&usg=AOvVaw1u_tc6Q_mK_qSy_JeUs21l',\n",
" 'Welcome - Oficjalny serwis internetowy Gminy Zabierzówwww.zabierzow.org.pl › Strona główna › Community'),\n",
" ('https://www.ipiss.com.pl/%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjALegQIBRAB&usg=AOvVaw1v4Ep4-1xZU2aj34RQNyA6',\n",
" 'Institute of Labour and Social Studieswww.ipiss.com.pl › lang=en'),\n",
" ('https://support.google.com/webmasters/answer/7489871%3Fhl%3Dpl&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQvxowC3oECAUQAg&usg=AOvVaw3QrhPCjSv1m5Remte9HOQz',\n",
" 'Dowiedz się dlaczego'),\n",
" ('http://www.klub-spadkobiercow.com.pl/%3Fs%3D%25E2%259A%25BD%25E2%259A%25A1%25E2%2598%2598%25EF%25B8%258F%25E2%258F%25B2%2Bkupi%25C4%2599%2Bbmw%2Bseria%2B5%2Boferty%2BSamocholand.pl%2B%25F0%259F%2590%259D%25E2%259C%258B%2B-%2BKupno%2Bsamochod%25C3%25B3w%2B%25F0%259F%258C%258D%25F0%259F%2593%2598%2Bbmw%2Bseria%2B5%2Bkupno%252C%2BKup%2Bbmw%2Bseria%2B5%2Btanio%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAMegQIAhAB&usg=AOvVaw3OrIJeKwmccNn-Z0ci9WZ5',\n",
" 'kupię bmw seria 5 oferty Samocholand.pl - Kupno samochodów ...www.klub-spadkobiercow.com.pl › s=⚽⚡☘⏲+kupię+bmw+seria+5+oferty...'),\n",
" ('http://www.klub-spadkobiercow.com.pl/%3Fs%3D%25F0%259F%2594%2590%25F0%259F%2598%25B2%25F0%259F%258C%259F%25F0%259F%2592%259C%2BSprzedam%2Bsamochody%2Bhummer%2Bh3%2Bog%25C5%2582oszenia%2BSamocholand.pl%2B%25E2%258F%25B2%25F0%259F%2598%258B%2B-%2BSprzeda%25C5%25BC%2Bsamochod%25C3%25B3w%2B%25F0%259F%2592%259E%25F0%259F%2594%2590%2Bsamochody%2Bhummer%2Bh3%2Bog%25C5%2582oszenia%252C%2BSpprzedaj%2Bsamochody%2Bhummer%2Bh3%2Bpilnie%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjANegQIARAB&usg=AOvVaw2gGpRa2QRI0s5hif4sSG15',\n",
" 'Sprzedam samochody hummer h3 ogłoszenia Samocholand.pl ...www.klub-spadkobiercow.com.pl › s=🔐😲🌟💜+Sprzedam+samochody+hu...'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQty4ISg&usg=AOvVaw3qJv9X5Au4qLqskqZgygmA',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dinurl:lang%25253Den%252Bsite:pl%26hl%3Dpl&sa=U&ved=0ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQxs8CCEs&usg=AOvVaw1bNj0srkIoKMTez1biljAK',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('inurl:lang=en site:pl')"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://context.reverso.net/t%25C5%2582umaczenie/angielski-polski/decided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAAegQIAxAB&usg=AOvVaw1VOWJd4mMu1wbrjT0N2fwg',\n",
" 'decided - Tłumaczenie na polski - angielskich przykładów | Reverso ...context.reverso.net › tłumaczenie › angielski-polski › decided'),\n",
" ('https://context.reverso.net/t%25C5%2582umaczenie/polski-angielski/zdecydowali&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjABegQIAhAB&usg=AOvVaw392MbfKZ25nbvv_wpUfF4s',\n",
" 'zdecydowali - Tłumaczenie na angielski - polskich przykładów ...context.reverso.net › tłumaczenie › polski-angielski › zdecydowali'),\n",
" ('https://pl.duolingo.com/dictionary/English/decided/f241156f8cd032ca9b65a8bd760439d8&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjACegQICxAB&usg=AOvVaw3ofU6NSr4cVJ7Wp75lDPWm',\n",
" 'Co oznacza „decided” po angielsku? - Duolingopl.duolingo.com › dictionary › English › decided'),\n",
" ('https://www.diki.pl/slownik-angielskiego%3Fq%3Ddecide&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjADegQICRAB&usg=AOvVaw3D_KS9QB14t8N79rhLEzXx',\n",
" 'decide - Tłumaczenie po polsku - Słownik angielsko-polski Dikiwww.diki.pl › slownik-angielskiego › q=decide'),\n",
" ('http://www.slownictwo.pl/dict1.php%3Ftxt%3Ddecided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAEegQIChAB&usg=AOvVaw2ho4z_VbbIZQfbaQTkaQir',\n",
" 'Internetowy słownik polsko-angielski i angielsko-polski z lektoremwww.slownictwo.pl › dict1 › txt=decided'),\n",
" ('https://pl.bab.la/slownik/angielski-polski/decided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAFegQICBAB&usg=AOvVaw1UVHsgO7GZH-vm4_x5MGDW',\n",
" 'DECIDED - Tłumaczenie na polski - bab.lapl.bab.la › slownik › angielski-polski › decided'),\n",
" ('https://fiszkoteka.pl/slownik/pl/en/zdecydowa%25C5%2582&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAKegQIARAB&usg=AOvVaw1zaRQ2cAhJHPJFYPa5JCT8',\n",
" '→ zdecydował po angielsku, słownik polsko - angielski | Fiszkotekafiszkoteka.pl › słownik polsko - angielski › Z'),\n",
" ('https://fiszkoteka.pl/slownik/en/pl/decided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjALegQIBhAB&usg=AOvVaw3JyZ1e2LvRkwv_mjklzaiO',\n",
" '→ decided po polsku, słownik angielsko - polski | Fiszkotekafiszkoteka.pl › słownik angielsko - polski › D'),\n",
" ('https://ellalanguage.com/pl/slownik_angielski_decide/&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAMegQIBBAB&usg=AOvVaw2hbOA7JWSyFSTH04bVg5rS',\n",
" 'Odmiana czasownika DECIDE | Angielskie czasowniki | ELLAellalanguage.com › slownik_angielski_decide'),\n",
" ('https://tr-ex.me/t%25C5%2582umaczenie/angielski-polski/decided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjANegQIBRAB&usg=AOvVaw0Fl5dYqoiEFcgUzWH0mN2S',\n",
" 'DECIDED ▷ Tłumaczenie Na Polski - Przykłady Użycia Decided W ...tr-ex.me › tłumaczenie › angielski-polski › decided'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwi0-436s6XvAhUzo3EKHU0MAG8Qty4IQw&usg=AOvVaw1uu2p_1jLxzOHd7KfkS2NU',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dzdecydowali%252Bdecided%26hl%3Dpl&sa=U&ved=0ahUKEwi0-436s6XvAhUzo3EKHU0MAG8Qxs8CCEQ&usg=AOvVaw1sNjBEDjM9eZu9ozeQEJqs',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('zdecydowali decided')"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://ispan.waw.pl/journals/index.php/sfps/article/view/sfps.2014.020&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAAegQIABAB&usg=AOvVaw3PKZOp-ZKdH0s_POMTQrv-',\n",
" 'Słowa kluczowe podawane przez autora publikacji jako podstawa ...ispan.waw.pl › journals › index.php › sfps › article › view › sfps.2014.020'),\n",
" ('http://www.wbios.us.edu.pl/tl_files/aktualnosci/revitare-2013/konferencja-streszczenie-wzor.pdf&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjABegQIAxAB&usg=AOvVaw1XgVp3uZUGn0Ig0sADojZO',\n",
" '[PDF] WZÓR STRESZCZENIAwww.wbios.us.edu.pl › revitare-2013 › konferencja-streszczenie-wzor'),\n",
" ('https://docs.microsoft.com/pl-pl/dotnet/csharp/language-reference/keywords/&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjACegQICxAB&usg=AOvVaw1Ppo-QeKIjwxw8D8zLOIDN',\n",
" 'Słowa kluczowe języka C#C# Keywords - Microsoft Docsdocs.microsoft.com › ... › Przewodnik dla języka C# › Dokumentacja języka'),\n",
" ('https://docs.microsoft.com/pl-pl/cpp/cpp/keywords-cpp&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjADegQICBAB&usg=AOvVaw09GBEO-bl_GHGuApWZv46H',\n",
" 'Słowa kluczowe (C++) | Microsoft Docsdocs.microsoft.com › ... › Konwencje leksykalne'),\n",
" ('https://www.researchgate.net/publication/271724450_Keywords_tags_and_what_else_Slowa_kluczowe_tagi_i_co_dalej&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAEegQICRAB&usg=AOvVaw2lYe8oCMu-372n8o6jjnvA',\n",
" '(PDF) Keywords, tags... and what else? [Słowa kluczowe, tagi…, i co ...www.researchgate.net › publication › 271724450_Keywords_tags_and_wh...'),\n",
" ('https://clarin-pl.eu/dspace/bitstream/handle/11321/589/S%25C5%2582owa%2520kluczowe%2520-%2520wytyczne%2520%2528publikacja%2529.pdf%3Fsequence%3D1%26isAllowed%3Dy&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAFegQIChAB&usg=AOvVaw1zbgvbNQDTRmK3GXVFB6Gx',\n",
" '[PDF] słowa kluczowe - CLARIN-PLclarin-pl.eu › dspace › bitstream › handle'),\n",
" ('https://pl.qaz.wiki/wiki/List_of_Java_keywords&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAKegQIBhAB&usg=AOvVaw32i5c9auW8kJ6j0fZPo2ml',\n",
" 'Lista słów kluczowych Java - List of Java keywords - qaz.wikipl.qaz.wiki › wiki › List_of_Java_keywords'),\n",
" ('http://www.standardy.pl/index.php/artykuly/drukuj/1316&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjALegQIAhAB&usg=AOvVaw0MgKxzmQaV_C8gvS9n_BU4',\n",
" '[PDF] x Keywords: x Autorzy: List otwarty do PTN Streszczenie: x Abstractwww.standardy.pl › index.php › artykuly › drukuj'),\n",
" ('http://cejsh.icm.edu.pl/cejsh/element/bwmeta1.element.ojs-doi-10_11649_sfps_2014_020&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAMegQIBRAB&usg=AOvVaw1ckSaZzuEVpMhFLEWNo7tU',\n",
" 'Słowa kluczowe podawane przez autora ... - CEJSH - ICM UWcejsh.icm.edu.pl › bwmeta1.element.ojs-doi-10_11649_sfps_2014_020'),\n",
" ('http://www.bobolanum.edu.pl/wydawnictwo-artykul&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjANegQIARAB&usg=AOvVaw1FzLP8mLAHuszJjWFoCtOZ',\n",
" 'Artykuł - wymogi edytorskie / The Article - Editorial Requirements ...www.bobolanum.edu.pl › wydawnictwo-artykul'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQty4ITQ&usg=AOvVaw275ECJoqdlgg6bzr8BjvBK',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D%252522s%2525C5%252582owa%252Bkluczowe%252522%252Bkeywords%252Babstract%26hl%3Dpl&sa=U&ved=0ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQxs8CCE4&usg=AOvVaw22rLBFpQgI8blcDhcAZu1P',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('\"słowa kluczowe\" keywords abstract')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Jak szukać dziurawych/dziwnych stron?"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://smolarz.szczecin.lasy.gov.pl/test-grafika&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAAegQIAhAB&usg=AOvVaw00PjOy7FFcAzFOiEWBj5q-',\n",
" 'test grafika - Nadleśnictwo Smolarz - Lasy Państwowesmolarz.szczecin.lasy.gov.pl › test-grafika'),\n",
" ('http://www.malopolska.mw.gov.pl/aktualnosci/samorzad/blabla&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjABegQICRAB&usg=AOvVaw2FPiYfJO-h1e4cEis8U7Pu',\n",
" 'Małopolska na Dożynkach Prezydenckich w Spale » Małopolskawww.malopolska.mw.gov.pl › aktualnosci › samorzad › blabla'),\n",
" ('http://sejm.gov.pl/Sejm9.nsf/wypowiedz.xsp%3Fposiedzenie%3D20%26dzien%3D2%26wyp%3D113%26symbol%3DRWYSTAPIENIA_WYP%26id%3D073&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjACegQICBAB&usg=AOvVaw06C6-TRfwEa0vnqBZqICgI',\n",
" 'Wypowiedzi na posiedzeniach Sejmusejm.gov.pl › Sejm9.nsf › wypowiedz'),\n",
" ('https://www.gov.pl/web/psse-walbrzych/test3&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjADegQIBxAB&usg=AOvVaw0C4Wts3msCWyEcHpuou4Gv',\n",
" 'test - Powiatowa Stacja Sanitarno-Epidemiologiczna w Wałbrzychu ...www.gov.pl › web › psse-walbrzych › test3'),\n",
" ('https://www.biznes.gov.pl/glos-przedsiebiorcy/idea/porzadny-slownik&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAEegQIBRAB&usg=AOvVaw3sSgvJNIu57v7xRbsUaGPJ',\n",
" 'Pomysły na biznes.gov.plwww.biznes.gov.pl › glos-przedsiebiorcy › idea › porzadny-slownik'),\n",
" ('http://demo.licytacje.uzp.gov.pl/contest/view/sid/L-76-2011&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAFegQIBhAB&usg=AOvVaw3qQ5q60_RMk3yVEHZSsLgd',\n",
" 'Urząd Zamówień Publicznychdemo.licytacje.uzp.gov.pl › contest › view › sid'),\n",
" ('https://www.biznes.gov.pl/glos-przedsiebiorcy%3Fpage%3D24&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAGegQIABAB&usg=AOvVaw0-0BmNu2idsAGELz1ytQrr',\n",
" 'Pomysły na biznes.gov.plwww.biznes.gov.pl › glos-przedsiebiorcy'),\n",
" ('https://www.gddkia.gov.pl/frontend/web/userfiles/articles/o/ogloszenie-z-dnia-27112017_27828/za%25C5%2582.2.%2520do%2520regulaminu%2520-%2520%25C5%259Bwiadectwa%2520legalno%25C5%259Bci%2520ze%2520zdj%25C4%2599ciami.pdf&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAHegQIBBAB&usg=AOvVaw1QSAjj5hsgZD9v5dO65nt3',\n",
" '[PDF] ŚWIADECTWO LEGALNOŚCI POZYSKANIA DREWNA [pdf] - GDDKiAwww.gddkia.gov.pl › articles › ogloszenie-z-dnia-27112017_27828'),\n",
" ('https://www.arimr.gov.pl/wersja-testowa/zalaczniki-do-wniosku-w-2015-r/rejestr.html&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAIegQIARAB&usg=AOvVaw1kN54m-oEhXvu9HM_jf5r2',\n",
" 'rejestr | Agencja Restrukturyzacji i Modernizacji Rolnictwawww.arimr.gov.pl › wersja-testowa › zalaczniki-do-wniosku-w-2015-r › re...'),\n",
" ('http://www.zielona-gora.sr.gov.pl/download.php%3Finst%3D1%26id%3D1889&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAJegQIAxAB&usg=AOvVaw0C5yVLkbZgo3j_SPFeS3kD',\n",
" '[PDF] Untitledwww.zielona-gora.sr.gov.pl › download'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQty4IMA&usg=AOvVaw2jLSNJ1Fojm0RC3f1Rei7X',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dblabla%252Bsite:gov.pl%26hl%3Dpl&sa=U&ved=0ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQxs8CCDE&usg=AOvVaw0MEcRxsUFD_99cunMcln-U',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('blabla site:gov.pl')"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('http://www.gios.gov.pl/images/dokumenty/pms/monitoring_pol_elektormagnetycznych/raport/Zalacznik_1-_mapa_Szczecin.pdf&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAAegQIARAB&usg=AOvVaw3iiQhOAEZVJob4cs973EUY',\n",
" '[PDF] mapa Szczecinwww.gios.gov.pl › pms › raport › Zalacznik_1-_mapa_Szczecin'),\n",
" ('http://www.gios.gov.pl/images/dokumenty/pms/monitoring_pol_elektormagnetycznych/raport/Zalacznik_1-_mapa_Gdansk.pdf&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjABegQICRAB&usg=AOvVaw0QQC4LH21f3xjE0rM8PI6L',\n",
" '[PDF] C:\\\\Documents and Settings\\\\ja\\\\Pulpit\\\\Gdańsk\\\\Mapy.dwg A3 mapa ...www.gios.gov.pl › pms › raport › Zalacznik_1-_mapa_Gdansk'),\n",
" ('https://www.gddkia.gov.pl/pl/d/f7041e734f9b37cd88cae0a9000102a1&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjACegQIBxAB&usg=AOvVaw0gO4uj__F-7icHZYIQeTPL',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl › ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/fec8268b624add970e544fefefcd043f&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjADegQICBAB&usg=AOvVaw02AVxRGVLmdXAyqtSBZZRo',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl › ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/392dd80745a5a025df1d225bbf0b8e02&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAEegQIAxAB&usg=AOvVaw2Vjr_Ez89bHJrDaPepIRsF',\n",
" 'mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl › ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/dfc6e11545fb637fef5a00f53ce94414&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAFegQIBBAB&usg=AOvVaw3A8r1jWPXDCm7XwoWfkjzf',\n",
" 'mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl › ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/996b6076155b215e7ee8d5897fc6153b&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAGegQIAhAB&usg=AOvVaw1TIDEU5BMlHlGMYkNkbWM4',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl › ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/3010c117961da9877405841ef5c65a07&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAHegQIBRAB&usg=AOvVaw2IS01b7eg6XhHaQHZ3jK13',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl › ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/bed97709d7349e000a041a60388ab1ee&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAIegQIBhAB&usg=AOvVaw1X_Tq2PTGDaRTXm_xi5PQz',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\Malik_M\\\\Moje ... - GDDKiAwww.gddkia.gov.pl › ...'),\n",
" ('http://www.gddkia.gov.pl/pl/d/0c5befb91a5b0b0c8bbc3b5a293ad0fc&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAJegQIABAB&usg=AOvVaw3sEannIxW2G91xP2bUK6Me',\n",
" 'mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl › ...'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwjLkrnptKXvAhXYSRUIHatABOMQty4ILg&usg=AOvVaw0yirg8KksKVYdZKGNbhKol',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dintitle:settings%252Bfiletype:pdf%252Bsite:gov.pl%26hl%3Dpl&sa=U&ved=0ahUKEwjLkrnptKXvAhXYSRUIHatABOMQxs8CCC8&usg=AOvVaw0b9IEfcDUv6isVIMCWaieO',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('intitle:settings filetype:pdf site:gov.pl')"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://www.gov.pl/attachment/3ddad90a-8136-4d9c-a56f-1ed206bf2b24&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAAegQIABAB&usg=AOvVaw3vizwfsDj6dYNSA8t3-tWi',\n",
" '[XLS] NAZWISKA_MEN A B 1 100 najpopularniejszych nazwisk męskich ...www.gov.pl › attachment'),\n",
" ('https://doc.rmf.pl/rmf_fm/store/Kopia_nazwiska_2010.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjABegQIBxAB&usg=AOvVaw3rhrn9Nyg5ac0TyxUqDi1t',\n",
" '[XLS] nazwiska A B C D E F G H I 1 Najcześciej występujące nazwiska ...doc.rmf.pl › rmf_fm › store › Kopia_nazwiska_2010'),\n",
" ('http://dydaktyka.polsl.pl/roz6/izdonek/Shared%2520Documents/MS%2520Excel/7_Dzia%25C5%2582ania%2520na%2520danych%2520typu%2520tekst_podr.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjACegQICBAB&usg=AOvVaw38RDxGxB5aMALBoLG9XEVR',\n",
" '[XLS] Wielkość liter A B C D 1 Przykład 7.1 2 Podany fragment bazy ...dydaktyka.polsl.pl › roz6 › izdonek'),\n",
" ('http://zprp.pl/wp-content/uploads/2015/02/Lista_transferowa_2017_18_v1.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjADegQICRAB&usg=AOvVaw2v7UWBKRjO57O-fM-Ox6-K',\n",
" '[XLS] Lista 2017 A B C D E F 1 Lp Nazwisko Imię Klub macierzysty Status ...zprp.pl › uploads › 2015/02 › Lista_transferowa_2017_18_v1'),\n",
" ('https://umostrow.pl/files/file_add/download/1163_kopia-2020-stmig-cooper-1-sprawozdanie-cz1.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAEegQIBhAB&usg=AOvVaw350iyfjLCkSKGxxX-ezFdj',\n",
" '[XLS] STMiG 2020 - formularz testu Coopera - Ostrów Wielkopolskiumostrow.pl › 1163_kopia-2020-stmig-cooper-1-sprawozdanie-cz1'),\n",
" ('https://www.mbank.pl/pobierz/mbankrejestumow.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAFegQIAhAB&usg=AOvVaw3fZtm8ph8HLJwAJIxTeoL5',\n",
" '[XLS] Sheet_1 A B C 1 Przedsiębiorca Siedziba Przedsiębiorcy NIP 2 ...www.mbank.pl › pobierz › mbankrejestumow'),\n",
" ('http://um.bip.legnica.eu/download/107/26919/drugiepolrocze2017.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAGegQIAxAB&usg=AOvVaw3JqmWVIkWufqX7a5NxLyeH',\n",
" '[XLS] Export Worksheet A B C D E 1 DATA_ZAWARCIA ...um.bip.legnica.eu › download › drugiepolrocze2017'),\n",
" ('http://szswielkopolska.pl/13-kk-io-44.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAHegQIARAB&usg=AOvVaw1SEckfGtKXrghNhgKx7UzB',\n",
" '[XLS] SP 7 Ostrów - SZS Wielkopolskaszswielkopolska.pl › 13-kk-io-44'),\n",
" ('http://www.wsm.edu.pl/fotos/dziekanat/karty_roczne_AIU_2009_2013.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAIegQIBBAB&usg=AOvVaw1i9Mt01azVHjxeFwZJMbLs',\n",
" '[XLS] sem 1 A B C D E F G H I J K L M N O P Q R S T U V W X 1 Wyższa ...www.wsm.edu.pl › fotos › dziekanat › karty_roczne_AIU_2009_2013'),\n",
" ('http://www.arimr.gov.pl/fileadmin/pliki/zdjecia_strony/132/OR07_los121_w.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAJegQIBRAB&usg=AOvVaw3QFCWasKloqTlTbK9HVfi0',\n",
" '[XLS] Kolejno** wylosowanych wniosków w ramach dzia*ania ...www.arimr.gov.pl › pliki › zdjecia_strony › OR07_los121_w'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4Qty4IMA&usg=AOvVaw3VOwJyWy4exubKqjpl7aPI',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dpesel%252Bfiletype:xls%252Bkaczmarek%26hl%3Dpl&sa=U&ved=0ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4Qxs8CCDE&usg=AOvVaw0f2Vo1eTV7WPUx-FUMYU8C',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('pesel filetype:xls kaczmarek')"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://akademia.nask.pl/foto/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAAegQIABAB&usg=AOvVaw1q9KOfc65WIi8jlO1z3TzI',\n",
" 'Index of /foto - Akademia NASKakademia.nask.pl › foto'),\n",
" ('http://ftp.man.poznan.pl/pub/apache/chemistry/%3FC%3DM%3BO%3DA&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjABegQICRAB&usg=AOvVaw3jheEqWF7Iq_HaItKHR2H4',\n",
" 'Index of /pub/apache/chemistry - Nameftp.man.poznan.pl › pub › apache › chemistry'),\n",
" ('http://ftp.man.poznan.pl/pub/apache/kafka/%3FC%3DM%3BO%3DA&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjACegQICBAB&usg=AOvVaw3oWl350iGMv7yN_zzmKlrj',\n",
" 'Index of /pub/apache/kafka - Descriptionftp.man.poznan.pl › pub › apache › kafka'),\n",
" ('http://www.ncac.torun.pl/~seyfert/%3FC%3DS%3BO%3DA&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjADegQIBxAB&usg=AOvVaw3IOMp-EkmpvsqzXfkzHLh_',\n",
" 'Index of /~seyfertwww.ncac.torun.pl › ~seyfert'),\n",
" ('http://www.mpu.pl/download/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAEegQIARAB&usg=AOvVaw2t4Py-QOSOgqH0JejD9OdE',\n",
" 'Index of /downloadwww.mpu.pl › download'),\n",
" ('http://www.psm-bielsk-podlaski.edu.pl/pl/images/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAFegQIBhAB&usg=AOvVaw1qPfo7aV0sGkb42ysGXzGS',\n",
" 'Index of /pl/images - PSM Bielsk Podlaskiwww.psm-bielsk-podlaski.edu.pl › images'),\n",
" ('http://www.matrix.umcs.lublin.pl/~akrajka/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAGegQIAxAB&usg=AOvVaw3op5HIl9tMV6GQhC1IkuB1',\n",
" 'Index of /~akrajka - matrix.umcs.lublin.plwww.matrix.umcs.lublin.pl › ~akrajka'),\n",
" ('http://www.combio.pl/mirex2.download/pen/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAHegQIBRAB&usg=AOvVaw2Hd6NmIvw6kn8ENWsSdJQk',\n",
" 'Index of /mirex2.download/pen - combio.plwww.combio.pl › mirex2.download › pen'),\n",
" ('http://www.iich.gliwice.pl/download/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAIegQIAhAB&usg=AOvVaw05o8hkDQv8hHPSqAjNp-wT',\n",
" 'Index of /downloadwww.iich.gliwice.pl › download'),\n",
" ('http://www.cs.put.poznan.pl/mkadzinski/%3FC%3DM%3BO%3DA&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAJegQIBBAB&usg=AOvVaw1fkEik765hTNPbBbenF_Rq',\n",
" 'Index of /mkadzinskiwww.cs.put.poznan.pl › mkadzinski'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQty4ILg&usg=AOvVaw3x8sw8cv98HNTbBSAnJ58x',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D%252522index%252Bof%252522%252B%252522last%252Bmodified%252522%252B%252522parent%252Bdirectory%252522%252Bapache%26hl%3Dpl&sa=U&ved=0ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQxs8CCC8&usg=AOvVaw0TVKuX1CIb5g3C-Y2_D4iC',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('\"index of\" \"last modified\" \"parent directory\" apache')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('http://filipg-jenkins.wmi.amu.edu.pl/ISI2019/lecture-2019-02.pdf&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAAegQIABAB&usg=AOvVaw2HIittTKuAOR1ATLm972d6',\n",
" '[PDF] Inteligentne systemy informacyjne - Filip Graliński / UAMfilipg-jenkins.wmi.amu.edu.pl › ISI2019 › lecture-2019-02'),\n",
" ('https://md5.gromweb.com/%3Fmd5%3D3fcedf144be9f3dff1145db6c515fb34&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjABegQICRAB&usg=AOvVaw0JTZQuMmrZH56enRrfBVG1',\n",
" 'MD5 reverse for 3fcedf144be9f3dff1145db6c515fb34md5.gromweb.com › ...'),\n",
" ('https://pastebin.pl/view/d872a388&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjACegQIBxAB&usg=AOvVaw3z3-Auzt_qQrkU08fj67q2',\n",
" 'Re: ruchanie - Pastebinpastebin.pl › view'),\n",
" ('http://people.cs.georgetown.edu/~clay/classes/fall2015/ia/MD5.pass.txt&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjADegQICBAB&usg=AOvVaw2tW7zmVhmNYCeEKr-1vA7V',\n",
" 'cbae07efa0c6ed330a283e80a9c02e8d ...people.cs.georgetown.edu › ~clay › classes › fall2015 › MD5.pass.txt'),\n",
" ('http://wklejto.pl/59019&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAEegQIBhAB&usg=AOvVaw1DNIJXZyC5I05BQsnSKMDh',\n",
" 'Kod: 59019 WKLEJTO.PL Darmowa wklejka, na zawsze!wklejto.pl › ...'),\n",
" ('http://docs2.chomikuj.pl/2854898545,PL,0,0,cs-szambo.txt&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAFegQIAhAB&usg=AOvVaw2VKr7YjOicUXzK4zqHIWKQ',\n",
" 'cs szambo.txt - Chomikuj.pldocs2.chomikuj.pl › 2854898545,PL,0,0,cs-szambo'),\n",
" ('https://hashkiller.io/download_list/Found/139863.txt&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAGegQIBBAB&usg=AOvVaw0cPadq0BLUdJR2EN_w1cNs',\n",
" 'f24eba008b3b789e4ee5d3dc8a33af27:Gumimaci1 ...hashkiller.io › download_list › Found'),\n",
" ('https://195.201.31.93/rx6NiRIx/&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAHegQIBRAB&usg=AOvVaw31DG-mSBQmSTDaBgxi8_XX',\n",
" 'Latest MD5 leaked AA3 - BitBin195.201.31.93 › ...'),\n",
" ('https://pastebin.com/dEsgsTqV&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAIegQIARAB&usg=AOvVaw1vCC1iy8lVGuq0E6rELfeM',\n",
" 'INSERT INTO `auth` (`id`, `name`, `premium ... - Pastebin.compastebin.com › dEsgsTqV'),\n",
" ('https://paste2.org/DeGOC334&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAJegQIAxAB&usg=AOvVaw2zwShLX08T5j4hSmbBM3Je',\n",
" 'Viewing Paste DeGOC334 - Paste2.orgpaste2.org › DeGOC334'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQty4ILw&usg=AOvVaw3TNS8kxuTo_YOIBJwKVXG_',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D6d932c406fa15164ee48ff5a52f81dae%26hl%3Dpl&sa=U&ved=0ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQxs8CCDA&usg=AOvVaw0DmFxG-Qro2rfZ0Ot1z-4V',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('6d932c406fa15164ee48ff5a52f81dae')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Projekt 1\n",
"\n",
"Opracować aplikację webową do półautomatycznego\n",
"systematycznego zbierania interesujących wyników Google\n",
"hackingu:\n",
"\n",
"* użytkownik podaje zapytanie\n",
" * możliwe użycie list wyrazów np. wulgaryzmy, wyrażenia potoczne, „wypełniacze” („bla bla”, „foo bar”), system\n",
"powinien wtedy generować serię zapytań\n",
"* aplikacja odpytuje wyszukiwarkę Google (i, być może, inne)\n",
"* aplikacja zbiera wyniki i przedstawia je użytkownikowi\n",
"* użytkownik taguje wyniki jako interesujące / nieinteresujące\n",
"* zapytania mogą być uruchamiane cyklicznie, użytkownik nie musi ponownie przeglądać otagowanych już wyników\n",
"* aplikacja pozwala wylistować wszystkie wyniki oznaczone do tej pory jako interesujące"
]
},
2021-03-17 11:03:03 +01:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Czego nie brać?\n",
"\n",
"Standard robots.txt"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"User-agent: *\n",
"Disallow: /*/wyszukaj/\n",
"Disallow: /*servlet\n",
"Disallow: /reloadwww?\n",
"Disallow: /dfptools/adview/\n",
"Disallow: /pub/ips/*\n",
"Disallow: /ods?\n",
"Disallow: /getFile.servlet*\n",
"Disallow: /aliasy/blad.jsp\n",
"Disallow: /znajdz.do\n",
"Disallow: /portalSearch.do\n",
"Disallow: /im/ab/b4/10/z17515435Q.jpg\n",
"Disallow: /75224259/\n",
"\n",
"User-agent: Googlebot-News\n",
"Disallow: /nowy/\n",
"Disallow: /mapa_strony\n",
"Disallow: /*/wyszukaj/\n",
"Disallow: /*/51,\n",
"Disallow: /*/55,\n",
"Disallow: /*/2,\n",
"Disallow: /*order=\n",
"Disallow: /*obxx=\n",
"Disallow: /*tag=\n",
"Disallow: /reloadwww?\n",
"Disallow: /ods?\n",
"Disallow: /*servlet\n",
"Disallow: /dfptools/adview/\n",
"\n",
"User-agent: Yandex\n",
"Disallow: /\n",
"\n",
"User-Agent: bingbot\n",
"Disallow: /\n",
"\n",
"User-agent: 008\n",
"Disallow: /\n",
"\n",
"User-agent: 010\n",
"Disallow: /\n",
"\n",
"User-agent: 360Spider\n",
"Disallow: /\n",
"\n",
"User-agent: 80legs\n",
"Disallow: /\n",
"\n",
"User-agent: Aboundex\n",
"Disallow: /\n",
"\n",
"User-agent: accelobot\n",
"Disallow: /\n",
"\n",
"User-agent: Add\\ Catalog\n",
"Disallow: /\n",
"\n",
"User-agent: AhrefsBot\n",
"Disallow: /\n",
"\n",
"User-agent: aiHitBot\n",
"Disallow: /\n",
"\n",
"User-agent: Alexibot\n",
"Disallow: /\n",
"\n",
"User-agent: Aqua_Products\n",
"Disallow: /\n",
"\n",
"User-agent: AskJeeves\n",
"Disallow: /\n",
"\n",
"User-agent: asterias\n",
"Disallow: /\n",
"\n",
"User-agent: awcheckBot\n",
"Disallow: /\n",
"\n",
"User-agent: b2w/0.1\n",
"Disallow: /\n",
"\n",
"User-agent: BackDoorBot/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: BacklinkCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: Baiduspider\n",
"Disallow: /\n",
"\n",
"User-agent: BecomeBot\n",
"Disallow: /\n",
"\n",
"User-agent: BLEXBot\n",
"Disallow: /\n",
"\n",
"User-agent: BlowFish/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: Bookmark search tool\n",
"Disallow: /\n",
"\n",
"User-agent: BotALot\n",
"Disallow: /\n",
"\n",
"User-agent: brandwatch.net\n",
"Disallow: /\n",
"\n",
"User-agent: BuiltBotTough\n",
"Disallow: /\n",
"\n",
"User-agent: Bullseye/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: BunnySlippers\n",
"Disallow: /\n",
"\n",
"User-agent: Butterfly\n",
"Disallow: /\n",
"\n",
"User-agent: CatchBot\n",
"Disallow: /\n",
"\n",
"User-agent: Charlotte\n",
"Disallow: /\n",
"\n",
"User-agent: CheeseBot\n",
"Disallow: /\n",
"\n",
"User-agent: CherryPicker\n",
"Disallow: /\n",
"\n",
"User-agent: CherryPickerElite/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: CherryPickerSE/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: CLIPish\n",
"Disallow: /\n",
"\n",
"User-agent: Cliqzbot\n",
"Disallow: /\n",
"\n",
"User-agent: COMODO\n",
"Disallow: /\n",
"\n",
"User-agent: Comodo-Certificates-Spider\n",
"Disallow: /\n",
"\n",
"User-agent: CompSpyBot\n",
"Disallow: /\n",
"\n",
"User-agent: Copernic\n",
"Disallow: /\n",
"\n",
"User-agent: CopyRightCheck\n",
"Disallow: /\n",
"\n",
"User-agent: cosmos\n",
"Disallow: /\n",
"\n",
"User-agent: crawler\n",
"Disallow: /\n",
"\n",
"User-agent: Crescent\n",
"Disallow: /\n",
"\n",
"User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0\n",
"Disallow: /\n",
"\n",
"User-agent: Curious\n",
"Disallow: /\n",
"\n",
"User-agent: curl\n",
"Disallow: /\n",
"\n",
"User-agent: dataprovider\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: DinoPing\n",
"Disallow: /\n",
"\n",
"User-agent: discoverybot\n",
"Disallow: /\n",
"\n",
"User-agent: DittoSpyder\n",
"Disallow: /\n",
"\n",
"User-agent: DomainCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: DomainCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: dotbot\n",
"Disallow: /\n",
"\n",
"User-agent: dotnetdotcom\n",
"Disallow: /\n",
"\n",
"User-agent: Dow\\ Jones\\ Searchbot\n",
"Disallow: /\n",
"\n",
"User-agent: dumbot\n",
"Disallow: /\n",
"\n",
"User-agent: EasouSpider\n",
"Disallow: /\n",
"\n",
"User-agent: EmailCollector\n",
"Disallow: /\n",
"\n",
"User-agent: EmailSiphon\n",
"Disallow: /\n",
"\n",
"User-agent: EmailWolf\n",
"Disallow: /\n",
"\n",
"User-agent: Enterprise_Search\n",
"Disallow: /\n",
"\n",
"User-agent: Enterprise_Search/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: EroCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: es\n",
"Disallow: /\n",
"\n",
"User-agent: Exabot\n",
"Disallow: /\n",
"\n",
"User-agent: ExtractorPro\n",
"Disallow: /\n",
"\n",
"User-agent: EzineArticlesLinkScanner\n",
"Disallow: /\n",
"\n",
"User-agent: Ezooms\n",
"Disallow: /\n",
"\n",
"User-agent: FairAd Client\n",
"Disallow: /\n",
"\n",
"User-agent: Flaming AttackBot\n",
"Disallow: /\n",
"\n",
"User-agent: Foobot\n",
"Disallow: /\n",
"\n",
"User-agent: FreeFind\n",
"Disallow: /\n",
"\n",
"User-agent: FTRF\\:\\ Friendly\n",
"Disallow: /\n",
"\n",
"User-agent: Gaisbot\n",
"Disallow: /\n",
"\n",
"User-agent: GetRight/4.2\n",
"Disallow: /\n",
"\n",
"User-agent: gigabot\n",
"Disallow: /\n",
"\n",
"User-agent: grub\n",
"Disallow: /\n",
"\n",
"User-agent: grub-client\n",
"Disallow: /\n",
"\n",
"User-agent: Harvest/1.5\n",
"Disallow: /\n",
"\n",
"User-agent: Hatena Antenna\n",
"Disallow: /\n",
"\n",
"User-agent: hloader\n",
"Disallow: /\n",
"\n",
"User-agent: http://www.SearchEngineWorld.com bot\n",
"Disallow: /\n",
"\n",
"User-agent: http://www.WebmasterWorld.com bot\n",
"Disallow: /\n",
"\n",
"User-agent: HTTP_Request\n",
"Disallow: /\n",
"\n",
"User-agent: HTTP_Request2\n",
"Disallow: /\n",
"\n",
"User-agent: httplib\n",
"Disallow: /\n",
"\n",
"User-agent: humanlinks\n",
"Disallow: /\n",
"\n",
"User-agent: ia_archiver\n",
"Disallow: /\n",
"\n",
"User-agent: ia_archiver\n",
"Disallow: /\n",
"\n",
"User-agent: ia_archiver/1.6\n",
"Disallow: /\n",
"\n",
"User-agent: Indy\\ Library\n",
"Disallow: /\n",
"\n",
"User-agent: InfoNaviRobot\n",
"Disallow: /\n",
"\n",
"User-agent: ip\\-web\\-crawler\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: Iron33/1.0.2\n",
"Disallow: /\n",
"\n",
"User-agent: Jakarta\\ Commons-HttpClient\n",
"Disallow: /\n",
"\n",
"User-agent: Jeeves\n",
"Disallow: /\n",
"\n",
"User-agent: JennyBot\n",
"Disallow: /\n",
"\n",
"User-agent: Jetbot\n",
"Disallow: /\n",
"\n",
"User-agent: Jetbot/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: JikeSpider\n",
"Disallow: /\n",
"\n",
"User-agent: Kenjin Spider\n",
"Disallow: /\n",
"\n",
"User-agent: Keyword Density/0.9\n",
"Disallow: /\n",
"\n",
"User-agent: larbin\n",
"Disallow: /\n",
"\n",
"User-agent: LexiBot\n",
"Disallow: /\n",
"\n",
"User-agent: libWeb/clsHTTP\n",
"Disallow: /\n",
"\n",
"User-agent: libwww-perl\n",
"Disallow: /\n",
"\n",
"User-agent: lindex\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: linkdex\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: linkdexbot\n",
"Disallow: /\n",
"\n",
"User-agent: LinkextractorPro\n",
"Disallow: /\n",
"\n",
"User-agent: LinkScan/8.1a Unix\n",
"Disallow: /\n",
"\n",
"User-agent: LinkWalker\n",
"Disallow: /\n",
"\n",
"User-agent: lipperhey\n",
"Disallow: /\n",
"\n",
"User-agent: LNSpiderguy\n",
"Disallow: /\n",
"\n",
"User-agent: looksmart\n",
"Disallow: /\n",
"\n",
"User-agent: ltbot\n",
"Disallow: /\n",
"\n",
"User-agent: lwp-trivial\n",
"Disallow: /\n",
"\n",
"User-agent: lwp-trivial/1.34\n",
"Disallow: /\n",
"\n",
"User-agent: Lynx\n",
"Disallow: /\n",
"\n",
"User-agent: magpie\\-crawler\n",
"Disallow: /\n",
"\n",
"User-agent: Mata Hari\n",
"Disallow: /\n",
"\n",
"User-agent: Microsoft URL Control\n",
"Disallow: /\n",
"\n",
"User-agent: Microsoft URL Control - 5.01.4511\n",
"Disallow: /\n",
"\n",
"User-agent: Microsoft URL Control - 6.00.8169\n",
"Disallow: /\n",
"\n",
"User-agent: MIIxpc\n",
"Disallow: /\n",
"\n",
"User-agent: MIIxpc/4.2\n",
"Disallow: /\n",
"\n",
"User-agent: Mister PiX\n",
"Disallow: /\n",
"\n",
"User-agent: MJ12bot\n",
"Disallow: /\n",
"\n",
"User-agent: moget\n",
"Disallow: /\n",
"\n",
"User-agent: moget/2.1\n",
"Disallow: /\n",
"\n",
"User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)\n",
"Disallow: /\n",
"\n",
"User-agent: MSIE\\ or\\ Firefox\\ mutant\n",
"Disallow: /\n",
"\n",
"User-agent: MSIECrawler\n",
"Disallow: /\n",
"\n",
"User-agent: naver\n",
"Disallow: /\n",
"\n",
"User-agent: NCBot\n",
"Disallow: /\n",
"\n",
"User-agent: NetAnts\n",
"Disallow: /\n",
"\n",
"User-agent: NetcraftSurveyAgent\n",
"Disallow: /\n",
"\n",
"User-agent: netEstate\\ NE\\ Crawler\n",
"Disallow: /\n",
"\n",
"User-agent: NetMechanic\n",
"Disallow: /\n",
"\n",
"User-agent: Netseer\n",
"Disallow: /\n",
"\n",
"User-agent: NextGenSearchBot\n",
"Disallow: /\n",
"\n",
"User-agent: NICErsPRO\n",
"Disallow: /\n",
"\n",
"User-agent: Nutch\n",
"Disallow: /\n",
"\n",
"User-agent: Nutch\n",
"Disallow: /\n",
"\n",
"User-agent: Ocelli\n",
"Disallow: /\n",
"\n",
"User-agent: Offline Explorer\n",
"Disallow: /\n",
"\n",
"User-agent: OmniExplorer_Bot\n",
"Disallow: /\n",
"\n",
"User-agent: Openbot\n",
"Disallow: /\n",
"\n",
"User-agent: Openfind\n",
"Disallow: /\n",
"\n",
"User-agent: Openfind\n",
"Disallow: /\n",
"\n",
"User-agent: Openfind data gathere\n",
"Disallow: /\n",
"\n",
"User-agent: OpenWebIndex\n",
"Disallow: /\n",
"\n",
"User-agent: Oracle Ultra Search\n",
"Disallow: /\n",
"\n",
"User-agent: PagesInventory\n",
"Disallow: /\n",
"\n",
"User-agent: PEAR\n",
"Disallow: /\n",
"\n",
"User-agent: PeoplePal\n",
"Disallow: /\n",
"\n",
"User-agent: PerMan\n",
"Disallow: /\n",
"\n",
"User-agent: ProCogSEOBot\n",
"Disallow: /\n",
"\n",
"User-agent: ProPowerBot/2.14\n",
"Disallow: /\n",
"\n",
"User-agent: ProWebWalker\n",
"Disallow: /\n",
"\n",
"User-agent: proximic\n",
"Disallow: /\n",
"\n",
"User-agent: psbot\n",
"Disallow: /\n",
"\n",
"User-agent: purebot\n",
"Disallow: /\n",
"\n",
"User-agent: QueryN Metasearch\n",
"Disallow: /\n",
"\n",
"User-agent: QuerySeekerSpider\n",
"Disallow: /\n",
"\n",
"User-agent: Radiation Retriever 1.1\n",
"Disallow: /\n",
"\n",
"User-agent: RepoMonkey\n",
"Disallow: /\n",
"\n",
"User-agent: RepoMonkey Bait & Tackle/v1.01\n",
"Disallow: /\n",
"\n",
"User-agent: Riddler\n",
"Disallow: /\n",
"\n",
"User-agent: RMA\n",
"Disallow: /\n",
"\n",
"User-agent: rojerbot\n",
"Disallow: /\n",
"\n",
"User-agent: RyteBot\n",
"Disallow: /\n",
"\n",
"User-agent: scooter\n",
"Disallow: /\n",
"\n",
"User-agent: ScoutJet\n",
"Disallow: /\n",
"\n",
"User-agent: Scrapy\n",
"Disallow: /\n",
"\n",
"User-agent: ScreenerBot\n",
"Disallow: /\n",
"\n",
"User-agent: searchmetrics\n",
"Disallow: /\n",
"\n",
"User-agent: searchpreview\n",
"Disallow: /\n",
"\n",
"User-agent: SemrushBot\n",
"Disallow: /\n",
"\n",
"User-agent: sentibot\n",
"Disallow: /\n",
"\n",
"User-agent: SEO-CRAWLING\n",
"Disallow: /\n",
"\n",
"User-agent: SEOENGWorldBot\n",
"Disallow: /\n",
"\n",
"User-agent: SEOkicks-Robot\n",
"Disallow: /\n",
"\n",
"User-agent: ShopWiki\n",
"Disallow: /\n",
"\n",
"User-agent: sistrix\n",
"Disallow: /\n",
"\n",
"User-agent: sitebot\n",
"Disallow: /\n",
"\n",
"User-agent: SiteSnagger\n",
"Disallow: /\n",
"\n",
"User-agent: Snoopy\n",
"Disallow: /\n",
"\n",
"User-agent: SocialSearcher\n",
"Disallow: /\n",
"\n",
"User-agent: Sogou\n",
"Disallow: /\n",
"\n",
"User-agent: SolomonoBot\n",
"Disallow: /\n",
"\n",
"User-agent: sootle\n",
"Disallow: /\n",
"\n",
"User-agent: Sosospider\n",
"Disallow: /\n",
"\n",
"User-agent: SpankBot\n",
"Disallow: /\n",
"\n",
"User-agent: spanner\n",
"Disallow: /\n",
"\n",
"User-agent: spbot\n",
"Disallow: /\n",
"\n",
"User-agent: Speedy\n",
"Disallow: /\n",
"\n",
"User-agent: Stanford\n",
"Disallow: /\n",
"\n",
"User-agent: Stanford Comp Sci\n",
"Disallow: /\n",
"\n",
"User-agent: SurveyBot\n",
"Disallow: /\n",
"\n",
"User-agent: suzuran\n",
"Disallow: /\n",
"\n",
"User-agent: Szukacz/1.4\n",
"Disallow: /\n",
"\n",
"User-agent: Szukacz/1.4\n",
"Disallow: /\n",
"\n",
"User-agent: Teleport\n",
"Disallow: /\n",
"\n",
"User-agent: TeleportPro\n",
"Disallow: /\n",
"\n",
"User-agent: Telesoft\n",
"Disallow: /\n",
"\n",
"User-agent: Teoma\n",
"Disallow: /\n",
"\n",
"User-agent: The Intraformant\n",
"Disallow: /\n",
"\n",
"User-agent: The\\ Incutio\\ XML-RPC\\ PHP\\ Library\n",
"Disallow: /\n",
"\n",
"User-agent: TheNomad\n",
"Disallow: /\n",
"\n",
"User-agent: toCrawl/UrlDispatcher\n",
"Disallow: /\n",
"\n",
"User-agent: True_Robot\n",
"Disallow: /\n",
"\n",
"User-agent: True_Robot/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: turingos\n",
"Disallow: /\n",
"\n",
"User-agent: TurnitinBot\n",
"Disallow: /\n",
"\n",
"User-agent: uCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: URL Control\n",
"Disallow: /\n",
"\n",
"User-agent: URL_Spider_Pro\n",
"Disallow: /\n",
"\n",
"User-agent: URLy Warning\n",
"Disallow: /\n",
"\n",
"User-agent: VCI\n",
"Disallow: /\n",
"\n",
"User-agent: VCI WebViewer VCI WebViewer Win32\n",
"Disallow: /\n",
"\n",
"User-agent: visaduhoc\\.info\n",
"Disallow: /\n",
"\n",
"User-agent: WBSearchBot\n",
"Disallow: /\n",
"\n",
"User-agent: Web Image Collector\n",
"Disallow: /\n",
"\n",
"User-agent: WebAuto\n",
"Disallow: /\n",
"\n",
"User-agent: WebBandit\n",
"Disallow: /\n",
"\n",
"User-agent: WebBandit/3.50\n",
"Disallow: /\n",
"\n",
"User-agent: WebCapture\n",
"Disallow: /\n",
"\n",
"User-agent: WebCopier\n",
"Disallow: /\n",
"\n",
"User-agent: WebEnhancer\n",
"Disallow: /\n",
"\n",
"User-agent: WebInDetail\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: WebmasterWorld Extractor\n",
"Disallow: /\n",
"\n",
"User-agent: WebmasterWorldForumBot\n",
"Disallow: /\n",
"\n",
"User-agent: WebSauger\n",
"Disallow: /\n",
"\n",
"User-agent: Website Quester\n",
"Disallow: /\n",
"\n",
"User-agent: WEBSITEtheWEB\\.COM\n",
"Disallow: /\n",
"\n",
"User-agent: Webster Pro\n",
"Disallow: /\n",
"\n",
"User-agent: WebStripper\n",
"Disallow: /\n",
"\n",
"User-agent: WebVac\n",
"Disallow: /\n",
"\n",
"User-agent: WebZip\n",
"Disallow: /\n",
"\n",
"User-agent: WebZip/4.0\n",
"Disallow: /\n",
"\n",
"User-agent: Wget\n",
"Disallow: /\n",
"\n",
"User-agent: Wget/1.5.3\n",
"Disallow: /\n",
"\n",
"User-agent: Wget/1.6\n",
"Disallow: /\n",
"\n",
"User-agent: Wotbot\n",
"Disallow: /\n",
"\n",
"User-agent: www\\.integromedb\\.org\n",
"Disallow: /\n",
"\n",
"User-agent: WWW-Collector-E\n",
"Disallow: /\n",
"\n",
"User-agent: Xenu's\n",
"Disallow: /\n",
"\n",
"User-agent: Xenu's Link Sleuth 1.1c\n",
"Disallow: /\n",
"\n",
"User-agent: xpymep\\.exe\n",
"Disallow: /\n",
"\n",
"User-agent: YamanaLab-Robot\n",
"Disallow: /\n",
"\n",
"User-agent: YisouSpider\n",
"Disallow: /\n",
"\n",
"User-agent: YodaoBot\n",
"Disallow: /\n",
"\n",
"User-agent: YoudaoBot\n",
"Disallow: /\n",
"\n",
"User-agent: Zend_Http_Client\n",
"Disallow: /\n",
"\n",
"User-agent: Zeus\n",
"Disallow: /\n",
"\n",
"User-agent: Zeus 32297 Webster Pro V2.9 Win32\n",
"Disallow: /\n",
"\n",
"User-agent: Zeus Link Scout\n",
"Disallow: /\n",
"\n",
"User-agent: ZmEu\n",
"Disallow: /\n",
"\n",
"User-agent: ZumBot\n",
"Disallow: /\n",
"\n",
"User-agent: Linguee\n",
"Disallow: /\n",
"\n",
"User-agent: sogou\n",
"Disallow: /\n"
]
}
],
"source": [
"import urllib\n",
"import requests\n",
"\n",
"url = 'https://gazeta.pl/robots.txt'\n",
"response = requests.get(url)\n",
"print(response.content.decode('utf-8'))\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Projekt 2\n",
"\n",
"Opracować wyszukiwarkę plików robots.txt.\n",
"\n",
"* pobrać robots.txt dla (prawie) wszystkich polskich stron WWW\n",
"* umożliwić wyszukiwanie i sortowanie według wszystkich możliwych pól (blokowana wyszukiwarka, adres, komentarz,\n",
"długość pliku itd.)\n",
"* opracować miary pozwalające automatycznie wyłuskać „ciekawe” pliki robots.txt (długość, występowanie pełnych\n",
"linków, odmienność od innych plików robots.txt); umożliwić sortowanie/filtrowanie według tej miary"
]
},
2021-03-08 21:32:33 +01:00
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
2021-09-27 07:42:48 +02:00
"author": "Filip Graliński",
"email": "filipg@amu.edu.pl",
2021-03-08 21:32:33 +01:00
"kernelspec": {
2021-09-27 07:36:37 +02:00
"display_name": "Python 3 (ipykernel)",
2021-03-08 21:32:33 +01:00
"language": "python",
"name": "python3"
},
2021-09-27 07:42:48 +02:00
"lang": "pl",
2021-03-08 21:32:33 +01:00
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2021-09-27 07:36:37 +02:00
"version": "3.9.6"
2021-09-27 07:42:48 +02:00
},
"subtitle": "2.Wyszukiwarki — wprowadzenie[wykład]",
"title": "Ekstrakcja informacji",
"year": "2021"
2021-03-08 21:32:33 +01:00
},
"nbformat": 4,
"nbformat_minor": 4
}