aitech-eks-pub/wyk/01_Wyszukiwarki-wprowadzenie.ipynb
2021-09-27 07:42:48 +02:00

1724 lines
75 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Ekstrakcja informacji </h1>\n",
"<h2> 1. <i>Wyszukiwarki — wprowadzenie</i> [wykład]</h2> \n",
"<h3> Filip Graliński (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Wyszukiwarki - wprowadzenie\n",
"\n",
"## Systemy wyszukiwania informacji (information retrieval systems)\n",
"\n",
"![System wyszukiwania informacji](system-wyszukiwania-informacji.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"## Wyszukiwarki\n",
"\n",
"![Wyszukiwarki](wyszukiwarka-internetowa.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chcę stworzyć swoją własną wyszukiwarkę internetową...\n",
"\n",
"1. Skąd brać adresy URL?\n",
"2. Jak pobrać pliki z tych adresów?\n",
"3. Jak wydobyć z nich tekst?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ... a może w ogóle nie pobierać?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Korpus CommonCrawl\n",
"\n",
"https://commoncrawl.org/the-data/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<!-- スマホ用 --\n",
"<!-- \n",
"<!--table width='750' border='0' align='center' cellpadding='0' cellspacing='0'\n",
"<!--a href='index.phtml?CHANNEL=R51&FID=389924'\n",
"<!-- mail: \n",
"<!-- beige_lavender-3c --\n",
"<!--\n",
"<!-- Template Design By BeigeHeart_Chako_http://beigeheart.blog9.fc2.com/ --\n",
"<!-- 関連記事_http://beigeheart.blog9.fc2.com/blog-entry-99.html --\n",
"<!-- 利用規約_http://beigeheart.blog9.fc2.com/blog-entry-103.html --\n",
"<!-- テンプレの再配布、営利目的の利用禁止 --\n",
"<!-- 画像の無断転載・再配布禁止 --\n",
"<!-- アダルト・法律違反サイト、使用不可 --\n",
"<!-- アクセス解析タグはここから --\n",
"<!-- アクセス解析タグはここまで --\n",
"<!--▼▼▼メインカラムカラム+右サイドカラム部分--\n",
"<!--▼ヘッダー--\n",
"<!--▼管理ページリンク--\n",
"<!--▲管理ページリンク--\n",
"<!--▼タイトル--\n"
]
}
],
"source": [
"# Bezpośrednio z serwisu\n",
"\n",
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a '<!--[^\\[\\]<>]+' | uniq | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dostępne są też \"ekstrakty\" czystego tekstu - zob. http://data.statmt.org/ngrams/raw/, np. 59 GB czystego tekstu po polsku z 2012 roku."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df6fa1abb58549287111ba8d776733e9 0.000000 http://www.gornicki.pl/focal_points_4/2006\n",
"Przegląd okulistyczny \n",
"Focal points \n",
"Przegląd reumatologiczny \n",
"Biblioteka on-line \n",
"STRONA GŁÓWNA \n",
"WYDAWNICTWO \n",
"O wydawnictwie \n",
"Kontakt \n",
"Regulamin zamówień \n",
"Spotkania autorskie \n",
"Nasi autorzy \n",
"CZYTELNIA ONLINE \n",
"w dziale: anatomia \n",
"w dziale: okulistyka \n",
"w dziale: ratownictwo \n",
"CENNIK \n",
"LINKI \n",
"USŁUGI \n",
"df6fa1abb58549287111ba8d776733e9 2.000000 http://www.gornicki.pl/focal_points_4/2006\n",
"Licencjaty \n",
"Multimedia \n",
"Pulmonologia \n",
"Okulistyka \n",
"Ratownictwo \n",
"Reumatologia \n",
"Zestawy specjalne \n",
"Onkologia \n",
"Focal Points 4/2006\n",
"\n"
]
}
],
"source": [
"! (wget -O - -q http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/pl/raw/pl.2012.raw.xz \\\n",
" | xzcat | head -n 30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Zrzuty Wikipedii\n",
"\n",
"Nie pobieraj Wikipedii strona po stronie!\n",
"\n",
"* tracisz swój czas\n",
"* i tracisz czas serwerów Wikipedii\n",
"\n",
"Lepiej pobrać zrzut (_dump_) ze strony https://dumps.wikimedia.org/backup-index.html"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1977]]\n",
"[[język skryptowy|skryptowy]]\n",
"[[programowanie proceduralne|proceduralny]]\n",
"[[Programowanie sterowane zdarzeniami|sterowany zdarzeniami]]\n",
"[[Alfred V. Aho|Alfred Aho]]\n",
"[[Peter J. Weinberger|Peter Weinberger]]\n",
"[[Brian Kernighan]]\n",
"[[wieloplatformowość|wieloplatformowy]]\n",
"[[język programowania]]\n",
"[[plik]]\n",
"[[system operacyjny|systemów operacyjnych]]\n",
"[[Unix|UNIX]]\n",
"[[tablica asocjacyjna|tablice asocjacyjne]]\n",
"[[Tekstowy typ danych|stringi]]\n",
"[[wyrażenie regularne|wyrażenia regularne]]\n",
"[[Alfred V. Aho|Alfreda V. Aho]]\n",
"[[Peter Weinberger|Petera Weinbergera]]\n",
"[[Brian Kernighan|Briana Kernighana]]\n",
"[[POSIX]]\n",
"[[System V|SVR4]]\n"
]
}
],
"source": [
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
" | bzcat | grep -P -o '\\[\\[[^\\]]+\\]\\]' | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Skąd brać adresy URL?\n",
"\n",
"### Zob. dumpy powyżej"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://ssl'\n",
"https://static.fc2.com/css_cn/common/headbar/120710style.css\n",
"https://blog.fc2.com/\n",
"https://spdeliver.i-mobile.co.jp/script/adsnativepc.js?20101001\n",
"https://media.fc2.com/counter_img.php?id=3493\n",
"https://plus.google.com/+apothekenumschau\n",
"https://script.ioam.de/iam.js\n",
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/AGP-Kontaktformular--73317.html\n",
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/Apotheker-HP--AGP-73319.html\n",
"https://login.apotheken-umschau.de/login?service=https://www.apotheken-umschau.de/j_spring_cas_security_check\n",
"https://forum.apotheken-umschau.de/portal/registration/register\n",
"https://www.facebook.com/Apotheken.Umschau\n",
"https://api.wortundbildverlag.com/drug-suggest/terms\n",
"https://07743rats-apotheke.apotheken-umschau.de/unternehmenskommunikation/Kontakt-zu-den-Redaktionen-53834.html\n",
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
"https://static.skyrock.net/js/common.min.js?eBtyhdw\n",
"https://static.skyrock.net/img/favicon_v5b.ico\n",
"https://wir.skyrock.net/wir/v1/resize/?c=isi&amp;im=%2F9775%2F59549775%2Fpics%2Fphoto_59549775_89.jpg&amp;w=16\n",
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
"https://static.skyrock.net/css/common.css?eahf2jw\n"
]
}
],
"source": [
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a 'https://[^ \"><]+' | uniq | head -n 20)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna\n",
"https://web.archive.org/web/20100116001012/http://homepages.cwi.nl/~dik/english/codes/stand.html#ascii\n",
"https://web.archive.org/web/20160613145224/http://www.aivosto.com/vbtips/charsets-7bit.html#body}}&lt;/ref&gt;\n",
"https://web.archive.org/web/20160522024759/http://worldpowersystems.com/J/codes/#ASCII-1967\n",
"https://books.google.com/?id=NQSpNAEACAAJ&amp;pg=PA28\n",
"https://web.archive.org/web/20160616084132/https://www.w3.org/blog/2008/05/utf8-web-growth/\n",
"https://web.archive.org/web/20160616084637/https://googleblog.blogspot.de/2008/05/moving-to-unicode-51.html\n",
"https://web.archive.org/web/20160616085323/https://googleblog.blogspot.de/2010/01/unicode-nearing-50-of-web.html\n",
"https://web.archive.org/web/20160827000956/http://dlx.bookzz.org/genesis/772000/c80a62495acf1e1a5b966de23c1f989a/_as/%5BInterface_Age_Staff%5D_Best_of_Interface_Age%2C_Volum%28BookZZ.org%29.pdf\n",
"https://books.google.com/books?id=bXLDwmIJNkUC&amp;pg=PA13\n",
"https://web.archive.org/web/20161031223347/http://ethw.org/First-Hand%3AChad_is_Our_Most_Important_Product%3A_An_Engineer%27s_Memory_of_Teletype_Corporation\n",
"https://textfiles.meulie.net/bitsaved/Books/Mackenzie_CodedCharSets.pdf\n",
"https://web.archive.org/web/20160526181319/http://longstreet.typepad.com/thesciencebookstore/2012/03/heres-the-link.html\n",
"https://web.archive.org/web/20120213005708/http://www.transbay.net/~enf/ascii/ascii.pdf\n",
"https://archive.org/details/dictionaryworldp00iann\n",
"https://archive.org/details/dictionaryworldp00iann/page/n80\n",
"https://www.theguardian.com/commentisfree/belief/2013/jan/28/lucretius-all-things-atoms\n",
"https://archive.org/details/distillingknowle00mora_557\n",
"https://archive.org/details/distillingknowle00mora_557/page/n156\n",
"https://archive.org/details/fromelementstoat00sieg\n"
]
}
],
"source": [
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
" | bzcat | grep -P -o 'https://[^ \"><]+' | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Serwis DMOZ/ODP (niestety już nieaktywny)\n",
"Ostatni link: https://web.archive.org/web/20160306230718/http://rdf.dmoz.org/rdf/content.rdf.u8.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Odpytywać \"pasożytniczo\" inną wyszukiwarkę"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# see https://hackernoon.com/how-to-scrape-google-with-python-bo7d2tal\n",
"\n",
"import urllib\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"def query_google(query):\n",
" url = f\"https://google.com/search?q={query}\"\n",
" response = requests.get(url)\n",
" soup = BeautifulSoup(response.content, \"html.parser\")\n",
" \n",
" results = []\n",
" for g in soup.find_all('a'):\n",
" link = g['href']\n",
" if '/url?q=' in link:\n",
" results.append((link[7:], g.parent.get_text()))\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQmhMwC3oECA0QDg&usg=AOvVaw0GUY96bFEsdrfOb9_ME9qP',\n",
" 'Wikipedia'),\n",
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjASegQIDBAB&usg=AOvVaw3LMsdCuK3PBSunL8shYp-S',\n",
" 'Wielka Stopa (zwierzę) Wikipedia, wolna encyklopediapl.wikipedia.org wiki Wielka_Stopa_(zwierzę)'),\n",
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Opis&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQ0gIwEnoECAwQAg&usg=AOvVaw02WHiDgMZ18jJGW-y7agVg',\n",
" 'Opis'),\n",
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Historia&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQ0gIwEnoECAwQAw&usg=AOvVaw10BrulHDJ4WgEOFkd-3-H6',\n",
" 'Historia'),\n",
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Najwa%25C5%25BCniejsze_argumenty_%25E2%2580%259Eza%25E2%2580%259D_i_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQ0gIwEnoECAwQBA&usg=AOvVaw1nSHJDVeWEJTqpRJOMBcus',\n",
" 'Najważniejsze argumenty ...'),\n",
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Argumenty_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQ0gIwEnoECAwQBQ&usg=AOvVaw3UqFIOr7y6yxvK-i1su1au',\n",
" 'Argumenty „przeciw”'),\n",
" ('https://pl.wikipedia.org/wiki/Wielka_Stopa_(w%25C3%25B3dz_Siuks%25C3%25B3w)&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjATegQICxAB&usg=AOvVaw1lZSYrEp4ez0Kh4o4SXrY1',\n",
" 'Wielka Stopa (wódz Siuksów) Wikipedia, wolna encyklopediapl.wikipedia.org wiki Wielka_Stopa_(wódz_Siuksów)'),\n",
" ('https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQtwIwFHoECAQQAQ&usg=AOvVaw2EugGtxH-FfMbNmqhS5py3',\n",
" 'Wielka Stopa w Suszu - YouTubewww.youtube.com watch'),\n",
" ('https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQuAIwFHoECAQQAg&usg=AOvVaw17g24VY46PboJW54XyZGa1',\n",
" '23 cze 2017 · Od niedawna oczy naukowców poszukujących Wielkiej Stopy skierowane są na niewielkie ...Czas trwania: 6:24\\nOpublikowano: 23 cze 2017'),\n",
" ('https://www.ceneo.pl/%3Bszukaj-wielka%2Bstopa&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAZegQIBhAB&usg=AOvVaw0HUE-TpszLKJjAMsV6lvPU',\n",
" 'Wielka Stopa - znaleziono na Ceneo.plwww.ceneo.pl ...'),\n",
" ('https://www.antyradio.pl/News/Kobieta-twierdzi-ze-spotkala-Wielka-Stope-Hustala-sie-na-drzewie-ZDJECIE-43102&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAaegQICBAB&usg=AOvVaw1iIlPUpJwldL0MacDY4ebw',\n",
" 'Wielka Stopa - kolejny przypadek spotkania z potworem - Antyradiowww.antyradio.pl News Kobieta-twierdzi-ze-spotkala-Wielka-Stope-Hu...'),\n",
" ('https://allegro.pl/kategoria/gry%3Fstring%3DWielka%2520stopa%2520%253A)%2520-&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAbegQIABAB&usg=AOvVaw0mgn1YuyE65LFfA54P-gQo',\n",
" 'Wielka stopa :) - Gry - Allegro.plallegro.pl Kultura i rozrywka Gry'),\n",
" ('https://allegro.pl/listing%3Fstring%3DWielka%2520stopa%2520%253A%2529%2520-&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAcegQIAxAB&usg=AOvVaw3dzMG9f8K5w31r30AyxNEz',\n",
" 'Wielka stopa :) - Niska cena na Allegro.plallegro.pl listing'),\n",
" ('https://www.empik.com/gra-strategiczna-yeti-wielka-stopa-jawa,p1103341700,zabawki-p&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAdegQIBxAB&usg=AOvVaw3xZ_RVxgMxK7vOUPAYO-pe',\n",
" 'Gra strategiczna Yeti Wielka stopa - | Sklep EMPIK.COMwww.empik.com Zabawki Gry Strategiczne i ekonomiczne'),\n",
" ('https://tvn24.pl/tvnmeteo/informacje-pogoda/ciekawostki,49/wielka-stopa-nie-istnieje-naukowcy-to-nie-koniec-nadziei,127328,1,0.html&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAeegQICRAB&usg=AOvVaw3XECuxJKyNK_x4MTREa9Ui',\n",
" 'Wielka Stopa nie istnieje? Naukowcy: to nie koniec nadziei - TVN24tvn24.pl Informacje pogodowe Ciekawostki'),\n",
" ('https://www.monolith.pl/filmy/2020/mala-wielka-stopa-2-w-rodzinie-sila/&sa=U&ved=2ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQFjAfegQIChAB&usg=AOvVaw3uFesbmGBr0dDWxK1ej5n_',\n",
" 'Mała Wielka Stopa 2 - Filmy - Monolith Filmswww.monolith.pl filmy mala-wielka-stopa-2-w-rodzinie-sila'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQty4IigE&usg=AOvVaw0fYQ97CWfJ8aCmNBcv3a_d',\n",
" 'Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D%252522wielka%252Bstopa%252522%26hl%3Dpl&sa=U&ved=0ahUKEwj-ktTzsLfvAhW8ZxUIHQHnB5EQxs8CCIsB&usg=AOvVaw1V17_OrU9CNrErDjbwNZRj',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('\"wielka stopa\"')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Google hacking\n",
"\n",
"... czyli kreatywne wykorzystanie wyszukiwarki Google (niekoniecznie w złowrogich celach)\n",
"\n",
"#### Jak szukać materiałów dwujęzycznych?"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://context.reverso.net/t%25C5%2582umaczenie/angielski-polski/english%2Bversion&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAAegQIABAB&usg=AOvVaw3RrHCxcaLe8qoaZfLEPV6Y',\n",
" 'english version - Tłumaczenie na polski - angielskich przykładów ...context.reverso.net tłumaczenie angielski-polski english+version'),\n",
" ('https://context.reverso.net/t%25C5%2582umaczenie/angielski-polski/An%2BEnglish%2Bversion&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjABegQIBhAB&usg=AOvVaw017LUPkNtKNdnPE8dToBSB',\n",
" 'An English version - Tłumaczenie na polski - angielskich przykładów ...context.reverso.net tłumaczenie angielski-polski An+English+version'),\n",
" ('https://pl.bab.la/slownik/angielski-polski/english-version&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjACegQICRAB&usg=AOvVaw0BG6Y5Y4PWUDFAMQbF5OiB',\n",
" 'ENGLISH VERSION - Tłumaczenie na polski - bab.lapl.bab.la slownik angielski-polski english-version'),\n",
" ('https://www.linguee.com/english-polish/translation/in%2Benglish%2Bversion.html&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjADegQIBxAB&usg=AOvVaw03YqBv17ZeVx2FwKA2Y2gu',\n",
" 'in English version - Polish translation Lingueewww.linguee.com english-polish translation in+english+version'),\n",
" ('https://www.linguee.com/english-polish/translation/an%2Benglish%2Bversion.html&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAEegQICBAB&usg=AOvVaw261dClyWD55TlTUkm5JNiI',\n",
" 'an English version - Polish translation Lingueewww.linguee.com english-polish translation an+english+version'),\n",
" ('https://www.youtube.com/watch%3Fv%3DdC8Jy0-VImU&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QtwIwBXoECAoQAQ&usg=AOvVaw1fvEyAWPyHIeWCqTmx5efS',\n",
" 'MELODIA - Sanah | PO ANGIELSKU | ENGLISH VERSION - YouTubewww.youtube.com watch'),\n",
" ('https://www.youtube.com/watch%3Fv%3DdC8Jy0-VImU&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QuAIwBXoECAoQAg&usg=AOvVaw2n8-O6Aooitc2POfMr2eSI',\n",
" '2 lip 2020 · Z uwagi na to, że wersja angielska \"Szampana\" bardzo Wam się spodobała, postanowiłam ...Czas trwania: 3:16\\nOpublikowano: 2 lip 2020'),\n",
" ('https://www.linguee.pl/angielski-polski/t%25C5%2582umaczenie/english%2Bversion%2Bprevail.html&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAGegQIAhAB&usg=AOvVaw2gR32hWrps8JeETEZFcnC3',\n",
" 'English version prevail - Tłumaczenie na polski słownik Lingueewww.linguee.pl angielski-polski tłumaczenie english+version+prevail'),\n",
" ('https://www.linguee.pl/angielski-polski/t%25C5%2582umaczenie/english%2Bversion%2Bcoming%2Bsoon.html&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAHegQIARAB&usg=AOvVaw1Gy_8y1P8j2LkQmOcFNUho',\n",
" 'English version coming soon - Tłumaczenie na polski słownik ...www.linguee.pl angielski-polski english+version+coming+soon'),\n",
" ('https://www.umcs.pl/pl/instrukcja-w-jezyku-angielskim-english-version-,15428.htm&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAIegQIBRAB&usg=AOvVaw2qxqPHA01a_XGp2OI2LwHh',\n",
" 'Instrukcja w języku angielskim (english version) - Nowi pracownicy ...www.umcs.pl ... Dla pracownika Nowi pracownicy (instrukcja)'),\n",
" ('https://www.wsb.net.pl/en/&sa=U&ved=2ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0QFjAJegQIAxAB&usg=AOvVaw33uMYMxHmM5oTynwt9481F',\n",
" 'English version : - Wyższa Szkoła Bezpieczeństwawww.wsb.net.pl ...'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0Qty4INA&usg=AOvVaw3FvXRX8gjDnoExpLAPHyWl',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dsi%2525C4%252599%252B%252522English%252Bversion%252522%26hl%3Dpl&sa=U&ved=0ahUKEwjJ24W_s6XvAhVcXRUIHTfDA_0Qxs8CCDU&usg=AOvVaw3nXIS27h-FWwpKhQDIdB9y',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('się \"English version\"')"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://www.ksk.gda.pl/%3Fs%3D%257Bsearch_term_string%257D%253Flang%253Den%253Flang%253Dfr%253Flang%253Dfr%253Flang%253Dde%253Flang%253Den%253Flang%253Dfr%253Flang%253Dfr%253Flang%253Den%253Flang%253Dfr%253Flang%253Dfr%253Flang%253Dde%253Flang%253Den%253Flang%253Dde%253Flang%253Dde%253Flang%253Dde%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAAegQIAxAB&usg=AOvVaw1rz99qpelK6AKXNq32Y3DB',\n",
" '{search_term_string}?lang=en?lang=fr?lang=fr?lang=de?lang=en ...www.ksk.gda.pl s={search_term_string}?lang=en?lang=fr?lang=fr?lang=...'),\n",
" ('https://emonitoring.poczta-polska.pl/%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjABegQIBBAB&usg=AOvVaw3BgMdqycY5NWdhCmVHe6Eo',\n",
" 'Śledzenie przesyłek - Poczta Polskaemonitoring.poczta-polska.pl lang=en'),\n",
" ('http://44mpa.pl/urban-adaptation-plans/%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjACegQICxAB&usg=AOvVaw0yHXmZ8Tv3dujCVJIRKjR7',\n",
" 'Urban Adaptation Plans | Wczujmy się w klimat!44mpa.pl urban-adaptation-plans lang=en'),\n",
" ('http://www.apiscosmetics.pl/start-en/products/professional-products/home-terapis-en.html%3Fproduct%3D288%26lang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjADegQICBAB&usg=AOvVaw1QwK_aHzWym29dEM4w0MSw',\n",
" '<!doctype html> <html lang=\"en\"> <head> <meta http-equiv ... - Apiswww.apiscosmetics.pl products professional-products home-terapis-en'),\n",
" ('https://ekursy.akademiakierowcy.pl/message/output/airnotifier/lang/en/&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAEegQIBxAB&usg=AOvVaw2fR_Xur4oOOIxEb1KiJBRL',\n",
" 'Index of /message/output/airnotifier/lang/en - Akademia Kierowcyekursy.akademiakierowcy.pl message output airnotifier lang'),\n",
" ('https://ekursy.akademiakierowcy.pl/message/output/popup/lang/en/&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAFegQICRAB&usg=AOvVaw38ifWqViF-gaqRnBYCs7ph',\n",
" 'Index of /message/output/popup/lang/en - Akademia Kierowcyekursy.akademiakierowcy.pl message output popup lang'),\n",
" ('https://www.zabierzow.org.pl/community/welcome/%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAKegQIABAB&usg=AOvVaw1u_tc6Q_mK_qSy_JeUs21l',\n",
" 'Welcome - Oficjalny serwis internetowy Gminy Zabierzówwww.zabierzow.org.pl Strona główna Community'),\n",
" ('https://www.ipiss.com.pl/%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjALegQIBRAB&usg=AOvVaw1v4Ep4-1xZU2aj34RQNyA6',\n",
" 'Institute of Labour and Social Studieswww.ipiss.com.pl lang=en'),\n",
" ('https://support.google.com/webmasters/answer/7489871%3Fhl%3Dpl&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQvxowC3oECAUQAg&usg=AOvVaw3QrhPCjSv1m5Remte9HOQz',\n",
" 'Dowiedz się dlaczego'),\n",
" ('http://www.klub-spadkobiercow.com.pl/%3Fs%3D%25E2%259A%25BD%25E2%259A%25A1%25E2%2598%2598%25EF%25B8%258F%25E2%258F%25B2%2Bkupi%25C4%2599%2Bbmw%2Bseria%2B5%2Boferty%2BSamocholand.pl%2B%25F0%259F%2590%259D%25E2%259C%258B%2B-%2BKupno%2Bsamochod%25C3%25B3w%2B%25F0%259F%258C%258D%25F0%259F%2593%2598%2Bbmw%2Bseria%2B5%2Bkupno%252C%2BKup%2Bbmw%2Bseria%2B5%2Btanio%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjAMegQIAhAB&usg=AOvVaw3OrIJeKwmccNn-Z0ci9WZ5',\n",
" 'kupię bmw seria 5 oferty Samocholand.pl - Kupno samochodów ...www.klub-spadkobiercow.com.pl s=⚽⚡☘⏲+kupię+bmw+seria+5+oferty...'),\n",
" ('http://www.klub-spadkobiercow.com.pl/%3Fs%3D%25F0%259F%2594%2590%25F0%259F%2598%25B2%25F0%259F%258C%259F%25F0%259F%2592%259C%2BSprzedam%2Bsamochody%2Bhummer%2Bh3%2Bog%25C5%2582oszenia%2BSamocholand.pl%2B%25E2%258F%25B2%25F0%259F%2598%258B%2B-%2BSprzeda%25C5%25BC%2Bsamochod%25C3%25B3w%2B%25F0%259F%2592%259E%25F0%259F%2594%2590%2Bsamochody%2Bhummer%2Bh3%2Bog%25C5%2582oszenia%252C%2BSpprzedaj%2Bsamochody%2Bhummer%2Bh3%2Bpilnie%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%253Flang%253Den%3Flang%3Den&sa=U&ved=2ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQFjANegQIARAB&usg=AOvVaw2gGpRa2QRI0s5hif4sSG15',\n",
" 'Sprzedam samochody hummer h3 ogłoszenia Samocholand.pl ...www.klub-spadkobiercow.com.pl s=🔐😲🌟💜+Sprzedam+samochody+hu...'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQty4ISg&usg=AOvVaw3qJv9X5Au4qLqskqZgygmA',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dinurl:lang%25253Den%252Bsite:pl%26hl%3Dpl&sa=U&ved=0ahUKEwiPwpzSs6XvAhVpSxUIHSLzBxQQxs8CCEs&usg=AOvVaw1bNj0srkIoKMTez1biljAK',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('inurl:lang=en site:pl')"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://context.reverso.net/t%25C5%2582umaczenie/angielski-polski/decided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAAegQIAxAB&usg=AOvVaw1VOWJd4mMu1wbrjT0N2fwg',\n",
" 'decided - Tłumaczenie na polski - angielskich przykładów | Reverso ...context.reverso.net tłumaczenie angielski-polski decided'),\n",
" ('https://context.reverso.net/t%25C5%2582umaczenie/polski-angielski/zdecydowali&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjABegQIAhAB&usg=AOvVaw392MbfKZ25nbvv_wpUfF4s',\n",
" 'zdecydowali - Tłumaczenie na angielski - polskich przykładów ...context.reverso.net tłumaczenie polski-angielski zdecydowali'),\n",
" ('https://pl.duolingo.com/dictionary/English/decided/f241156f8cd032ca9b65a8bd760439d8&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjACegQICxAB&usg=AOvVaw3ofU6NSr4cVJ7Wp75lDPWm',\n",
" 'Co oznacza „decided” po angielsku? - Duolingopl.duolingo.com dictionary English decided'),\n",
" ('https://www.diki.pl/slownik-angielskiego%3Fq%3Ddecide&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjADegQICRAB&usg=AOvVaw3D_KS9QB14t8N79rhLEzXx',\n",
" 'decide - Tłumaczenie po polsku - Słownik angielsko-polski Dikiwww.diki.pl slownik-angielskiego q=decide'),\n",
" ('http://www.slownictwo.pl/dict1.php%3Ftxt%3Ddecided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAEegQIChAB&usg=AOvVaw2ho4z_VbbIZQfbaQTkaQir',\n",
" 'Internetowy słownik polsko-angielski i angielsko-polski z lektoremwww.slownictwo.pl dict1 txt=decided'),\n",
" ('https://pl.bab.la/slownik/angielski-polski/decided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAFegQICBAB&usg=AOvVaw1UVHsgO7GZH-vm4_x5MGDW',\n",
" 'DECIDED - Tłumaczenie na polski - bab.lapl.bab.la slownik angielski-polski decided'),\n",
" ('https://fiszkoteka.pl/slownik/pl/en/zdecydowa%25C5%2582&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAKegQIARAB&usg=AOvVaw1zaRQ2cAhJHPJFYPa5JCT8',\n",
" '→ zdecydował po angielsku, słownik polsko - angielski | Fiszkotekafiszkoteka.pl słownik polsko - angielski Z'),\n",
" ('https://fiszkoteka.pl/slownik/en/pl/decided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjALegQIBhAB&usg=AOvVaw3JyZ1e2LvRkwv_mjklzaiO',\n",
" '→ decided po polsku, słownik angielsko - polski | Fiszkotekafiszkoteka.pl słownik angielsko - polski D'),\n",
" ('https://ellalanguage.com/pl/slownik_angielski_decide/&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjAMegQIBBAB&usg=AOvVaw2hbOA7JWSyFSTH04bVg5rS',\n",
" 'Odmiana czasownika DECIDE | Angielskie czasowniki | ELLAellalanguage.com slownik_angielski_decide'),\n",
" ('https://tr-ex.me/t%25C5%2582umaczenie/angielski-polski/decided&sa=U&ved=2ahUKEwi0-436s6XvAhUzo3EKHU0MAG8QFjANegQIBRAB&usg=AOvVaw0Fl5dYqoiEFcgUzWH0mN2S',\n",
" 'DECIDED ▷ Tłumaczenie Na Polski - Przykłady Użycia Decided W ...tr-ex.me tłumaczenie angielski-polski decided'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwi0-436s6XvAhUzo3EKHU0MAG8Qty4IQw&usg=AOvVaw1uu2p_1jLxzOHd7KfkS2NU',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dzdecydowali%252Bdecided%26hl%3Dpl&sa=U&ved=0ahUKEwi0-436s6XvAhUzo3EKHU0MAG8Qxs8CCEQ&usg=AOvVaw1sNjBEDjM9eZu9ozeQEJqs',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('zdecydowali decided')"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://ispan.waw.pl/journals/index.php/sfps/article/view/sfps.2014.020&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAAegQIABAB&usg=AOvVaw3PKZOp-ZKdH0s_POMTQrv-',\n",
" 'Słowa kluczowe podawane przez autora publikacji jako podstawa ...ispan.waw.pl journals index.php sfps article view sfps.2014.020'),\n",
" ('http://www.wbios.us.edu.pl/tl_files/aktualnosci/revitare-2013/konferencja-streszczenie-wzor.pdf&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjABegQIAxAB&usg=AOvVaw1XgVp3uZUGn0Ig0sADojZO',\n",
" '[PDF] WZÓR STRESZCZENIAwww.wbios.us.edu.pl revitare-2013 konferencja-streszczenie-wzor'),\n",
" ('https://docs.microsoft.com/pl-pl/dotnet/csharp/language-reference/keywords/&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjACegQICxAB&usg=AOvVaw1Ppo-QeKIjwxw8D8zLOIDN',\n",
" 'Słowa kluczowe języka C#C# Keywords - Microsoft Docsdocs.microsoft.com ... Przewodnik dla języka C# Dokumentacja języka'),\n",
" ('https://docs.microsoft.com/pl-pl/cpp/cpp/keywords-cpp&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjADegQICBAB&usg=AOvVaw09GBEO-bl_GHGuApWZv46H',\n",
" 'Słowa kluczowe (C++) | Microsoft Docsdocs.microsoft.com ... Konwencje leksykalne'),\n",
" ('https://www.researchgate.net/publication/271724450_Keywords_tags_and_what_else_Slowa_kluczowe_tagi_i_co_dalej&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAEegQICRAB&usg=AOvVaw2lYe8oCMu-372n8o6jjnvA',\n",
" '(PDF) Keywords, tags... and what else? [Słowa kluczowe, tagi…, i co ...www.researchgate.net publication 271724450_Keywords_tags_and_wh...'),\n",
" ('https://clarin-pl.eu/dspace/bitstream/handle/11321/589/S%25C5%2582owa%2520kluczowe%2520-%2520wytyczne%2520%2528publikacja%2529.pdf%3Fsequence%3D1%26isAllowed%3Dy&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAFegQIChAB&usg=AOvVaw1zbgvbNQDTRmK3GXVFB6Gx',\n",
" '[PDF] słowa kluczowe - CLARIN-PLclarin-pl.eu dspace bitstream handle'),\n",
" ('https://pl.qaz.wiki/wiki/List_of_Java_keywords&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAKegQIBhAB&usg=AOvVaw32i5c9auW8kJ6j0fZPo2ml',\n",
" 'Lista słów kluczowych Java - List of Java keywords - qaz.wikipl.qaz.wiki wiki List_of_Java_keywords'),\n",
" ('http://www.standardy.pl/index.php/artykuly/drukuj/1316&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjALegQIAhAB&usg=AOvVaw0MgKxzmQaV_C8gvS9n_BU4',\n",
" '[PDF] x Keywords: x Autorzy: List otwarty do PTN Streszczenie: x Abstractwww.standardy.pl index.php artykuly drukuj'),\n",
" ('http://cejsh.icm.edu.pl/cejsh/element/bwmeta1.element.ojs-doi-10_11649_sfps_2014_020&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjAMegQIBRAB&usg=AOvVaw1ckSaZzuEVpMhFLEWNo7tU',\n",
" 'Słowa kluczowe podawane przez autora ... - CEJSH - ICM UWcejsh.icm.edu.pl bwmeta1.element.ojs-doi-10_11649_sfps_2014_020'),\n",
" ('http://www.bobolanum.edu.pl/wydawnictwo-artykul&sa=U&ved=2ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQFjANegQIARAB&usg=AOvVaw1FzLP8mLAHuszJjWFoCtOZ',\n",
" 'Artykuł - wymogi edytorskie / The Article - Editorial Requirements ...www.bobolanum.edu.pl wydawnictwo-artykul'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQty4ITQ&usg=AOvVaw275ECJoqdlgg6bzr8BjvBK',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D%252522s%2525C5%252582owa%252Bkluczowe%252522%252Bkeywords%252Babstract%26hl%3Dpl&sa=U&ved=0ahUKEwju5sKVtKXvAhXyrnEKHS9jDrsQxs8CCE4&usg=AOvVaw22rLBFpQgI8blcDhcAZu1P',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('\"słowa kluczowe\" keywords abstract')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Jak szukać dziurawych/dziwnych stron?"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://smolarz.szczecin.lasy.gov.pl/test-grafika&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAAegQIAhAB&usg=AOvVaw00PjOy7FFcAzFOiEWBj5q-',\n",
" 'test grafika - Nadleśnictwo Smolarz - Lasy Państwowesmolarz.szczecin.lasy.gov.pl test-grafika'),\n",
" ('http://www.malopolska.mw.gov.pl/aktualnosci/samorzad/blabla&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjABegQICRAB&usg=AOvVaw2FPiYfJO-h1e4cEis8U7Pu',\n",
" 'Małopolska na Dożynkach Prezydenckich w Spale » Małopolskawww.malopolska.mw.gov.pl aktualnosci samorzad blabla'),\n",
" ('http://sejm.gov.pl/Sejm9.nsf/wypowiedz.xsp%3Fposiedzenie%3D20%26dzien%3D2%26wyp%3D113%26symbol%3DRWYSTAPIENIA_WYP%26id%3D073&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjACegQICBAB&usg=AOvVaw06C6-TRfwEa0vnqBZqICgI',\n",
" 'Wypowiedzi na posiedzeniach Sejmusejm.gov.pl Sejm9.nsf wypowiedz'),\n",
" ('https://www.gov.pl/web/psse-walbrzych/test3&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjADegQIBxAB&usg=AOvVaw0C4Wts3msCWyEcHpuou4Gv',\n",
" 'test - Powiatowa Stacja Sanitarno-Epidemiologiczna w Wałbrzychu ...www.gov.pl web psse-walbrzych test3'),\n",
" ('https://www.biznes.gov.pl/glos-przedsiebiorcy/idea/porzadny-slownik&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAEegQIBRAB&usg=AOvVaw3sSgvJNIu57v7xRbsUaGPJ',\n",
" 'Pomysły na biznes.gov.plwww.biznes.gov.pl glos-przedsiebiorcy idea porzadny-slownik'),\n",
" ('http://demo.licytacje.uzp.gov.pl/contest/view/sid/L-76-2011&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAFegQIBhAB&usg=AOvVaw3qQ5q60_RMk3yVEHZSsLgd',\n",
" 'Urząd Zamówień Publicznychdemo.licytacje.uzp.gov.pl contest view sid'),\n",
" ('https://www.biznes.gov.pl/glos-przedsiebiorcy%3Fpage%3D24&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAGegQIABAB&usg=AOvVaw0-0BmNu2idsAGELz1ytQrr',\n",
" 'Pomysły na biznes.gov.plwww.biznes.gov.pl glos-przedsiebiorcy'),\n",
" ('https://www.gddkia.gov.pl/frontend/web/userfiles/articles/o/ogloszenie-z-dnia-27112017_27828/za%25C5%2582.2.%2520do%2520regulaminu%2520-%2520%25C5%259Bwiadectwa%2520legalno%25C5%259Bci%2520ze%2520zdj%25C4%2599ciami.pdf&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAHegQIBBAB&usg=AOvVaw1QSAjj5hsgZD9v5dO65nt3',\n",
" '[PDF] ŚWIADECTWO LEGALNOŚCI POZYSKANIA DREWNA [pdf] - GDDKiAwww.gddkia.gov.pl articles ogloszenie-z-dnia-27112017_27828'),\n",
" ('https://www.arimr.gov.pl/wersja-testowa/zalaczniki-do-wniosku-w-2015-r/rejestr.html&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAIegQIARAB&usg=AOvVaw1kN54m-oEhXvu9HM_jf5r2',\n",
" 'rejestr | Agencja Restrukturyzacji i Modernizacji Rolnictwawww.arimr.gov.pl wersja-testowa zalaczniki-do-wniosku-w-2015-r re...'),\n",
" ('http://www.zielona-gora.sr.gov.pl/download.php%3Finst%3D1%26id%3D1889&sa=U&ved=2ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQFjAJegQIAxAB&usg=AOvVaw0C5yVLkbZgo3j_SPFeS3kD',\n",
" '[PDF] Untitledwww.zielona-gora.sr.gov.pl download'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQty4IMA&usg=AOvVaw2jLSNJ1Fojm0RC3f1Rei7X',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dblabla%252Bsite:gov.pl%26hl%3Dpl&sa=U&ved=0ahUKEwi_gcTatKXvAhXvXRUIHVHdBugQxs8CCDE&usg=AOvVaw0MEcRxsUFD_99cunMcln-U',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('blabla site:gov.pl')"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('http://www.gios.gov.pl/images/dokumenty/pms/monitoring_pol_elektormagnetycznych/raport/Zalacznik_1-_mapa_Szczecin.pdf&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAAegQIARAB&usg=AOvVaw3iiQhOAEZVJob4cs973EUY',\n",
" '[PDF] mapa Szczecinwww.gios.gov.pl pms raport Zalacznik_1-_mapa_Szczecin'),\n",
" ('http://www.gios.gov.pl/images/dokumenty/pms/monitoring_pol_elektormagnetycznych/raport/Zalacznik_1-_mapa_Gdansk.pdf&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjABegQICRAB&usg=AOvVaw0QQC4LH21f3xjE0rM8PI6L',\n",
" '[PDF] C:\\\\Documents and Settings\\\\ja\\\\Pulpit\\\\Gdańsk\\\\Mapy.dwg A3 mapa ...www.gios.gov.pl pms raport Zalacznik_1-_mapa_Gdansk'),\n",
" ('https://www.gddkia.gov.pl/pl/d/f7041e734f9b37cd88cae0a9000102a1&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjACegQIBxAB&usg=AOvVaw0gO4uj__F-7icHZYIQeTPL',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/fec8268b624add970e544fefefcd043f&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjADegQICBAB&usg=AOvVaw02AVxRGVLmdXAyqtSBZZRo',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/392dd80745a5a025df1d225bbf0b8e02&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAEegQIAxAB&usg=AOvVaw2Vjr_Ez89bHJrDaPepIRsF',\n",
" 'mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/dfc6e11545fb637fef5a00f53ce94414&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAFegQIBBAB&usg=AOvVaw3A8r1jWPXDCm7XwoWfkjzf',\n",
" 'mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/996b6076155b215e7ee8d5897fc6153b&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAGegQIAhAB&usg=AOvVaw1TIDEU5BMlHlGMYkNkbWM4',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/3010c117961da9877405841ef5c65a07&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAHegQIBRAB&usg=AOvVaw2IS01b7eg6XhHaQHZ3jK13',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl ...'),\n",
" ('https://www.gddkia.gov.pl/pl/d/bed97709d7349e000a041a60388ab1ee&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAIegQIBhAB&usg=AOvVaw1X_Tq2PTGDaRTXm_xi5PQz',\n",
" '[PDF] mhtml:file://C:\\\\Documents and Settings\\\\Malik_M\\\\Moje ... - GDDKiAwww.gddkia.gov.pl ...'),\n",
" ('http://www.gddkia.gov.pl/pl/d/0c5befb91a5b0b0c8bbc3b5a293ad0fc&sa=U&ved=2ahUKEwjLkrnptKXvAhXYSRUIHatABOMQFjAJegQIABAB&usg=AOvVaw3sEannIxW2G91xP2bUK6Me',\n",
" 'mhtml:file://C:\\\\Documents and Settings\\\\user\\\\Pulpit ... - GDDKiAwww.gddkia.gov.pl ...'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwjLkrnptKXvAhXYSRUIHatABOMQty4ILg&usg=AOvVaw0yirg8KksKVYdZKGNbhKol',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dintitle:settings%252Bfiletype:pdf%252Bsite:gov.pl%26hl%3Dpl&sa=U&ved=0ahUKEwjLkrnptKXvAhXYSRUIHatABOMQxs8CCC8&usg=AOvVaw0b9IEfcDUv6isVIMCWaieO',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('intitle:settings filetype:pdf site:gov.pl')"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://www.gov.pl/attachment/3ddad90a-8136-4d9c-a56f-1ed206bf2b24&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAAegQIABAB&usg=AOvVaw3vizwfsDj6dYNSA8t3-tWi',\n",
" '[XLS] NAZWISKA_MEN A B 1 100 najpopularniejszych nazwisk męskich ...www.gov.pl attachment'),\n",
" ('https://doc.rmf.pl/rmf_fm/store/Kopia_nazwiska_2010.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjABegQIBxAB&usg=AOvVaw3rhrn9Nyg5ac0TyxUqDi1t',\n",
" '[XLS] nazwiska A B C D E F G H I 1 Najcześciej występujące nazwiska ...doc.rmf.pl rmf_fm store Kopia_nazwiska_2010'),\n",
" ('http://dydaktyka.polsl.pl/roz6/izdonek/Shared%2520Documents/MS%2520Excel/7_Dzia%25C5%2582ania%2520na%2520danych%2520typu%2520tekst_podr.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjACegQICBAB&usg=AOvVaw38RDxGxB5aMALBoLG9XEVR',\n",
" '[XLS] Wielkość liter A B C D 1 Przykład 7.1 2 Podany fragment bazy ...dydaktyka.polsl.pl roz6 izdonek'),\n",
" ('http://zprp.pl/wp-content/uploads/2015/02/Lista_transferowa_2017_18_v1.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjADegQICRAB&usg=AOvVaw2v7UWBKRjO57O-fM-Ox6-K',\n",
" '[XLS] Lista 2017 A B C D E F 1 Lp Nazwisko Imię Klub macierzysty Status ...zprp.pl uploads 2015/02 Lista_transferowa_2017_18_v1'),\n",
" ('https://umostrow.pl/files/file_add/download/1163_kopia-2020-stmig-cooper-1-sprawozdanie-cz1.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAEegQIBhAB&usg=AOvVaw350iyfjLCkSKGxxX-ezFdj',\n",
" '[XLS] STMiG 2020 - formularz testu Coopera - Ostrów Wielkopolskiumostrow.pl 1163_kopia-2020-stmig-cooper-1-sprawozdanie-cz1'),\n",
" ('https://www.mbank.pl/pobierz/mbankrejestumow.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAFegQIAhAB&usg=AOvVaw3fZtm8ph8HLJwAJIxTeoL5',\n",
" '[XLS] Sheet_1 A B C 1 Przedsiębiorca Siedziba Przedsiębiorcy NIP 2 ...www.mbank.pl pobierz mbankrejestumow'),\n",
" ('http://um.bip.legnica.eu/download/107/26919/drugiepolrocze2017.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAGegQIAxAB&usg=AOvVaw3JqmWVIkWufqX7a5NxLyeH',\n",
" '[XLS] Export Worksheet A B C D E 1 DATA_ZAWARCIA ...um.bip.legnica.eu download drugiepolrocze2017'),\n",
" ('http://szswielkopolska.pl/13-kk-io-44.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAHegQIARAB&usg=AOvVaw1SEckfGtKXrghNhgKx7UzB',\n",
" '[XLS] SP 7 Ostrów - SZS Wielkopolskaszswielkopolska.pl 13-kk-io-44'),\n",
" ('http://www.wsm.edu.pl/fotos/dziekanat/karty_roczne_AIU_2009_2013.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAIegQIBBAB&usg=AOvVaw1i9Mt01azVHjxeFwZJMbLs',\n",
" '[XLS] sem 1 A B C D E F G H I J K L M N O P Q R S T U V W X 1 Wyższa ...www.wsm.edu.pl fotos dziekanat karty_roczne_AIU_2009_2013'),\n",
" ('http://www.arimr.gov.pl/fileadmin/pliki/zdjecia_strony/132/OR07_los121_w.xls&sa=U&ved=2ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4QFjAJegQIBRAB&usg=AOvVaw3QFCWasKloqTlTbK9HVfi0',\n",
" '[XLS] Kolejno** wylosowanych wniosków w ramach dzia*ania ...www.arimr.gov.pl pliki zdjecia_strony OR07_los121_w'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4Qty4IMA&usg=AOvVaw3VOwJyWy4exubKqjpl7aPI',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253Dpesel%252Bfiletype:xls%252Bkaczmarek%26hl%3Dpl&sa=U&ved=0ahUKEwir_KT7tKXvAhXkQhUIHRFZBn4Qxs8CCDE&usg=AOvVaw0f2Vo1eTV7WPUx-FUMYU8C',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('pesel filetype:xls kaczmarek')"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('https://akademia.nask.pl/foto/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAAegQIABAB&usg=AOvVaw1q9KOfc65WIi8jlO1z3TzI',\n",
" 'Index of /foto - Akademia NASKakademia.nask.pl foto'),\n",
" ('http://ftp.man.poznan.pl/pub/apache/chemistry/%3FC%3DM%3BO%3DA&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjABegQICRAB&usg=AOvVaw3jheEqWF7Iq_HaItKHR2H4',\n",
" 'Index of /pub/apache/chemistry - Nameftp.man.poznan.pl pub apache chemistry'),\n",
" ('http://ftp.man.poznan.pl/pub/apache/kafka/%3FC%3DM%3BO%3DA&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjACegQICBAB&usg=AOvVaw3oWl350iGMv7yN_zzmKlrj',\n",
" 'Index of /pub/apache/kafka - Descriptionftp.man.poznan.pl pub apache kafka'),\n",
" ('http://www.ncac.torun.pl/~seyfert/%3FC%3DS%3BO%3DA&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjADegQIBxAB&usg=AOvVaw3IOMp-EkmpvsqzXfkzHLh_',\n",
" 'Index of /~seyfertwww.ncac.torun.pl ~seyfert'),\n",
" ('http://www.mpu.pl/download/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAEegQIARAB&usg=AOvVaw2t4Py-QOSOgqH0JejD9OdE',\n",
" 'Index of /downloadwww.mpu.pl download'),\n",
" ('http://www.psm-bielsk-podlaski.edu.pl/pl/images/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAFegQIBhAB&usg=AOvVaw1qPfo7aV0sGkb42ysGXzGS',\n",
" 'Index of /pl/images - PSM Bielsk Podlaskiwww.psm-bielsk-podlaski.edu.pl images'),\n",
" ('http://www.matrix.umcs.lublin.pl/~akrajka/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAGegQIAxAB&usg=AOvVaw3op5HIl9tMV6GQhC1IkuB1',\n",
" 'Index of /~akrajka - matrix.umcs.lublin.plwww.matrix.umcs.lublin.pl ~akrajka'),\n",
" ('http://www.combio.pl/mirex2.download/pen/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAHegQIBRAB&usg=AOvVaw2Hd6NmIvw6kn8ENWsSdJQk',\n",
" 'Index of /mirex2.download/pen - combio.plwww.combio.pl mirex2.download pen'),\n",
" ('http://www.iich.gliwice.pl/download/&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAIegQIAhAB&usg=AOvVaw05o8hkDQv8hHPSqAjNp-wT',\n",
" 'Index of /downloadwww.iich.gliwice.pl download'),\n",
" ('http://www.cs.put.poznan.pl/mkadzinski/%3FC%3DM%3BO%3DA&sa=U&ved=2ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQFjAJegQIBBAB&usg=AOvVaw1fkEik765hTNPbBbenF_Rq',\n",
" 'Index of /mkadzinskiwww.cs.put.poznan.pl mkadzinski'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQty4ILg&usg=AOvVaw3x8sw8cv98HNTbBSAnJ58x',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D%252522index%252Bof%252522%252B%252522last%252Bmodified%252522%252B%252522parent%252Bdirectory%252522%252Bapache%26hl%3Dpl&sa=U&ved=0ahUKEwi8i6WStaXvAhW3QhUIHf4oCzYQxs8CCC8&usg=AOvVaw0TVKuX1CIb5g3C-Y2_D4iC',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('\"index of\" \"last modified\" \"parent directory\" apache')"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('http://filipg-jenkins.wmi.amu.edu.pl/ISI2019/lecture-2019-02.pdf&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAAegQIABAB&usg=AOvVaw2HIittTKuAOR1ATLm972d6',\n",
" '[PDF] Inteligentne systemy informacyjne - Filip Graliński / UAMfilipg-jenkins.wmi.amu.edu.pl ISI2019 lecture-2019-02'),\n",
" ('https://md5.gromweb.com/%3Fmd5%3D3fcedf144be9f3dff1145db6c515fb34&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjABegQICRAB&usg=AOvVaw0JTZQuMmrZH56enRrfBVG1',\n",
" 'MD5 reverse for 3fcedf144be9f3dff1145db6c515fb34md5.gromweb.com ...'),\n",
" ('https://pastebin.pl/view/d872a388&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjACegQIBxAB&usg=AOvVaw3z3-Auzt_qQrkU08fj67q2',\n",
" 'Re: ruchanie - Pastebinpastebin.pl view'),\n",
" ('http://people.cs.georgetown.edu/~clay/classes/fall2015/ia/MD5.pass.txt&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjADegQICBAB&usg=AOvVaw2tW7zmVhmNYCeEKr-1vA7V',\n",
" 'cbae07efa0c6ed330a283e80a9c02e8d ...people.cs.georgetown.edu ~clay classes fall2015 MD5.pass.txt'),\n",
" ('http://wklejto.pl/59019&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAEegQIBhAB&usg=AOvVaw1DNIJXZyC5I05BQsnSKMDh',\n",
" 'Kod: 59019 WKLEJTO.PL Darmowa wklejka, na zawsze!wklejto.pl ...'),\n",
" ('http://docs2.chomikuj.pl/2854898545,PL,0,0,cs-szambo.txt&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAFegQIAhAB&usg=AOvVaw2VKr7YjOicUXzK4zqHIWKQ',\n",
" 'cs szambo.txt - Chomikuj.pldocs2.chomikuj.pl 2854898545,PL,0,0,cs-szambo'),\n",
" ('https://hashkiller.io/download_list/Found/139863.txt&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAGegQIBBAB&usg=AOvVaw0cPadq0BLUdJR2EN_w1cNs',\n",
" 'f24eba008b3b789e4ee5d3dc8a33af27:Gumimaci1 ...hashkiller.io download_list Found'),\n",
" ('https://195.201.31.93/rx6NiRIx/&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAHegQIBRAB&usg=AOvVaw31DG-mSBQmSTDaBgxi8_XX',\n",
" 'Latest MD5 leaked AA3 - BitBin195.201.31.93 ...'),\n",
" ('https://pastebin.com/dEsgsTqV&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAIegQIARAB&usg=AOvVaw1vCC1iy8lVGuq0E6rELfeM',\n",
" 'INSERT INTO `auth` (`id`, `name`, `premium ... - Pastebin.compastebin.com dEsgsTqV'),\n",
" ('https://paste2.org/DeGOC334&sa=U&ved=2ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQFjAJegQIAxAB&usg=AOvVaw2zwShLX08T5j4hSmbBM3Je',\n",
" 'Viewing Paste DeGOC334 - Paste2.orgpaste2.org DeGOC334'),\n",
" ('https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQty4ILw&usg=AOvVaw3TNS8kxuTo_YOIBJwKVXG_',\n",
" 'Stare Miasto, Poznań\\xa0-\\xa0Z Twojego adresu internetowego\\xa0-\\xa0Dowiedz się więcej'),\n",
" ('https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D6d932c406fa15164ee48ff5a52f81dae%26hl%3Dpl&sa=U&ved=0ahUKEwiGipWetaXvAhUdSxUIHTzFDuoQxs8CCDA&usg=AOvVaw0DmFxG-Qro2rfZ0Ot1z-4V',\n",
" 'Zaloguj się')]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('6d932c406fa15164ee48ff5a52f81dae')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Projekt 1\n",
"\n",
"Opracować aplikację webową do półautomatycznego\n",
"systematycznego zbierania interesujących wyników Google\n",
"hackingu:\n",
"\n",
"* użytkownik podaje zapytanie\n",
" * możliwe użycie list wyrazów np. wulgaryzmy, wyrażenia potoczne, „wypełniacze” („bla bla”, „foo bar”), system\n",
"powinien wtedy generować serię zapytań\n",
"* aplikacja odpytuje wyszukiwarkę Google (i, być może, inne)\n",
"* aplikacja zbiera wyniki i przedstawia je użytkownikowi\n",
"* użytkownik taguje wyniki jako interesujące / nieinteresujące\n",
"* zapytania mogą być uruchamiane cyklicznie, użytkownik nie musi ponownie przeglądać otagowanych już wyników\n",
"* aplikacja pozwala wylistować wszystkie wyniki oznaczone do tej pory jako interesujące"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Czego nie brać?\n",
"\n",
"Standard robots.txt"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"User-agent: *\n",
"Disallow: /*/wyszukaj/\n",
"Disallow: /*servlet\n",
"Disallow: /reloadwww?\n",
"Disallow: /dfptools/adview/\n",
"Disallow: /pub/ips/*\n",
"Disallow: /ods?\n",
"Disallow: /getFile.servlet*\n",
"Disallow: /aliasy/blad.jsp\n",
"Disallow: /znajdz.do\n",
"Disallow: /portalSearch.do\n",
"Disallow: /im/ab/b4/10/z17515435Q.jpg\n",
"Disallow: /75224259/\n",
"\n",
"User-agent: Googlebot-News\n",
"Disallow: /nowy/\n",
"Disallow: /mapa_strony\n",
"Disallow: /*/wyszukaj/\n",
"Disallow: /*/51,\n",
"Disallow: /*/55,\n",
"Disallow: /*/2,\n",
"Disallow: /*order=\n",
"Disallow: /*obxx=\n",
"Disallow: /*tag=\n",
"Disallow: /reloadwww?\n",
"Disallow: /ods?\n",
"Disallow: /*servlet\n",
"Disallow: /dfptools/adview/\n",
"\n",
"User-agent: Yandex\n",
"Disallow: /\n",
"\n",
"User-Agent: bingbot\n",
"Disallow: /\n",
"\n",
"User-agent: 008\n",
"Disallow: /\n",
"\n",
"User-agent: 010\n",
"Disallow: /\n",
"\n",
"User-agent: 360Spider\n",
"Disallow: /\n",
"\n",
"User-agent: 80legs\n",
"Disallow: /\n",
"\n",
"User-agent: Aboundex\n",
"Disallow: /\n",
"\n",
"User-agent: accelobot\n",
"Disallow: /\n",
"\n",
"User-agent: Add\\ Catalog\n",
"Disallow: /\n",
"\n",
"User-agent: AhrefsBot\n",
"Disallow: /\n",
"\n",
"User-agent: aiHitBot\n",
"Disallow: /\n",
"\n",
"User-agent: Alexibot\n",
"Disallow: /\n",
"\n",
"User-agent: Aqua_Products\n",
"Disallow: /\n",
"\n",
"User-agent: AskJeeves\n",
"Disallow: /\n",
"\n",
"User-agent: asterias\n",
"Disallow: /\n",
"\n",
"User-agent: awcheckBot\n",
"Disallow: /\n",
"\n",
"User-agent: b2w/0.1\n",
"Disallow: /\n",
"\n",
"User-agent: BackDoorBot/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: BacklinkCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: Baiduspider\n",
"Disallow: /\n",
"\n",
"User-agent: BecomeBot\n",
"Disallow: /\n",
"\n",
"User-agent: BLEXBot\n",
"Disallow: /\n",
"\n",
"User-agent: BlowFish/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: Bookmark search tool\n",
"Disallow: /\n",
"\n",
"User-agent: BotALot\n",
"Disallow: /\n",
"\n",
"User-agent: brandwatch.net\n",
"Disallow: /\n",
"\n",
"User-agent: BuiltBotTough\n",
"Disallow: /\n",
"\n",
"User-agent: Bullseye/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: BunnySlippers\n",
"Disallow: /\n",
"\n",
"User-agent: Butterfly\n",
"Disallow: /\n",
"\n",
"User-agent: CatchBot\n",
"Disallow: /\n",
"\n",
"User-agent: Charlotte\n",
"Disallow: /\n",
"\n",
"User-agent: CheeseBot\n",
"Disallow: /\n",
"\n",
"User-agent: CherryPicker\n",
"Disallow: /\n",
"\n",
"User-agent: CherryPickerElite/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: CherryPickerSE/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: CLIPish\n",
"Disallow: /\n",
"\n",
"User-agent: Cliqzbot\n",
"Disallow: /\n",
"\n",
"User-agent: COMODO\n",
"Disallow: /\n",
"\n",
"User-agent: Comodo-Certificates-Spider\n",
"Disallow: /\n",
"\n",
"User-agent: CompSpyBot\n",
"Disallow: /\n",
"\n",
"User-agent: Copernic\n",
"Disallow: /\n",
"\n",
"User-agent: CopyRightCheck\n",
"Disallow: /\n",
"\n",
"User-agent: cosmos\n",
"Disallow: /\n",
"\n",
"User-agent: crawler\n",
"Disallow: /\n",
"\n",
"User-agent: Crescent\n",
"Disallow: /\n",
"\n",
"User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0\n",
"Disallow: /\n",
"\n",
"User-agent: Curious\n",
"Disallow: /\n",
"\n",
"User-agent: curl\n",
"Disallow: /\n",
"\n",
"User-agent: dataprovider\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: DinoPing\n",
"Disallow: /\n",
"\n",
"User-agent: discoverybot\n",
"Disallow: /\n",
"\n",
"User-agent: DittoSpyder\n",
"Disallow: /\n",
"\n",
"User-agent: DomainCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: DomainCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: dotbot\n",
"Disallow: /\n",
"\n",
"User-agent: dotnetdotcom\n",
"Disallow: /\n",
"\n",
"User-agent: Dow\\ Jones\\ Searchbot\n",
"Disallow: /\n",
"\n",
"User-agent: dumbot\n",
"Disallow: /\n",
"\n",
"User-agent: EasouSpider\n",
"Disallow: /\n",
"\n",
"User-agent: EmailCollector\n",
"Disallow: /\n",
"\n",
"User-agent: EmailSiphon\n",
"Disallow: /\n",
"\n",
"User-agent: EmailWolf\n",
"Disallow: /\n",
"\n",
"User-agent: Enterprise_Search\n",
"Disallow: /\n",
"\n",
"User-agent: Enterprise_Search/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: EroCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: es\n",
"Disallow: /\n",
"\n",
"User-agent: Exabot\n",
"Disallow: /\n",
"\n",
"User-agent: ExtractorPro\n",
"Disallow: /\n",
"\n",
"User-agent: EzineArticlesLinkScanner\n",
"Disallow: /\n",
"\n",
"User-agent: Ezooms\n",
"Disallow: /\n",
"\n",
"User-agent: FairAd Client\n",
"Disallow: /\n",
"\n",
"User-agent: Flaming AttackBot\n",
"Disallow: /\n",
"\n",
"User-agent: Foobot\n",
"Disallow: /\n",
"\n",
"User-agent: FreeFind\n",
"Disallow: /\n",
"\n",
"User-agent: FTRF\\:\\ Friendly\n",
"Disallow: /\n",
"\n",
"User-agent: Gaisbot\n",
"Disallow: /\n",
"\n",
"User-agent: GetRight/4.2\n",
"Disallow: /\n",
"\n",
"User-agent: gigabot\n",
"Disallow: /\n",
"\n",
"User-agent: grub\n",
"Disallow: /\n",
"\n",
"User-agent: grub-client\n",
"Disallow: /\n",
"\n",
"User-agent: Harvest/1.5\n",
"Disallow: /\n",
"\n",
"User-agent: Hatena Antenna\n",
"Disallow: /\n",
"\n",
"User-agent: hloader\n",
"Disallow: /\n",
"\n",
"User-agent: http://www.SearchEngineWorld.com bot\n",
"Disallow: /\n",
"\n",
"User-agent: http://www.WebmasterWorld.com bot\n",
"Disallow: /\n",
"\n",
"User-agent: HTTP_Request\n",
"Disallow: /\n",
"\n",
"User-agent: HTTP_Request2\n",
"Disallow: /\n",
"\n",
"User-agent: httplib\n",
"Disallow: /\n",
"\n",
"User-agent: humanlinks\n",
"Disallow: /\n",
"\n",
"User-agent: ia_archiver\n",
"Disallow: /\n",
"\n",
"User-agent: ia_archiver\n",
"Disallow: /\n",
"\n",
"User-agent: ia_archiver/1.6\n",
"Disallow: /\n",
"\n",
"User-agent: Indy\\ Library\n",
"Disallow: /\n",
"\n",
"User-agent: InfoNaviRobot\n",
"Disallow: /\n",
"\n",
"User-agent: ip\\-web\\-crawler\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: Iron33/1.0.2\n",
"Disallow: /\n",
"\n",
"User-agent: Jakarta\\ Commons-HttpClient\n",
"Disallow: /\n",
"\n",
"User-agent: Jeeves\n",
"Disallow: /\n",
"\n",
"User-agent: JennyBot\n",
"Disallow: /\n",
"\n",
"User-agent: Jetbot\n",
"Disallow: /\n",
"\n",
"User-agent: Jetbot/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: JikeSpider\n",
"Disallow: /\n",
"\n",
"User-agent: Kenjin Spider\n",
"Disallow: /\n",
"\n",
"User-agent: Keyword Density/0.9\n",
"Disallow: /\n",
"\n",
"User-agent: larbin\n",
"Disallow: /\n",
"\n",
"User-agent: LexiBot\n",
"Disallow: /\n",
"\n",
"User-agent: libWeb/clsHTTP\n",
"Disallow: /\n",
"\n",
"User-agent: libwww-perl\n",
"Disallow: /\n",
"\n",
"User-agent: lindex\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: linkdex\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: linkdexbot\n",
"Disallow: /\n",
"\n",
"User-agent: LinkextractorPro\n",
"Disallow: /\n",
"\n",
"User-agent: LinkScan/8.1a Unix\n",
"Disallow: /\n",
"\n",
"User-agent: LinkWalker\n",
"Disallow: /\n",
"\n",
"User-agent: lipperhey\n",
"Disallow: /\n",
"\n",
"User-agent: LNSpiderguy\n",
"Disallow: /\n",
"\n",
"User-agent: looksmart\n",
"Disallow: /\n",
"\n",
"User-agent: ltbot\n",
"Disallow: /\n",
"\n",
"User-agent: lwp-trivial\n",
"Disallow: /\n",
"\n",
"User-agent: lwp-trivial/1.34\n",
"Disallow: /\n",
"\n",
"User-agent: Lynx\n",
"Disallow: /\n",
"\n",
"User-agent: magpie\\-crawler\n",
"Disallow: /\n",
"\n",
"User-agent: Mata Hari\n",
"Disallow: /\n",
"\n",
"User-agent: Microsoft URL Control\n",
"Disallow: /\n",
"\n",
"User-agent: Microsoft URL Control - 5.01.4511\n",
"Disallow: /\n",
"\n",
"User-agent: Microsoft URL Control - 6.00.8169\n",
"Disallow: /\n",
"\n",
"User-agent: MIIxpc\n",
"Disallow: /\n",
"\n",
"User-agent: MIIxpc/4.2\n",
"Disallow: /\n",
"\n",
"User-agent: Mister PiX\n",
"Disallow: /\n",
"\n",
"User-agent: MJ12bot\n",
"Disallow: /\n",
"\n",
"User-agent: moget\n",
"Disallow: /\n",
"\n",
"User-agent: moget/2.1\n",
"Disallow: /\n",
"\n",
"User-agent: Mozilla/4.0 (compatible; BullsEye; Windows 95)\n",
"Disallow: /\n",
"\n",
"User-agent: MSIE\\ or\\ Firefox\\ mutant\n",
"Disallow: /\n",
"\n",
"User-agent: MSIECrawler\n",
"Disallow: /\n",
"\n",
"User-agent: naver\n",
"Disallow: /\n",
"\n",
"User-agent: NCBot\n",
"Disallow: /\n",
"\n",
"User-agent: NetAnts\n",
"Disallow: /\n",
"\n",
"User-agent: NetcraftSurveyAgent\n",
"Disallow: /\n",
"\n",
"User-agent: netEstate\\ NE\\ Crawler\n",
"Disallow: /\n",
"\n",
"User-agent: NetMechanic\n",
"Disallow: /\n",
"\n",
"User-agent: Netseer\n",
"Disallow: /\n",
"\n",
"User-agent: NextGenSearchBot\n",
"Disallow: /\n",
"\n",
"User-agent: NICErsPRO\n",
"Disallow: /\n",
"\n",
"User-agent: Nutch\n",
"Disallow: /\n",
"\n",
"User-agent: Nutch\n",
"Disallow: /\n",
"\n",
"User-agent: Ocelli\n",
"Disallow: /\n",
"\n",
"User-agent: Offline Explorer\n",
"Disallow: /\n",
"\n",
"User-agent: OmniExplorer_Bot\n",
"Disallow: /\n",
"\n",
"User-agent: Openbot\n",
"Disallow: /\n",
"\n",
"User-agent: Openfind\n",
"Disallow: /\n",
"\n",
"User-agent: Openfind\n",
"Disallow: /\n",
"\n",
"User-agent: Openfind data gathere\n",
"Disallow: /\n",
"\n",
"User-agent: OpenWebIndex\n",
"Disallow: /\n",
"\n",
"User-agent: Oracle Ultra Search\n",
"Disallow: /\n",
"\n",
"User-agent: PagesInventory\n",
"Disallow: /\n",
"\n",
"User-agent: PEAR\n",
"Disallow: /\n",
"\n",
"User-agent: PeoplePal\n",
"Disallow: /\n",
"\n",
"User-agent: PerMan\n",
"Disallow: /\n",
"\n",
"User-agent: ProCogSEOBot\n",
"Disallow: /\n",
"\n",
"User-agent: ProPowerBot/2.14\n",
"Disallow: /\n",
"\n",
"User-agent: ProWebWalker\n",
"Disallow: /\n",
"\n",
"User-agent: proximic\n",
"Disallow: /\n",
"\n",
"User-agent: psbot\n",
"Disallow: /\n",
"\n",
"User-agent: purebot\n",
"Disallow: /\n",
"\n",
"User-agent: QueryN Metasearch\n",
"Disallow: /\n",
"\n",
"User-agent: QuerySeekerSpider\n",
"Disallow: /\n",
"\n",
"User-agent: Radiation Retriever 1.1\n",
"Disallow: /\n",
"\n",
"User-agent: RepoMonkey\n",
"Disallow: /\n",
"\n",
"User-agent: RepoMonkey Bait & Tackle/v1.01\n",
"Disallow: /\n",
"\n",
"User-agent: Riddler\n",
"Disallow: /\n",
"\n",
"User-agent: RMA\n",
"Disallow: /\n",
"\n",
"User-agent: rojerbot\n",
"Disallow: /\n",
"\n",
"User-agent: RyteBot\n",
"Disallow: /\n",
"\n",
"User-agent: scooter\n",
"Disallow: /\n",
"\n",
"User-agent: ScoutJet\n",
"Disallow: /\n",
"\n",
"User-agent: Scrapy\n",
"Disallow: /\n",
"\n",
"User-agent: ScreenerBot\n",
"Disallow: /\n",
"\n",
"User-agent: searchmetrics\n",
"Disallow: /\n",
"\n",
"User-agent: searchpreview\n",
"Disallow: /\n",
"\n",
"User-agent: SemrushBot\n",
"Disallow: /\n",
"\n",
"User-agent: sentibot\n",
"Disallow: /\n",
"\n",
"User-agent: SEO-CRAWLING\n",
"Disallow: /\n",
"\n",
"User-agent: SEOENGWorldBot\n",
"Disallow: /\n",
"\n",
"User-agent: SEOkicks-Robot\n",
"Disallow: /\n",
"\n",
"User-agent: ShopWiki\n",
"Disallow: /\n",
"\n",
"User-agent: sistrix\n",
"Disallow: /\n",
"\n",
"User-agent: sitebot\n",
"Disallow: /\n",
"\n",
"User-agent: SiteSnagger\n",
"Disallow: /\n",
"\n",
"User-agent: Snoopy\n",
"Disallow: /\n",
"\n",
"User-agent: SocialSearcher\n",
"Disallow: /\n",
"\n",
"User-agent: Sogou\n",
"Disallow: /\n",
"\n",
"User-agent: SolomonoBot\n",
"Disallow: /\n",
"\n",
"User-agent: sootle\n",
"Disallow: /\n",
"\n",
"User-agent: Sosospider\n",
"Disallow: /\n",
"\n",
"User-agent: SpankBot\n",
"Disallow: /\n",
"\n",
"User-agent: spanner\n",
"Disallow: /\n",
"\n",
"User-agent: spbot\n",
"Disallow: /\n",
"\n",
"User-agent: Speedy\n",
"Disallow: /\n",
"\n",
"User-agent: Stanford\n",
"Disallow: /\n",
"\n",
"User-agent: Stanford Comp Sci\n",
"Disallow: /\n",
"\n",
"User-agent: SurveyBot\n",
"Disallow: /\n",
"\n",
"User-agent: suzuran\n",
"Disallow: /\n",
"\n",
"User-agent: Szukacz/1.4\n",
"Disallow: /\n",
"\n",
"User-agent: Szukacz/1.4\n",
"Disallow: /\n",
"\n",
"User-agent: Teleport\n",
"Disallow: /\n",
"\n",
"User-agent: TeleportPro\n",
"Disallow: /\n",
"\n",
"User-agent: Telesoft\n",
"Disallow: /\n",
"\n",
"User-agent: Teoma\n",
"Disallow: /\n",
"\n",
"User-agent: The Intraformant\n",
"Disallow: /\n",
"\n",
"User-agent: The\\ Incutio\\ XML-RPC\\ PHP\\ Library\n",
"Disallow: /\n",
"\n",
"User-agent: TheNomad\n",
"Disallow: /\n",
"\n",
"User-agent: toCrawl/UrlDispatcher\n",
"Disallow: /\n",
"\n",
"User-agent: True_Robot\n",
"Disallow: /\n",
"\n",
"User-agent: True_Robot/1.0\n",
"Disallow: /\n",
"\n",
"User-agent: turingos\n",
"Disallow: /\n",
"\n",
"User-agent: TurnitinBot\n",
"Disallow: /\n",
"\n",
"User-agent: uCrawler\n",
"Disallow: /\n",
"\n",
"User-agent: URL Control\n",
"Disallow: /\n",
"\n",
"User-agent: URL_Spider_Pro\n",
"Disallow: /\n",
"\n",
"User-agent: URLy Warning\n",
"Disallow: /\n",
"\n",
"User-agent: VCI\n",
"Disallow: /\n",
"\n",
"User-agent: VCI WebViewer VCI WebViewer Win32\n",
"Disallow: /\n",
"\n",
"User-agent: visaduhoc\\.info\n",
"Disallow: /\n",
"\n",
"User-agent: WBSearchBot\n",
"Disallow: /\n",
"\n",
"User-agent: Web Image Collector\n",
"Disallow: /\n",
"\n",
"User-agent: WebAuto\n",
"Disallow: /\n",
"\n",
"User-agent: WebBandit\n",
"Disallow: /\n",
"\n",
"User-agent: WebBandit/3.50\n",
"Disallow: /\n",
"\n",
"User-agent: WebCapture\n",
"Disallow: /\n",
"\n",
"User-agent: WebCopier\n",
"Disallow: /\n",
"\n",
"User-agent: WebEnhancer\n",
"Disallow: /\n",
"\n",
"User-agent: WebInDetail\\.com\n",
"Disallow: /\n",
"\n",
"User-agent: WebmasterWorld Extractor\n",
"Disallow: /\n",
"\n",
"User-agent: WebmasterWorldForumBot\n",
"Disallow: /\n",
"\n",
"User-agent: WebSauger\n",
"Disallow: /\n",
"\n",
"User-agent: Website Quester\n",
"Disallow: /\n",
"\n",
"User-agent: WEBSITEtheWEB\\.COM\n",
"Disallow: /\n",
"\n",
"User-agent: Webster Pro\n",
"Disallow: /\n",
"\n",
"User-agent: WebStripper\n",
"Disallow: /\n",
"\n",
"User-agent: WebVac\n",
"Disallow: /\n",
"\n",
"User-agent: WebZip\n",
"Disallow: /\n",
"\n",
"User-agent: WebZip/4.0\n",
"Disallow: /\n",
"\n",
"User-agent: Wget\n",
"Disallow: /\n",
"\n",
"User-agent: Wget/1.5.3\n",
"Disallow: /\n",
"\n",
"User-agent: Wget/1.6\n",
"Disallow: /\n",
"\n",
"User-agent: Wotbot\n",
"Disallow: /\n",
"\n",
"User-agent: www\\.integromedb\\.org\n",
"Disallow: /\n",
"\n",
"User-agent: WWW-Collector-E\n",
"Disallow: /\n",
"\n",
"User-agent: Xenu's\n",
"Disallow: /\n",
"\n",
"User-agent: Xenu's Link Sleuth 1.1c\n",
"Disallow: /\n",
"\n",
"User-agent: xpymep\\.exe\n",
"Disallow: /\n",
"\n",
"User-agent: YamanaLab-Robot\n",
"Disallow: /\n",
"\n",
"User-agent: YisouSpider\n",
"Disallow: /\n",
"\n",
"User-agent: YodaoBot\n",
"Disallow: /\n",
"\n",
"User-agent: YoudaoBot\n",
"Disallow: /\n",
"\n",
"User-agent: Zend_Http_Client\n",
"Disallow: /\n",
"\n",
"User-agent: Zeus\n",
"Disallow: /\n",
"\n",
"User-agent: Zeus 32297 Webster Pro V2.9 Win32\n",
"Disallow: /\n",
"\n",
"User-agent: Zeus Link Scout\n",
"Disallow: /\n",
"\n",
"User-agent: ZmEu\n",
"Disallow: /\n",
"\n",
"User-agent: ZumBot\n",
"Disallow: /\n",
"\n",
"User-agent: Linguee\n",
"Disallow: /\n",
"\n",
"User-agent: sogou\n",
"Disallow: /\n"
]
}
],
"source": [
"import urllib\n",
"import requests\n",
"\n",
"url = 'https://gazeta.pl/robots.txt'\n",
"response = requests.get(url)\n",
"print(response.content.decode('utf-8'))\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Projekt 2\n",
"\n",
"Opracować wyszukiwarkę plików robots.txt.\n",
"\n",
"* pobrać robots.txt dla (prawie) wszystkich polskich stron WWW\n",
"* umożliwić wyszukiwanie i sortowanie według wszystkich możliwych pól (blokowana wyszukiwarka, adres, komentarz,\n",
"długość pliku itd.)\n",
"* opracować miary pozwalające automatycznie wyłuskać „ciekawe” pliki robots.txt (długość, występowanie pełnych\n",
"linków, odmienność od innych plików robots.txt); umożliwić sortowanie/filtrowanie według tej miary"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"author": "Filip Graliński",
"email": "filipg@amu.edu.pl",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"lang": "pl",
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
},
"subtitle": "2.Wyszukiwarki — wprowadzenie[wykład]",
"title": "Ekstrakcja informacji",
"year": "2021"
},
"nbformat": 4,
"nbformat_minor": 4
}