aitech-eks-pub/wyk/01_Wyszukiwarki-wprowadzenie.ipynb

384 lines
15 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Wyszukiwarki - wprowadzenie\n",
"\n",
"## Systemy wyszukiwania informacji\n",
"\n",
"![System wyszukiwania informacji](system-wyszukiwania-informacji.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wyszukiwarki\n",
"\n",
"![Wyszukiwarki](wyszukiwarka-internetowa.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chcę stworzyć swoją własną wyszukiwarkę internetową...\n",
"\n",
"1. Skąd brać adresy URL?\n",
"2. Jak pobrać pliki z tych adresów?\n",
"3. Jak wydobyć z nich tekst?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ... a może w ogóle nie pobierać?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Korpus CommonCrawl\n",
"\n",
"https://commoncrawl.org/the-data/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<!-- スマホ用 --\n",
"<!-- \n",
"<!--table width='750' border='0' align='center' cellpadding='0' cellspacing='0'\n",
"<!--a href='index.phtml?CHANNEL=R51&FID=389924'\n",
"<!-- mail: \n",
"<!-- beige_lavender-3c --\n",
"<!--\n",
"<!-- Template Design By BeigeHeart_Chako_http://beigeheart.blog9.fc2.com/ --\n",
"<!-- 関連記事_http://beigeheart.blog9.fc2.com/blog-entry-99.html --\n",
"<!-- 利用規約_http://beigeheart.blog9.fc2.com/blog-entry-103.html --\n",
"<!-- テンプレの再配布、営利目的の利用禁止 --\n",
"<!-- 画像の無断転載・再配布禁止 --\n",
"<!-- アダルト・法律違反サイト、使用不可 --\n",
"<!-- アクセス解析タグはここから --\n",
"<!-- アクセス解析タグはここまで --\n",
"<!--▼▼▼メインカラムカラム+右サイドカラム部分--\n",
"<!--▼ヘッダー--\n",
"<!--▼管理ページリンク--\n",
"<!--▲管理ページリンク--\n",
"<!--▼タイトル--\n"
]
}
],
"source": [
"# Bezpośrednio z serwisu\n",
"\n",
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a '<!--[^\\[\\]<>]+' | uniq | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dostępne są też \"ekstrakty\" czystego tekstu - zob. http://data.statmt.org/ngrams/raw/, np. 59 GB czystego tekstu po polsku z 2012 roku."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df6fa1abb58549287111ba8d776733e9 0.000000 http://www.gornicki.pl/focal_points_4/2006\n",
"Przegląd okulistyczny \n",
"Focal points \n",
"Przegląd reumatologiczny \n",
"Biblioteka on-line \n",
"STRONA GŁÓWNA \n",
"WYDAWNICTWO \n",
"O wydawnictwie \n",
"Kontakt \n",
"Regulamin zamówień \n",
"Spotkania autorskie \n",
"Nasi autorzy \n",
"CZYTELNIA ONLINE \n",
"w dziale: anatomia \n",
"w dziale: okulistyka \n",
"w dziale: ratownictwo \n",
"CENNIK \n",
"LINKI \n",
"USŁUGI \n",
"df6fa1abb58549287111ba8d776733e9 2.000000 http://www.gornicki.pl/focal_points_4/2006\n",
"Licencjaty \n",
"Multimedia \n",
"Pulmonologia \n",
"Okulistyka \n",
"Ratownictwo \n",
"Reumatologia \n",
"Zestawy specjalne \n",
"Onkologia \n",
"Focal Points 4/2006\n",
"\n"
]
}
],
"source": [
"! (wget -O - -q http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/pl/raw/pl.2012.raw.xz \\\n",
" | xzcat | head -n 30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Zrzuty Wikipedii\n",
"\n",
"Nie pobieraj Wikipedii strona po stronie!\n",
"\n",
"* tracisz swój czas\n",
"* i tracisz czas serwerów Wikipedii\n",
"\n",
"Lepiej pobrać zrzut (_dump_) ze strony https://dumps.wikimedia.org/backup-index.html"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1977]]\n",
"[[język skryptowy|skryptowy]]\n",
"[[programowanie proceduralne|proceduralny]]\n",
"[[Programowanie sterowane zdarzeniami|sterowany zdarzeniami]]\n",
"[[Alfred V. Aho|Alfred Aho]]\n",
"[[Peter J. Weinberger|Peter Weinberger]]\n",
"[[Brian Kernighan]]\n",
"[[wieloplatformowość|wieloplatformowy]]\n",
"[[język programowania]]\n",
"[[plik]]\n",
"[[system operacyjny|systemów operacyjnych]]\n",
"[[Unix|UNIX]]\n",
"[[tablica asocjacyjna|tablice asocjacyjne]]\n",
"[[Tekstowy typ danych|stringi]]\n",
"[[wyrażenie regularne|wyrażenia regularne]]\n",
"[[Alfred V. Aho|Alfreda V. Aho]]\n",
"[[Peter Weinberger|Petera Weinbergera]]\n",
"[[Brian Kernighan|Briana Kernighana]]\n",
"[[POSIX]]\n",
"[[System V|SVR4]]\n"
]
}
],
"source": [
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
" | bzcat | grep -P -o '\\[\\[[^\\]]+\\]\\]' | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Skąd brać adresy URL\n",
"\n",
"### Zob. dumpy powyżej"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://ssl'\n",
"https://static.fc2.com/css_cn/common/headbar/120710style.css\n",
"https://blog.fc2.com/\n",
"https://spdeliver.i-mobile.co.jp/script/adsnativepc.js?20101001\n",
"https://media.fc2.com/counter_img.php?id=3493\n",
"https://plus.google.com/+apothekenumschau\n",
"https://script.ioam.de/iam.js\n",
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/AGP-Kontaktformular--73317.html\n",
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/Apotheker-HP--AGP-73319.html\n",
"https://login.apotheken-umschau.de/login?service=https://www.apotheken-umschau.de/j_spring_cas_security_check\n",
"https://forum.apotheken-umschau.de/portal/registration/register\n",
"https://www.facebook.com/Apotheken.Umschau\n",
"https://api.wortundbildverlag.com/drug-suggest/terms\n",
"https://07743rats-apotheke.apotheken-umschau.de/unternehmenskommunikation/Kontakt-zu-den-Redaktionen-53834.html\n",
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
"https://static.skyrock.net/js/common.min.js?eBtyhdw\n",
"https://static.skyrock.net/img/favicon_v5b.ico\n",
"https://wir.skyrock.net/wir/v1/resize/?c=isi&amp;im=%2F9775%2F59549775%2Fpics%2Fphoto_59549775_89.jpg&amp;w=16\n",
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
"https://static.skyrock.net/css/common.css?eahf2jw\n"
]
}
],
"source": [
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a 'https://[^ \"><]+' | uniq | head -n 20)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna\n",
"https://web.archive.org/web/20100116001012/http://homepages.cwi.nl/~dik/english/codes/stand.html#ascii\n",
"https://web.archive.org/web/20160613145224/http://www.aivosto.com/vbtips/charsets-7bit.html#body}}&lt;/ref&gt;\n",
"https://web.archive.org/web/20160522024759/http://worldpowersystems.com/J/codes/#ASCII-1967\n",
"https://books.google.com/?id=NQSpNAEACAAJ&amp;pg=PA28\n",
"https://web.archive.org/web/20160616084132/https://www.w3.org/blog/2008/05/utf8-web-growth/\n",
"https://web.archive.org/web/20160616084637/https://googleblog.blogspot.de/2008/05/moving-to-unicode-51.html\n",
"https://web.archive.org/web/20160616085323/https://googleblog.blogspot.de/2010/01/unicode-nearing-50-of-web.html\n",
"https://web.archive.org/web/20160827000956/http://dlx.bookzz.org/genesis/772000/c80a62495acf1e1a5b966de23c1f989a/_as/%5BInterface_Age_Staff%5D_Best_of_Interface_Age%2C_Volum%28BookZZ.org%29.pdf\n",
"https://books.google.com/books?id=bXLDwmIJNkUC&amp;pg=PA13\n",
"https://web.archive.org/web/20161031223347/http://ethw.org/First-Hand%3AChad_is_Our_Most_Important_Product%3A_An_Engineer%27s_Memory_of_Teletype_Corporation\n",
"https://textfiles.meulie.net/bitsaved/Books/Mackenzie_CodedCharSets.pdf\n",
"https://web.archive.org/web/20160526181319/http://longstreet.typepad.com/thesciencebookstore/2012/03/heres-the-link.html\n",
"https://web.archive.org/web/20120213005708/http://www.transbay.net/~enf/ascii/ascii.pdf\n",
"https://archive.org/details/dictionaryworldp00iann\n",
"https://archive.org/details/dictionaryworldp00iann/page/n80\n",
"https://www.theguardian.com/commentisfree/belief/2013/jan/28/lucretius-all-things-atoms\n",
"https://archive.org/details/distillingknowle00mora_557\n",
"https://archive.org/details/distillingknowle00mora_557/page/n156\n",
"https://archive.org/details/fromelementstoat00sieg\n"
]
}
],
"source": [
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
" | bzcat | grep -P -o 'https://[^ \"><]+' | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Serwis DMOZ/ODP (niestety już nieaktywny)\n",
"Ostatni link: https://web.archive.org/web/20160306230718/http://rdf.dmoz.org/rdf/content.rdf.u8.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Odpytywać \"pasożytniczo\" inną wyszukiwarkę"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# see https://hackernoon.com/how-to-scrape-google-with-python-bo7d2tal\n",
"\n",
"import urllib\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"def query_google(query):\n",
" url = f\"https://google.com/search?q={query}\"\n",
" response = requests.get(url)\n",
" soup = BeautifulSoup(response.content, \"html.parser\")\n",
" \n",
" results = []\n",
" for g in soup.find_all('a'):\n",
" link = g['href']\n",
" if '/url?q=' in link:\n",
" results.append(link[7:])\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QmhMwC3oECAwQDg&usg=AOvVaw1F4NoOH13sPHmkkVrKPKPc',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAQegQICxAB&usg=AOvVaw0cBRsP3ORH8ItFxcBkkaXl',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Opis&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQAg&usg=AOvVaw2pQXVnDLY_DxI-QJncPJ-J',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Historia&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQAw&usg=AOvVaw3Fkx-NtoxRASml4JWUS68g',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Najwa%25C5%25BCniejsze_argumenty_%25E2%2580%259Eza%25E2%2580%259D_i_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQBA&usg=AOvVaw2pTlj01g4WYUd9G__fMDdO',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Argumenty_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQBQ&usg=AOvVaw09DHFpaDfQ8rbvPCsALuqQ',\n",
" 'https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QtwIwEXoECAQQAQ&usg=AOvVaw0oHXUaa0kvQwNCNe5W9JIh',\n",
" 'https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QuAIwEXoECAQQAg&usg=AOvVaw2CrxxVzwVVwE4Xsj31_w3T',\n",
" 'https://www.youtube.com/watch%3Fv%3DIhS1d56aPOc&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QtwIwEnoECAUQAQ&usg=AOvVaw12i_Qq-aNn2KMbZciKlmAM',\n",
" 'https://www.youtube.com/watch%3Fv%3DIhS1d56aPOc&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QuAIwEnoECAUQAg&usg=AOvVaw3zdXkOsnuCMFVR8USryFDw',\n",
" 'https://www.youtube.com/watch%3Fv%3D_r4_GIfTn2o&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QtwIwF3oECAMQAQ&usg=AOvVaw3jgTHagNopqqBsCo594Zip',\n",
" 'https://www.youtube.com/watch%3Fv%3D_r4_GIfTn2o&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QuAIwF3oECAMQAg&usg=AOvVaw0iwfh9wM9EkhqRY_YoXuYU',\n",
" 'https://www.ceneo.pl/%3Bszukaj-wielka%2Bstopa&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAYegQICBAB&usg=AOvVaw38rQfzltST6zIW8eCRdta-',\n",
" 'https://www.ceneo.pl/Filmy%3Bszukaj-wielka%2Bstopa&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAZegQIAhAB&usg=AOvVaw3WqL8324pgm8Rd57USPD8M',\n",
" 'https://www.antyradio.pl/News/Kobieta-twierdzi-ze-spotkala-Wielka-Stope-Hustala-sie-na-drzewie-ZDJECIE-43102&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAaegQIChAB&usg=AOvVaw30c7T2Ymn-Q4Vqq5C962BO',\n",
" 'https://allegro.pl/kategoria/gry%3Fstring%3DWielka%2520stopa%2520%253A)%2520-&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAbegQIBxAB&usg=AOvVaw2kdw9sx7alxFh5IwLfsVX4',\n",
" 'https://allegro.pl/listing%3Fstring%3DWielka%2520stopa%2520%253A%2529%2520-&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAcegQICRAB&usg=AOvVaw0nK7AoJJjmr1oWrN46umA_',\n",
" 'https://tvn24.pl/tvnmeteo/informacje-pogoda/ciekawostki,49/wielka-stopa-nie-istnieje-naukowcy-to-nie-koniec-nadziei,127328,1,0.html&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAdegQIBhAB&usg=AOvVaw0WWcyH9m2XpHzz7koN1IrJ',\n",
" 'https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Qty4Ifw&usg=AOvVaw177POHJ8_tlgAuIzWDTzhM',\n",
" 'https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D%252522wielka%252Bstopa%252522%26hl%3Dpl&sa=U&ved=0ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Qxs8CCIAB&usg=AOvVaw0OmJ8GZoJAvzg7NX5Aby4M']"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('\"wielka stopa\"')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}