More on Lecture 1

This commit is contained in:
Filip Gralinski 2021-03-09 22:24:20 +01:00
parent f62e68ccab
commit 6deecf43dd

View File

@ -20,6 +20,337 @@
"![Wyszukiwarki](wyszukiwarka-internetowa.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Chcę stworzyć swoją własną wyszukiwarkę internetową...\n",
"\n",
"1. Skąd brać adresy URL?\n",
"2. Jak pobrać pliki z tych adresów?\n",
"3. Jak wydobyć z nich tekst?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ... a może w ogóle nie pobierać?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Korpus CommonCrawl\n",
"\n",
"https://commoncrawl.org/the-data/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<!-- スマホ用 --\n",
"<!-- \n",
"<!--table width='750' border='0' align='center' cellpadding='0' cellspacing='0'\n",
"<!--a href='index.phtml?CHANNEL=R51&FID=389924'\n",
"<!-- mail: \n",
"<!-- beige_lavender-3c --\n",
"<!--\n",
"<!-- Template Design By BeigeHeart_Chako_http://beigeheart.blog9.fc2.com/ --\n",
"<!-- 関連記事_http://beigeheart.blog9.fc2.com/blog-entry-99.html --\n",
"<!-- 利用規約_http://beigeheart.blog9.fc2.com/blog-entry-103.html --\n",
"<!-- テンプレの再配布、営利目的の利用禁止 --\n",
"<!-- 画像の無断転載・再配布禁止 --\n",
"<!-- アダルト・法律違反サイト、使用不可 --\n",
"<!-- アクセス解析タグはここから --\n",
"<!-- アクセス解析タグはここまで --\n",
"<!--▼▼▼メインカラムカラム+右サイドカラム部分--\n",
"<!--▼ヘッダー--\n",
"<!--▼管理ページリンク--\n",
"<!--▲管理ページリンク--\n",
"<!--▼タイトル--\n"
]
}
],
"source": [
"# Bezpośrednio z serwisu\n",
"\n",
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a '<!--[^\\[\\]<>]+' | uniq | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dostępne są też \"ekstrakty\" czystego tekstu - zob. http://data.statmt.org/ngrams/raw/, np. 59 GB czystego tekstu po polsku z 2012 roku."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"df6fa1abb58549287111ba8d776733e9 0.000000 http://www.gornicki.pl/focal_points_4/2006\n",
"Przegląd okulistyczny \n",
"Focal points \n",
"Przegląd reumatologiczny \n",
"Biblioteka on-line \n",
"STRONA GŁÓWNA \n",
"WYDAWNICTWO \n",
"O wydawnictwie \n",
"Kontakt \n",
"Regulamin zamówień \n",
"Spotkania autorskie \n",
"Nasi autorzy \n",
"CZYTELNIA ONLINE \n",
"w dziale: anatomia \n",
"w dziale: okulistyka \n",
"w dziale: ratownictwo \n",
"CENNIK \n",
"LINKI \n",
"USŁUGI \n",
"df6fa1abb58549287111ba8d776733e9 2.000000 http://www.gornicki.pl/focal_points_4/2006\n",
"Licencjaty \n",
"Multimedia \n",
"Pulmonologia \n",
"Okulistyka \n",
"Ratownictwo \n",
"Reumatologia \n",
"Zestawy specjalne \n",
"Onkologia \n",
"Focal Points 4/2006\n",
"\n"
]
}
],
"source": [
"! (wget -O - -q http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/pl/raw/pl.2012.raw.xz \\\n",
" | xzcat | head -n 30)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Zrzuty Wikipedii\n",
"\n",
"Nie pobieraj Wikipedii strona po stronie!\n",
"\n",
"* tracisz swój czas\n",
"* i tracisz czas serwerów Wikipedii\n",
"\n",
"Lepiej pobrać zrzut (_dump_) ze strony https://dumps.wikimedia.org/backup-index.html"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1977]]\n",
"[[język skryptowy|skryptowy]]\n",
"[[programowanie proceduralne|proceduralny]]\n",
"[[Programowanie sterowane zdarzeniami|sterowany zdarzeniami]]\n",
"[[Alfred V. Aho|Alfred Aho]]\n",
"[[Peter J. Weinberger|Peter Weinberger]]\n",
"[[Brian Kernighan]]\n",
"[[wieloplatformowość|wieloplatformowy]]\n",
"[[język programowania]]\n",
"[[plik]]\n",
"[[system operacyjny|systemów operacyjnych]]\n",
"[[Unix|UNIX]]\n",
"[[tablica asocjacyjna|tablice asocjacyjne]]\n",
"[[Tekstowy typ danych|stringi]]\n",
"[[wyrażenie regularne|wyrażenia regularne]]\n",
"[[Alfred V. Aho|Alfreda V. Aho]]\n",
"[[Peter Weinberger|Petera Weinbergera]]\n",
"[[Brian Kernighan|Briana Kernighana]]\n",
"[[POSIX]]\n",
"[[System V|SVR4]]\n"
]
}
],
"source": [
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
" | bzcat | grep -P -o '\\[\\[[^\\]]+\\]\\]' | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Skąd brać adresy URL\n",
"\n",
"### Zob. dumpy powyżej"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://ssl'\n",
"https://static.fc2.com/css_cn/common/headbar/120710style.css\n",
"https://blog.fc2.com/\n",
"https://spdeliver.i-mobile.co.jp/script/adsnativepc.js?20101001\n",
"https://media.fc2.com/counter_img.php?id=3493\n",
"https://plus.google.com/+apothekenumschau\n",
"https://script.ioam.de/iam.js\n",
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/AGP-Kontaktformular--73317.html\n",
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/Apotheker-HP--AGP-73319.html\n",
"https://login.apotheken-umschau.de/login?service=https://www.apotheken-umschau.de/j_spring_cas_security_check\n",
"https://forum.apotheken-umschau.de/portal/registration/register\n",
"https://www.facebook.com/Apotheken.Umschau\n",
"https://api.wortundbildverlag.com/drug-suggest/terms\n",
"https://07743rats-apotheke.apotheken-umschau.de/unternehmenskommunikation/Kontakt-zu-den-Redaktionen-53834.html\n",
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
"https://static.skyrock.net/js/common.min.js?eBtyhdw\n",
"https://static.skyrock.net/img/favicon_v5b.ico\n",
"https://wir.skyrock.net/wir/v1/resize/?c=isi&amp;im=%2F9775%2F59549775%2Fpics%2Fphoto_59549775_89.jpg&amp;w=16\n",
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
"https://static.skyrock.net/css/common.css?eahf2jw\n"
]
}
],
"source": [
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a 'https://[^ \"><]+' | uniq | head -n 20)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna\n",
"https://web.archive.org/web/20100116001012/http://homepages.cwi.nl/~dik/english/codes/stand.html#ascii\n",
"https://web.archive.org/web/20160613145224/http://www.aivosto.com/vbtips/charsets-7bit.html#body}}&lt;/ref&gt;\n",
"https://web.archive.org/web/20160522024759/http://worldpowersystems.com/J/codes/#ASCII-1967\n",
"https://books.google.com/?id=NQSpNAEACAAJ&amp;pg=PA28\n",
"https://web.archive.org/web/20160616084132/https://www.w3.org/blog/2008/05/utf8-web-growth/\n",
"https://web.archive.org/web/20160616084637/https://googleblog.blogspot.de/2008/05/moving-to-unicode-51.html\n",
"https://web.archive.org/web/20160616085323/https://googleblog.blogspot.de/2010/01/unicode-nearing-50-of-web.html\n",
"https://web.archive.org/web/20160827000956/http://dlx.bookzz.org/genesis/772000/c80a62495acf1e1a5b966de23c1f989a/_as/%5BInterface_Age_Staff%5D_Best_of_Interface_Age%2C_Volum%28BookZZ.org%29.pdf\n",
"https://books.google.com/books?id=bXLDwmIJNkUC&amp;pg=PA13\n",
"https://web.archive.org/web/20161031223347/http://ethw.org/First-Hand%3AChad_is_Our_Most_Important_Product%3A_An_Engineer%27s_Memory_of_Teletype_Corporation\n",
"https://textfiles.meulie.net/bitsaved/Books/Mackenzie_CodedCharSets.pdf\n",
"https://web.archive.org/web/20160526181319/http://longstreet.typepad.com/thesciencebookstore/2012/03/heres-the-link.html\n",
"https://web.archive.org/web/20120213005708/http://www.transbay.net/~enf/ascii/ascii.pdf\n",
"https://archive.org/details/dictionaryworldp00iann\n",
"https://archive.org/details/dictionaryworldp00iann/page/n80\n",
"https://www.theguardian.com/commentisfree/belief/2013/jan/28/lucretius-all-things-atoms\n",
"https://archive.org/details/distillingknowle00mora_557\n",
"https://archive.org/details/distillingknowle00mora_557/page/n156\n",
"https://archive.org/details/fromelementstoat00sieg\n"
]
}
],
"source": [
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
" | bzcat | grep -P -o 'https://[^ \"><]+' | head -n 20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Serwis DMOZ/ODP (niestety już nieaktywny)\n",
"Ostatni link: https://web.archive.org/web/20160306230718/http://rdf.dmoz.org/rdf/content.rdf.u8.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Odpytywać \"pasożytniczo\" inną wyszukiwarkę"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# see https://hackernoon.com/how-to-scrape-google-with-python-bo7d2tal\n",
"\n",
"import urllib\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"def query_google(query):\n",
" url = f\"https://google.com/search?q={query}\"\n",
" response = requests.get(url)\n",
" soup = BeautifulSoup(response.content, \"html.parser\")\n",
" \n",
" results = []\n",
" for g in soup.find_all('a'):\n",
" link = g['href']\n",
" if '/url?q=' in link:\n",
" results.append(link[7:])\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QmhMwC3oECAwQDg&usg=AOvVaw1F4NoOH13sPHmkkVrKPKPc',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAQegQICxAB&usg=AOvVaw0cBRsP3ORH8ItFxcBkkaXl',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Opis&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQAg&usg=AOvVaw2pQXVnDLY_DxI-QJncPJ-J',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Historia&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQAw&usg=AOvVaw3Fkx-NtoxRASml4JWUS68g',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Najwa%25C5%25BCniejsze_argumenty_%25E2%2580%259Eza%25E2%2580%259D_i_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQBA&usg=AOvVaw2pTlj01g4WYUd9G__fMDdO',\n",
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Argumenty_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQBQ&usg=AOvVaw09DHFpaDfQ8rbvPCsALuqQ',\n",
" 'https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QtwIwEXoECAQQAQ&usg=AOvVaw0oHXUaa0kvQwNCNe5W9JIh',\n",
" 'https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QuAIwEXoECAQQAg&usg=AOvVaw2CrxxVzwVVwE4Xsj31_w3T',\n",
" 'https://www.youtube.com/watch%3Fv%3DIhS1d56aPOc&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QtwIwEnoECAUQAQ&usg=AOvVaw12i_Qq-aNn2KMbZciKlmAM',\n",
" 'https://www.youtube.com/watch%3Fv%3DIhS1d56aPOc&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QuAIwEnoECAUQAg&usg=AOvVaw3zdXkOsnuCMFVR8USryFDw',\n",
" 'https://www.youtube.com/watch%3Fv%3D_r4_GIfTn2o&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QtwIwF3oECAMQAQ&usg=AOvVaw3jgTHagNopqqBsCo594Zip',\n",
" 'https://www.youtube.com/watch%3Fv%3D_r4_GIfTn2o&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QuAIwF3oECAMQAg&usg=AOvVaw0iwfh9wM9EkhqRY_YoXuYU',\n",
" 'https://www.ceneo.pl/%3Bszukaj-wielka%2Bstopa&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAYegQICBAB&usg=AOvVaw38rQfzltST6zIW8eCRdta-',\n",
" 'https://www.ceneo.pl/Filmy%3Bszukaj-wielka%2Bstopa&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAZegQIAhAB&usg=AOvVaw3WqL8324pgm8Rd57USPD8M',\n",
" 'https://www.antyradio.pl/News/Kobieta-twierdzi-ze-spotkala-Wielka-Stope-Hustala-sie-na-drzewie-ZDJECIE-43102&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAaegQIChAB&usg=AOvVaw30c7T2Ymn-Q4Vqq5C962BO',\n",
" 'https://allegro.pl/kategoria/gry%3Fstring%3DWielka%2520stopa%2520%253A)%2520-&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAbegQIBxAB&usg=AOvVaw2kdw9sx7alxFh5IwLfsVX4',\n",
" 'https://allegro.pl/listing%3Fstring%3DWielka%2520stopa%2520%253A%2529%2520-&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAcegQICRAB&usg=AOvVaw0nK7AoJJjmr1oWrN46umA_',\n",
" 'https://tvn24.pl/tvnmeteo/informacje-pogoda/ciekawostki,49/wielka-stopa-nie-istnieje-naukowcy-to-nie-koniec-nadziei,127328,1,0.html&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAdegQIBhAB&usg=AOvVaw0WWcyH9m2XpHzz7koN1IrJ',\n",
" 'https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Qty4Ifw&usg=AOvVaw177POHJ8_tlgAuIzWDTzhM',\n",
" 'https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D%252522wielka%252Bstopa%252522%26hl%3Dpl&sa=U&ved=0ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Qxs8CCIAB&usg=AOvVaw0OmJ8GZoJAvzg7NX5Aby4M']"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_google('\"wielka stopa\"')"
]
},
{
"cell_type": "code",
"execution_count": null,