More on Lecture 1
This commit is contained in:
parent
f62e68ccab
commit
6deecf43dd
@ -20,6 +20,337 @@
|
||||
"![Wyszukiwarki](wyszukiwarka-internetowa.png)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Chcę stworzyć swoją własną wyszukiwarkę internetową...\n",
|
||||
"\n",
|
||||
"1. Skąd brać adresy URL?\n",
|
||||
"2. Jak pobrać pliki z tych adresów?\n",
|
||||
"3. Jak wydobyć z nich tekst?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ... a może w ogóle nie pobierać?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Korpus CommonCrawl\n",
|
||||
"\n",
|
||||
"https://commoncrawl.org/the-data/"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"<!-- スマホ用 --\n",
|
||||
"<!-- \n",
|
||||
"<!--table width='750' border='0' align='center' cellpadding='0' cellspacing='0'\n",
|
||||
"<!--a href='index.phtml?CHANNEL=R51&FID=389924'\n",
|
||||
"<!-- mail: \n",
|
||||
"<!-- beige_lavender-3c --\n",
|
||||
"<!--\n",
|
||||
"<!-- Template Design By BeigeHeart_Chako_http://beigeheart.blog9.fc2.com/ --\n",
|
||||
"<!-- 関連記事_http://beigeheart.blog9.fc2.com/blog-entry-99.html --\n",
|
||||
"<!-- 利用規約_http://beigeheart.blog9.fc2.com/blog-entry-103.html --\n",
|
||||
"<!-- テンプレの再配布、営利目的の利用禁止 --\n",
|
||||
"<!-- 画像の無断転載・再配布禁止 --\n",
|
||||
"<!-- アダルト・法律違反サイト、使用不可 --\n",
|
||||
"<!-- アクセス解析タグはここから --\n",
|
||||
"<!-- アクセス解析タグはここまで --\n",
|
||||
"<!--▼▼▼メインカラムカラム+右サイドカラム部分--\n",
|
||||
"<!--▼ヘッダー--\n",
|
||||
"<!--▼管理ページリンク--\n",
|
||||
"<!--▲管理ページリンク--\n",
|
||||
"<!--▼タイトル--\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Bezpośrednio z serwisu\n",
|
||||
"\n",
|
||||
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a '<!--[^\\[\\]<>]+' | uniq | head -n 20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Dostępne są też \"ekstrakty\" czystego tekstu - zob. http://data.statmt.org/ngrams/raw/, np. 59 GB czystego tekstu po polsku z 2012 roku."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"df6fa1abb58549287111ba8d776733e9 0.000000 http://www.gornicki.pl/focal_points_4/2006\n",
|
||||
"Przegląd okulistyczny \n",
|
||||
"Focal points \n",
|
||||
"Przegląd reumatologiczny \n",
|
||||
"Biblioteka on-line \n",
|
||||
"STRONA GŁÓWNA \n",
|
||||
"WYDAWNICTWO \n",
|
||||
"O wydawnictwie \n",
|
||||
"Kontakt \n",
|
||||
"Regulamin zamówień \n",
|
||||
"Spotkania autorskie \n",
|
||||
"Nasi autorzy \n",
|
||||
"CZYTELNIA ONLINE \n",
|
||||
"w dziale: anatomia \n",
|
||||
"w dziale: okulistyka \n",
|
||||
"w dziale: ratownictwo \n",
|
||||
"CENNIK \n",
|
||||
"LINKI \n",
|
||||
"USŁUGI \n",
|
||||
"df6fa1abb58549287111ba8d776733e9 2.000000 http://www.gornicki.pl/focal_points_4/2006\n",
|
||||
"Licencjaty \n",
|
||||
"Multimedia \n",
|
||||
"Pulmonologia \n",
|
||||
"Okulistyka \n",
|
||||
"Ratownictwo \n",
|
||||
"Reumatologia \n",
|
||||
"Zestawy specjalne \n",
|
||||
"Onkologia \n",
|
||||
"Focal Points 4/2006\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"! (wget -O - -q http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/pl/raw/pl.2012.raw.xz \\\n",
|
||||
" | xzcat | head -n 30)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Zrzuty Wikipedii\n",
|
||||
"\n",
|
||||
"Nie pobieraj Wikipedii strona po stronie!\n",
|
||||
"\n",
|
||||
"* tracisz swój czas\n",
|
||||
"* i tracisz czas serwerów Wikipedii\n",
|
||||
"\n",
|
||||
"Lepiej pobrać zrzut (_dump_) ze strony https://dumps.wikimedia.org/backup-index.html"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[[1977]]\n",
|
||||
"[[język skryptowy|skryptowy]]\n",
|
||||
"[[programowanie proceduralne|proceduralny]]\n",
|
||||
"[[Programowanie sterowane zdarzeniami|sterowany zdarzeniami]]\n",
|
||||
"[[Alfred V. Aho|Alfred Aho]]\n",
|
||||
"[[Peter J. Weinberger|Peter Weinberger]]\n",
|
||||
"[[Brian Kernighan]]\n",
|
||||
"[[wieloplatformowość|wieloplatformowy]]\n",
|
||||
"[[język programowania]]\n",
|
||||
"[[plik]]\n",
|
||||
"[[system operacyjny|systemów operacyjnych]]\n",
|
||||
"[[Unix|UNIX]]\n",
|
||||
"[[tablica asocjacyjna|tablice asocjacyjne]]\n",
|
||||
"[[Tekstowy typ danych|stringi]]\n",
|
||||
"[[wyrażenie regularne|wyrażenia regularne]]\n",
|
||||
"[[Alfred V. Aho|Alfreda V. Aho]]\n",
|
||||
"[[Peter Weinberger|Petera Weinbergera]]\n",
|
||||
"[[Brian Kernighan|Briana Kernighana]]\n",
|
||||
"[[POSIX]]\n",
|
||||
"[[System V|SVR4]]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
|
||||
" | bzcat | grep -P -o '\\[\\[[^\\]]+\\]\\]' | head -n 20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Skąd brać adresy URL\n",
|
||||
"\n",
|
||||
"### Zob. dumpy powyżej"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"https://ssl'\n",
|
||||
"https://static.fc2.com/css_cn/common/headbar/120710style.css\n",
|
||||
"https://blog.fc2.com/\n",
|
||||
"https://spdeliver.i-mobile.co.jp/script/adsnativepc.js?20101001\n",
|
||||
"https://media.fc2.com/counter_img.php?id=3493\n",
|
||||
"https://plus.google.com/+apothekenumschau\n",
|
||||
"https://script.ioam.de/iam.js\n",
|
||||
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/AGP-Kontaktformular--73317.html\n",
|
||||
"https://07743rats-apotheke.apotheken-umschau.de/News--Wissen/Apotheker-HP--AGP-73319.html\n",
|
||||
"https://login.apotheken-umschau.de/login?service=https://www.apotheken-umschau.de/j_spring_cas_security_check\n",
|
||||
"https://forum.apotheken-umschau.de/portal/registration/register\n",
|
||||
"https://www.facebook.com/Apotheken.Umschau\n",
|
||||
"https://api.wortundbildverlag.com/drug-suggest/terms\n",
|
||||
"https://07743rats-apotheke.apotheken-umschau.de/unternehmenskommunikation/Kontakt-zu-den-Redaktionen-53834.html\n",
|
||||
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
|
||||
"https://static.skyrock.net/js/common.min.js?eBtyhdw\n",
|
||||
"https://static.skyrock.net/img/favicon_v5b.ico\n",
|
||||
"https://wir.skyrock.net/wir/v1/resize/?c=isi&im=%2F9775%2F59549775%2Fpics%2Fphoto_59549775_89.jpg&w=16\n",
|
||||
"https://i.skyrock.net/9775/59549775/pics/photo_59549775_89.jpg\n",
|
||||
"https://static.skyrock.net/css/common.css?eahf2jw\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"! (wget -O - -q https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00719.warc.gz | zcat| grep -P -o -a 'https://[^ \"><]+' | uniq | head -n 20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"https://pl.wikipedia.org/wiki/Wikipedia:Strona_g%C5%82%C3%B3wna\n",
|
||||
"https://web.archive.org/web/20100116001012/http://homepages.cwi.nl/~dik/english/codes/stand.html#ascii\n",
|
||||
"https://web.archive.org/web/20160613145224/http://www.aivosto.com/vbtips/charsets-7bit.html#body}}</ref>\n",
|
||||
"https://web.archive.org/web/20160522024759/http://worldpowersystems.com/J/codes/#ASCII-1967\n",
|
||||
"https://books.google.com/?id=NQSpNAEACAAJ&pg=PA28\n",
|
||||
"https://web.archive.org/web/20160616084132/https://www.w3.org/blog/2008/05/utf8-web-growth/\n",
|
||||
"https://web.archive.org/web/20160616084637/https://googleblog.blogspot.de/2008/05/moving-to-unicode-51.html\n",
|
||||
"https://web.archive.org/web/20160616085323/https://googleblog.blogspot.de/2010/01/unicode-nearing-50-of-web.html\n",
|
||||
"https://web.archive.org/web/20160827000956/http://dlx.bookzz.org/genesis/772000/c80a62495acf1e1a5b966de23c1f989a/_as/%5BInterface_Age_Staff%5D_Best_of_Interface_Age%2C_Volum%28BookZZ.org%29.pdf\n",
|
||||
"https://books.google.com/books?id=bXLDwmIJNkUC&pg=PA13\n",
|
||||
"https://web.archive.org/web/20161031223347/http://ethw.org/First-Hand%3AChad_is_Our_Most_Important_Product%3A_An_Engineer%27s_Memory_of_Teletype_Corporation\n",
|
||||
"https://textfiles.meulie.net/bitsaved/Books/Mackenzie_CodedCharSets.pdf\n",
|
||||
"https://web.archive.org/web/20160526181319/http://longstreet.typepad.com/thesciencebookstore/2012/03/heres-the-link.html\n",
|
||||
"https://web.archive.org/web/20120213005708/http://www.transbay.net/~enf/ascii/ascii.pdf\n",
|
||||
"https://archive.org/details/dictionaryworldp00iann\n",
|
||||
"https://archive.org/details/dictionaryworldp00iann/page/n80\n",
|
||||
"https://www.theguardian.com/commentisfree/belief/2013/jan/28/lucretius-all-things-atoms\n",
|
||||
"https://archive.org/details/distillingknowle00mora_557\n",
|
||||
"https://archive.org/details/distillingknowle00mora_557/page/n156\n",
|
||||
"https://archive.org/details/fromelementstoat00sieg\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"! (wget -O - -q https://dumps.wikimedia.org/plwiki/20210301/plwiki-20210301-pages-articles-multistream.xml.bz2 \\\n",
|
||||
" | bzcat | grep -P -o 'https://[^ \"><]+' | head -n 20)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Serwis DMOZ/ODP (niestety już nieaktywny)\n",
|
||||
"Ostatni link: https://web.archive.org/web/20160306230718/http://rdf.dmoz.org/rdf/content.rdf.u8.gz"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Odpytywać \"pasożytniczo\" inną wyszukiwarkę"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# see https://hackernoon.com/how-to-scrape-google-with-python-bo7d2tal\n",
|
||||
"\n",
|
||||
"import urllib\n",
|
||||
"import requests\n",
|
||||
"from bs4 import BeautifulSoup\n",
|
||||
"\n",
|
||||
"def query_google(query):\n",
|
||||
" url = f\"https://google.com/search?q={query}\"\n",
|
||||
" response = requests.get(url)\n",
|
||||
" soup = BeautifulSoup(response.content, \"html.parser\")\n",
|
||||
" \n",
|
||||
" results = []\n",
|
||||
" for g in soup.find_all('a'):\n",
|
||||
" link = g['href']\n",
|
||||
" if '/url?q=' in link:\n",
|
||||
" results.append(link[7:])\n",
|
||||
" return results"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QmhMwC3oECAwQDg&usg=AOvVaw1F4NoOH13sPHmkkVrKPKPc',\n",
|
||||
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAQegQICxAB&usg=AOvVaw0cBRsP3ORH8ItFxcBkkaXl',\n",
|
||||
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Opis&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQAg&usg=AOvVaw2pQXVnDLY_DxI-QJncPJ-J',\n",
|
||||
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Historia&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQAw&usg=AOvVaw3Fkx-NtoxRASml4JWUS68g',\n",
|
||||
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Najwa%25C5%25BCniejsze_argumenty_%25E2%2580%259Eza%25E2%2580%259D_i_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQBA&usg=AOvVaw2pTlj01g4WYUd9G__fMDdO',\n",
|
||||
" 'https://pl.wikipedia.org/wiki/Wielka_Stopa_(zwierz%25C4%2599)%23Argumenty_%25E2%2580%259Eprzeciw%25E2%2580%259D&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Q0gIwEHoECAsQBQ&usg=AOvVaw09DHFpaDfQ8rbvPCsALuqQ',\n",
|
||||
" 'https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QtwIwEXoECAQQAQ&usg=AOvVaw0oHXUaa0kvQwNCNe5W9JIh',\n",
|
||||
" 'https://www.youtube.com/watch%3Fv%3DEPRggWavPX4&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QuAIwEXoECAQQAg&usg=AOvVaw2CrxxVzwVVwE4Xsj31_w3T',\n",
|
||||
" 'https://www.youtube.com/watch%3Fv%3DIhS1d56aPOc&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QtwIwEnoECAUQAQ&usg=AOvVaw12i_Qq-aNn2KMbZciKlmAM',\n",
|
||||
" 'https://www.youtube.com/watch%3Fv%3DIhS1d56aPOc&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QuAIwEnoECAUQAg&usg=AOvVaw3zdXkOsnuCMFVR8USryFDw',\n",
|
||||
" 'https://www.youtube.com/watch%3Fv%3D_r4_GIfTn2o&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QtwIwF3oECAMQAQ&usg=AOvVaw3jgTHagNopqqBsCo594Zip',\n",
|
||||
" 'https://www.youtube.com/watch%3Fv%3D_r4_GIfTn2o&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QuAIwF3oECAMQAg&usg=AOvVaw0iwfh9wM9EkhqRY_YoXuYU',\n",
|
||||
" 'https://www.ceneo.pl/%3Bszukaj-wielka%2Bstopa&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAYegQICBAB&usg=AOvVaw38rQfzltST6zIW8eCRdta-',\n",
|
||||
" 'https://www.ceneo.pl/Filmy%3Bszukaj-wielka%2Bstopa&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAZegQIAhAB&usg=AOvVaw3WqL8324pgm8Rd57USPD8M',\n",
|
||||
" 'https://www.antyradio.pl/News/Kobieta-twierdzi-ze-spotkala-Wielka-Stope-Hustala-sie-na-drzewie-ZDJECIE-43102&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAaegQIChAB&usg=AOvVaw30c7T2Ymn-Q4Vqq5C962BO',\n",
|
||||
" 'https://allegro.pl/kategoria/gry%3Fstring%3DWielka%2520stopa%2520%253A)%2520-&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAbegQIBxAB&usg=AOvVaw2kdw9sx7alxFh5IwLfsVX4',\n",
|
||||
" 'https://allegro.pl/listing%3Fstring%3DWielka%2520stopa%2520%253A%2529%2520-&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAcegQICRAB&usg=AOvVaw0nK7AoJJjmr1oWrN46umA_',\n",
|
||||
" 'https://tvn24.pl/tvnmeteo/informacje-pogoda/ciekawostki,49/wielka-stopa-nie-istnieje-naukowcy-to-nie-koniec-nadziei,127328,1,0.html&sa=U&ved=2ahUKEwje-6mWk6TvAhUxpHEKHVatAO0QFjAdegQIBhAB&usg=AOvVaw0WWcyH9m2XpHzz7koN1IrJ',\n",
|
||||
" 'https://support.google.com/websearch%3Fp%3Dws_settings_location%26hl%3Dpl&sa=U&ved=0ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Qty4Ifw&usg=AOvVaw177POHJ8_tlgAuIzWDTzhM',\n",
|
||||
" 'https://accounts.google.com/ServiceLogin%3Fcontinue%3Dhttps://www.google.com/search%253Fq%253D%252522wielka%252Bstopa%252522%26hl%3Dpl&sa=U&ved=0ahUKEwje-6mWk6TvAhUxpHEKHVatAO0Qxs8CCIAB&usg=AOvVaw0OmJ8GZoJAvzg7NX5Aby4M']"
|
||||
]
|
||||
},
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"query_google('\"wielka stopa\"')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
|
Loading…
Reference in New Issue
Block a user