forked from filipg/aitech-eks-pub
Merge branch 'master' of git.wmi.amu.edu.pl:filipg/aitech-eks
This commit is contained in:
commit
11b1397252
81
cw/00_Informacje_na_temat_przedmiotu.ipynb
Normal file
81
cw/00_Informacje_na_temat_przedmiotu.ipynb
Normal file
@ -0,0 +1,81 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Informacje ogólne"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Kontakt z prowadzącym\n",
|
||||||
|
"\n",
|
||||||
|
"prowadzący: mgr inż. Jakub Pokrywka\n",
|
||||||
|
"\n",
|
||||||
|
"Najlepiej kontaktowąć się ze mną przez MS TEAMS na grupie kanału (ogólne sprawy) lub w prywatnych wiadomościach. Odpisuję co 2-3 dni. Można też umówić się na zdzwonko w godzinach dyżuru (wt 12.00-13.00) lub umówić się w innym terminie.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"## Literatura\n",
|
||||||
|
"Polecana literatura do przedmiotu:\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chociaż przejrzeć.\n",
|
||||||
|
"- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest trochę nieaktualna)\n",
|
||||||
|
"- Alex Graves. 2012. Supervised sequence labelling. Studies in Computational Intelligence, vol 385. Springer. Berlin, Heidelberg. \n",
|
||||||
|
"\n",
|
||||||
|
"- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL). \n",
|
||||||
|
"\n",
|
||||||
|
"- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67. \n",
|
||||||
|
"\n",
|
||||||
|
"- Flip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n",
|
||||||
|
"\n",
|
||||||
|
"- Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Filip Graliński. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n",
|
||||||
|
"\n",
|
||||||
|
"## Zaliczenie\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Do zdobycia będzie conajmniej 500 punktów.\n",
|
||||||
|
"\n",
|
||||||
|
"Ocena:\n",
|
||||||
|
"\n",
|
||||||
|
"- -299 — 2\n",
|
||||||
|
"\n",
|
||||||
|
"- 300-349 — 3\n",
|
||||||
|
"\n",
|
||||||
|
"- 350-399 — 3+\n",
|
||||||
|
"\n",
|
||||||
|
"- 400-449 — 4\n",
|
||||||
|
"\n",
|
||||||
|
"- 450—499 — 4+\n",
|
||||||
|
"\n",
|
||||||
|
"- 500- — 5\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.8.5"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
257
cw/01_Wyszukiwarki-wprowadzenie.ipynb
Normal file
257
cw/01_Wyszukiwarki-wprowadzenie.ipynb
Normal file
@ -0,0 +1,257 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Zajecia 1\n",
|
||||||
|
"\n",
|
||||||
|
"Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Przydatne materiały:\n",
|
||||||
|
"\n",
|
||||||
|
"https://www.google.com/advanced_search\n",
|
||||||
|
"\n",
|
||||||
|
"https://www.google.pl/advanced_image_search\n",
|
||||||
|
"\n",
|
||||||
|
"https://support.google.com/websearch/answer/2466433?hl=en\n",
|
||||||
|
"\n",
|
||||||
|
"https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7\n",
|
||||||
|
"\n",
|
||||||
|
"https://allegro.pl/dla-sprzedajacych/trafnosc-xGmVjoPwOTo\n",
|
||||||
|
"\n",
|
||||||
|
"https://developer.allegro.pl/about/\n",
|
||||||
|
"\n",
|
||||||
|
"https://serpapi.com/"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"Będziemy omawiać: \n",
|
||||||
|
"- Wyszukiwarki ogólnego przeznaczenia (google, bing, ...)\n",
|
||||||
|
"- Wyszukiwarki na konkretną platformę (amazon, allegro, olx, spar, ...)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Wyszukiwanie zaawansowane google\n",
|
||||||
|
"\n",
|
||||||
|
"- \"job steve\"\n",
|
||||||
|
"- poduszka |/OR drzwi \n",
|
||||||
|
"- poduszka -biała\n",
|
||||||
|
"- poduszka * drzwi\n",
|
||||||
|
"- define:pillow\n",
|
||||||
|
"- cache:wp.pl\n",
|
||||||
|
"- poduszka filetype:pdf\n",
|
||||||
|
"- poduszka site:allegro.pl\n",
|
||||||
|
"- related:allegro.pl\n",
|
||||||
|
"- intitle:poduszka\n",
|
||||||
|
"- allintitle:poduszka biała\n",
|
||||||
|
"- inurl:poduszka\n",
|
||||||
|
"- allinurl:poduszka biała\n",
|
||||||
|
"- poduszka AROUND(4) drzwi\n",
|
||||||
|
"- weather:poznan\n",
|
||||||
|
"- stocks:gme\n",
|
||||||
|
"- map:poznań\n",
|
||||||
|
"- $329 in pln\n",
|
||||||
|
"- euro 1990..2000\n",
|
||||||
|
"- 15*30\n",
|
||||||
|
"- color picker\n",
|
||||||
|
"- elon musk @twitter\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Komponenty wyszukiwarki google\n",
|
||||||
|
"- pole do wpisywania tekstu i search button \n",
|
||||||
|
"- sugestie do wpisywania\n",
|
||||||
|
"- ghosting\n",
|
||||||
|
"- autokorekta, np. pdouszka\n",
|
||||||
|
"- ilość wyświetleń dla wyniku \n",
|
||||||
|
"- elementy dodaktowe po wpisaniu frazy (odpowiedzi na pytania ogólne, wyszukiwania powiązane, itp)\n",
|
||||||
|
"- lista elementów (podzielona na strony)\n",
|
||||||
|
"- jak działają strony na urządzeniach mobilnych?\n",
|
||||||
|
"- prezentacja wyników: nazwa strony oraz tam gdzie jest match pogrubienie (czy google ma prawo do umieszczania takich tekstów na swojej stronie)?\n",
|
||||||
|
"- inne komponenty - np best games for nintendo switch\n",
|
||||||
|
"- reklamy"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Komponenty wyszukiwarki specjalistycznej na przykładzie allegro\n",
|
||||||
|
"\n",
|
||||||
|
"- wyszukiwarna tekstowa albo nawigowanie bezpośrednio po kategoriach\n",
|
||||||
|
"- każdy ma własny unikalny sposób wyszukiwania\n",
|
||||||
|
"- okno wyszukiwania\n",
|
||||||
|
"- sugestie przy wpisywaniu frazy\n",
|
||||||
|
"- ghosting (np santander.pl)\n",
|
||||||
|
"- autokorekta (sugestia oraz przekierowanie)\n",
|
||||||
|
"- można też wpisać, że szukamy również w opisach, parametrach itp.\n",
|
||||||
|
"- komentarz: tutaj wpisujemy jakąś frazę\n",
|
||||||
|
"- mamy zbiór dokumumentów oraz są posortowane w jakiś sposób (ale niekoniecznie tak musi być)\n",
|
||||||
|
"- jak działa odzyskiwanie dokumentów?\n",
|
||||||
|
" - stopwordy \n",
|
||||||
|
" - normalizacja do lowercase\n",
|
||||||
|
" - lista synonimów, fleksja, odmiana (także ujednoznacznienie do jednej formy → wielka poduszka/ wielki poduszka, kubek kubki)\n",
|
||||||
|
"- sortowania (omówić możliwe sortowania)- element którego nie ma w google\n",
|
||||||
|
"https://allegro.pl/dla-sprzedajacych/trafnosc-xGmVjoPwOTo#moja-oferta-ma-duza-sprzedaz-a-mimo-tego-jest-ona-nizej-w-sortowaniu-po-trafnosci-niz-inne-nowe-oferty-dlaczego-\n",
|
||||||
|
"- trafność dla każdego może znaczyć coś innego\n",
|
||||||
|
"- sortowanie domyślne- jakie jest jego znaczenie?\n",
|
||||||
|
"- inne rodzaje sortowania\n",
|
||||||
|
"- rerankowanie \n",
|
||||||
|
"- po lewej stronie mamy zawężenie do kategorii oraz filtry, wyszukiwanie facetowe- nie ma w google\n",
|
||||||
|
"- mamy także oferty sponsorowane oraz promowane - dylemat- ważniejszy jest biznes czy użytkownik\n",
|
||||||
|
"- rekomendacje dla użytkowników na dole- właściwie to jest osobny dział \n",
|
||||||
|
"- inne możliwości (szukaj wielu)\n",
|
||||||
|
"- wyszukiwanie zaawansowane: https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7\n",
|
||||||
|
"- ewaluacja jakości wyszukiwarki- dyskusja, kto by co wybrał, jak wygląda sprawa z uczeniem maszynowym?\n",
|
||||||
|
"- jakie cele musi spełniać inżynier trafonośći?\n",
|
||||||
|
"- jak ewaluować wyszukiwarki?"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## API do wyszukiwarek\n",
|
||||||
|
"- https://developer.allegro.pl/listing/"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Google trends"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## SEO (Search Engine Optimization)\n",
|
||||||
|
"- pod google\n",
|
||||||
|
"- pod wyszukiwarki typu allegro, olx \n",
|
||||||
|
"- https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Zadanie domowe\n",
|
||||||
|
"\n",
|
||||||
|
"----------------------\n",
|
||||||
|
"Maksymalnie do zdobycia za zadania 100: 30\n",
|
||||||
|
"\n",
|
||||||
|
"Maksymalnie do zdobycia za zadania 101-107: 50\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Zadania proszę oddawać w formie pliku pdf w MS TEAMS (grupa kanału → assignments) do końca 17.03.2021.\n",
|
||||||
|
"\n",
|
||||||
|
"Oprocz samego rozwiązania, proszę umieścić sposób w jaki Państwo do niego doszli (np frazy wpisywane w wysuzkiwarkę, itp.).\n",
|
||||||
|
"\n",
|
||||||
|
"## Zadanie 100\n",
|
||||||
|
"\n",
|
||||||
|
"Znaleźć przykłady „wyzwań” researcherskich — nagród pieniężnych za\n",
|
||||||
|
"znalezienie jakiejś informacji, najwcześniejszego wystąpienia jakiegoś słowa itp.\n",
|
||||||
|
"Wyzwanie musi polegać na znalezieniu jakieś informacji w powszechnie dostępnych źródłach (internet, biblioteki).\n",
|
||||||
|
"Zatem nie liczą sie np. nagrody za udzielenie informacji o jakimś mordercy, itp.\n",
|
||||||
|
"Interesują nas tylko „otwarte” wyzwania. Język, jakiego dotyczy wyzwanie — dowolny.\n",
|
||||||
|
"\n",
|
||||||
|
"Wyzwania podać w formie tabelki: nagroda, link, krótki opis.\n",
|
||||||
|
"\n",
|
||||||
|
"Liczba punktów za każde znalezione wyzwanie: max( 30, 5*log_10(nagroda w dolarach) )\n",
|
||||||
|
"\n",
|
||||||
|
"Przykład: [nagroda $250 za znalezienie wzmianki dotyczącej chupacabry\n",
|
||||||
|
"(potwora) przed 1990 rokiem](http://www.cryptozoonews.com/chupa-250/).\n",
|
||||||
|
"\n",
|
||||||
|
"Maksymalna liczba punktów: 30.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"## Zadanie 101\n",
|
||||||
|
"\n",
|
||||||
|
"Podać 3 przykłady zapytań na allegro, które daje zaskakujące/niesatysfakcjonujące wyniki. Napisz jaka może być przyczyna takich wyników?\n",
|
||||||
|
"\n",
|
||||||
|
"Maksymalna liczba punktów: 20.\n",
|
||||||
|
"\n",
|
||||||
|
"## Zadanie 102\n",
|
||||||
|
" \n",
|
||||||
|
"Znaleźć PDF-a w języku francuskim opublikowanego w Internecie przed\n",
|
||||||
|
"10 marca 2021 roku z największą ilością stron.\n",
|
||||||
|
"\n",
|
||||||
|
"Punkty: 30 (za największy plik).\n",
|
||||||
|
" \n",
|
||||||
|
"## Zadanie 103\n",
|
||||||
|
"\n",
|
||||||
|
"Znajdź najwcześniejsze poświadczenie w języku angielskim słowa \"coronavirus\".\n",
|
||||||
|
"\n",
|
||||||
|
"Punkty: 35\n",
|
||||||
|
"\n",
|
||||||
|
"## Zadanie 104\n",
|
||||||
|
"\n",
|
||||||
|
"Znajdź najwcześniejsze poświadczenie w języku angielskim słowa \"SARS-CoV-2\".\n",
|
||||||
|
"Punkty: 35\n",
|
||||||
|
" \n",
|
||||||
|
" \n",
|
||||||
|
"## Zadanie 105\n",
|
||||||
|
" \n",
|
||||||
|
"Podaj 3 przykłady ofert na portalach (allegro, olx, inne), które mają nieoczywiste tytuły w celu pojawienia się\n",
|
||||||
|
"dla jak największej ilości zapytań. Powinny to być 3 różne powody. Napisz jakie to są powody przy ofercie.\n",
|
||||||
|
"\n",
|
||||||
|
"Punkty: 20\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"## Zadanie 106\n",
|
||||||
|
"\n",
|
||||||
|
"Znajdź wykres na google trends, który pokazuje równoczesny wzrost zainteresowania jednej frazy, gdy maleje\n",
|
||||||
|
"zainteresowanie drugą frazą. Obie frazy powinny być choć trochę popularne. Niekoniecznie musi występować \n",
|
||||||
|
"powiązanie przyczynowo-skutkowe, ale jeżeli zachodzi- tym lepiej. Skorzystaj z opcji porównywania trendów.\n",
|
||||||
|
"\n",
|
||||||
|
"Punkty: 20\n",
|
||||||
|
"\n",
|
||||||
|
"## Zadanie 107\n",
|
||||||
|
"\n",
|
||||||
|
"Znajdź zapytanie na google trends, które jest popularne w niektórych regionach polski, a w innych nie. Z czego mogą wynikać te różnice?\n",
|
||||||
|
"\n",
|
||||||
|
"Punkty: 20\n",
|
||||||
|
" \n",
|
||||||
|
" \n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.8.5"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
1125
cw/02a_tfidf_tasks.ipynb
Normal file
1125
cw/02a_tfidf_tasks.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
32
cw/02a_tfidf_tasks_ODPOWIEDZI.ipynb
Normal file
32
cw/02a_tfidf_tasks_ODPOWIEDZI.ipynb
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": []
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.8.5"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
708
cw/02b_tfidf_newsgroup.ipynb
Normal file
708
cw/02b_tfidf_newsgroup.ipynb
Normal file
@ -0,0 +1,708 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Zajecia 2\n",
|
||||||
|
"\n",
|
||||||
|
"Przydatne materiały:\n",
|
||||||
|
"\n",
|
||||||
|
"https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n",
|
||||||
|
"\n",
|
||||||
|
"https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Importy"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import sklearn.metrics\n",
|
||||||
|
"\n",
|
||||||
|
"from sklearn.datasets import fetch_20newsgroups\n",
|
||||||
|
"\n",
|
||||||
|
"from sklearn.feature_extraction.text import TfidfVectorizer"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Zbiór danych"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"newsgroups = fetch_20newsgroups()['data']"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"11314"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 3,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"len(newsgroups)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 4,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"From: lerxst@wam.umd.edu (where's my thing)\n",
|
||||||
|
"Subject: WHAT car is this!?\n",
|
||||||
|
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
|
||||||
|
"Organization: University of Maryland, College Park\n",
|
||||||
|
"Lines: 15\n",
|
||||||
|
"\n",
|
||||||
|
" I was wondering if anyone out there could enlighten me on this car I saw\n",
|
||||||
|
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
|
||||||
|
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
|
||||||
|
"the front bumper was separate from the rest of the body. This is \n",
|
||||||
|
"all I know. If anyone can tellme a model name, engine specs, years\n",
|
||||||
|
"of production, where this car is made, history, or whatever info you\n",
|
||||||
|
"have on this funky looking car, please e-mail.\n",
|
||||||
|
"\n",
|
||||||
|
"Thanks,\n",
|
||||||
|
"- IL\n",
|
||||||
|
" ---- brought to you by your neighborhood Lerxst ----\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(newsgroups[0])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Naiwne przeszukiwanie"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"all_documents = list() \n",
|
||||||
|
"for document in newsgroups:\n",
|
||||||
|
" if 'car' in document:\n",
|
||||||
|
" all_documents.append(document)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 6,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"From: lerxst@wam.umd.edu (where's my thing)\n",
|
||||||
|
"Subject: WHAT car is this!?\n",
|
||||||
|
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
|
||||||
|
"Organization: University of Maryland, College Park\n",
|
||||||
|
"Lines: 15\n",
|
||||||
|
"\n",
|
||||||
|
" I was wondering if anyone out there could enlighten me on this car I saw\n",
|
||||||
|
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
|
||||||
|
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
|
||||||
|
"the front bumper was separate from the rest of the body. This is \n",
|
||||||
|
"all I know. If anyone can tellme a model name, engine specs, years\n",
|
||||||
|
"of production, where this car is made, history, or whatever info you\n",
|
||||||
|
"have on this funky looking car, please e-mail.\n",
|
||||||
|
"\n",
|
||||||
|
"Thanks,\n",
|
||||||
|
"- IL\n",
|
||||||
|
" ---- brought to you by your neighborhood Lerxst ----\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(all_documents[0])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 7,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"From: guykuo@carson.u.washington.edu (Guy Kuo)\n",
|
||||||
|
"Subject: SI Clock Poll - Final Call\n",
|
||||||
|
"Summary: Final call for SI clock reports\n",
|
||||||
|
"Keywords: SI,acceleration,clock,upgrade\n",
|
||||||
|
"Article-I.D.: shelley.1qvfo9INNc3s\n",
|
||||||
|
"Organization: University of Washington\n",
|
||||||
|
"Lines: 11\n",
|
||||||
|
"NNTP-Posting-Host: carson.u.washington.edu\n",
|
||||||
|
"\n",
|
||||||
|
"A fair number of brave souls who upgraded their SI clock oscillator have\n",
|
||||||
|
"shared their experiences for this poll. Please send a brief message detailing\n",
|
||||||
|
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
|
||||||
|
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
|
||||||
|
"functionality with 800 and 1.4 m floppies are especially requested.\n",
|
||||||
|
"\n",
|
||||||
|
"I will be summarizing in the next two days, so please add to the network\n",
|
||||||
|
"knowledge base if you have done the clock upgrade and haven't answered this\n",
|
||||||
|
"poll. Thanks.\n",
|
||||||
|
"\n",
|
||||||
|
"Guy Kuo <guykuo@u.washington.edu>\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"print(all_documents[1])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### jakie są problemy z takim podejściem?\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## TFIDF i odległość cosinusowa- gotowe biblioteki"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 8,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"vectorizer = TfidfVectorizer()\n",
|
||||||
|
"#vectorizer = TfidfVectorizer(use_idf = False, ngram_range=(1,2))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 9,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"document_vectors = vectorizer.fit_transform(newsgroups)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 10,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
|
||||||
|
"\twith 1787565 stored elements in Compressed Sparse Row format>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 10,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"document_vectors"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 11,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
|
||||||
|
"\twith 89 stored elements in Compressed Sparse Row format>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 11,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"document_vectors[0]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 12,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"matrix([[0., 0., 0., ..., 0., 0., 0.]])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 12,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"document_vectors[0].todense()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 13,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"matrix([[0., 0., 0., ..., 0., 0., 0.],\n",
|
||||||
|
" [0., 0., 0., ..., 0., 0., 0.],\n",
|
||||||
|
" [0., 0., 0., ..., 0., 0., 0.],\n",
|
||||||
|
" [0., 0., 0., ..., 0., 0., 0.]])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 13,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"document_vectors[0:4].todense()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 14,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query_str = 'speed'\n",
|
||||||
|
"#query_str = 'speed car'\n",
|
||||||
|
"#query_str = 'spider man'"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 15,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"query_vector = vectorizer.transform([query_str])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 16,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
|
||||||
|
"\twith 1787565 stored elements in Compressed Sparse Row format>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 16,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"document_vectors"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 17,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
|
||||||
|
"\twith 1 stored elements in Compressed Sparse Row format>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 17,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"query_vector"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 18,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 19,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"array([0.26949927, 0.3491801 , 0.44292083, 0.47784165])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 19,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"np.sort(similarities)[0][-4:]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 20,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"data": {
|
||||||
|
"text/plain": [
|
||||||
|
"array([4517, 5509, 2116, 9921])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"execution_count": 20,
|
||||||
|
"metadata": {},
|
||||||
|
"output_type": "execute_result"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"similarities.argsort()[0][-4:]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 21,
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": false
|
||||||
|
},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"From: ray@netcom.com (Ray Fischer)\n",
|
||||||
|
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
|
||||||
|
"Organization: Netcom. San Jose, California\n",
|
||||||
|
"Distribution: usa\n",
|
||||||
|
"Lines: 36\n",
|
||||||
|
"\n",
|
||||||
|
"dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
|
||||||
|
">I'm sure Intel and Motorola are competing neck-and-neck for \n",
|
||||||
|
">crunch-power, but for a given clock speed, how do we rank the\n",
|
||||||
|
">following (from 1st to 6th):\n",
|
||||||
|
"> 486\t\t68040\n",
|
||||||
|
"> 386\t\t68030\n",
|
||||||
|
"> 286\t\t68020\n",
|
||||||
|
"\n",
|
||||||
|
"040 486 030 386 020 286\n",
|
||||||
|
"\n",
|
||||||
|
">While you're at it, where will the following fit into the list:\n",
|
||||||
|
"> 68060\n",
|
||||||
|
"> Pentium\n",
|
||||||
|
"> PowerPC\n",
|
||||||
|
"\n",
|
||||||
|
"060 fastest, then Pentium, with the first versions of the PowerPC\n",
|
||||||
|
"somewhere in the vicinity.\n",
|
||||||
|
"\n",
|
||||||
|
">And about clock speed: Does doubling the clock speed double the\n",
|
||||||
|
">overall processor speed? And fill in the __'s below:\n",
|
||||||
|
"> 68030 @ __ MHz = 68040 @ __ MHz\n",
|
||||||
|
"\n",
|
||||||
|
"No. Computer speed is only partly dependent of processor/clock speed.\n",
|
||||||
|
"Memory system speed play a large role as does video system speed and\n",
|
||||||
|
"I/O speed. As processor clock rates go up, the speed of the memory\n",
|
||||||
|
"system becomes the greatest factor in the overall system speed. If\n",
|
||||||
|
"you have a 50MHz processor, it can be reading another word from memory\n",
|
||||||
|
"every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
|
||||||
|
"it will cost 10 times as much as the slower 80ns SIMMs.\n",
|
||||||
|
"\n",
|
||||||
|
"And roughly, the 68040 is twice as fast at a given clock\n",
|
||||||
|
"speed as is the 68030.\n",
|
||||||
|
"\n",
|
||||||
|
"-- \n",
|
||||||
|
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
|
||||||
|
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
|
||||||
|
"\n",
|
||||||
|
"0.4778416465020907\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar)\n",
|
||||||
|
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
|
||||||
|
"Distribution: usa\n",
|
||||||
|
"Organization: University of Illinois at Urbana\n",
|
||||||
|
"Lines: 59\n",
|
||||||
|
"\n",
|
||||||
|
"ray@netcom.com (Ray Fischer) writes:\n",
|
||||||
|
"\n",
|
||||||
|
">dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
|
||||||
|
">>I'm sure Intel and Motorola are competing neck-and-neck for \n",
|
||||||
|
">>crunch-power, but for a given clock speed, how do we rank the\n",
|
||||||
|
">>following (from 1st to 6th):\n",
|
||||||
|
">> 486\t\t68040\n",
|
||||||
|
">> 386\t\t68030\n",
|
||||||
|
">> 286\t\t68020\n",
|
||||||
|
"\n",
|
||||||
|
">040 486 030 386 020 286\n",
|
||||||
|
"\n",
|
||||||
|
"How about some numbers here? Some kind of benchmark?\n",
|
||||||
|
"If you want, let me start it - 486DX2-66 - 32 SPECint92, 16 SPECfp92 .\n",
|
||||||
|
"\n",
|
||||||
|
">>While you're at it, where will the following fit into the list:\n",
|
||||||
|
">> 68060\n",
|
||||||
|
">> Pentium\n",
|
||||||
|
">> PowerPC\n",
|
||||||
|
"\n",
|
||||||
|
">060 fastest, then Pentium, with the first versions of the PowerPC\n",
|
||||||
|
">somewhere in the vicinity.\n",
|
||||||
|
"\n",
|
||||||
|
"Numbers? Pentium @66MHz - 65 SPECint92, 57 SPECfp92 .\n",
|
||||||
|
"\t PowerPC @66MHz - 50 SPECint92, 80 SPECfp92 . (Note this is the 601)\n",
|
||||||
|
" (Alpha @150MHz - 74 SPECint92,126 SPECfp92 - just for comparison)\n",
|
||||||
|
"\n",
|
||||||
|
">>And about clock speed: Does doubling the clock speed double the\n",
|
||||||
|
">>overall processor speed? And fill in the __'s below:\n",
|
||||||
|
">> 68030 @ __ MHz = 68040 @ __ MHz\n",
|
||||||
|
"\n",
|
||||||
|
">No. Computer speed is only partly dependent of processor/clock speed.\n",
|
||||||
|
">Memory system speed play a large role as does video system speed and\n",
|
||||||
|
">I/O speed. As processor clock rates go up, the speed of the memory\n",
|
||||||
|
">system becomes the greatest factor in the overall system speed. If\n",
|
||||||
|
">you have a 50MHz processor, it can be reading another word from memory\n",
|
||||||
|
">every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
|
||||||
|
">it will cost 10 times as much as the slower 80ns SIMMs.\n",
|
||||||
|
"\n",
|
||||||
|
"Not in a clock-doubled system. There isn't a doubling in performance, but\n",
|
||||||
|
"it _is_ quite significant. Maybe about a 70% increase in performance.\n",
|
||||||
|
"\n",
|
||||||
|
"Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
|
||||||
|
"who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
|
||||||
|
"memory speed corresponds to a clock speed of 12.5 MHz.\n",
|
||||||
|
"\n",
|
||||||
|
">And roughly, the 68040 is twice as fast at a given clock\n",
|
||||||
|
">speed as is the 68030.\n",
|
||||||
|
"\n",
|
||||||
|
"Numbers?\n",
|
||||||
|
"\n",
|
||||||
|
">-- \n",
|
||||||
|
">Ray Fischer \"Convictions are more dangerous enemies of truth\n",
|
||||||
|
">ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
|
||||||
|
"-- \n",
|
||||||
|
"Ravikumar Venkateswar\n",
|
||||||
|
"rvenkate@uiuc.edu\n",
|
||||||
|
"\n",
|
||||||
|
"A pun is a no' blessed form of whit.\n",
|
||||||
|
"\n",
|
||||||
|
"0.44292082969477664\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"From: ray@netcom.com (Ray Fischer)\n",
|
||||||
|
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
|
||||||
|
"Organization: Netcom. San Jose, California\n",
|
||||||
|
"Distribution: usa\n",
|
||||||
|
"Lines: 30\n",
|
||||||
|
"\n",
|
||||||
|
"rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) writes ...\n",
|
||||||
|
">ray@netcom.com (Ray Fischer) writes:\n",
|
||||||
|
">>040 486 030 386 020 286\n",
|
||||||
|
">\n",
|
||||||
|
">How about some numbers here? Some kind of benchmark?\n",
|
||||||
|
"\n",
|
||||||
|
"Benchmarks are for marketing dweebs and CPU envy. OK, if it will make\n",
|
||||||
|
"you happy, the 486 is faster than the 040. BFD. Both architectures\n",
|
||||||
|
"are nearing then end of their lifetimes. And especially with the x86\n",
|
||||||
|
"architecture: good riddance.\n",
|
||||||
|
"\n",
|
||||||
|
">Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
|
||||||
|
">who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
|
||||||
|
">memory speed corresponds to a clock speed of 12.5 MHz.\n",
|
||||||
|
"\n",
|
||||||
|
"The point being the processor speed is only one of many aspects of a\n",
|
||||||
|
"computers performance. Clock speed, processor, memory speed, CPU\n",
|
||||||
|
"architecture, I/O systems, even the application program all contribute \n",
|
||||||
|
"to the overall system performance.\n",
|
||||||
|
"\n",
|
||||||
|
">>And roughly, the 68040 is twice as fast at a given clock\n",
|
||||||
|
">>speed as is the 68030.\n",
|
||||||
|
">\n",
|
||||||
|
">Numbers?\n",
|
||||||
|
"\n",
|
||||||
|
"Look them up yourself.\n",
|
||||||
|
"\n",
|
||||||
|
"-- \n",
|
||||||
|
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
|
||||||
|
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
|
||||||
|
"\n",
|
||||||
|
"0.3491800997095306\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"From: mb4008@cehp11 (Morgan J Bullard)\n",
|
||||||
|
"Subject: Re: speeding up windows\n",
|
||||||
|
"Keywords: speed\n",
|
||||||
|
"Organization: University of Illinois at Urbana\n",
|
||||||
|
"Lines: 30\n",
|
||||||
|
"\n",
|
||||||
|
"djserian@flash.LakeheadU.Ca (Reincarnation of Elvis) writes:\n",
|
||||||
|
"\n",
|
||||||
|
">I have a 386/33 with 8 megs of memory\n",
|
||||||
|
"\n",
|
||||||
|
">I have noticed that lately when I use programs like WpfW or Corel Draw\n",
|
||||||
|
">my computer \"boggs\" down and becomes really sluggish!\n",
|
||||||
|
"\n",
|
||||||
|
">What can I do to increase performance? What should I turn on or off\n",
|
||||||
|
"\n",
|
||||||
|
">Will not loading wallpapers or stuff like that help when it comes to\n",
|
||||||
|
">the running speed of windows and the programs that run under it?\n",
|
||||||
|
"\n",
|
||||||
|
">Thanx in advance\n",
|
||||||
|
"\n",
|
||||||
|
">Derek\n",
|
||||||
|
"\n",
|
||||||
|
"1) make sure your hard drive is defragmented. This will speed up more than \n",
|
||||||
|
" just windows BTW. Use something like Norton's or PC Tools.\n",
|
||||||
|
"2) I _think_ that leaving the wall paper out will use less RAM and therefore\n",
|
||||||
|
" will speed up your machine but I could very will be wrong on this.\n",
|
||||||
|
"There's a good chance you've already done this but if not it may speed things\n",
|
||||||
|
"up. good luck\n",
|
||||||
|
"\t\t\t\tMorgan Bullard mb4008@coewl.cen.uiuc.edu\n",
|
||||||
|
"\t\t\t\t\t or mjbb@uxa.cso.uiuc.edu\n",
|
||||||
|
"\n",
|
||||||
|
">--\n",
|
||||||
|
">$_ /|$Derek J.P. Serianni $ E-Mail : djserian@flash.lakeheadu.ca $ \n",
|
||||||
|
">$\\'o.O' $Sociologist $ It's 106 miles to Chicago,we've got a full tank$\n",
|
||||||
|
">$=(___)=$Lakehead University $ of gas, half a pack of cigarettes,it's dark,and$\n",
|
||||||
|
">$ U $Thunder Bay, Ontario$ we're wearing sunglasses. -Elwood Blues $ \n",
|
||||||
|
"\n",
|
||||||
|
"0.26949927393886913\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n",
|
||||||
|
"----------------------------------------------------------------------------------------------------\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"for i in range (1,5):\n",
|
||||||
|
" print(newsgroups[similarities.argsort()[0][-i]])\n",
|
||||||
|
" print(np.sort(similarities)[0,-i])\n",
|
||||||
|
" print('-'*100)\n",
|
||||||
|
" print('-'*100)\n",
|
||||||
|
" print('-'*100)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Zadanie domowe\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"- Wybrać zbiór tekstowy, który ma conajmniej 5000 dokumentów.\n",
|
||||||
|
"- Na jego podstawie stworzyć wyszukiwarkę bazującą na OKAPI BM25, tzn. system który dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasujących dokumentów. Można korzystać z gotowych bibliotek do wektoryzacji dokumentów, należy jednak samemu zaimplementować OKAPI BM25.\n",
|
||||||
|
"- Znaleźć frazę (query), dla której wynik nie jest satysfakcjonujący.\n",
|
||||||
|
"- Poprawić wyszukiwarkę (np. poprzez zmianę preprocessingu tekstu, wektoryzer, zmianę parametrów algotytmu rankującego lub sam algorytm) tak, żeby zwracała satysfakcjonujące wyniki dla poprzedniej frazy\n",
|
||||||
|
"- prezentować pracę na następnych zajęciach (15.03) odpowiadając na pytania:\n",
|
||||||
|
" - jak wygląda zbiór i system wyszukiwania przed zmianami\n",
|
||||||
|
" - dla jakiej frazy wyniki są niesatysfakcjonujące (pokazać wyniki)\n",
|
||||||
|
" - jakie zmiany zostały naniesione\n",
|
||||||
|
" - jak wyglądają wyniki wyszukiwania po zmianach\n",
|
||||||
|
" - jak zmiany wpłynęły na wyniki (1-2 zdania)\n",
|
||||||
|
" \n",
|
||||||
|
"Prezentacja powinna być maksymalnie prosta i trwać maksymalnie 2-3 minuty.\n",
|
||||||
|
"punktów do zdobycia: 40\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": []
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.8.5"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
Loading…
Reference in New Issue
Block a user