Merge branch 'master' of git.wmi.amu.edu.pl:filipg/aitech-eks

This commit is contained in:
Filip Gralinski 2021-03-09 22:24:39 +01:00
commit 11b1397252
5 changed files with 2203 additions and 0 deletions

View File

@ -0,0 +1,81 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Informacje ogólne"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Kontakt z prowadzącym\n",
"\n",
"prowadzący: mgr inż. Jakub Pokrywka\n",
"\n",
"Najlepiej kontaktowąć się ze mną przez MS TEAMS na grupie kanału (ogólne sprawy) lub w prywatnych wiadomościach. Odpisuję co 2-3 dni. Można też umówić się na zdzwonko w godzinach dyżuru (wt 12.00-13.00) lub umówić się w innym terminie.\n",
"\n",
"\n",
"## Literatura\n",
"Polecana literatura do przedmiotu:\n",
"\n",
"\n",
"- https://www.manning.com/books/relevant-search#toc (darmowa) Polecam chociaż przejrzeć.\n",
"- Marie-Francine Moens. 2006. Information Extraction: Algorithms and Prospects in a Retrieval Context. Springer. (polecam mniej, jest trochę nieaktualna)\n",
"- Alex Graves. 2012. Supervised sequence labelling. Studies in Computational Intelligence, vol 385. Springer. Berlin, Heidelberg. \n",
"\n",
"- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. North American Association for Computational Linguistics (NAACL). \n",
"\n",
"- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research vol 21, number 140, pages 1-67. \n",
"\n",
"- Flip Graliński, Tomasz Stanisławek, Anna Wróblewska, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska, Bartosz Topolski, Przemysław Biecek. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. URL https://arxiv.org/abs/2003.02356 \n",
"\n",
"- Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Filip Graliński. 2020. LAMBERT: Layout-Aware (Language) Modeling using BERT. URL https://arxiv.org/pdf/2002.08087 \n",
"\n",
"## Zaliczenie\n",
"\n",
"\n",
"\n",
"Do zdobycia będzie conajmniej 500 punktów.\n",
"\n",
"Ocena:\n",
"\n",
"- -299 — 2\n",
"\n",
"- 300-349 — 3\n",
"\n",
"- 350-399 — 3+\n",
"\n",
"- 400-449 — 4\n",
"\n",
"- 450—499 — 4+\n",
"\n",
"- 500- — 5\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,257 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zajecia 1\n",
"\n",
"Na tych zajęciach za aktywnośc można otrzymać po 5 punktów za wartościową wypowiedź. Maksymalnie jedna osoba może zdobyć na tych ćwiczeniach do 15 punktów."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Przydatne materiały:\n",
"\n",
"https://www.google.com/advanced_search\n",
"\n",
"https://www.google.pl/advanced_image_search\n",
"\n",
"https://support.google.com/websearch/answer/2466433?hl=en\n",
"\n",
"https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7\n",
"\n",
"https://allegro.pl/dla-sprzedajacych/trafnosc-xGmVjoPwOTo\n",
"\n",
"https://developer.allegro.pl/about/\n",
"\n",
"https://serpapi.com/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Będziemy omawiać: \n",
"- Wyszukiwarki ogólnego przeznaczenia (google, bing, ...)\n",
"- Wyszukiwarki na konkretną platformę (amazon, allegro, olx, spar, ...)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wyszukiwanie zaawansowane google\n",
"\n",
"- \"job steve\"\n",
"- poduszka |/OR drzwi \n",
"- poduszka -biała\n",
"- poduszka * drzwi\n",
"- define:pillow\n",
"- cache:wp.pl\n",
"- poduszka filetype:pdf\n",
"- poduszka site:allegro.pl\n",
"- related:allegro.pl\n",
"- intitle:poduszka\n",
"- allintitle:poduszka biała\n",
"- inurl:poduszka\n",
"- allinurl:poduszka biała\n",
"- poduszka AROUND(4) drzwi\n",
"- weather:poznan\n",
"- stocks:gme\n",
"- map:poznań\n",
"- $329 in pln\n",
"- euro 1990..2000\n",
"- 15*30\n",
"- color picker\n",
"- elon musk @twitter\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Komponenty wyszukiwarki google\n",
"- pole do wpisywania tekstu i search button \n",
"- sugestie do wpisywania\n",
"- ghosting\n",
"- autokorekta, np. pdouszka\n",
"- ilość wyświetleń dla wyniku \n",
"- elementy dodaktowe po wpisaniu frazy (odpowiedzi na pytania ogólne, wyszukiwania powiązane, itp)\n",
"- lista elementów (podzielona na strony)\n",
"- jak działają strony na urządzeniach mobilnych?\n",
"- prezentacja wyników: nazwa strony oraz tam gdzie jest match pogrubienie (czy google ma prawo do umieszczania takich tekstów na swojej stronie)?\n",
"- inne komponenty - np best games for nintendo switch\n",
"- reklamy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Komponenty wyszukiwarki specjalistycznej na przykładzie allegro\n",
"\n",
"- wyszukiwarna tekstowa albo nawigowanie bezpośrednio po kategoriach\n",
"- każdy ma własny unikalny sposób wyszukiwania\n",
"- okno wyszukiwania\n",
"- sugestie przy wpisywaniu frazy\n",
"- ghosting (np santander.pl)\n",
"- autokorekta (sugestia oraz przekierowanie)\n",
"- można też wpisać, że szukamy również w opisach, parametrach itp.\n",
"- komentarz: tutaj wpisujemy jakąś frazę\n",
"- mamy zbiór dokumumentów oraz są posortowane w jakiś sposób (ale niekoniecznie tak musi być)\n",
"- jak działa odzyskiwanie dokumentów?\n",
" - stopwordy \n",
" - normalizacja do lowercase\n",
" - lista synonimów, fleksja, odmiana (także ujednoznacznienie do jednej formy → wielka poduszka/ wielki poduszka, kubek kubki)\n",
"- sortowania (omówić możliwe sortowania)- element którego nie ma w google\n",
"https://allegro.pl/dla-sprzedajacych/trafnosc-xGmVjoPwOTo#moja-oferta-ma-duza-sprzedaz-a-mimo-tego-jest-ona-nizej-w-sortowaniu-po-trafnosci-niz-inne-nowe-oferty-dlaczego-\n",
"- trafność dla każdego może znaczyć coś innego\n",
"- sortowanie domyślne- jakie jest jego znaczenie?\n",
"- inne rodzaje sortowania\n",
"- rerankowanie \n",
"- po lewej stronie mamy zawężenie do kategorii oraz filtry, wyszukiwanie facetowe- nie ma w google\n",
"- mamy także oferty sponsorowane oraz promowane - dylemat- ważniejszy jest biznes czy użytkownik\n",
"- rekomendacje dla użytkowników na dole- właściwie to jest osobny dział \n",
"- inne możliwości (szukaj wielu)\n",
"- wyszukiwanie zaawansowane: https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7\n",
"- ewaluacja jakości wyszukiwarki- dyskusja, kto by co wybrał, jak wygląda sprawa z uczeniem maszynowym?\n",
"- jakie cele musi spełniać inżynier trafonośći?\n",
"- jak ewaluować wyszukiwarki?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API do wyszukiwarek\n",
"- https://developer.allegro.pl/listing/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Google trends"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SEO (Search Engine Optimization)\n",
"- pod google\n",
"- pod wyszukiwarki typu allegro, olx \n",
"- https://allegro.pl/pomoc/dla-kupujacych/wyszukiwanie-i-obserwowanie/jak-korzystac-z-wyszukiwarki-i-znalezc-przedmiot-mGwAg2jRrU7"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zadanie domowe\n",
"\n",
"----------------------\n",
"Maksymalnie do zdobycia za zadania 100: 30\n",
"\n",
"Maksymalnie do zdobycia za zadania 101-107: 50\n",
"\n",
"\n",
"Zadania proszę oddawać w formie pliku pdf w MS TEAMS (grupa kanału → assignments) do końca 17.03.2021.\n",
"\n",
"Oprocz samego rozwiązania, proszę umieścić sposób w jaki Państwo do niego doszli (np frazy wpisywane w wysuzkiwarkę, itp.).\n",
"\n",
"## Zadanie 100\n",
"\n",
"Znaleźć przykłady „wyzwań” researcherskich — nagród pieniężnych za\n",
"znalezienie jakiejś informacji, najwcześniejszego wystąpienia jakiegoś słowa itp.\n",
"Wyzwanie musi polegać na znalezieniu jakieś informacji w powszechnie dostępnych źródłach (internet, biblioteki).\n",
"Zatem nie liczą sie np. nagrody za udzielenie informacji o jakimś mordercy, itp.\n",
"Interesują nas tylko „otwarte” wyzwania. Język, jakiego dotyczy wyzwanie — dowolny.\n",
"\n",
"Wyzwania podać w formie tabelki: nagroda, link, krótki opis.\n",
"\n",
"Liczba punktów za każde znalezione wyzwanie: max( 30, 5*log_10(nagroda w dolarach) )\n",
"\n",
"Przykład: [nagroda $250 za znalezienie wzmianki dotyczącej chupacabry\n",
"(potwora) przed 1990 rokiem](http://www.cryptozoonews.com/chupa-250/).\n",
"\n",
"Maksymalna liczba punktów: 30.\n",
"\n",
"\n",
"## Zadanie 101\n",
"\n",
"Podać 3 przykłady zapytań na allegro, które daje zaskakujące/niesatysfakcjonujące wyniki. Napisz jaka może być przyczyna takich wyników?\n",
"\n",
"Maksymalna liczba punktów: 20.\n",
"\n",
"## Zadanie 102\n",
" \n",
"Znaleźć PDF-a w języku francuskim opublikowanego w Internecie przed\n",
"10 marca 2021 roku z największą ilością stron.\n",
"\n",
"Punkty: 30 (za największy plik).\n",
" \n",
"## Zadanie 103\n",
"\n",
"Znajdź najwcześniejsze poświadczenie w języku angielskim słowa \"coronavirus\".\n",
"\n",
"Punkty: 35\n",
"\n",
"## Zadanie 104\n",
"\n",
"Znajdź najwcześniejsze poświadczenie w języku angielskim słowa \"SARS-CoV-2\".\n",
"Punkty: 35\n",
" \n",
" \n",
"## Zadanie 105\n",
" \n",
"Podaj 3 przykłady ofert na portalach (allegro, olx, inne), które mają nieoczywiste tytuły w celu pojawienia się\n",
"dla jak największej ilości zapytań. Powinny to być 3 różne powody. Napisz jakie to są powody przy ofercie.\n",
"\n",
"Punkty: 20\n",
"\n",
"\n",
"## Zadanie 106\n",
"\n",
"Znajdź wykres na google trends, który pokazuje równoczesny wzrost zainteresowania jednej frazy, gdy maleje\n",
"zainteresowanie drugą frazą. Obie frazy powinny być choć trochę popularne. Niekoniecznie musi występować \n",
"powiązanie przyczynowo-skutkowe, ale jeżeli zachodzi- tym lepiej. Skorzystaj z opcji porównywania trendów.\n",
"\n",
"Punkty: 20\n",
"\n",
"## Zadanie 107\n",
"\n",
"Znajdź zapytanie na google trends, które jest popularne w niektórych regionach polski, a w innych nie. Z czego mogą wynikać te różnice?\n",
"\n",
"Punkty: 20\n",
" \n",
" \n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

1125
cw/02a_tfidf_tasks.ipynb Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,32 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,708 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zajecia 2\n",
"\n",
"Przydatne materiały:\n",
"\n",
"https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html\n",
"\n",
"https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Importy"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import sklearn.metrics\n",
"\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zbiór danych"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"newsgroups = fetch_20newsgroups()['data']"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11314"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: lerxst@wam.umd.edu (where's my thing)\n",
"Subject: WHAT car is this!?\n",
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
"Organization: University of Maryland, College Park\n",
"Lines: 15\n",
"\n",
" I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n",
"\n",
"Thanks,\n",
"- IL\n",
" ---- brought to you by your neighborhood Lerxst ----\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"print(newsgroups[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Naiwne przeszukiwanie"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"all_documents = list() \n",
"for document in newsgroups:\n",
" if 'car' in document:\n",
" all_documents.append(document)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: lerxst@wam.umd.edu (where's my thing)\n",
"Subject: WHAT car is this!?\n",
"Nntp-Posting-Host: rac3.wam.umd.edu\n",
"Organization: University of Maryland, College Park\n",
"Lines: 15\n",
"\n",
" I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n",
"\n",
"Thanks,\n",
"- IL\n",
" ---- brought to you by your neighborhood Lerxst ----\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"print(all_documents[0])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: guykuo@carson.u.washington.edu (Guy Kuo)\n",
"Subject: SI Clock Poll - Final Call\n",
"Summary: Final call for SI clock reports\n",
"Keywords: SI,acceleration,clock,upgrade\n",
"Article-I.D.: shelley.1qvfo9INNc3s\n",
"Organization: University of Washington\n",
"Lines: 11\n",
"NNTP-Posting-Host: carson.u.washington.edu\n",
"\n",
"A fair number of brave souls who upgraded their SI clock oscillator have\n",
"shared their experiences for this poll. Please send a brief message detailing\n",
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
"functionality with 800 and 1.4 m floppies are especially requested.\n",
"\n",
"I will be summarizing in the next two days, so please add to the network\n",
"knowledge base if you have done the clock upgrade and haven't answered this\n",
"poll. Thanks.\n",
"\n",
"Guy Kuo <guykuo@u.washington.edu>\n",
"\n"
]
}
],
"source": [
"print(all_documents[1])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### jakie są problemy z takim podejściem?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TFIDF i odległość cosinusowa- gotowe biblioteki"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"vectorizer = TfidfVectorizer()\n",
"#vectorizer = TfidfVectorizer(use_idf = False, ngram_range=(1,2))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"document_vectors = vectorizer.fit_transform(newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 1787565 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 89 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0].todense()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"matrix([[0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors[0:4].todense()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"query_str = 'speed'\n",
"#query_str = 'speed car'\n",
"#query_str = 'spider man'"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"query_vector = vectorizer.transform([query_str])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<11314x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 1787565 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document_vectors"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1x130107 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 1 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query_vector"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.26949927, 0.3491801 , 0.44292083, 0.47784165])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sort(similarities)[0][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([4517, 5509, 2116, 9921])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"similarities.argsort()[0][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"From: ray@netcom.com (Ray Fischer)\n",
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
"Organization: Netcom. San Jose, California\n",
"Distribution: usa\n",
"Lines: 36\n",
"\n",
"dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
">I'm sure Intel and Motorola are competing neck-and-neck for \n",
">crunch-power, but for a given clock speed, how do we rank the\n",
">following (from 1st to 6th):\n",
"> 486\t\t68040\n",
"> 386\t\t68030\n",
"> 286\t\t68020\n",
"\n",
"040 486 030 386 020 286\n",
"\n",
">While you're at it, where will the following fit into the list:\n",
"> 68060\n",
"> Pentium\n",
"> PowerPC\n",
"\n",
"060 fastest, then Pentium, with the first versions of the PowerPC\n",
"somewhere in the vicinity.\n",
"\n",
">And about clock speed: Does doubling the clock speed double the\n",
">overall processor speed? And fill in the __'s below:\n",
"> 68030 @ __ MHz = 68040 @ __ MHz\n",
"\n",
"No. Computer speed is only partly dependent of processor/clock speed.\n",
"Memory system speed play a large role as does video system speed and\n",
"I/O speed. As processor clock rates go up, the speed of the memory\n",
"system becomes the greatest factor in the overall system speed. If\n",
"you have a 50MHz processor, it can be reading another word from memory\n",
"every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
"it will cost 10 times as much as the slower 80ns SIMMs.\n",
"\n",
"And roughly, the 68040 is twice as fast at a given clock\n",
"speed as is the 68030.\n",
"\n",
"-- \n",
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
"\n",
"0.4778416465020907\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar)\n",
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
"Distribution: usa\n",
"Organization: University of Illinois at Urbana\n",
"Lines: 59\n",
"\n",
"ray@netcom.com (Ray Fischer) writes:\n",
"\n",
">dhk@ubbpc.uucp (Dave Kitabjian) writes ...\n",
">>I'm sure Intel and Motorola are competing neck-and-neck for \n",
">>crunch-power, but for a given clock speed, how do we rank the\n",
">>following (from 1st to 6th):\n",
">> 486\t\t68040\n",
">> 386\t\t68030\n",
">> 286\t\t68020\n",
"\n",
">040 486 030 386 020 286\n",
"\n",
"How about some numbers here? Some kind of benchmark?\n",
"If you want, let me start it - 486DX2-66 - 32 SPECint92, 16 SPECfp92 .\n",
"\n",
">>While you're at it, where will the following fit into the list:\n",
">> 68060\n",
">> Pentium\n",
">> PowerPC\n",
"\n",
">060 fastest, then Pentium, with the first versions of the PowerPC\n",
">somewhere in the vicinity.\n",
"\n",
"Numbers? Pentium @66MHz - 65 SPECint92, 57 SPECfp92 .\n",
"\t PowerPC @66MHz - 50 SPECint92, 80 SPECfp92 . (Note this is the 601)\n",
" (Alpha @150MHz - 74 SPECint92,126 SPECfp92 - just for comparison)\n",
"\n",
">>And about clock speed: Does doubling the clock speed double the\n",
">>overall processor speed? And fill in the __'s below:\n",
">> 68030 @ __ MHz = 68040 @ __ MHz\n",
"\n",
">No. Computer speed is only partly dependent of processor/clock speed.\n",
">Memory system speed play a large role as does video system speed and\n",
">I/O speed. As processor clock rates go up, the speed of the memory\n",
">system becomes the greatest factor in the overall system speed. If\n",
">you have a 50MHz processor, it can be reading another word from memory\n",
">every 20ns. Sure, you can put all 20ns memory in your computer, but\n",
">it will cost 10 times as much as the slower 80ns SIMMs.\n",
"\n",
"Not in a clock-doubled system. There isn't a doubling in performance, but\n",
"it _is_ quite significant. Maybe about a 70% increase in performance.\n",
"\n",
"Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
"who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
"memory speed corresponds to a clock speed of 12.5 MHz.\n",
"\n",
">And roughly, the 68040 is twice as fast at a given clock\n",
">speed as is the 68030.\n",
"\n",
"Numbers?\n",
"\n",
">-- \n",
">Ray Fischer \"Convictions are more dangerous enemies of truth\n",
">ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
"-- \n",
"Ravikumar Venkateswar\n",
"rvenkate@uiuc.edu\n",
"\n",
"A pun is a no' blessed form of whit.\n",
"\n",
"0.44292082969477664\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: ray@netcom.com (Ray Fischer)\n",
"Subject: Re: x86 ~= 680x0 ?? (How do they compare?)\n",
"Organization: Netcom. San Jose, California\n",
"Distribution: usa\n",
"Lines: 30\n",
"\n",
"rvenkate@ux4.cso.uiuc.edu (Ravikuma Venkateswar) writes ...\n",
">ray@netcom.com (Ray Fischer) writes:\n",
">>040 486 030 386 020 286\n",
">\n",
">How about some numbers here? Some kind of benchmark?\n",
"\n",
"Benchmarks are for marketing dweebs and CPU envy. OK, if it will make\n",
"you happy, the 486 is faster than the 040. BFD. Both architectures\n",
"are nearing then end of their lifetimes. And especially with the x86\n",
"architecture: good riddance.\n",
"\n",
">Besides, for 0 wait state performance, you'd need a cache anyway. I mean,\n",
">who uses a processor that runs at the speed of 80ns SIMMs? Note that this\n",
">memory speed corresponds to a clock speed of 12.5 MHz.\n",
"\n",
"The point being the processor speed is only one of many aspects of a\n",
"computers performance. Clock speed, processor, memory speed, CPU\n",
"architecture, I/O systems, even the application program all contribute \n",
"to the overall system performance.\n",
"\n",
">>And roughly, the 68040 is twice as fast at a given clock\n",
">>speed as is the 68030.\n",
">\n",
">Numbers?\n",
"\n",
"Look them up yourself.\n",
"\n",
"-- \n",
"Ray Fischer \"Convictions are more dangerous enemies of truth\n",
"ray@netcom.com than lies.\" -- Friedrich Nietzsche\n",
"\n",
"0.3491800997095306\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"From: mb4008@cehp11 (Morgan J Bullard)\n",
"Subject: Re: speeding up windows\n",
"Keywords: speed\n",
"Organization: University of Illinois at Urbana\n",
"Lines: 30\n",
"\n",
"djserian@flash.LakeheadU.Ca (Reincarnation of Elvis) writes:\n",
"\n",
">I have a 386/33 with 8 megs of memory\n",
"\n",
">I have noticed that lately when I use programs like WpfW or Corel Draw\n",
">my computer \"boggs\" down and becomes really sluggish!\n",
"\n",
">What can I do to increase performance? What should I turn on or off\n",
"\n",
">Will not loading wallpapers or stuff like that help when it comes to\n",
">the running speed of windows and the programs that run under it?\n",
"\n",
">Thanx in advance\n",
"\n",
">Derek\n",
"\n",
"1) make sure your hard drive is defragmented. This will speed up more than \n",
" just windows BTW. Use something like Norton's or PC Tools.\n",
"2) I _think_ that leaving the wall paper out will use less RAM and therefore\n",
" will speed up your machine but I could very will be wrong on this.\n",
"There's a good chance you've already done this but if not it may speed things\n",
"up. good luck\n",
"\t\t\t\tMorgan Bullard mb4008@coewl.cen.uiuc.edu\n",
"\t\t\t\t\t or mjbb@uxa.cso.uiuc.edu\n",
"\n",
">--\n",
">$_ /|$Derek J.P. Serianni $ E-Mail : djserian@flash.lakeheadu.ca $ \n",
">$\\'o.O' $Sociologist $ It's 106 miles to Chicago,we've got a full tank$\n",
">$=(___)=$Lakehead University $ of gas, half a pack of cigarettes,it's dark,and$\n",
">$ U $Thunder Bay, Ontario$ we're wearing sunglasses. -Elwood Blues $ \n",
"\n",
"0.26949927393886913\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n",
"----------------------------------------------------------------------------------------------------\n"
]
}
],
"source": [
"for i in range (1,5):\n",
" print(newsgroups[similarities.argsort()[0][-i]])\n",
" print(np.sort(similarities)[0,-i])\n",
" print('-'*100)\n",
" print('-'*100)\n",
" print('-'*100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zadanie domowe\n",
"\n",
"\n",
"- Wybrać zbiór tekstowy, który ma conajmniej 5000 dokumentów.\n",
"- Na jego podstawie stworzyć wyszukiwarkę bazującą na OKAPI BM25, tzn. system który dla podanej frazy podaje kilka (5-10) posortowanych najbardziej pasujących dokumentów. Można korzystać z gotowych bibliotek do wektoryzacji dokumentów, należy jednak samemu zaimplementować OKAPI BM25.\n",
"- Znaleźć frazę (query), dla której wynik nie jest satysfakcjonujący.\n",
"- Poprawić wyszukiwarkę (np. poprzez zmianę preprocessingu tekstu, wektoryzer, zmianę parametrów algotytmu rankującego lub sam algorytm) tak, żeby zwracała satysfakcjonujące wyniki dla poprzedniej frazy\n",
"- prezentować pracę na następnych zajęciach (15.03) odpowiadając na pytania:\n",
" - jak wygląda zbiór i system wyszukiwania przed zmianami\n",
" - dla jakiej frazy wyniki są niesatysfakcjonujące (pokazać wyniki)\n",
" - jakie zmiany zostały naniesione\n",
" - jak wyglądają wyniki wyszukiwania po zmianach\n",
" - jak zmiany wpłynęły na wyniki (1-2 zdania)\n",
" \n",
"Prezentacja powinna być maksymalnie prosta i trwać maksymalnie 2-3 minuty.\n",
"punktów do zdobycia: 40\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}