similarity search

This commit is contained in:
Jakub Pokrywka 2021-06-21 13:09:59 +02:00
parent 9c30361526
commit 6016fadb68
3 changed files with 2328 additions and 0 deletions

1491
BBC News Train.csv Normal file

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,522 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pickle\n",
"import numpy as np\n",
"import faiss\n",
"from sklearn.metrics import ndcg_score, dcg_score, average_precision_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### sprawy organizacyjne- zaliczenie\n",
"\n",
"dodatkowy mini projekt/zadanie domowe? "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"true_relevance = np.asarray([[10, 2, 0, 1, 5]])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"scores = np.asarray([[9, 5, 2, 1, 1]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"idealny- 9,2,5\n",
"\n",
"\n",
"nasz- 10,2,0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"dla p = 3"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"CG = 10 + 2 + 0 "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CG"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.log2(2)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"DCG = 10 / np.log2(2) + 2 / np.log2(3) + 0 / np.log2(4)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.261859507142916"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DCG"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.261859507142916"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dcg_score(true_relevance, scores, k=3)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"IDCG = 10 / np.log2(2) + 5 / np.log2(3) + 2 / np.log2(4)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14.154648767857287"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"IDCG"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7956297391650307"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DCG / IDCG"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7956297391650307"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ndcg_score(true_relevance, scores, k=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pytanie:\n",
"jak to się odnosi do praktyki?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ZADANIE\n",
"\n",
"policz ręcznie CG, DCG, nDCG i sprawdź czy zgadza się to z scikit-learn:\n",
"dla k = 10"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"TRUE = np.asarray([[1,0,1,0,0,0,0,1,0,1,0,0,0,0,1]])\n",
"PREDICTED = np.asarray([[15,14,13,12,11,10,9,8,7,6,5,4,3,2,1]])"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7137727261090451"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ndcg_score(TRUE, PREDICTED, k=10)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0.7137727261090451"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ndcg_score(TRUE, PREDICTED, k=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wyszukiwarka TFIDF - jakie są plusy i minusy?\n",
"\n",
"W jaki sposób można zrobić lepszą wyszukiwarkę (wykorzystując transformery lub inne modele neuronowe)?\n",
"Jakie są potencjalne zalety i wady takiego podejścia?"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-0.07142267 -0.07716201 -0.0304776 ... 0.01356028 -0.04016104\n",
" -0.02446149]\n",
" [-0.06508801 -0.06923407 -0.03735013 ... 0.01013562 -0.04027326\n",
" -0.02171571]]\n"
]
}
],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"sentences = [\"Hello World\", \"Hallo Welt\"]\n",
"\n",
"model = SentenceTransformer('LaBSE')\n",
"embeddings = model.encode(sentences)\n",
"print(embeddings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Zadanie \n",
"1. zainstaluj faiss i zrób tutorial: https://github.com/facebookresearch\n",
"2. wczytaj treści artykułów z BBC News Train.csv\n",
"3. Użyj któregoś z transformerów (możesz użyć biblioteki sentence-transformers) do stworzenia embeddingów dokumentów\n",
"4. wczytaj embeddingi do bazy danych faiss\n",
"5. wyszukaj query 'consumer electronics market'"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"r = pd.read_csv('BBC News Train.csv')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"DOCUMENTS = list(r.Text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"embeddings = model.encode(DOCUMENTS)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"embeddings = model.encode(list(r.Text))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pickle.dump(embeddings, open('embeddings.pkl','wb'))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"QUERY_STR = 'consumer electronics market'"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"query = model.encode([QUERY_STR])"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"embeddings = pickle.load(open('embeddings.pkl','rb'))"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"index = faiss.IndexFlatL2(embeddings.shape[1]) "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"index.add(np.ascontiguousarray(embeddings))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"D, I = index.search(query, 5) "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1363, 1371, 898, 744, 292]])"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"I"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1.3110979, 1.4027176, 1.4045265, 1.442167 , 1.442167 ]],\n",
" dtype=float32)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"D"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'internet boom for gift shopping cyberspace is becoming a very popular destination for christmas shoppers. forecasts predict that british people will spend £4bn buying gifts online during the festive season an increase of 64% on 2003. surveys also show that the average amount that people are spending is rising as is the range of goods that they are happy to buy online. savvy shoppers are also using the net to find the hot presents that are all but sold out in high street stores. almost half of the uk population now shop online according to figures collected by the interactive media in retail group which represents web retailers. about 85% of this group 18m people expect to do a lot of their christmas gift buying online this year reports the industry group. on average each shopper will spend £220 and britons lead europe in their affection for online shopping. almost a third of all the money spent online this christmas will come out of british wallets and purses compared to 29% from german shoppers and only 4% from italian gift buyers. james roper director of the imrg said shoppers were now much happier to buy so-called big ticket items such as lcd television sets and digital cameras. mr roper added that many retailers were working hard to reassure consumers that online shopping was safe and that goods ordered as presents would arrive in time for christmas. he advised consumers to give shops a little more time than usual to fulfil orders given that online buying is proving so popular. a survey by hostway suggests that many men prefer to shop online to avoid the embarrassment of buying some types of presents such as lingerie for wives and girlfriends. much of this online shopping is likely to be done during work time according to research carried out by security firm saint bernard software. the research reveals that up to two working days will be lost by staff who do their shopping via their work computer. worst offenders will be those in the 18-35 age bracket suggests the research who will spend up to five hours per week in december browsing and buying at online shops. iggy fanlo chief revenue officer at shopping.com said that the growing numbers of people using broadband was driving interest in online shopping. when you consider narrowband and broadband the conversion to sale is two times higher he said. higher speeds meant that everything happened much faster he said which let people spend time browsing and finding out about products before they buy. the behaviour of online shoppers was also changing he said. the single biggest reason people went online before this year was price he said. the number one reason now is convenience. very few consumers click on the lowest price he said. they are looking for good prices and merchant reliability. consumer comments and reviews were also proving popular with shoppers keen to find out who had the most reliable customer service. data collected by ebay suggests that some smart shoppers are getting round the shortages of hot presents by buying them direct through the auction site. according to ebay uk there are now more than 150 robosapiens remote control robots for sale via the site. the robosapiens toy is almost impossible to find in online and offline stores. similarly many shoppers are turning to ebay to help them get hold of the hard-to-find slimline playstation 2 which many retailers are only selling as part of an expensive bundle. the high demand for the playstation 2 has meant that prices for it are being driven up. in shops the ps2 is supposed to sell for £104.99. in some ebay uk auctions the price has risen to more than double this figure. many people are also using ebay to get hold of gadgets not even released in this country. the portable version of the playstation has only just gone on sale in japan yet some enterprising ebay users are selling the device to uk gadget fans.'"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DOCUMENTS[1363]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

315
similarity search.ipynb Normal file
View File

@ -0,0 +1,315 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pickle\n",
"import numpy as np\n",
"import faiss\n",
"from sklearn.metrics import ndcg_score, dcg_score, average_precision_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### sprawy organizacyjne- zaliczenie\n",
"\n",
"dodatkowy mini projekt/zadanie domowe? "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"true_relevance = np.asarray([[10, 2, 0, 1, 5]])"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"scores = np.asarray([[9, 5, 2, 1, 1]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"idealny- 9,2,5\n",
"\n",
"\n",
"nasz- 10,2,0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"dla p = 3"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"CG = 10 + 2 + 0 "
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CG"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.log2(2)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"DCG = 10 / np.log2(2) + 2 / np.log2(3) + 0 / np.log2(4)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.261859507142916"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DCG"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.261859507142916"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dcg_score(true_relevance, scores, k=3)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"IDCG = 10 / np.log2(2) + 5 / np.log2(3) + 2 / np.log2(4)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14.154648767857287"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"IDCG"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7956297391650307"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DCG / IDCG"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7956297391650307"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ndcg_score(true_relevance, scores, k=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pytanie:\n",
"jak to się odnosi do praktyki?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ZADANIE\n",
"\n",
"policz ręcznie CG, DCG, nDCG i sprawdź czy zgadza się to z scikit-learn:\n",
"dla k = 10"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"TRUE = np.asarray([[1,0,1,0,0,0,0,1,0,1,0,0,0,0,1]])\n",
"PREDICTED = np.asarray([[15,14,13,12,11,10,9,8,7,6,5,4,3,2,1]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wyszukiwarka TFIDF - jakie są plusy i minusy?\n",
"\n",
"W jaki sposób można zrobić lepszą wyszukiwarkę (wykorzystując transformery lub inne modele neuronowe)?\n",
"Jakie są potencjalne zalety i wady takiego podejścia?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"sentences = [\"Hello World\", \"Hallo Welt\"]\n",
"\n",
"model = SentenceTransformer('LaBSE')\n",
"embeddings = model.encode(sentences)\n",
"print(embeddings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Zadanie \n",
"1. zainstaluj faiss i zrób tutorial: https://github.com/facebookresearch\n",
"2. wczytaj treści artykułów z BBC News Train.csv\n",
"3. Użyj któregoś z transformerów (możesz użyć biblioteki sentence-transformers) do stworzenia embeddingów dokumentów\n",
"4. wczytaj embeddingi do bazy danych faiss\n",
"5. wyszukaj query 'consumer electronics market'"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}