ISI-transformers/similarity search.ipynb

316 lines
5.8 KiB
Plaintext

{
"cells": [
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pickle\n",
"import numpy as np\n",
"import faiss\n",
"from sklearn.metrics import ndcg_score, dcg_score, average_precision_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### sprawy organizacyjne- zaliczenie\n",
"\n",
"dodatkowy mini projekt/zadanie domowe? "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"true_relevance = np.asarray([[10, 2, 0, 1, 5]])"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"scores = np.asarray([[9, 5, 2, 1, 1]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"idealny- 9,2,5\n",
"\n",
"\n",
"nasz- 10,2,0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"dla p = 3"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"CG = 10 + 2 + 0 "
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CG"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.log2(2)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"DCG = 10 / np.log2(2) + 2 / np.log2(3) + 0 / np.log2(4)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.261859507142916"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DCG"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"11.261859507142916"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dcg_score(true_relevance, scores, k=3)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"IDCG = 10 / np.log2(2) + 5 / np.log2(3) + 2 / np.log2(4)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14.154648767857287"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"IDCG"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7956297391650307"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DCG / IDCG"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7956297391650307"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ndcg_score(true_relevance, scores, k=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pytanie:\n",
"jak to się odnosi do praktyki?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ZADANIE\n",
"\n",
"policz ręcznie CG, DCG, nDCG i sprawdź czy zgadza się to z scikit-learn:\n",
"dla k = 10"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"TRUE = np.asarray([[1,0,1,0,0,0,0,1,0,1,0,0,0,0,1]])\n",
"PREDICTED = np.asarray([[15,14,13,12,11,10,9,8,7,6,5,4,3,2,1]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wyszukiwarka TFIDF - jakie są plusy i minusy?\n",
"\n",
"W jaki sposób można zrobić lepszą wyszukiwarkę (wykorzystując transformery lub inne modele neuronowe)?\n",
"Jakie są potencjalne zalety i wady takiego podejścia?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from sentence_transformers import SentenceTransformer\n",
"sentences = [\"Hello World\", \"Hallo Welt\"]\n",
"\n",
"model = SentenceTransformer('LaBSE')\n",
"embeddings = model.encode(sentences)\n",
"print(embeddings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Zadanie \n",
"1. zainstaluj faiss i zrób tutorial: https://github.com/facebookresearch\n",
"2. wczytaj treści artykułów z BBC News Train.csv\n",
"3. Użyj któregoś z transformerów (możesz użyć biblioteki sentence-transformers) do stworzenia embeddingów dokumentów\n",
"4. wczytaj embeddingi do bazy danych faiss\n",
"5. wyszukaj query 'consumer electronics market'"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}