{ "cells": [ { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import pickle\n", "import numpy as np\n", "import faiss\n", "from sklearn.metrics import ndcg_score, dcg_score, average_precision_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### sprawy organizacyjne- zaliczenie\n", "\n", "dodatkowy mini projekt/zadanie domowe? " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "true_relevance = np.asarray([[10, 2, 0, 1, 5]])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "scores = np.asarray([[9, 5, 2, 1, 1]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "idealny- 9,2,5\n", "\n", "\n", "nasz- 10,2,0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "dla p = 3" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "CG = 10 + 2 + 0 " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "CG" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.log2(2)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "DCG = 10 / np.log2(2) + 2 / np.log2(3) + 0 / np.log2(4)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11.261859507142916" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DCG" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "11.261859507142916" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dcg_score(true_relevance, scores, k=3)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "IDCG = 10 / np.log2(2) + 5 / np.log2(3) + 2 / np.log2(4)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14.154648767857287" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "IDCG" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7956297391650307" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "DCG / IDCG" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7956297391650307" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ndcg_score(true_relevance, scores, k=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pytanie:\n", "jak to się odnosi do praktyki?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE\n", "\n", "policz ręcznie CG, DCG, nDCG i sprawdź czy zgadza się to z scikit-learn:\n", "dla k = 10" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "TRUE = np.asarray([[1,0,1,0,0,0,0,1,0,1,0,0,0,0,1]])\n", "PREDICTED = np.asarray([[15,14,13,12,11,10,9,8,7,6,5,4,3,2,1]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wyszukiwarka TFIDF - jakie są plusy i minusy?\n", "\n", "W jaki sposób można zrobić lepszą wyszukiwarkę (wykorzystując transformery lub inne modele neuronowe)?\n", "Jakie są potencjalne zalety i wady takiego podejścia?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer\n", "sentences = [\"Hello World\", \"Hallo Welt\"]\n", "\n", "model = SentenceTransformer('LaBSE')\n", "embeddings = model.encode(sentences)\n", "print(embeddings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Zadanie \n", "1. zainstaluj faiss i zrób tutorial: https://github.com/facebookresearch\n", "2. wczytaj treści artykułów z BBC News Train.csv\n", "3. Użyj któregoś z transformerów (możesz użyć biblioteki sentence-transformers) do stworzenia embeddingów dokumentów\n", "4. wczytaj embeddingi do bazy danych faiss\n", "5. wyszukaj query 'consumer electronics market'" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 4 }