{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Ekstrakcja informacji

\n", "

14. Ekstrakcja informacji seq2seq [ćwiczenia]

\n", "

Jakub Pokrywka (2021)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SIMILARITY SEARCH\n", "1. zainstaluj faiss i zrób tutorial: https://github.com/facebookresearch/faiss\n", "2. wczytaj treści artykułów z BBC News Train.csv\n", "3. Użyj któregoś z transformerów (możesz użyć biblioteki sentence-transformers) do stworzenia embeddingów dokumentów\n", "4. wczytaj embeddingi do bazy danych faiss\n", "5. wyszukaj query 'consumer electronics market'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://www.kaggle.com/avishi/bbc-news-train-data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import pickle\n", "import numpy as np\n", "import faiss\n", "from sklearn.metrics import ndcg_score, dcg_score, average_precision_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!pip install sentence-transformers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer\n", "sentences = [\"Hello World\", \"Hallo Welt\"]\n", "\n", "model = SentenceTransformer('LaBSE')\n", "embeddings = model.encode(sentences)\n", "print(embeddings)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "r = pd.read_csv('BBC News Train.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DOCUMENTS = list(r.Text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "embeddings = model.encode(DOCUMENTS)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "embeddings = model.encode(list(r.Text))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "QUERY_STR = 'consumer electronics market'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query = model.encode([QUERY_STR])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "index = faiss.IndexFlatL2(embeddings.shape[1]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "index.add(np.ascontiguousarray(embeddings))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "D, I = index.search(query, 5) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "I" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "D" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DOCUMENTS[1363]" ] } ], "metadata": { "author": "Jakub Pokrywka", "email": "kubapok@wmi.amu.edu.pl", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "lang": "pl", "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "subtitle": "14.Ekstrakcja informacji seq2seq[ćwiczenia]", "title": "Ekstrakcja informacji", "year": "2021" }, "nbformat": 4, "nbformat_minor": 4 }