Add zadania04

2024-05-17 15:52:00 +02:00 · 2024-05-17 15:52:00 +02:00 · 4abe74e453
commit 4abe74e453
parent f853ab3d68
2 changed files with 255 additions and 0 deletions
--- a/04_zadania.ipynb
+++ b/04_zadania.ipynb
@ -0,0 +1,51 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Rozwiązania do zadań proszę umieszczać w nowych komórkach pomiędzy zadaniami\n",
+    "Zadania (jeżeli wymagają napisania programu) piszemy w języku Python"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Zadanie 1 (150 punktów)\n",
+    "\n",
+    "Na podstawie zbioru danych https://huggingface.co/datasets/mteb/tweet_sentiment_extraction stwórz model bazujący na dwukierunkowej sieci neuronowej LSTM (proszę skorzystać z gotowego modułu LSTM w bibliotece torch) do klasyfikacji sentymentu tekstów w postaci tweetów. Można skorzystać z gotowych embeddingów lub wytrenować własne - względem uznania. Metody filtrowania tekstów (często zawierają wiele różnych znaków/symboli, które mogą mieć znaczenie) również należą do Państwa zadania. \n",
+    "\n",
+    "Model należy wytrenować na podzbiorze \"train\" ze zbioru danych, natomiast ewaluację dokonujemy na podzbiorze \"test\".\n",
+    "\n",
+    "Liczba punktów zależy od wyniku metryki accuracy na zbiorze testowym:\n",
+    "* 0-50% - 0 punktów\n",
+    "* 50-60% - 40 punktów\n",
+    "* 60-70% - 70 punktow\n",
+    "* 70-80% - 120 punktów\n",
+    "* 80-100% (lub 2 najlepsze wyniki powyżej 70%) - 170 punktów"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python39",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.18"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/04_zadania_helpful_codeblocks.ipynb
+++ b/04_zadania_helpful_codeblocks.ipynb
@ -0,0 +1,204 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Wczytanie zbioru danych do postaci DataFrame"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "c:\\Users\\ryssta\\AppData\\Local\\anaconda3\\envs\\python39\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    }
+   ],
+   "source": [
+    "from datasets import load_dataset\n",
+    "import pandas as pd\n",
+    "import torch\n",
+    "from torch.nn.utils.rnn import pad_sequence\n",
+    "\n",
+    "hf_dataset = load_dataset(\"mteb/tweet_sentiment_extraction\")\n",
+    "df = pd.DataFrame(hf_dataset[\"train\"])\n",
+    "test_df = pd.DataFrame(hf_dataset[\"test\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Przykładowa modyfikacja tekstu (analogiczne operacje należy wykonać dla podzbioru test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0               I`d have responded, if I were going\n",
+      "1     Sooo SAD I will miss you here in San Diego!!!\n",
+      "2                         my boss is bullying me...\n",
+      "Name: text, dtype: object\n",
+      "0               I`D HAVE RESPONDED, IF I WERE GOING\n",
+      "1     SOOO SAD I WILL MISS YOU HERE IN SAN DIEGO!!!\n",
+      "2                         MY BOSS IS BULLYING ME...\n",
+      "Name: text, dtype: object\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = pd.DataFrame(hf_dataset[\"train\"])\n",
+    "print(df[\"text\"].head(3))\n",
+    "df[\"text\"] = df[\"text\"].apply(lambda text_row: text_row.upper())\n",
+    "print(df[\"text\"].head(3))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Dodanie warstwy embedding z tokenem pad (czyli \"zapychaczem\" służącym do wypełniania macierzy)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[0., 0., 0., 0., 0.]], grad_fn=<EmbeddingBackward0>)\n"
+     ]
+    }
+   ],
+   "source": [
+    "padding_idx = 9\n",
+    "embedding = torch.nn.Embedding(10, 5, padding_idx=padding_idx)\n",
+    "\n",
+    "pad_embedding = embedding(torch.LongTensor([9]))\n",
+    "print(pad_embedding)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Padowanie sekwencji przy pomocy funkcji"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "tensor([[4, 7, 2, 9, 9, 9, 9],\n",
+      "        [7, 3, 2, 7, 5, 3, 2],\n",
+      "        [1, 7, 4, 2, 5, 9, 9]])\n",
+      "Długości inputów\n",
+      "[3, 7, 5]\n"
+     ]
+    }
+   ],
+   "source": [
+    "input_token_ids = [[4,7,2], [7,3,2,7,5,3,2], [1,7,4,2,5]]\n",
+    "\n",
+    "max_length = max(len(seq) for seq in input_token_ids)\n",
+    "padded_input = pad_sequence([torch.tensor(seq) for seq in input_token_ids], batch_first=True, padding_value=padding_idx)\n",
+    "lengths = [len(seq) for seq in input_token_ids]\n",
+    "\n",
+    "print(padded_input)\n",
+    "print(\"Długości inputów\")\n",
+    "print(lengths)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Przepuszczanie embeddingów przez warstwę LSTM (przy pomocy funkcji padujących)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 39,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lstm_layer = torch.nn.LSTM(5, 5, 2, batch_first=True, bidirectional=True)\n",
+    "\n",
+    "embedded_inputs = embedding(padded_input)\n",
+    "x = torch.nn.utils.rnn.pack_padded_sequence(embedded_inputs, lengths, batch_first=True, enforce_sorted=False)\n",
+    "output, (hidden, cell) = lstm_layer(x)\n",
+    "output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Wartościami, które należy wykorzystać do klasyfikacji to (jedna z dwóch opcji):\n",
+    "* konkatenacja ostatniego i przedostatniego elementu z warstwy hidden (sieć jest dwukierunkowa, więc chcemy się dostać do stanów z ostatniej warstwy jednego oraz drugiego kierunku)\n",
+    "* pierwszy element dla każdego przykładu ze zmiennej out (tam jest automatycznie skonkatenowany output dla obu kierunków, dlatego mamy na końcu rozmiar 10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "torch.Size([6, 3, 5])\n",
+      "torch.Size([3, 7, 10])\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(hidden.shape)\n",
+    "print(output.shape)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "python39",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.18"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}