moj-2024-ns-cw/04_zadania_helpful_codeblocks.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Wczytanie zbioru danych do postaci DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\Users\\ryssta\\AppData\\Local\\anaconda3\\envs\\python39\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "import pandas as pd\n",
    "import torch\n",
    "from torch.nn.utils.rnn import pad_sequence\n",
    "\n",
    "hf_dataset = load_dataset(\"mteb/tweet_sentiment_extraction\")\n",
    "df = pd.DataFrame(hf_dataset[\"train\"])\n",
    "test_df = pd.DataFrame(hf_dataset[\"test\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Przykładowa modyfikacja tekstu (analogiczne operacje należy wykonać dla podzbioru test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0               I`d have responded, if I were going\n",
      "1     Sooo SAD I will miss you here in San Diego!!!\n",
      "2                         my boss is bullying me...\n",
      "Name: text, dtype: object\n",
      "0               I`D HAVE RESPONDED, IF I WERE GOING\n",
      "1     SOOO SAD I WILL MISS YOU HERE IN SAN DIEGO!!!\n",
      "2                         MY BOSS IS BULLYING ME...\n",
      "Name: text, dtype: object\n"
     ]
    }
   ],
   "source": [
    "df = pd.DataFrame(hf_dataset[\"train\"])\n",
    "print(df[\"text\"].head(3))\n",
    "df[\"text\"] = df[\"text\"].apply(lambda text_row: text_row.upper())\n",
    "print(df[\"text\"].head(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dodanie warstwy embedding z tokenem pad (czyli \"zapychaczem\" służącym do wypełniania macierzy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[0., 0., 0., 0., 0.]], grad_fn=<EmbeddingBackward0>)\n"
     ]
    }
   ],
   "source": [
    "padding_idx = 9\n",
    "embedding = torch.nn.Embedding(10, 5, padding_idx=padding_idx)\n",
    "\n",
    "pad_embedding = embedding(torch.LongTensor([9]))\n",
    "print(pad_embedding)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Padowanie sekwencji przy pomocy funkcji"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tensor([[4, 7, 2, 9, 9, 9, 9],\n",
      "        [7, 3, 2, 7, 5, 3, 2],\n",
      "        [1, 7, 4, 2, 5, 9, 9]])\n",
      "Długości inputów\n",
      "[3, 7, 5]\n"
     ]
    }
   ],
   "source": [
    "input_token_ids = [[4,7,2], [7,3,2,7,5,3,2], [1,7,4,2,5]]\n",
    "\n",
    "max_length = max(len(seq) for seq in input_token_ids)\n",
    "padded_input = pad_sequence([torch.tensor(seq) for seq in input_token_ids], batch_first=True, padding_value=padding_idx)\n",
    "lengths = [len(seq) for seq in input_token_ids]\n",
    "\n",
    "print(padded_input)\n",
    "print(\"Długości inputów\")\n",
    "print(lengths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Przepuszczanie embeddingów przez warstwę LSTM (przy pomocy funkcji padujących)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "lstm_layer = torch.nn.LSTM(5, 5, 30, batch_first=True, bidirectional=True)\n",
    "\n",
    "embedded_inputs = embedding(padded_input)\n",
    "x = torch.nn.utils.rnn.pack_padded_sequence(embedded_inputs, lengths, batch_first=True, enforce_sorted=False)\n",
    "output, (hidden, cell) = lstm_layer(x)\n",
    "output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Zmienna hidden zawiera wszystkie ukryte stany na przestrzeni wszystkich warstw, natomiast zmienna output zawiera jedynie stany w ostatniej warstwie"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Wartościami, które należy wykorzystać do klasyfikacji to (jedna z dwóch opcji):\n",
    "* konkatenacja ostatniego i przedostatniego elementu ze zmiennej hidden (sieć jest dwukierunkowa, więc chcemy się dostać do stanów z ostatniej warstwy jednego oraz drugiego kierunku)\n",
    "* pierwszy element dla każdego przykładu ze zmiennej out (tam jest automatycznie skonkatenowany output dla obu kierunków, dlatego mamy na końcu rozmiar 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([60, 3, 5])\n",
      "torch.Size([3, 7, 10])\n"
     ]
    }
   ],
   "source": [
    "print(hidden.shape)\n",
    "print(output.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "torch.Size([6, 3, 5])\n",
    "torch.Size([3, 7, 10])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "python39",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}