{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wczytanie zbioru danych do postaci DataFrame" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\ryssta\\AppData\\Local\\anaconda3\\envs\\python39\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "from datasets import load_dataset\n", "import pandas as pd\n", "import torch\n", "from torch.nn.utils.rnn import pad_sequence\n", "\n", "hf_dataset = load_dataset(\"mteb/tweet_sentiment_extraction\")\n", "df = pd.DataFrame(hf_dataset[\"train\"])\n", "test_df = pd.DataFrame(hf_dataset[\"test\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Przykładowa modyfikacja tekstu (analogiczne operacje należy wykonać dla podzbioru test)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 I`d have responded, if I were going\n", "1 Sooo SAD I will miss you here in San Diego!!!\n", "2 my boss is bullying me...\n", "Name: text, dtype: object\n", "0 I`D HAVE RESPONDED, IF I WERE GOING\n", "1 SOOO SAD I WILL MISS YOU HERE IN SAN DIEGO!!!\n", "2 MY BOSS IS BULLYING ME...\n", "Name: text, dtype: object\n" ] } ], "source": [ "df = pd.DataFrame(hf_dataset[\"train\"])\n", "print(df[\"text\"].head(3))\n", "df[\"text\"] = df[\"text\"].apply(lambda text_row: text_row.upper())\n", "print(df[\"text\"].head(3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dodanie warstwy embedding z tokenem pad (czyli \"zapychaczem\" służącym do wypełniania macierzy)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([[0., 0., 0., 0., 0.]], grad_fn=)\n" ] } ], "source": [ "padding_idx = 9\n", "embedding = torch.nn.Embedding(10, 5, padding_idx=padding_idx)\n", "\n", "pad_embedding = embedding(torch.LongTensor([9]))\n", "print(pad_embedding)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Padowanie sekwencji przy pomocy funkcji" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([[4, 7, 2, 9, 9, 9, 9],\n", " [7, 3, 2, 7, 5, 3, 2],\n", " [1, 7, 4, 2, 5, 9, 9]])\n", "Długości inputów\n", "[3, 7, 5]\n" ] } ], "source": [ "input_token_ids = [[4,7,2], [7,3,2,7,5,3,2], [1,7,4,2,5]]\n", "\n", "max_length = max(len(seq) for seq in input_token_ids)\n", "padded_input = pad_sequence([torch.tensor(seq) for seq in input_token_ids], batch_first=True, padding_value=padding_idx)\n", "lengths = [len(seq) for seq in input_token_ids]\n", "\n", "print(padded_input)\n", "print(\"Długości inputów\")\n", "print(lengths)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Przepuszczanie embeddingów przez warstwę LSTM (przy pomocy funkcji padujących)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "lstm_layer = torch.nn.LSTM(5, 5, 30, batch_first=True, bidirectional=True)\n", "\n", "embedded_inputs = embedding(padded_input)\n", "x = torch.nn.utils.rnn.pack_padded_sequence(embedded_inputs, lengths, batch_first=True, enforce_sorted=False)\n", "output, (hidden, cell) = lstm_layer(x)\n", "output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Zmienna hidden zawiera wszystkie ukryte stany na przestrzeni wszystkich warstw, natomiast zmienna output zawiera jedynie stany w ostatniej warstwie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wartościami, które należy wykorzystać do klasyfikacji to (jedna z dwóch opcji):\n", "* konkatenacja ostatniego i przedostatniego elementu ze zmiennej hidden (sieć jest dwukierunkowa, więc chcemy się dostać do stanów z ostatniej warstwy jednego oraz drugiego kierunku)\n", "* pierwszy element dla każdego przykładu ze zmiennej out (tam jest automatycznie skonkatenowany output dla obu kierunków, dlatego mamy na końcu rozmiar 10)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([60, 3, 5])\n", "torch.Size([3, 7, 10])\n" ] } ], "source": [ "print(hidden.shape)\n", "print(output.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "torch.Size([6, 3, 5])\n", "torch.Size([3, 7, 10])" ] } ], "metadata": { "kernelspec": { "display_name": "python39", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 2 }