220 lines
5.8 KiB
Plaintext
220 lines
5.8 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Wczytanie zbioru danych do postaci DataFrame"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"c:\\Users\\ryssta\\AppData\\Local\\anaconda3\\envs\\python39\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
|
" from .autonotebook import tqdm as notebook_tqdm\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from datasets import load_dataset\n",
|
|
"import pandas as pd\n",
|
|
"import torch\n",
|
|
"from torch.nn.utils.rnn import pad_sequence\n",
|
|
"\n",
|
|
"hf_dataset = load_dataset(\"mteb/tweet_sentiment_extraction\")\n",
|
|
"df = pd.DataFrame(hf_dataset[\"train\"])\n",
|
|
"test_df = pd.DataFrame(hf_dataset[\"test\"])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Przykładowa modyfikacja tekstu (analogiczne operacje należy wykonać dla podzbioru test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"0 I`d have responded, if I were going\n",
|
|
"1 Sooo SAD I will miss you here in San Diego!!!\n",
|
|
"2 my boss is bullying me...\n",
|
|
"Name: text, dtype: object\n",
|
|
"0 I`D HAVE RESPONDED, IF I WERE GOING\n",
|
|
"1 SOOO SAD I WILL MISS YOU HERE IN SAN DIEGO!!!\n",
|
|
"2 MY BOSS IS BULLYING ME...\n",
|
|
"Name: text, dtype: object\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"df = pd.DataFrame(hf_dataset[\"train\"])\n",
|
|
"print(df[\"text\"].head(3))\n",
|
|
"df[\"text\"] = df[\"text\"].apply(lambda text_row: text_row.upper())\n",
|
|
"print(df[\"text\"].head(3))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Dodanie warstwy embedding z tokenem pad (czyli \"zapychaczem\" służącym do wypełniania macierzy)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"tensor([[0., 0., 0., 0., 0.]], grad_fn=<EmbeddingBackward0>)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"padding_idx = 9\n",
|
|
"embedding = torch.nn.Embedding(10, 5, padding_idx=padding_idx)\n",
|
|
"\n",
|
|
"pad_embedding = embedding(torch.LongTensor([9]))\n",
|
|
"print(pad_embedding)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Padowanie sekwencji przy pomocy funkcji"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"tensor([[4, 7, 2, 9, 9, 9, 9],\n",
|
|
" [7, 3, 2, 7, 5, 3, 2],\n",
|
|
" [1, 7, 4, 2, 5, 9, 9]])\n",
|
|
"Długości inputów\n",
|
|
"[3, 7, 5]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"input_token_ids = [[4,7,2], [7,3,2,7,5,3,2], [1,7,4,2,5]]\n",
|
|
"\n",
|
|
"max_length = max(len(seq) for seq in input_token_ids)\n",
|
|
"padded_input = pad_sequence([torch.tensor(seq) for seq in input_token_ids], batch_first=True, padding_value=padding_idx)\n",
|
|
"lengths = [len(seq) for seq in input_token_ids]\n",
|
|
"\n",
|
|
"print(padded_input)\n",
|
|
"print(\"Długości inputów\")\n",
|
|
"print(lengths)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Przepuszczanie embeddingów przez warstwę LSTM (przy pomocy funkcji padujących)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 42,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"lstm_layer = torch.nn.LSTM(5, 5, 30, batch_first=True, bidirectional=True)\n",
|
|
"\n",
|
|
"embedded_inputs = embedding(padded_input)\n",
|
|
"x = torch.nn.utils.rnn.pack_padded_sequence(embedded_inputs, lengths, batch_first=True, enforce_sorted=False)\n",
|
|
"output, (hidden, cell) = lstm_layer(x)\n",
|
|
"output, _ = torch.nn.utils.rnn.pad_packed_sequence(output, batch_first=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Zmienna hidden zawiera wszystkie ukryte stany na przestrzeni wszystkich warstw, natomiast zmienna output zawiera jedynie stany w ostatniej warstwie"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Wartościami, które należy wykorzystać do klasyfikacji to (jedna z dwóch opcji):\n",
|
|
"* konkatenacja ostatniego i przedostatniego elementu ze zmiennej hidden (sieć jest dwukierunkowa, więc chcemy się dostać do stanów z ostatniej warstwy jednego oraz drugiego kierunku)\n",
|
|
"* pierwszy element dla każdego przykładu ze zmiennej out (tam jest automatycznie skonkatenowany output dla obu kierunków, dlatego mamy na końcu rozmiar 10)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 43,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"torch.Size([60, 3, 5])\n",
|
|
"torch.Size([3, 7, 10])\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(hidden.shape)\n",
|
|
"print(output.shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"torch.Size([6, 3, 5])\n",
|
|
"torch.Size([3, 7, 10])"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "python39",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.18"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|