949 lines
62 KiB
Plaintext
949 lines
62 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# Klasyfikacja w Pythonie"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**zad. 1** Które z poniższych problemów jest problemem regresji, a które klasyfikacji?\n",
|
|||
|
" 1. Sprawdzenie, czy wiadomość jest spamem.\n",
|
|||
|
" 1. Przewidzenie oceny (od 1 do 5 gwiazdek) na podstawie komentarza.\n",
|
|||
|
" 1. OCR cyfr: rozpoznanie cyfry z obrazka.\n",
|
|||
|
" \n",
|
|||
|
" Jeżeli problem jest klasyfikacyjny, to jakie mamy klasy?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. klasyfikacja\n",
|
|||
|
"2. można traktować jako klasyfikację lub regresję. Jeżeli jako regresję to należy sprowadzić liczbę rzeczywistą do jednej z {1,2,3,4,5}\n",
|
|||
|
"3. klasyfikacja"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Miary dla klasyfikacji"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Istnieje wieje miar (metryk), na podstawie których możemy ocenić jakość modelu. Podobnie jak w przypadku regresji liniowej potrzebne są dwie listy: lista poprawnych klas i lista predykcji z modelu. Najpopularniejszą z metryk jest trafność, którą definiuje się w następujący sposób:\n",
|
|||
|
" $$ACC = \\frac{k}{N}$$ \n",
|
|||
|
" \n",
|
|||
|
" gdzie: \n",
|
|||
|
" * $k$ to liczba poprawnie zaklasyfikowanych przypadków,\n",
|
|||
|
" * $N$ liczebność zbioru testującego."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**zadanie** Napisz funkcję, która jako parametry przyjmnie dwie listy (lista poprawnych klas i wyjście z klasyfikatora) i zwróci trafność."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"ACC: 0.4\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"def accuracy_measure(true, predicted):\n",
|
|||
|
" return sum([1 if t==p else 0 for t,p in zip(true, predicted)]) / len(true)\n",
|
|||
|
"\n",
|
|||
|
"true_label = [1, 1, 1, 0, 0]\n",
|
|||
|
"predicted = [0, 1, 0, 1, 0]\n",
|
|||
|
"print(\"ACC:\", accuracy_measure(true_label, predicted))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Klasyfikator $k$ najbliższych sąsiadów *(ang. k-nearest neighbors, KNN)*"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Klasyfikator [KNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm), który został wprowadzony na ostatnim wykładzie, jest bardzo intuicyjny. Pomysł, który stoi za tym klasyfikatorem jest bardzo prosty: Jeżeli mamy nowy obiekt do zaklasyfikowania, to szukamy wśród danych trenujących $k$ najbardziej podobnych do niego przykładów i na ich podstawie decydujemy (np. biorąc większość) do jakie klasy powinien należeć dany obiekt."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"** Przykład 1** Mamy za zadanie przydzielenie obiektów do dwóch klas: trójkątów lub kwadratów. Rozpatrywany obiekt jest zaznaczony zielonym kółkiem. Przyjmując $k=3$, mamy wśród sąsiadów 2 trójkąty i 1 kwadrat. Stąd obiekt powinienm zostać zaklasyfikowany jako trójkąt. Jak zmienia się sytuacja, gdy przyjmiemy $k=5$?\n",
|
|||
|
"\n",
|
|||
|
"![Przykład 1](./KnnClassification.svg.png)\n",
|
|||
|
"\n",
|
|||
|
"( Grafika pochodzi z https://pl.wikipedia.org/wiki/K_najbli%C5%BCszych_s%C4%85siad%C3%B3w )"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Herbal Iris"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"*Herbal Iris* jest klasycznym zbiorem danych w uczeniu maszynowym, który powstał w 1936 roku. Zawiera on informacje na 150 egzemplarzy roślin, które należą do jednej z 3 odmian."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**zad. 2** Wczytaj do zmiennej ``data`` zbiór *Herbal Iris*, który znajduje się w pliku ``iris.data``. Jest to plik csv.\n",
|
|||
|
"\n",
|
|||
|
"Kolumny są następujące:\n",
|
|||
|
"\n",
|
|||
|
"1. sepal length in cm\n",
|
|||
|
"2. sepal width in cm\n",
|
|||
|
"3. petal length in cm\n",
|
|||
|
"4. petal width in cm\n",
|
|||
|
"5. class: \n",
|
|||
|
" * Iris Setosa\n",
|
|||
|
" * Iris Versicolour\n",
|
|||
|
" * Iris Virginica"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"5.1,3.5,1.4,0.2,Iris-setosa\r\n",
|
|||
|
"4.9,3.0,1.4,0.2,Iris-setosa\r\n",
|
|||
|
"4.7,3.2,1.3,0.2,Iris-setosa\r\n",
|
|||
|
"4.6,3.1,1.5,0.2,Iris-setosa\r\n",
|
|||
|
"5.0,3.6,1.4,0.2,Iris-setosa\r\n",
|
|||
|
"5.4,3.9,1.7,0.4,Iris-setosa\r\n",
|
|||
|
"4.6,3.4,1.4,0.3,Iris-setosa\r\n",
|
|||
|
"5.0,3.4,1.5,0.2,Iris-setosa\r\n",
|
|||
|
"4.4,2.9,1.4,0.2,Iris-setosa\r\n",
|
|||
|
"4.9,3.1,1.5,0.1,Iris-setosa\r\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"!head iris.data"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"import pandas as pd\n",
|
|||
|
"data = pd.read_csv('iris.data', names=('sepal_length', 'sepal_width', 'petal_length', 'petal_width','class'),index_col=False)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<div>\n",
|
|||
|
"<style scoped>\n",
|
|||
|
" .dataframe tbody tr th:only-of-type {\n",
|
|||
|
" vertical-align: middle;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe tbody tr th {\n",
|
|||
|
" vertical-align: top;\n",
|
|||
|
" }\n",
|
|||
|
"\n",
|
|||
|
" .dataframe thead th {\n",
|
|||
|
" text-align: right;\n",
|
|||
|
" }\n",
|
|||
|
"</style>\n",
|
|||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|||
|
" <thead>\n",
|
|||
|
" <tr style=\"text-align: right;\">\n",
|
|||
|
" <th></th>\n",
|
|||
|
" <th>sepal_length</th>\n",
|
|||
|
" <th>sepal_width</th>\n",
|
|||
|
" <th>petal_length</th>\n",
|
|||
|
" <th>petal_width</th>\n",
|
|||
|
" <th>class</th>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </thead>\n",
|
|||
|
" <tbody>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>0</th>\n",
|
|||
|
" <td>5.1</td>\n",
|
|||
|
" <td>3.5</td>\n",
|
|||
|
" <td>1.4</td>\n",
|
|||
|
" <td>0.2</td>\n",
|
|||
|
" <td>Iris-setosa</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>1</th>\n",
|
|||
|
" <td>4.9</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>1.4</td>\n",
|
|||
|
" <td>0.2</td>\n",
|
|||
|
" <td>Iris-setosa</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>2</th>\n",
|
|||
|
" <td>4.7</td>\n",
|
|||
|
" <td>3.2</td>\n",
|
|||
|
" <td>1.3</td>\n",
|
|||
|
" <td>0.2</td>\n",
|
|||
|
" <td>Iris-setosa</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>3</th>\n",
|
|||
|
" <td>4.6</td>\n",
|
|||
|
" <td>3.1</td>\n",
|
|||
|
" <td>1.5</td>\n",
|
|||
|
" <td>0.2</td>\n",
|
|||
|
" <td>Iris-setosa</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>4</th>\n",
|
|||
|
" <td>5.0</td>\n",
|
|||
|
" <td>3.6</td>\n",
|
|||
|
" <td>1.4</td>\n",
|
|||
|
" <td>0.2</td>\n",
|
|||
|
" <td>Iris-setosa</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>...</th>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" <td>...</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>145</th>\n",
|
|||
|
" <td>6.7</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>5.2</td>\n",
|
|||
|
" <td>2.3</td>\n",
|
|||
|
" <td>Iris-virginica</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>146</th>\n",
|
|||
|
" <td>6.3</td>\n",
|
|||
|
" <td>2.5</td>\n",
|
|||
|
" <td>5.0</td>\n",
|
|||
|
" <td>1.9</td>\n",
|
|||
|
" <td>Iris-virginica</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>147</th>\n",
|
|||
|
" <td>6.5</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>5.2</td>\n",
|
|||
|
" <td>2.0</td>\n",
|
|||
|
" <td>Iris-virginica</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>148</th>\n",
|
|||
|
" <td>6.2</td>\n",
|
|||
|
" <td>3.4</td>\n",
|
|||
|
" <td>5.4</td>\n",
|
|||
|
" <td>2.3</td>\n",
|
|||
|
" <td>Iris-virginica</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" <tr>\n",
|
|||
|
" <th>149</th>\n",
|
|||
|
" <td>5.9</td>\n",
|
|||
|
" <td>3.0</td>\n",
|
|||
|
" <td>5.1</td>\n",
|
|||
|
" <td>1.8</td>\n",
|
|||
|
" <td>Iris-virginica</td>\n",
|
|||
|
" </tr>\n",
|
|||
|
" </tbody>\n",
|
|||
|
"</table>\n",
|
|||
|
"<p>150 rows × 5 columns</p>\n",
|
|||
|
"</div>"
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
" sepal_length sepal_width petal_length petal_width class\n",
|
|||
|
"0 5.1 3.5 1.4 0.2 Iris-setosa\n",
|
|||
|
"1 4.9 3.0 1.4 0.2 Iris-setosa\n",
|
|||
|
"2 4.7 3.2 1.3 0.2 Iris-setosa\n",
|
|||
|
"3 4.6 3.1 1.5 0.2 Iris-setosa\n",
|
|||
|
"4 5.0 3.6 1.4 0.2 Iris-setosa\n",
|
|||
|
".. ... ... ... ... ...\n",
|
|||
|
"145 6.7 3.0 5.2 2.3 Iris-virginica\n",
|
|||
|
"146 6.3 2.5 5.0 1.9 Iris-virginica\n",
|
|||
|
"147 6.5 3.0 5.2 2.0 Iris-virginica\n",
|
|||
|
"148 6.2 3.4 5.4 2.3 Iris-virginica\n",
|
|||
|
"149 5.9 3.0 5.1 1.8 Iris-virginica\n",
|
|||
|
"\n",
|
|||
|
"[150 rows x 5 columns]"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"data"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**zad. 3** Odpowiedz na poniższe pytania:\n",
|
|||
|
" 1. Które atrybuty są wejściowe, a w której kolumnie znajduje się klasa wyjściowa?\n",
|
|||
|
" 1. Ile jest różnych klas? Wypisz je ekran.\n",
|
|||
|
" 1. Jaka jest średnia wartość w kolumnie ``sepal_length``? Jak zachowuje się średnia, jeżeli policzymy ją dla każdej z klas osobno?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"1. wejściowe są sepal_length, sepal_width, petal_length, petal_width. Klasa wyjściowa to class"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"data['class'].unique()\n",
|
|||
|
"# 3 klasy"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"5.843333333333334"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"data['sepal_length'].mean()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"class\n",
|
|||
|
"Iris-setosa 5.006\n",
|
|||
|
"Iris-versicolor 5.936\n",
|
|||
|
"Iris-virginica 6.588\n",
|
|||
|
"Name: sepal_length, dtype: float64"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"data.groupby('class')['sepal_length'].mean()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Wytrenujmy klasyfikator *KNN*, ale najpierw przygotujmy dane. Fukcja ``train_test_split`` dzieli zadany zbiór danych na dwie części. My wykorzystamy ją do podziału na zbiór treningowy (66%) i testowy (33%), służy do tego parametr ``test_size``."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import train_test_split\n",
|
|||
|
"\n",
|
|||
|
"X = data.loc[:, 'sepal_length':'petal_width']\n",
|
|||
|
"Y = data['class']\n",
|
|||
|
"\n",
|
|||
|
"(train_X, test_X, train_Y, test_Y) = train_test_split(X, Y, test_size=0.33, random_state=42)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Trenowanie klasyfikatora wygląda bardzo podobnie do treningi modelu regresji liniowej:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/html": [
|
|||
|
"<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-r
|
|||
|
],
|
|||
|
"text/plain": [
|
|||
|
"KNeighborsClassifier(n_neighbors=3)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.neighbors import KNeighborsClassifier\n",
|
|||
|
"\n",
|
|||
|
"model = KNeighborsClassifier(n_neighbors=3)\n",
|
|||
|
"model.fit(train_X, train_Y)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Mając wytrenowany model możemy wykorzystać go do predykcji na zbiorze testowym."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Zaklasyfikowane: Iris-versicolor, Orginalne: Iris-versicolor\n",
|
|||
|
"Zaklasyfikowane: Iris-setosa, Orginalne: Iris-setosa\n",
|
|||
|
"Zaklasyfikowane: Iris-virginica, Orginalne: Iris-virginica\n",
|
|||
|
"Zaklasyfikowane: Iris-versicolor, Orginalne: Iris-versicolor\n",
|
|||
|
"Zaklasyfikowane: Iris-versicolor, Orginalne: Iris-versicolor\n",
|
|||
|
"Zaklasyfikowane: Iris-setosa, Orginalne: Iris-setosa\n",
|
|||
|
"Zaklasyfikowane: Iris-versicolor, Orginalne: Iris-versicolor\n",
|
|||
|
"Zaklasyfikowane: Iris-virginica, Orginalne: Iris-virginica\n",
|
|||
|
"Zaklasyfikowane: Iris-versicolor, Orginalne: Iris-versicolor\n",
|
|||
|
"Zaklasyfikowane: Iris-versicolor, Orginalne: Iris-versicolor\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"predicted = model.predict(test_X)\n",
|
|||
|
"\n",
|
|||
|
"for i in range(10):\n",
|
|||
|
" print(\"Zaklasyfikowane: {}, Orginalne: {}\".format(predicted[i], test_Y.reset_index()['class'][i]))\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Możemy obliczyć *accuracy*:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"0.98\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.metrics import accuracy_score\n",
|
|||
|
"\n",
|
|||
|
"print(accuracy_score(test_Y, predicted))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**zad. 4** Wytrenuj nowy model ``model_2`` zmieniając liczbę sąsiadów na 20. Czy zmieniły się wyniki?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 44,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"0.98\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.neighbors import KNeighborsClassifier\n",
|
|||
|
"\n",
|
|||
|
"model = KNeighborsClassifier(n_neighbors=10)\n",
|
|||
|
"model.fit(train_X, train_Y)\n",
|
|||
|
"predicted = model.predict(test_X)\n",
|
|||
|
"\n",
|
|||
|
"print(accuracy_score(test_Y, predicted))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**zad. 5** Wytrenuj model z $k=1$. Przeprowadź walidację na zbiorze trenującym zamiast na zbiorze testowym? Jakie wyniki otrzymałeś? Czy jest to wyjątek? Dlaczego tak się dzieje?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 45,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"1.0\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.neighbors import KNeighborsClassifier\n",
|
|||
|
"\n",
|
|||
|
"model = KNeighborsClassifier(n_neighbors=1)\n",
|
|||
|
"model.fit(train_X, train_Y)\n",
|
|||
|
"predicted = model.predict(train_X)\n",
|
|||
|
"\n",
|
|||
|
"print(accuracy_score(train_Y, predicted))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Walidacja krzyżowa"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Zbiór *herbal Iris* jest bardzo małym zbiorem. Wydzielenie z niego zbioru testowego jest obciążone dużą wariancją wyników, tj. w zależności od sposoby wyboru zbioru testowego wyniki mogą się bardzo różnic. Żeby temu zaradzić, stosuje się algorytm [walidacji krzyżowej](https://en.wikipedia.org/wiki/Cross-validation_(statistics). Algorytm wygląda następująco:\n",
|
|||
|
" 1. Podziel zbiór danych na $n$ części (losowo).\n",
|
|||
|
" 1. Dla każdego i od 1 do $n$ wykonaj:\n",
|
|||
|
" 1. Weź $i$-tą część jako zbiór testowy, pozostałe dane jako zbiór trenujący.\n",
|
|||
|
" 1. Wytrenuj model na zbiorze trenującym.\n",
|
|||
|
" 1. Uruchom model na danych testowych i zapisz wyniki.\n",
|
|||
|
" 1. Ostateczne wyniki to średnia z $n$ wyników częściowych. \n",
|
|||
|
" \n",
|
|||
|
" W Pythonie służy do tego funkcja ``cross_val_score``, która przyjmuje jako parametry (kolejno) model, zbiór X, zbiór Y. Możemy ustawić parametr ``cv``, który określa na ile części mamy podzielić zbiór danych oraz parametr ``scoring`` określający miarę.\n",
|
|||
|
" \n",
|
|||
|
" W poniższym przykładzie dzielimy zbiór danych na 10 części (10-krotna walidacja krzyżowa) i jako miarę ustawiany celność (ang. accuracy)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 46,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Wynik walidacji krzyżowej: 0.9666666666666668\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.model_selection import cross_val_score\n",
|
|||
|
"\n",
|
|||
|
"k=10\n",
|
|||
|
"knn = KNeighborsClassifier(n_neighbors=k)\n",
|
|||
|
"scores = cross_val_score(knn, X, Y, cv=10, scoring='accuracy')\n",
|
|||
|
"print(\"Wynik walidacji krzyżowej:\", scores.mean())"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"**zad. 6** Klasyfikator $k$ najbliższych sąsiadów posiada jeden parametr: $k$, który określa liczbę sąsiadów podczas klasyfikacji. Jak widzieliśmy, wybór $k$ może mieć duże znaczenie dla jakości klasyfikatora. Wykonaj:\n",
|
|||
|
" 1. Stwórz listę ``neighbors`` wszystkich liczb nieparzystych od 1 do 50.\n",
|
|||
|
" 1. Dla każdego elementu ``i`` z listy ``neighbors`` zbuduj klasyfikator *KNN* o liczbie sąsiadów równej ``i``. Nastepnie przeprowadz walidację krzyżową (parametry takie same jak powyżej) i zapisz wyniki do tablicy ``cv_scores``.\n",
|
|||
|
" 1. Znajdź ``k``, dla którego klasyfikator osiąga najwyższy wynik."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 66,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"neighbors = list(range(1,50,2))\n",
|
|||
|
"cv_scores = list()\n",
|
|||
|
"max_score = -1\n",
|
|||
|
"for neighbor_num in neighbors:\n",
|
|||
|
" knn = KNeighborsClassifier(n_neighbors=neighbor_num)\n",
|
|||
|
" score = cross_val_score(knn, X, Y, cv=10, scoring='accuracy').mean()\n",
|
|||
|
" max_score = score if score > max_score else max_score\n",
|
|||
|
" neighbor_num_best = neighbor_num if score == max_score else neighbor_num_best\n",
|
|||
|
" cv_scores.append(score)\n",
|
|||
|
" "
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 67,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"13\n",
|
|||
|
"0.9800000000000001\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"print(neighbor_num_best)\n",
|
|||
|
"print(max_score)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Wykres przedstawiający precent błedów w zależnosci od liczby sąsiadów."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 68,
|
|||
|
"metadata": {
|
|||
|
"scrolled": true
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAkAAAAGxCAYAAACKvAkXAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8g+/7EAAAACXBIWXMAAA9hAAAPYQGoP6dpAABbOklEQVR4nO3dd3hUZdo/8O/MZErqhJBe6CWVBFIwoIKaFQQpghhYVxTRXV0QNKuv4C7ivv4UfXdRcGV12XfV1101iAgqKsVQVFpIgySEFkp6hfQ+c35/JDMkJoFkMpMz5fu5rrkuOHPmzD0nwNw8z/3cj0QQBAFERERENkQqdgBEREREg40JEBEREdkcJkBERERkc5gAERERkc1hAkREREQ2hwkQERER2RwmQERERGRzmAARERGRzbETOwBzpdVqUVRUBGdnZ0gkErHDISIioj4QBAG1tbXw9fWFVNr7OA8ToF4UFRUhICBA7DCIiIjIAPn5+fD39+/1eSZAvXB2dgbQfgNdXFxEjoaIiIj6oqamBgEBAfrv8d4wAeqFbtrLxcWFCRAREZGFuVX5CougiYiIyOYwASIiIiKbwwSIiIiIbA4TICIiIrI5TICIiIjI5jABIiIiIpvDBIiIiIhsDhMgIiIisjlMgIiIiMjmMAEiIiIim8MEiIiIiGwOEyAiIiKyOdwMlYiIyEpdLKvDB0cuo6VNa5Tr+bnaY9U9YyGT3nyjUUvABIiIiMhKvb3/PL7NLDbqNUd7OmFuuK9RrykGJkBERERW6lRBFQDgkduGw9fVfkDXSsu7jv1nSrHtZB4TICIiIjJPVQ0tKLjeCAB4/t7xUDvIB3S9/GsN+CGnFEcuViKvsgHDhjoYI0zRsAiaiIjICmUX1QAAAtzsB5z8tF/HAbePcQcAfJ6SP+DriY0JEBERkRXKKqwGAIT5qY12zcXRwwAA21Pz0aYxTmG1WJgAERERWaGsjhGgEF/jJUC/CvaCm6MCpTXNOHy+3GjXFQMTICIiIiuU3TECFGrEESCFnRQLJ/kBAD5LtuxpMCZAREREVqa2qRWXKuoBACG+Lka9dnx0AADg4LkylNY0GfXag4kJEBERkZXJKa4FAPioVXB3Uhr12mM8nRE1fAg0WgFfpBYY9dqDiQkQERGRldEVQBuz/qcz3SjQ5yn50GoFk7yHqTEBIiIisjJZRbr6H+NOf+nMnuADZ6UdrlY24PjlSpO8h6kxASIiIrIyuhGgUBONADko7DA3or0bdKKFFkMzASIiIrIijS0aXCyrA2DcFWC/pOsJtCerBNfrW0z2PqbCBIiIiMiK5JTUQCsA7k4KeLkYtwC6s1A/FwT7uKBFo8WujEKTvY+pMAEiIiKyIp37/0gkEpO9j0QiweKY9mLoxOR8CIJlFUMzASIiIrIiWYXtHaBNVf/T2bwIPyjtpDhXWouM/CqTv58xMQEiIiKyIqZeAdaZ2l6O2WE+AIBtJy2rGJoJEBERkZVobtPgfGl7E0RT9QD6JV1PoK9PFaGuuW1Q3tMYmAARERFZiQuldWjVCFDby+E/xH5Q3jNmpBtGujuioUWDb08XDcp7GgMTICIiIiuh7//j52LSAujOJBKJfhQo0YKmwZgAERERWQl9/c8gTX/pLJzkDzupBOl5VThXUjuo720oJkBERERWIrNjBViICRsg9sTDWYm4IC8AQOLJvEF9b0MxASIiIrICrRotcop1S+BNvwLsl+I7egLtTC9EU6tm0N+/v5gAERERWYHc8jq0tGnhpLTDiKGOg/7+d471gI9ahaqGVuw7Uzro799fTICIiIisgK4BYrCvC6TSwSmA7kwmlWBRVPso0DYLmAZjAkRERGQFTL0DfF88FOUPiQQ4crESVyvrRYujL5gAERERWYHsQewA3Rv/IQ64Y6wHAODzFPNeEs8EiIiIyMJptQKyizoKoAd5BdgvLe7oCbQ9pQBtGq2osdwMEyAiIiILd7myHg0tGqjkUoxyH/wC6M7igrzg5qhAWW0zDp0rFzWWm2ECREREZOF09T9BPi6wk4n71a6wk2LhJD8A5t0ZmgkQERGRhdNPf4lYAN2ZbmuMg+fKUFrTJHI0PWMCREREZOEyC8QvgO5sjKczokcMgUYr4IvUArHD6RETICIiIgsmCMKNPcBELoDuLD56GABg28l8aLWCyNF0xwSIiIjIguVfa0RtUxsUMinGejqLHY7erDBvOCvtkHetAccvVYodTjdMgIiIiCyYbvRnvLczFHbm87XuoLDD3AhfAOZZDG0+d4qIiIj6Td8B2kzqfzpb3DENtierBNfrW0SOpismQERERBYsq2MFWIiZrADrLMxfjRBfF7RotNiZXih2OF0wASIiIrJQgiAgu9D8CqA703WG3nYyH4JgPsXQTICIiIgsVElNEyrrWyCTShDobT4F0J3NjfCD0k6Kc6W1yMivEjscPSZAREREFiqrsH36a6ynE1RymcjR9ExtL8fsMB8A7aNA5oIJEBERkYXSFUCbY/1PZ7rO0F+fKkJdc5vI0bRjAkRERGShzHkFWGcxI90wyt0RDS0a7D5VJHY4AJgAERERWSxdD6AwMy2A1pFIJPpRIHPpCcQEiIiIyAKV1TahtKYZEkn7LvDmbsEkf9hJJcjIr8LZkhqxwzGPBGjLli0YMWIEVCoVJk+ejOTk5Juev337dgQGBkKlUiEsLAzfffddt3NycnIwd+5cqNVqODo6Ijo6Gnl5eab6CERERINKtwP8KHdHOCrtRI7m1jyclYgL8gJgHsXQoidA27ZtQ0JCAtavX4+0tDSEh4djxowZKCsr6/H8o0ePYsmSJVi+fDnS09Mxf/58zJ8/H1lZWfpzcnNzcfvttyMwMBCHDh3C6dOnsW7dOqhUqsH6WERERCZl7v1/ehIf0z4NtjO9EE2tGlFjkQgidyWaPHkyoqOj8e677wIAtFotAgIC8Mwzz2DNmjXdzo+Pj0d9fT12796tP3bbbbchIiIC77//PgBg8eLFkMvl+Pe//21wXDU1NVCr1aiuroaLi/kPLRIRkW156t+p2JNdgj/OCsKTd44SO5w+0WgF3PHmARRVN2Hz4gjMi/Az+nv09ftb1BGglpYWpKamIi4uTn9MKpUiLi4Ox44d6/E1x44d63I+AMyYMUN/vlarxbfffotx48ZhxowZ8PT0xOTJk7Fr166bxtLc3IyampouDyIiInOlK4AOMfMVYJ3JpBIsirrRGVpMoiZAFRUV0Gg08PLy6nLcy8sLJSUlPb6mpKTkpueXlZWhrq4Ob7zxBmbOnIl9+/bhgQcewIIFC3D48OFeY9mwYQPUarX+ERAQMMBPR0REZBpVDS0ouN4IwPx7AP3Soih/SCTA0dxKXK2sFy0O86+a6ietVgsAmDdvHp577jkAQEREBI4ePYr3338f06ZN6/F1a9euRUJCgv73NTU1TIKIiMgs6Qqgh7k5QG0vFzma/vEf4oAV08cg0McZ3mrxanNFTYDc3d0hk8lQWlra5XhpaSm8vb17fI23t/dNz3d3d4ednR2Cg4O7nBMUFISff/6511iUSiWUSqUhH4OIiGhQWUoDxN48P2O82CGIOwWmUCgQGRmJpKQk/TGtVoukpCTExsb2+JrY2Ngu5wPA/v379ecrFApER0fj3LlzXc45f/48hg8fbuRPQERENPgyLXAFmLkRfQosISEBjz76KKKiohATE4NNmzahvr4ey5YtAwAsXboUfn5+2LBhAwBg9erVmDZtGjZu3IjZs2cjMTERKSkp2Lp1q/6aL7zwAuLj43HnnXfirrvuwp49e/DNN9/g0KFDYnxEIiIio9JNgYVaWP2PORE9AYqPj0d5eTlefvlllJSUICIiAnv27NEXOufl5UEqvTFQNWXKFHz66af405/+hJdeegljx47Frl27EBoaqj/ngQcewPvvv48NGzZg1apVGD9+PHbs2IHbb7990D8fERGRMdU2teJyRXvxcIivZU6BmQPR+wCZK/YBIiIic3TiUiXitx6Hr1qFo2vvETscs2MRfYCIiIiof7I6pr9CWP8zIEyAiIiILIh+CwzW/wwIEyAiIiI
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 640x480 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"# changing to misclassification error\n",
|
|||
|
"MSE = [1 - x for x in cv_scores]\n",
|
|||
|
"\n",
|
|||
|
"# plot misclassification error vs k\n",
|
|||
|
"plt.plot(neighbors, MSE)\n",
|
|||
|
"plt.xlabel('Liczba sąsiadów')\n",
|
|||
|
"plt.ylabel('Procent błędów')\n",
|
|||
|
"plt.show()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## TF IDF Vectorizer\n",
|
|||
|
"\n",
|
|||
|
"Czasami, żeby wytrenować model nie da się zastosować bezpośrednio danego typu danych, ponieważ najczęściej wejściem do algorytmu ML jest wektor, macierz lub tensor.\n",
|
|||
|
"Dane tekstowe musimy również przekształcić do wektorów. Przydatny w tym przypadku jest TF IDF Vectorizer.\n",
|
|||
|
"Oto przyład z dokumentacji jak można z niego skorzystać (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)\n",
|
|||
|
"\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 70,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"(4, 9)\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
|||
|
"corpus = [\n",
|
|||
|
" 'This is the first document.',\n",
|
|||
|
" 'This document is the second document.',\n",
|
|||
|
" 'And this is the third one.',\n",
|
|||
|
" 'Is this the first document?',\n",
|
|||
|
"]\n",
|
|||
|
"vectorizer = TfidfVectorizer()\n",
|
|||
|
"X = vectorizer.fit_transform(corpus)\n",
|
|||
|
"vectorizer.get_feature_names_out()\n",
|
|||
|
"print(X.shape)\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 73,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',\n",
|
|||
|
" 'this'], dtype=object)"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 73,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"vectorizer.get_feature_names_out()\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 71,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"<4x9 sparse matrix of type '<class 'numpy.float64'>'\n",
|
|||
|
"\twith 21 stored elements in Compressed Sparse Row format>"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 71,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"X"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 72,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"matrix([[0. , 0.46979139, 0.58028582, 0.38408524, 0. ,\n",
|
|||
|
" 0. , 0.38408524, 0. , 0.38408524],\n",
|
|||
|
" [0. , 0.6876236 , 0. , 0.28108867, 0. ,\n",
|
|||
|
" 0.53864762, 0.28108867, 0. , 0.28108867],\n",
|
|||
|
" [0.51184851, 0. , 0. , 0.26710379, 0.51184851,\n",
|
|||
|
" 0. , 0.26710379, 0.51184851, 0.26710379],\n",
|
|||
|
" [0. , 0.46979139, 0.58028582, 0.38408524, 0. ,\n",
|
|||
|
" 0. , 0.38408524, 0. , 0.38408524]])"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 72,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"X.todense()"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"Na podstawie tych danych możemy wytrenowąc model regresji logistycznej. Jest to model regresji liniowej z dodatkową nałożoną funkcją logistyczną:\n",
|
|||
|
" ( https://en.wikipedia.org/wiki/Logistic_function )\n",
|
|||
|
" \n",
|
|||
|
" \n",
|
|||
|
"![Przykład 1](./logistic.png)\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"Dzięki wyjściu modelu zawsze pomiędzy 0, a 1 można traktować wynik jako prawdopodobieństwo\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 75,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"from sklearn.linear_model import LogisticRegression"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 81,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array([0, 0, 1, 0])"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 81,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"y = [0,0,1,1]\n",
|
|||
|
"model = LogisticRegression()\n",
|
|||
|
"model.fit(X, y)\n",
|
|||
|
"model.predict(X)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 82,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"array([[0.51514316, 0.48485684],\n",
|
|||
|
" [0.56428483, 0.43571517],\n",
|
|||
|
" [0.40543928, 0.59456072],\n",
|
|||
|
" [0.51514316, 0.48485684]])"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 82,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"model.predict_proba(X)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Sieci neuronowe\n",
|
|||
|
"\n",
|
|||
|
"Warto zauważyć, że sieci neuronowe w najprostszym wariancie to tak naprawdę złożenie wielu funkcji regresji logistycznej ze sobą, gdzie wejściem jednego modelu regresji logistycznej jest wyjście poprzedniej. W przypadku danych tekstowych zazwyczaj jest wybierana wtedy inna reprezentacja danych niż TF IDF, ponieważ TF IDF nie uwzględnia kolejności słów"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"## Standard Scaler\n",
|
|||
|
"\n",
|
|||
|
"**Zadanie 7**\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"Sprawdź dokumentację https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html\n",
|
|||
|
"\n",
|
|||
|
"KNN jest wrażliwy na liniowe skalowanie danych (w przeciwieństwie do modeli bazujących na regresji, gdyż współczynniki liniowe rekompensują skalowanie liniowe).\n",
|
|||
|
"\n",
|
|||
|
"Wytrenuj dowolny model KNN na cechach pozyskanych ze StandardScaler. Pamiętaj, żeby wyskalować zarówno dane ze zbioru test jak i train.\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"Zauważ, że scaler ma podobne API (fit, transform) jak TF IDF Vectorier"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": []
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3 (ipykernel)",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.11.7"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 4
|
|||
|
}
|