1313 lines
102 KiB
Plaintext
1313 lines
102 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"## Uczenie maszynowe – zastosowania\n",
|
|||
|
"# 4. Metody ewaluacji"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"## 4.1. Metodologia testowania"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"W uczeniu maszynowym bardzo ważna jest ewaluacja budowanego modelu. Dlatego dobrze jest podzielić posiadane dane na odrębne zbiory – osobny zbiór danych do uczenia i osobny do testowania. W niektórych przypadkach potrzeba będzie dodatkowo wyodrębnić tzw. zbiór walidacyjny."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Zbiór uczący a zbiór testowy"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "fragment"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"* Na zbiorze uczącym (treningowym) uczymy algorytmy, a na zbiorze testowym sprawdzamy ich poprawność.\n",
|
|||
|
"* Zbiór uczący powinien być kilkukrotnie większy od testowego (np. 4:1, 9:1 itp.).\n",
|
|||
|
"* Zbiór testowy często jest nieznany.\n",
|
|||
|
"* Należy unikać mieszania danych testowych i treningowych – nie wolno „zanieczyszczać” danych treningowych danymi testowymi!"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Czasami potrzebujemy dobrać parametry modelu, np. $\\alpha$ – który zbiór wykorzystać do tego celu?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Zbiór walidacyjny"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "fragment"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Do doboru parametrów najlepiej użyć jeszcze innego zbioru – jest to tzw. **zbiór walidacyjny**"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "fragment"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
" * Zbiór walidacyjny powinien mieć wielkość zbliżoną do wielkości zbioru testowego, czyli np. dane można podzielić na te trzy zbiory w proporcjach 3:1:1, 8:1:1 itp."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Walidacja krzyżowa"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "fragment"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Którą część danych wydzielić jako zbiór walidacyjny tak, żeby było „najlepiej”?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "fragment"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
" * Niech każda partia danych pełni tę rolę naprzemiennie!"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"<img width=\"100%\" src=\"https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png\"/>\n",
|
|||
|
"Żródło: https://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Walidacja krzyżowa\n",
|
|||
|
"\n",
|
|||
|
"* Podziel dane $D = \\left\\{ (x^{(1)}, y^{(1)}), \\ldots, (x^{(m)}, y^{(m)})\\right\\} $ na $N$ rozłącznych zbiorów $T_1,\\ldots,T_N$\n",
|
|||
|
"* Dla $i=1,\\ldots,N$, wykonaj:\n",
|
|||
|
" * Użyj $T_i$ do walidacji i zbiór $S_i$ do trenowania, gdzie $S_i = D \\smallsetminus T_i$. \n",
|
|||
|
" * Zapisz model $\\theta_i$.\n",
|
|||
|
"* Akumuluj wyniki dla modeli $\\theta_i$ dla zbiorów $T_i$.\n",
|
|||
|
"* Ustalaj parametry uczenia na akumulowanych wynikach."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Walidacja krzyżowa – wskazówki\n",
|
|||
|
"\n",
|
|||
|
"* Zazwyczaj ustala się $N$ w przedziale od $4$ do $10$, tzw. $N$-krotna walidacja krzyżowa (*$N$-fold cross validation*). \n",
|
|||
|
"* Zbiór $D$ warto zrandomizować przed podziałem.\n",
|
|||
|
"* W jaki sposób akumulować wyniki dla wszystkich zbiórow $T_i$?\n",
|
|||
|
"* Po ustaleniu parametrów dla każdego $T_i$, trenujemy model na całych danych treningowych z ustalonymi parametrami.\n",
|
|||
|
"* Testujemy na zbiorze testowym (jeśli nim dysponujemy)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### _Leave-one-out_\n",
|
|||
|
"\n",
|
|||
|
"Jest to szczególny przypadek walidacji krzyżowej, w której $N = m$."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "fragment"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"* Jaki jest rozmiar pojedynczego zbioru $T_i$?\n",
|
|||
|
"* Jakie są zalety i wady tej metody?\n",
|
|||
|
"* Kiedy może być przydatna?"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Zbiór walidujący a algorytmy optymalizacji\n",
|
|||
|
"\n",
|
|||
|
"* Gdy błąd rośnie na zbiorze uczącym, mamy źle dobrany parametr $\\alpha$. Należy go wtedy zmniejszyć.\n",
|
|||
|
"* Gdy błąd zmniejsza się na zbiorze trenującym, ale rośnie na zbiorze walidującym, mamy do czynienia ze zjawiskiem **nadmiernego dopasowania** (*overfitting*).\n",
|
|||
|
"* Należy wtedy przerwać optymalizację. Automatyzacja tego procesu to _early stopping_."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"## 4.2. Miary jakości"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Aby przeprowadzić ewaluację modelu, musimy wybrać **miarę** (**metrykę**), jakiej będziemy używać.\n",
|
|||
|
"\n",
|
|||
|
"Jakiej miary użyc najlepiej?\n",
|
|||
|
" * To zależy od rodzaju zadania.\n",
|
|||
|
" * Innych metryk używa się do regresji, a innych do klasyfikacji"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Metryki dla zadań regresji\n",
|
|||
|
"\n",
|
|||
|
"Dla zadań regresji możemy zastosować np.:\n",
|
|||
|
" * błąd średniokwadratowy (*mean-square error*, MSE):\n",
|
|||
|
" $$ \\mathrm{MSE} \\, = \\, \\frac{1}{m} \\sum_{i=1}^{m} \\left( \\hat{y}^{(i)} - y^{(i)} \\right)^2 $$\n",
|
|||
|
" * pierwiastek z błędu średniokwadratowego (*root-mean-square error*, RMSE):\n",
|
|||
|
" $$ \\mathrm{RMSE} \\, = \\, \\sqrt{ \\frac{1}{m} \\sum_{i=1}^{m} \\left( \\hat{y}^{(i)} - y^{(i)} \\right)^2 } $$\n",
|
|||
|
" * średni błąd bezwzględny (*mean absolute error*, MAE):\n",
|
|||
|
" $$ \\mathrm{MAE} \\, = \\, \\frac{1}{m} \\sum_{i=1}^{m} \\left| \\hat{y}^{(i)} - y^{(i)} \\right| $$"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"W powyższych wzorach $y^{(i)}$ oznacza **oczekiwaną** wartości zmiennej $y$ w $i$-tym przykładzie, a $\\hat{y}^{(i)}$ oznacza wartość zmiennej $y$ w $i$-tym przykładzie wyliczoną (**przewidzianą**) przez nasz model."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"### Metryki dla zadań klasyfikacji\n",
|
|||
|
"\n",
|
|||
|
"Aby przedstawić kilka najpopularniejszych metryk stosowanych dla zadań klasyfikacyjnych, posłużmy się następującym przykładem:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 1,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Przydatne importy\n",
|
|||
|
"\n",
|
|||
|
"import ipywidgets as widgets\n",
|
|||
|
"import matplotlib.pyplot as plt\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"import pandas\n",
|
|||
|
"import random\n",
|
|||
|
"import seaborn\n",
|
|||
|
"\n",
|
|||
|
"%matplotlib inline"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 2,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def powerme(x1,x2,n):\n",
|
|||
|
" \"\"\"Funkcja, która generuje n potęg dla zmiennych x1 i x2 oraz ich iloczynów\"\"\"\n",
|
|||
|
" X = []\n",
|
|||
|
" for m in range(n+1):\n",
|
|||
|
" for i in range(m+1):\n",
|
|||
|
" X.append(np.multiply(np.power(x1,i),np.power(x2,(m-i))))\n",
|
|||
|
" return np.hstack(X)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 3,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def plot_data_for_classification(X, Y, xlabel=None, ylabel=None, Y_predicted=[], highlight=None):\n",
|
|||
|
" \"\"\"Wykres danych dla zadania klasyfikacji\"\"\"\n",
|
|||
|
" fig = plt.figure(figsize=(16*.6, 9*.6))\n",
|
|||
|
" ax = fig.add_subplot(111)\n",
|
|||
|
" fig.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9)\n",
|
|||
|
" X = X.tolist()\n",
|
|||
|
" Y = Y.tolist()\n",
|
|||
|
" X1n = [x[1] for x, y in zip(X, Y) if y[0] == 0]\n",
|
|||
|
" X1p = [x[1] for x, y in zip(X, Y) if y[0] == 1]\n",
|
|||
|
" X2n = [x[2] for x, y in zip(X, Y) if y[0] == 0]\n",
|
|||
|
" X2p = [x[2] for x, y in zip(X, Y) if y[0] == 1]\n",
|
|||
|
" \n",
|
|||
|
" if len(Y_predicted) > 0:\n",
|
|||
|
" Y_predicted = Y_predicted.tolist()\n",
|
|||
|
" X1tn = [x[1] for x, y, yp in zip(X, Y, Y_predicted) if y[0] == 0 and yp[0] == 0]\n",
|
|||
|
" X1fn = [x[1] for x, y, yp in zip(X, Y, Y_predicted) if y[0] == 1 and yp[0] == 0]\n",
|
|||
|
" X1tp = [x[1] for x, y, yp in zip(X, Y, Y_predicted) if y[0] == 1 and yp[0] == 1]\n",
|
|||
|
" X1fp = [x[1] for x, y, yp in zip(X, Y, Y_predicted) if y[0] == 0 and yp[0] == 1]\n",
|
|||
|
" X2tn = [x[2] for x, y, yp in zip(X, Y, Y_predicted) if y[0] == 0 and yp[0] == 0]\n",
|
|||
|
" X2fn = [x[2] for x, y, yp in zip(X, Y, Y_predicted) if y[0] == 1 and yp[0] == 0]\n",
|
|||
|
" X2tp = [x[2] for x, y, yp in zip(X, Y, Y_predicted) if y[0] == 1 and yp[0] == 1]\n",
|
|||
|
" X2fp = [x[2] for x, y, yp in zip(X, Y, Y_predicted) if y[0] == 0 and yp[0] == 1]\n",
|
|||
|
" \n",
|
|||
|
" if highlight == 'tn':\n",
|
|||
|
" ax.scatter(X1tn, X2tn, c='r', marker='x', s=100, label='Dane')\n",
|
|||
|
" ax.scatter(X1fn, X2fn, c='k', marker='o', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1tp, X2tp, c='k', marker='o', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1fp, X2fp, c='k', marker='x', s=50, label='Dane')\n",
|
|||
|
" elif highlight == 'fn':\n",
|
|||
|
" ax.scatter(X1tn, X2tn, c='k', marker='x', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1fn, X2fn, c='g', marker='o', s=100, label='Dane')\n",
|
|||
|
" ax.scatter(X1tp, X2tp, c='k', marker='o', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1fp, X2fp, c='k', marker='x', s=50, label='Dane')\n",
|
|||
|
" elif highlight == 'tp':\n",
|
|||
|
" ax.scatter(X1tn, X2tn, c='k', marker='x', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1fn, X2fn, c='k', marker='o', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1tp, X2tp, c='g', marker='o', s=100, label='Dane')\n",
|
|||
|
" ax.scatter(X1fp, X2fp, c='k', marker='x', s=50, label='Dane')\n",
|
|||
|
" elif highlight == 'fp':\n",
|
|||
|
" ax.scatter(X1tn, X2tn, c='k', marker='x', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1fn, X2fn, c='k', marker='o', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1tp, X2tp, c='k', marker='o', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1fp, X2fp, c='r', marker='x', s=100, label='Dane')\n",
|
|||
|
" else:\n",
|
|||
|
" ax.scatter(X1tn, X2tn, c='r', marker='x', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1fn, X2fn, c='g', marker='o', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1tp, X2tp, c='g', marker='o', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1fp, X2fp, c='r', marker='x', s=50, label='Dane')\n",
|
|||
|
"\n",
|
|||
|
" else:\n",
|
|||
|
" ax.scatter(X1n, X2n, c='r', marker='x', s=50, label='Dane')\n",
|
|||
|
" ax.scatter(X1p, X2p, c='g', marker='o', s=50, label='Dane')\n",
|
|||
|
" \n",
|
|||
|
" if xlabel:\n",
|
|||
|
" ax.set_xlabel(xlabel)\n",
|
|||
|
" if ylabel:\n",
|
|||
|
" ax.set_ylabel(ylabel)\n",
|
|||
|
" \n",
|
|||
|
" ax.margins(.05, .05)\n",
|
|||
|
" return fig"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 4,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Wczytanie danych\n",
|
|||
|
"import pandas\n",
|
|||
|
"import numpy as np\n",
|
|||
|
"\n",
|
|||
|
"alldata = pandas.read_csv('data-metrics.tsv', sep='\\t')\n",
|
|||
|
"data = np.matrix(alldata)\n",
|
|||
|
"\n",
|
|||
|
"m, n_plus_1 = data.shape\n",
|
|||
|
"n = n_plus_1 - 1\n",
|
|||
|
"\n",
|
|||
|
"X2 = powerme(data[:, 1], data[:, 2], n)\n",
|
|||
|
"Y2 = np.matrix(data[:, 0]).reshape(m, 1)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 5,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA18AAAHvCAYAAACrE2U1AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAA9hAAAPYQGoP6dpAABZSklEQVR4nO3de3xU1b3///dMkglRyHiJJFADQh9NEMELoUISQawxqKgRekpo7Xj5WnvyVauI/o5ST6mm35bac8Reg5dyaVPFORVT5IBW0hoNSVSuFhWILWhGJVA9kIGemEky+/fHNGPmkivJntvr+XjMw2bvNTtrdneGec/a67MshmEYAgAAAAAMK2ukOwAAAAAAiYDwBQAAAAAmIHwBAAAAgAkIXwAAAABgAsIXAAAAAJiA8AUAAAAAJiB8AQAAAIAJkiPdgXjg9Xr18ccfa9SoUbJYLJHuDgAAAIBhZBiGjh8/rrFjx8pq7f94FuFrCHz88cfKzs6OdDcAAAAAmMjlcunss8/ud3vC1xAYNWqUJN/JT09Pj3BvAAAAAAwnt9ut7Oxsfw7oL8LXEOi61TA9PZ3wBQAAACSIgU45ouAGAAAAAJiA8AUAAAAAJiB8AQAAAIAJCF8AAAAAYALCFwAAAACYIKbC12uvvaZrr71WY8eOlcVi0R/+8Ic+n/Pqq68qLy9PI0aM0MSJE/X444+HtFm/fr0mT56s1NRUTZ48WVVVVcPQewAAAACJLKbC1z/+8Q9dcMEF+uUvf9mv9gcPHtTVV1+tWbNmadeuXfrud7+ru+66S+vXr/e3aWhoUGlpqRwOh9566y05HA4tXLhQb7zxxnC9DAAAAAAJyGIYhhHpTgyGxWJRVVWVrr/++h7b3H///XrhhRe0d+9e/7aysjK99dZbamhokCSVlpbK7XbrxRdf9Le58sordfrpp2vdunX96ovb7ZbdbldLSwvrfAEAAABxbrCf/2Nq5GugGhoaVFxcHLBt7ty52r59u9rb23ttU19f3+Nx29ra5Ha7Ax4AAAAA0Ju4Dl/Nzc3KzMwM2JaZmamOjg598sknvbZpbm7u8bjLly+X3W73P7Kzs4e+8wAAAADiSlyHL8l3e2J3XXdZdt8erk3wtu6WLl2qlpYW/8Plcg1hjwEAAADEo+RId2A4ZWVlhYxgHTlyRMnJyTrzzDN7bRM8GtZdamqqUlNTh77DAOKTxyPZbIPfDwAA4kJcj3zl5+dry5YtAdtefvllTZ8+XSkpKb22KSgoMK2fAOKY0ylNnSr1NELucvn2O53m9gsAAJgupsLXiRMntHv3bu3evVuSr5T87t271dTUJMl3O+CNN97ob19WVqYPPvhAS5Ys0d69e7V69WqtWrVK9913n7/N3XffrZdfflmPPPKI9u3bp0ceeUTV1dVavHixmS8NQDzyeKRly6TGRmnOnNAA5nL5tjc2+tp5PJHoJQAAMElMha/t27froosu0kUXXSRJWrJkiS666CItW7ZMknTo0CF/EJOkCRMmaPPmzaqpqdGFF16oH/zgB/r5z3+ur371q/42BQUFevbZZ7VmzRqdf/75Wrt2rZxOp2bMmGHuiwMQf2w2qbpamjhROnAgMIB1Ba8DB3z7q6u59RAAgDgXs+t8RRPW+QLQq+CgVVkpORyf/1xTI1E1FQCAmDHYz/9xXXADAKJCdrYvYHUFsMJC33aCFwAACSWmbjsEgJiVne0b8equspLgBQBAAiF8AYAZXC7frYbdORw9V0EEAABxh/AFAMMteM5XXV34IhwAACCuEb4AYDgFB6+aGqmgwPdfAhgAAAmF8AUAw8XjkYqKwlc17CrC0RXAiopY5wsAgDhH+AKA4WKzSeXlUk5O+KqGXQEsJ8fXjnW+AACIa6zzNQRY5wtArzye3oNVX/sBAEBUGeznf0a+AGC49RWsCF4AACQEwhcAAAAAmIDwBQAAAAAmIHwBAAAAgAkIXwAAAABgAsIXAAAAAJiA8AUAAAAAJiB8AQAAAIAJCF8AAAAAYALCFwAAAACYgPAFAAAAACYgfAEAAACACQhfAAAAcai1vVWHTxxWa3trpLsC4J8IXwAAAHFka9NWLXAu0MjlI5X1aJZGLh+pBc4Fqmuqi3TXgIRH+AIAAIgTK7et1Ow1s7WxcaO8hleS5DW82ti4UbPWzNLj2x+PcA+BxEb4AgAAiANbm7bqjs13yJChDm9HwL4Ob4cMGbp90+2MgAERRPgCAACIAysaVijJmtRrmyRrkh57/TGTegQgGOELAAAgxrW2t2rD/g0hI17BOrwdqtpXRREOIEIIXwAAADHO3eb2z/Hqi9fwyt3mHuYeAQiH8AUAABDj0lPTZbX072Od1WJVemr6MPcIQDiELwAAgBiXlpKmktwSJffx0S7ZkqT5H5yitOdfMKlnALojfAEAAMSBJdO/o84+bj3s9Hbqni0npGXLJI/HpJ4B6EL4AgAAiAOXfPEyVRT+UBZDSu4M3JdsSZLFkCo2SYUpE6Xqaslmi0xHgQRG+AIAAIgTZVd8V7XXrFfJh6fI+s9BMKusKnk/VbWrpbL/mSjV1EjZ2RHtJ5CoLIZhGJHuRKxzu92y2+1qaWlRejoTWAEAQIS5XGq9fLbcH7+v9DYprUPSRIIXMFQG+/mfkS8AAIB4k52ttLVPK/Mf/wxeklRZSfACIozwBQAAEG9cLsnhCNzmcPi2A4gYwhcAAEA8cbmkOXOkAwd8txrW1fn+e+CAbzsBDIgYwhcAAEC8CA5eNTVSQYHvvwQwIOIIXwAAAPHA45GKigKDV9ccr+zswABWVMQ6X0AEEL4AAADigc0mlZdLOTnhqxp2BbCcHF871vkCTBdz4auiokITJkzQiBEjlJeXp9ra2h7b3nzzzbJYLCGP8847z99m7dq1Ydt89tlnZrwcAACAoVNaKu3Z03NVw+xs3/7SUnP7BUBSjIUvp9OpxYsX68EHH9SuXbs0a9YsXXXVVWpqagrb/mc/+5kOHTrkf7hcLp1xxhn62te+FtAuPT09oN2hQ4c0YsQIM14SAADA0OprRIsRLyBiYip8rVixQrfeequ+9a1v6dxzz9VPf/pTZWdna+XKlWHb2+12ZWVl+R/bt2/X0aNHdcsttwS0s1gsAe2ysrLMeDkAAAAAEkjMhC+Px6MdO3aouLg4YHtxcbHq6+v7dYxVq1apqKhI48ePD9h+4sQJjR8/XmeffbauueYa7dq1q9fjtLW1ye12BzwAAAmqr6IFFDUAAPxTzISvTz75RJ2dncrMzAzYnpmZqebm5j6ff+jQIb344ov61re+FbB90qRJWrt2rV544QWtW7dOI0aMUGFhod57770ej7V8+XLZ7Xb/I5vV4gEgMTmd0tSpPZftdrl8+51Oc/sFAIhKMRO+ulgsloCfDcMI2RbO2rVrddppp+n6668P2D5z5kx985vf1AUXXKBZs2bpv/7rv5STk6Nf/OIXPR5r6dKlamlp8T9crJUBAInH45GWLZMaG8Ovm9S13lJjo68dI2AAkPBiJnxlZGQoKSkpZJTryJEjIaNhwQzD0OrVq+VwOGTrY5Kp1WrVl7/85V5HvlJTU5Wenh7wAAAkGJtNqq4Ov3Bt8EK31dUUOQAAxE74stlsysvL05YtWwK2b9myRQUFBb0+99VXX9Vf//pX3XrrrX3+HsMwtHv3bo0ZM+ak+gsASADBC9fOmSPV1wcGr3DrLQEAElJypDswEEuWLJHD4dD06dOVn5+vJ598Uk1NTSorK5Pkux3wo48+0m9/+9uA561atUozZszQlClTQo758MMPa+bMmfrSl74kt9utn//859q9e7d+9atfmfKaAAAxriuAdQWuwkLfdoIXACBITIWv0tJSffrppyovL9ehQ4c0ZcoUbd682V+98NChQyFrfrW0tGj9+vX62c9+FvaYx44d07e//W01NzfLbrfroosu0muvvaaLL7542F8PACBOZGdLlZWfBy/J9zPBCwDQjcUwDCPSnYh1brdbdrtdLS0tzP8CgETUfY5XF0a+ACBuDfbzf8zM+QIAICoFF9eoqwtfhAMAkPAIXwAADFZw8KqpkQo
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 960x540 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"fig = plot_data_for_classification(X2, Y2, xlabel=r'$x_1$', ylabel=r'$x_2$')"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 6,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def safeSigmoid(x, eps=0):\n",
|
|||
|
" \"\"\"Funkcja sigmoidalna zmodyfikowana w taki sposób, \n",
|
|||
|
" żeby wartości zawsz były odległe od asymptot o co najmniej eps\n",
|
|||
|
" \"\"\"\n",
|
|||
|
" y = 1.0/(1.0 + np.exp(-x))\n",
|
|||
|
" if eps > 0:\n",
|
|||
|
" y[y < eps] = eps\n",
|
|||
|
" y[y > 1 - eps] = 1 - eps\n",
|
|||
|
" return y\n",
|
|||
|
"\n",
|
|||
|
"def h(theta, X, eps=0.0):\n",
|
|||
|
" \"\"\"Funkcja hipotezy (regresja logistyczna)\"\"\"\n",
|
|||
|
" return safeSigmoid(X*theta, eps)\n",
|
|||
|
"\n",
|
|||
|
"def J(h,theta,X,y, lamb=0):\n",
|
|||
|
" \"\"\"Funkcja kosztu dla regresji logistycznej\"\"\"\n",
|
|||
|
" m = len(y)\n",
|
|||
|
" f = h(theta, X, eps=10**-7)\n",
|
|||
|
" j = -np.sum(np.multiply(y, np.log(f)) + \n",
|
|||
|
" np.multiply(1 - y, np.log(1 - f)), axis=0)/m\n",
|
|||
|
" if lamb > 0:\n",
|
|||
|
" j += lamb/(2*m) * np.sum(np.power(theta[1:],2))\n",
|
|||
|
" return j\n",
|
|||
|
"\n",
|
|||
|
"def dJ(h,theta,X,y,lamb=0):\n",
|
|||
|
" \"\"\"Gradient funkcji kosztu\"\"\"\n",
|
|||
|
" g = 1.0/y.shape[0]*(X.T*(h(theta,X)-y))\n",
|
|||
|
" if lamb > 0:\n",
|
|||
|
" g[1:] += lamb/float(y.shape[0]) * theta[1:] \n",
|
|||
|
" return g\n",
|
|||
|
"\n",
|
|||
|
"def classifyBi(theta, X):\n",
|
|||
|
" \"\"\"Funkcja predykcji - klasyfikacja dwuklasowa\"\"\"\n",
|
|||
|
" prob = h(theta, X)\n",
|
|||
|
" return prob"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 7,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def GD(h, fJ, fdJ, theta, X, y, alpha=0.01, eps=10**-3, maxSteps=10000):\n",
|
|||
|
" \"\"\"Metoda gradientu prostego dla regresji logistycznej\"\"\"\n",
|
|||
|
" errorCurr = fJ(h, theta, X, y)\n",
|
|||
|
" errors = [[errorCurr, theta]]\n",
|
|||
|
" while True:\n",
|
|||
|
" # oblicz nowe theta\n",
|
|||
|
" theta = theta - alpha * fdJ(h, theta, X, y)\n",
|
|||
|
" # raportuj poziom błędu\n",
|
|||
|
" errorCurr, errorPrev = fJ(h, theta, X, y), errorCurr\n",
|
|||
|
" # kryteria stopu\n",
|
|||
|
" if abs(errorPrev - errorCurr) <= eps:\n",
|
|||
|
" break\n",
|
|||
|
" if len(errors) > maxSteps:\n",
|
|||
|
" break\n",
|
|||
|
" errors.append([errorCurr, theta]) \n",
|
|||
|
" return theta, errors"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 8,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"theta = [[ 1.37136167]\n",
|
|||
|
" [ 0.90128948]\n",
|
|||
|
" [ 0.54708112]\n",
|
|||
|
" [-5.9929264 ]\n",
|
|||
|
" [ 2.64435168]\n",
|
|||
|
" [-4.27978238]]\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Uruchomienie metody gradientu prostego dla regresji logistycznej\n",
|
|||
|
"theta_start = np.matrix(np.zeros(X2.shape[1])).reshape(X2.shape[1],1)\n",
|
|||
|
"theta, errors = GD(h, J, dJ, theta_start, X2, Y2, \n",
|
|||
|
" alpha=0.1, eps=10**-7, maxSteps=10000)\n",
|
|||
|
"print('theta = {}'.format(theta))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 9,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"def plot_decision_boundary(fig, theta, X):\n",
|
|||
|
" \"\"\"Wykres granicy klas\"\"\"\n",
|
|||
|
" ax = fig.axes[0]\n",
|
|||
|
" xx, yy = np.meshgrid(np.arange(-1.0, 1.0, 0.02),\n",
|
|||
|
" np.arange(-1.0, 1.0, 0.02))\n",
|
|||
|
" l = len(xx.ravel())\n",
|
|||
|
" C = powerme(xx.reshape(l, 1), yy.reshape(l, 1), n)\n",
|
|||
|
" z = classifyBi(theta, C).reshape(int(np.sqrt(l)), int(np.sqrt(l)))\n",
|
|||
|
"\n",
|
|||
|
" plt.contour(xx, yy, z, levels=[0.5], lw=3);"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 10,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"Y_expected = Y2.astype(int)\n",
|
|||
|
"Y_predicted = (classifyBi(theta, X2) > 0.5).astype(int)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 11,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Przygotowanie interaktywnego wykresu\n",
|
|||
|
"\n",
|
|||
|
"dropdown_highlight = widgets.Dropdown(options=['all', 'tp', 'fp', 'tn', 'fn'], value='all', description='highlight')\n",
|
|||
|
"\n",
|
|||
|
"def interactive_classification(highlight):\n",
|
|||
|
" fig = plot_data_for_classification(X2, Y2, xlabel=r'$x_1$', ylabel=r'$x_2$',\n",
|
|||
|
" Y_predicted=Y_predicted, highlight=highlight)\n",
|
|||
|
" plot_decision_boundary(fig, theta, X2)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"application/vnd.jupyter.widget-view+json": {
|
|||
|
"model_id": "660e6cef0e7c4e8e85052791cceacf7f",
|
|||
|
"version_major": 2,
|
|||
|
"version_minor": 0
|
|||
|
},
|
|||
|
"text/plain": [
|
|||
|
"interactive(children=(Dropdown(description='highlight', options=('all', 'tp', 'fp', 'tn', 'fn'), value='all'),…"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
},
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"text/plain": [
|
|||
|
"<function __main__.interactive_classification(highlight)>"
|
|||
|
]
|
|||
|
},
|
|||
|
"execution_count": 12,
|
|||
|
"metadata": {},
|
|||
|
"output_type": "execute_result"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"widgets.interact(interactive_classification, highlight=dropdown_highlight)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Zadanie klasyfikacyjne z powyższego przykładu polega na przypisaniu punktów do jednej z dwóch kategorii:\n",
|
|||
|
" 0. <font color=\"red\">czerwone krzyżyki</font>\n",
|
|||
|
" 1. <font color=\"green\">zielone kółka</font>\n",
|
|||
|
"\n",
|
|||
|
"W tym celu zastosowano regresję logistyczną.\n",
|
|||
|
"\n",
|
|||
|
"W rezultacie otrzymano model, który dzieli płaszczyznę na dwa obszary:\n",
|
|||
|
" 0. <font color=\"red\">na zewnątrz granatowej krzywej</font>\n",
|
|||
|
" 1. <font color=\"green\">wewnątrz granatowej krzywej</font>\n",
|
|||
|
" \n",
|
|||
|
"Model przewiduje klasę <font color=\"red\">0 („czerwoną”)</font> dla punktów znajdujący się w obszarze na zewnątrz krzywej, natomiast klasę <font color=\"green\">1 („zieloną”)</font> dla punktów znajdujących sie w obszarze wewnąrz krzywej."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Wszysktie obserwacje możemy podzielić zatem na cztery grupy:\n",
|
|||
|
" * **true positives (TP)** – prawidłowo sklasyfikowane pozytywne przykłady (<font color=\"green\">zielone kółka</font> w <font color=\"green\">wewnętrznym obszarze</font>)\n",
|
|||
|
" * **true negatives (TN)** – prawidłowo sklasyfikowane negatywne przykłady (<font color=\"red\">czerwone krzyżyki</font> w <font color=\"red\">zewnętrznym obszarze</font>)\n",
|
|||
|
" * **false positives (FP)** – negatywne przykłady sklasyfikowane jako pozytywne (<font color=\"red\">czerwone krzyżyki</font> w <font color=\"green\">wewnętrznym obszarze</font>)\n",
|
|||
|
" * **false negatives (FN)** – pozytywne przykłady sklasyfikowane jako negatywne (<font color=\"green\">zielone kółka</font> w <font color=\"red\">zewnętrznym obszarze</font>)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Innymi słowy:\n",
|
|||
|
"\n",
|
|||
|
"<img width=\"50%\" src=\"https://blog.aimultiple.com/wp-content/uploads/2019/07/positive-negative-true-false-matrix.png\">"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 13,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "skip"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"TP = 5\n",
|
|||
|
"TN = 35\n",
|
|||
|
"FP = 3\n",
|
|||
|
"FN = 6\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"# Obliczmy TP, TN, FP i FN\n",
|
|||
|
"\n",
|
|||
|
"tp = 0\n",
|
|||
|
"tn = 0\n",
|
|||
|
"fp = 0\n",
|
|||
|
"fn = 0\n",
|
|||
|
"\n",
|
|||
|
"for i in range(len(Y_expected)):\n",
|
|||
|
" if Y_expected[i] == 1 and Y_predicted[i] == 1:\n",
|
|||
|
" tp += 1\n",
|
|||
|
" elif Y_expected[i] == 0 and Y_predicted[i] == 0:\n",
|
|||
|
" tn += 1\n",
|
|||
|
" elif Y_expected[i] == 0 and Y_predicted[i] == 1:\n",
|
|||
|
" fp += 1\n",
|
|||
|
" elif Y_expected[i] == 1 and Y_predicted[i] == 0:\n",
|
|||
|
" fn += 1\n",
|
|||
|
" \n",
|
|||
|
"print('TP =', tp)\n",
|
|||
|
"print('TN =', tn)\n",
|
|||
|
"print('FP =', fp)\n",
|
|||
|
"print('FN =', fn)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Możemy teraz zdefiniować następujące metryki:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### Dokładność (*accuracy*)\n",
|
|||
|
"$$ \\mbox{accuracy} = \\frac{\\mbox{przypadki poprawnie sklasyfikowane}}{\\mbox{wszystkie przypadki}} = \\frac{TP + TN}{TP + TN + FP + FN} $$"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Dokładność otrzymujemy przez podzielenie liczby przypadków poprawnie sklasyfikowanych przez liczbę wszystkich przypadków:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 14,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Accuracy: 0.8163265306122449\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"accuracy = (tp + tn) / (tp + tn + fp + fn)\n",
|
|||
|
"print('Accuracy:', accuracy)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"**Uwaga:** Nie zawsze dokładność będzie dobrą miarą, zwłaszcza gdy klasy są bardzo asymetryczne!\n",
|
|||
|
"\n",
|
|||
|
"*Przykład:* Wyobraźmy sobie test na koronawirusa, który **zawsze** zwraca wynik negatywny. Jaką przydatność będzie miał taki test w praktyce? Żadną. A jaka będzie jego *dokładność*? Policzmy:\n",
|
|||
|
"$$ \\mbox{accuracy} \\, = \\, \\frac{\\mbox{szacowana liczba osób zdrowych na świecie}}{\\mbox{populacja Ziemi}} \\, \\approx \\, \\frac{7\\,700\\,000\\,000 - 600\\,000}{7\\,700\\,000\\,000} \\, \\approx \\, 0.99992 $$\n",
|
|||
|
"(zaokrąglone dane z 27 marca 2020)\n",
|
|||
|
"\n",
|
|||
|
"Powyższy wynik jest tak wysoki, ponieważ zdecydowana większość osób na świecie nie jest zakażona, więc biorąc losowego Ziemianina możemy w ciemno strzelać, że nie ma koronawirusa.\n",
|
|||
|
"\n",
|
|||
|
"W tym przypadku duża różnica w liczności obu zbiorów (zakażeni/niezakażeni) powoduje, że *accuracy* nie jest dobrą metryką.\n",
|
|||
|
"\n",
|
|||
|
"Dlatego dysponujemy również innymi metrykami:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### Precyzja (*precision*)\n",
|
|||
|
"$$ \\mbox{precision} = \\frac{TP}{TP + FP} $$"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 15,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Precision: 0.625\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"precision = tp / (tp + fp)\n",
|
|||
|
"print('Precision:', precision)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Precyzja określa, jaka część przykładów sklasyfikowanych jako pozytywne to faktycznie przykłady pozytywne."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### Pokrycie (czułość, *recall*)\n",
|
|||
|
"$$ \\mbox{recall} = \\frac{TP}{TP + FN} $$"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 16,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"Recall: 0.45454545454545453\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"recall = tp / (tp + fn)\n",
|
|||
|
"print('Recall:', recall)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Pokrycie mówi nam, jaka część przykładów pozytywnych została poprawnie sklasyfikowana."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"#### *$F$-measure* (*$F$-score*)\n",
|
|||
|
"$$ F = \\frac{2 \\cdot \\mbox{precision} \\cdot \\mbox{recall}}{\\mbox{precision} + \\mbox{recall}} $$"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 17,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"name": "stdout",
|
|||
|
"output_type": "stream",
|
|||
|
"text": [
|
|||
|
"F-score: 0.5263157894736842\n"
|
|||
|
]
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"fscore = (2 * precision * recall) / (precision + recall)\n",
|
|||
|
"print('F-score:', fscore)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"$F$-_measure_ jest kompromisem między precyzją a pokryciem (a ściślej: jest średnią harmoniczną precyzji i pokrycia)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"$F$-_measure_ jest szczególnym przypadkiem ogólniejszej miary:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"*$F_\\beta$-measure*:\n",
|
|||
|
"$$ F_\\beta = \\frac{(1 + \\beta) \\cdot \\mbox{precision} \\cdot \\mbox{recall}}{\\beta^2 \\cdot \\mbox{precision} + \\mbox{recall}} $$"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "fragment"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Dla $\\beta = 1$ otrzymujemy:\n",
|
|||
|
"$$ F_1 \\, = \\, \\frac{(1 + 1) \\cdot \\mbox{precision} \\cdot \\mbox{recall}}{1^2 \\cdot \\mbox{precision} + \\mbox{recall}} \\, = \\, \\frac{2 \\cdot \\mbox{precision} \\cdot \\mbox{recall}}{\\mbox{precision} + \\mbox{recall}} \\, = \\, F $$"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "slide"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"## 4.3. Obserwacje odstające"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"**Obserwacje odstające** (*outliers*) – to wszelkie obserwacje posiadające nietypową wartość.\n",
|
|||
|
"\n",
|
|||
|
"Mogą być na przykład rezultatem błędnego pomiaru albo pomyłki przy wprowadzaniu danych do bazy, ale nie tylko.\n",
|
|||
|
"\n",
|
|||
|
"Obserwacje odstające mogą niekiedy znacząco wpłynąć na parametry modelu, dlatego ważne jest, żeby takie obserwacje odrzucić zanim przystąpi się do tworzenia modelu."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"W poniższym przykładzie można zobaczyć wpływ obserwacji odstających na wynik modelowania na przykładzie danych dotyczących cen mieszkań zebranych z ogłoszeń na portalu Gratka.pl: tutaj przykładem obserwacji odstającej może być ogłoszenie, w którym podano cenę w tys. zł zamiast ceny w zł."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 18,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Przydatne funkcje\n",
|
|||
|
"\n",
|
|||
|
"def h_linear(Theta, x):\n",
|
|||
|
" \"\"\"Funkcja regresji liniowej\"\"\"\n",
|
|||
|
" return x * Theta\n",
|
|||
|
"\n",
|
|||
|
"def linear_regression(theta):\n",
|
|||
|
" \"\"\"Ta funkcja zwraca funkcję regresji liniowej dla danego wektora parametrów theta\"\"\"\n",
|
|||
|
" return lambda x: h_linear(theta, x)\n",
|
|||
|
"\n",
|
|||
|
"def cost(theta, X, y):\n",
|
|||
|
" \"\"\"Wersja macierzowa funkcji kosztu\"\"\"\n",
|
|||
|
" m = len(y)\n",
|
|||
|
" J = 1.0 / (2.0 * m) * ((X * theta - y).T * (X * theta - y))\n",
|
|||
|
" return J.item()\n",
|
|||
|
"\n",
|
|||
|
"def gradient(theta, X, y):\n",
|
|||
|
" \"\"\"Wersja macierzowa gradientu funkcji kosztu\"\"\"\n",
|
|||
|
" return 1.0 / len(y) * (X.T * (X * theta - y)) \n",
|
|||
|
"\n",
|
|||
|
"def gradient_descent(fJ, fdJ, theta, X, y, alpha=0.1, eps=10**-5):\n",
|
|||
|
" \"\"\"Algorytm gradientu prostego (wersja macierzowa)\"\"\"\n",
|
|||
|
" current_cost = fJ(theta, X, y)\n",
|
|||
|
" logs = [[current_cost, theta]]\n",
|
|||
|
" while True:\n",
|
|||
|
" theta = theta - alpha * fdJ(theta, X, y)\n",
|
|||
|
" current_cost, prev_cost = fJ(theta, X, y), current_cost\n",
|
|||
|
" if abs(prev_cost - current_cost) > 10**15:\n",
|
|||
|
" print('Algorithm does not converge!')\n",
|
|||
|
" break\n",
|
|||
|
" if abs(prev_cost - current_cost) <= eps:\n",
|
|||
|
" break\n",
|
|||
|
" logs.append([current_cost, theta]) \n",
|
|||
|
" return theta, logs\n",
|
|||
|
"\n",
|
|||
|
"def plot_data(X, y, xlabel, ylabel):\n",
|
|||
|
" \"\"\"Wykres danych (wersja macierzowa)\"\"\"\n",
|
|||
|
" fig = plt.figure(figsize=(16*.6, 9*.6))\n",
|
|||
|
" ax = fig.add_subplot(111)\n",
|
|||
|
" fig.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9)\n",
|
|||
|
" ax.scatter([X[:, 1]], [y], c='r', s=50, label='Dane')\n",
|
|||
|
" \n",
|
|||
|
" ax.set_xlabel(xlabel)\n",
|
|||
|
" ax.set_ylabel(ylabel)\n",
|
|||
|
" ax.margins(.05, .05)\n",
|
|||
|
" plt.ylim(y.min() - 1, y.max() + 1)\n",
|
|||
|
" plt.xlim(np.min(X[:, 1]) - 1, np.max(X[:, 1]) + 1)\n",
|
|||
|
" return fig\n",
|
|||
|
"\n",
|
|||
|
"def plot_regression(fig, fun, theta, X):\n",
|
|||
|
" \"\"\"Wykres krzywej regresji (wersja macierzowa)\"\"\"\n",
|
|||
|
" ax = fig.axes[0]\n",
|
|||
|
" x0 = np.min(X[:, 1]) - 1.0\n",
|
|||
|
" x1 = np.max(X[:, 1]) + 1.0\n",
|
|||
|
" L = [x0, x1]\n",
|
|||
|
" LX = np.matrix([1, x0, 1, x1]).reshape(2, 2)\n",
|
|||
|
" ax.plot(L, fun(theta, LX), linewidth='2',\n",
|
|||
|
" label=(r'$y={theta0:.2}{op}{theta1:.2}x$'.format(\n",
|
|||
|
" theta0=float(theta[0][0]),\n",
|
|||
|
" theta1=(float(theta[1][0]) if theta[1][0] >= 0 else float(-theta[1][0])),\n",
|
|||
|
" op='+' if theta[1][0] >= 0 else '-')))"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 19,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Wczytanie danych (mieszkania) przy pomocy biblioteki pandas\n",
|
|||
|
"\n",
|
|||
|
"alldata = pandas.read_csv('data_flats_with_outliers.tsv', sep='\\t',\n",
|
|||
|
" names=['price', 'isNew', 'rooms', 'floor', 'location', 'sqrMetres'])\n",
|
|||
|
"data = np.matrix(alldata[['price', 'sqrMetres']])\n",
|
|||
|
"\n",
|
|||
|
"m, n_plus_1 = data.shape\n",
|
|||
|
"n = n_plus_1 - 1\n",
|
|||
|
"Xn = data[:, 0:n]\n",
|
|||
|
"\n",
|
|||
|
"Xo = np.matrix(np.concatenate((np.ones((m, 1)), Xn), axis=1)).reshape(m, n + 1)\n",
|
|||
|
"yo = np.matrix(data[:, -1]).reshape(m, 1)\n",
|
|||
|
"\n",
|
|||
|
"Xo /= np.amax(Xo, axis=0)\n",
|
|||
|
"yo /= np.amax(yo, axis=0)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 20,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA14AAAH0CAYAAAAtwPxTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAz3ElEQVR4nO3de3RV5Z038N8h4dZCooKE0KJoW6kVrQhWUbG2tiC2LARnen2p2qkzKN6gji3Ou6a1nXex2mVRu1pAZ7Q37VvnLeJlvFSsgqCoxaJOl0idVoVqIoI2EcyASfb7x5lEQwLkcp6Ek3w+a50V9j7Pfs7vZLMP58uz97NzWZZlAQAAQDL9eroAAACA3k7wAgAASEzwAgAASEzwAgAASEzwAgAASEzwAgAASEzwAgAASEzwAgAASEzwAgAASEzwAgAASKxogtfChQvj+OOPj6FDh8aIESPirLPOio0bN+5zu1WrVsWECRNi0KBBcfjhh8fSpUu7oVoAAIB3FE3wWrVqVcydOzcee+yxWLFiRdTX18eUKVNix44de9zmhRdeiDPPPDMmT54c69evjyuvvDIuueSSWLZsWTdWDgAA9HW5LMuyni6iM1577bUYMWJErFq1Kk499dQ223zjG9+IO++8MzZs2NC8bs6cOfH000/H2rVru6tUAACgjyvt6QI6q6amJiIiDjrooD22Wbt2bUyZMqXFuqlTp8aNN94Yb7/9dvTv37/VNjt37oydO3c2Lzc2Nsbrr78ew4YNi1wuV6DqAQCA/VGWZfHmm2/GqFGjol+/wp0gWJTBK8uymD9/fpxyyikxbty4Pbarrq6OioqKFusqKiqivr4+tm7dGpWVla22WbhwYVx11VUFrxkAACgemzdvjve///0F668og9dFF10UzzzzTKxZs2afbXcfpWo6s3JPo1cLFiyI+fPnNy/X1NTEIYccEps3b46ysrIuVA0AAOzvamtrY/To0TF06NCC9lt0weviiy+OO++8Mx5++OF9JtCRI0dGdXV1i3VbtmyJ0tLSGDZsWJvbDBw4MAYOHNhqfVlZmeAFAAB9RKEvMyqaWQ2zLIuLLroobrvttnjwwQfjsMMO2+c2kyZNihUrVrRYd//998fEiRPbvL4LAAAghaIJXnPnzo2bb745fvnLX8bQoUOjuro6qquro66urrnNggUL4itf+Urz8pw5c+Kll16K+fPnx4YNG+Kmm26KG2+8MS6//PKeeAsAAEAfVTTBa8mSJVFTUxOnnXZaVFZWNj9uvfXW5jZVVVWxadOm5uXDDjss7rnnnli5cmUce+yx8d3vfjd++MMfxtlnn90TbwEAAOijivY+Xt2ltrY2ysvLo6amxjVeAADQy6X6/l80I14AAADFSvACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABIrKiC18MPPxzTp0+PUaNGRS6Xi9tvv32v7VeuXBm5XK7V47nnnuueggEAACKitKcL6IgdO3bERz/60TjvvPPi7LPPbvd2GzdujLKysublgw8+OEV5AAAAbSqq4DVt2rSYNm1ah7cbMWJEHHDAAYUvCAAAoB2K6lTDzho/fnxUVlbG6aefHg899NBe2+7cuTNqa2tbPAAAALqiVwevysrKuOGGG2LZsmVx2223xdixY+P000+Phx9+eI/bLFy4MMrLy5sfo0eP7saKAQCA3iiXZVnW00V0Ri6Xi+XLl8dZZ53Voe2mT58euVwu7rzzzjaf37lzZ+zcubN5uba2NkaPHh01NTUtrhMDAAB6n9ra2igvLy/49/9ePeLVlhNPPDGef/75PT4/cODAKCsra/EAAADoij4XvNavXx+VlZU9XQYAANCHFNWshtu3b4//+q//al5+4YUX4qmnnoqDDjooDjnkkFiwYEG8/PLL8fOf/zwiIq699toYM2ZMHHXUUbFr1664+eabY9myZbFs2bKeegsAAEAfVFTBa926dfGJT3yieXn+/PkREXHOOefET3/606iqqopNmzY1P79r1664/PLL4+WXX47BgwfHUUcdFXfffXeceeaZ3V47AADQdxXt5BrdJdXFdQAAwP7H5BoAAABFSvACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPACAABITPAC6Ki6uohXX83/BABoB8ELoL3WrImYNStiyJCIkSPzP2fNinjkkZ6uDADYzxVV8Hr44Ydj+vTpMWrUqMjlcnH77bfvc5tVq1bFhAkTYtCgQXH44YfH0qVL0xcK9D5LlkScemrEXXdFNDbm1zU25pcnT47w2QIA7EVRBa8dO3bERz/60fjRj37UrvYvvPBCnHnmmTF58uRYv359XHnllXHJJZfEsmXLElcK9Cpr1kTMnRuRZRH19S2fq6/Pr7/wQiNfAMAelfZ0AR0xbdq0mDZtWrvbL126NA455JC49tprIyLiyCOPjHXr1sXVV18dZ599dqIqgV5n0aKIkpLWoevdSkoirrkm4uSTu68uAKBoFNWIV0etXbs2pkyZ0mLd1KlTY926dfH222+3uc3OnTujtra2xQPow+rqIu64Y++hKyL//PLlJtwAANrUq4NXdXV1VFRUtFhXUVER9fX1sXXr1ja3WbhwYZSXlzc/Ro8e3R2lAvur2tp3runal8bGfHsAgN306uAVEZHL5VosZ1nW5vomCxYsiJqamubH5s2bk9cI7MfKyiL6tfOjsl+/fHsAgN306uA1cuTIqK6ubrFuy5YtUVpaGsOGDWtzm4EDB0ZZWVmLB9CHDR4cMWNGROk+LoktLY2YOTPfHgBgN706eE2aNClWrFjRYt39998fEydOjP79+/dQVUDRmT8/oqFh720aGiLmzeueegCAolNUwWv79u3x1FNPxVNPPRUR+enin3rqqdi0aVNE5E8T/MpXvtLcfs6cOfHSSy/F/PnzY8OGDXHTTTfFjTfeGJdffnlPlA8Uq1NOiVi8OCKXaz3yVVqaX794sRkNAYA9KqrgtW7duhg/fnyMHz8+IiLmz58f48ePj3/+53+OiIiqqqrmEBYRcdhhh8U999wTK1eujGOPPTa++93vxg9/+ENTyQMdN2dOxOrV+dMOm6756tcvv7x6df55AIA9yGVNs03Qptra2igvL4+amhrXewF5dXX52QvLylzTBQC9TKrv/0V1A2WA/cLgwQIXANAhRXWqIQAAQDESvAAAABITvAAAABITvAAAABITvAA
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 960x540 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"fig = plot_data(Xo, yo, xlabel=u'metraż', ylabel=u'cena')\n",
|
|||
|
"theta_start = np.matrix([0.0, 0.0]).reshape(2, 1)\n",
|
|||
|
"theta, logs = gradient_descent(cost, gradient, theta_start, Xo, yo, alpha=0.01)\n",
|
|||
|
"plot_regression(fig, h_linear, theta, Xo)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Na powyższym przykładzie obserwacja odstająca jawi sie jako pojedynczy punkt po prawej stronie wykresu. Widzimy, że otrzymana krzywa regresji zamiast odwzorowywać ogólny trend, próbuje „dopasować się” do tej pojedynczej obserwacji.\n",
|
|||
|
"\n",
|
|||
|
"Dlatego taką obserwację należy usunąć ze zbioru danych (zobacz ponizej)."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 21,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"# Odrzućmy obserwacje odstające\n",
|
|||
|
"alldata_no_outliers = [\n",
|
|||
|
" (index, item) for index, item in alldata.iterrows() \n",
|
|||
|
" if item.price > 10000 and item.sqrMetres < 1000]\n",
|
|||
|
"\n",
|
|||
|
"# Alternatywnie można to zrobić w następujący sposób\n",
|
|||
|
"alldata_no_outliers = alldata.loc[(alldata['price'] > 10000) & (alldata['sqrMetres'] < 1000)]"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 22,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"data = np.matrix(alldata_no_outliers[['price', 'sqrMetres']])\n",
|
|||
|
"\n",
|
|||
|
"m, n_plus_1 = data.shape\n",
|
|||
|
"n = n_plus_1 - 1\n",
|
|||
|
"Xn = data[:, 0:n]\n",
|
|||
|
"\n",
|
|||
|
"Xo = np.matrix(np.concatenate((np.ones((m, 1)), Xn), axis=1)).reshape(m, n + 1)\n",
|
|||
|
"yo = np.matrix(data[:, -1]).reshape(m, 1)\n",
|
|||
|
"\n",
|
|||
|
"Xo /= np.amax(Xo, axis=0)\n",
|
|||
|
"yo /= np.amax(yo, axis=0)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": 23,
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "subslide"
|
|||
|
}
|
|||
|
},
|
|||
|
"outputs": [
|
|||
|
{
|
|||
|
"data": {
|
|||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA14AAAH0CAYAAAAtwPxTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAzmElEQVR4nO3dfXRV1Z038N8l4a1CooKEMEXBtqIVX8EKKtTWFsSWheDM2NpF1U6dwXdAR4t9nra28yxWZ1nUrhbQGe2bduqsIlbHl4pVEBS1WNSZJVKnVaGagKBNBCmY5Dx/3BINBAjJ3Ulu8vmsdVY8++6z7+/meG/ul3POPrksy7IAAAAgmR4dXQAAAEBXJ3gBAAAkJngBAAAkJngBAAAkJngBAAAkJngBAAAkJngBAAAkJngBAAAkJngBAAAkJngBAAAkVjTBa+7cuXHSSSdF//79Y9CgQXH22WfH2rVr97ndsmXLYtSoUdGnT584/PDDY+HChe1QLQAAwPuKJngtW7YsLr300njqqadiyZIlUVdXFxMmTIitW7fucZtXXnklzjrrrBg3blysXr06rrvuurjiiiti0aJF7Vg5AADQ3eWyLMs6uojWePPNN2PQoEGxbNmyGD9+fLN9rr322rj33ntjzZo1jW0zZsyI559/PlauXNlepQIAAN1caUcX0Fo1NTUREXHwwQfvsc/KlStjwoQJTdomTpwYt912W7z33nvRs2fP3bbZvn17bN++vXG9oaEh3nrrrRgwYEDkcrkCVQ8AAHRGWZbFO++8E0OGDIkePQp3gmBRBq8sy2L27Nlx2mmnxciRI/fYr7q6OioqKpq0VVRURF1dXWzatCkqKyt322bu3Llx/fXXF7xmAACgeKxfvz4+/OEPF2y8ogxel112WbzwwguxYsWKffbd9SjVzjMr93T0as6cOTF79uzG9Zqamjj00ENj/fr1UVZW1oaqAQCAzq62tjaGDh0a/fv3L+i4RRe8Lr/88rj33nvj8ccf32cCHTx4cFRXVzdp27hxY5SWlsaAAQOa3aZ3797Ru3fv3drLysoELwAA6CYKfZlR0cxqmGVZXHbZZXH33XfHo48+GsOHD9/nNmPHjo0lS5Y0aXv44Ydj9OjRzV7fBQAAkELRBK9LL7007rjjjvj5z38e/fv3j+rq6qiuro5t27Y19pkzZ058+ctfblyfMWNGvPbaazF79uxYs2ZN3H777XHbbbfF1Vdf3REvAQAA6KaKJngtWLAgampq4vTTT4/KysrG5a677mrsU1VVFevWrWtcHz58eDzwwAOxdOnSOP744+M73/lOfP/7349zzjmnI14CAADQTRXtfbzaS21tbZSXl0dNTY1rvAAAoItL9f2/aI54AQAAFCvBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAIDHBCwAAILGiCl6PP/54TJ48OYYMGRK5XC7uueeevfZfunRp5HK53ZaXXnqpfQoGAACIiNKOLmB/bN26NY477ri48MIL45xzzmnxdmvXro2ysrLG9UMOOSRFeQAAAM0qquA1adKkmDRp0n5vN2jQoDjwwAMLXxAAAEALFNWphq11wgknRGVlZZxxxhnx2GOP7bXv9u3bo7a2tskCAADQFl06eFVWVsatt94aixYtirvvvjtGjBgRZ5xxRjz++ON73Gbu3LlRXl7euAwdOrQdKwYAALqiXJZlWUcX0Rq5XC4WL14cZ5999n5tN3ny5MjlcnHvvfc2+/j27dtj+/btjeu1tbUxdOjQqKmpaXKdGAAA0PXU1tZGeXl5wb//d+kjXs0ZM2ZMvPzyy3t8vHfv3lFWVtZkAQAAaItuF7xWr14dlZWVHV0GAADQjRTVrIZbtmyJ//3f/21cf+WVV+K5556Lgw8+OA499NCYM2dOvP766/HTn/40IiJuuummGDZsWBx99NGxY8eOuOOOO2LRokWxaNGijnoJAABAN1RUwWvVqlXxqU99qnF99uzZERFx/vnnx49//OOoqqqKdevWNT6+Y8eOuPrqq+P111+Pvn37xtFHHx33339/nHXWWe1eOwAA0H0V7eQa7SXVxXUAAEDnY3INAACAIiV4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AQAAJCZ4AeyvbdsiNmzI/wQAaAHBC6ClVqyImDYtol+/iMGD8z+nTYt44omOrgwA6OSKKng9/vjjMXny5BgyZEjkcrm455579rnNsmXLYtSoUdGnT584/PDDY+HChekLBbqeBQsixo+PuO++iIaGfFtDQ3593LgIny0AwF4UVfDaunVrHHfccfGDH/ygRf1feeWVOOuss2LcuHGxevXquO666+KKK66IRYsWJa4U6FJWrIi49NKILIuoq2v6WF1dvv2SSxz5AgD2qLSjC9gfkyZNikmTJrW4/8KFC+PQQw+Nm266KSIijjrqqFi1alXccMMNcc455ySqEuhy5s2LKCnZPXR9UElJxI03Rpx6avvVBQAUjaI64rW/Vq5cGRMmTGjSNnHixFi1alW89957zW6zffv2qK2tbbIA3di2bRG/+tXeQ1dE/vHFi024AQA0q0sHr+rq6qioqGjSVlFREXV1dbFp06Zmt5k7d26Ul5c3LkOHDm2PUoHOqrb2/Wu69qWhId8fAGAXXTp4RUTkcrkm61mWNdu+05w5c6KmpqZxWb9+ffIagU6srCyiRws/Knv0yPcHANhFlw5egwcPjurq6iZtGzdujNLS0hgwYECz2/Tu3TvKysqaLEA31rdvxJQpEaX7uCS2tDRi6tR8fwCAXXTp4DV27NhYsmRJk7aHH344Ro8eHT179uygqoCiM3t2RH393vvU10fMmtU+9QAARaeogteWLVviueeei+eeey4i8tPFP/fcc7Fu3bqIyJ8m+OUvf7mx/4wZM+K1116L2bNnx5o1a+L222+P2267La6++uqOKB8oVqedFjF/fkQut/uRr9LSfPv8+WY0BAD2qKiC16pVq+KEE06IE044ISIiZs+eHSeccEJ84xvfiIiIqqqqxhAWETF8+PB44IEHYunSpXH88cfHd77znfj+979vKnlg/82YEbF8ef60w53XfPXokV9fvjz/OADAHuSynbNN0Kza2tooLy+Pmpoa13sBedu25WcvLCtzTRcAdDGpvv8X1Q2UATqFvn0FLgBgvxTVqYYAAADFSPACAABITPACAABITPACAABITPACAAB
|
|||
|
"text/plain": [
|
|||
|
"<Figure size 960x540 with 1 Axes>"
|
|||
|
]
|
|||
|
},
|
|||
|
"metadata": {},
|
|||
|
"output_type": "display_data"
|
|||
|
}
|
|||
|
],
|
|||
|
"source": [
|
|||
|
"fig = plot_data(Xo, yo, xlabel=u'metraż', ylabel=u'cena')\n",
|
|||
|
"theta_start = np.matrix([0.0, 0.0]).reshape(2, 1)\n",
|
|||
|
"theta, logs = gradient_descent(cost, gradient, theta_start, Xo, yo, alpha=0.01)\n",
|
|||
|
"plot_regression(fig, h_linear, theta, Xo)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {
|
|||
|
"slideshow": {
|
|||
|
"slide_type": "notes"
|
|||
|
}
|
|||
|
},
|
|||
|
"source": [
|
|||
|
"Na powyższym wykresie widać, że po odrzuceniu obserwacji odstających otrzymujemy dużo bardziej „wiarygodną” krzywą regresji."
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": []
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {},
|
|||
|
"outputs": [],
|
|||
|
"source": []
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"celltoolbar": "Slideshow",
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "Python 3 (ipykernel)",
|
|||
|
"language": "python",
|
|||
|
"name": "python3"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": {
|
|||
|
"name": "ipython",
|
|||
|
"version": 3
|
|||
|
},
|
|||
|
"file_extension": ".py",
|
|||
|
"mimetype": "text/x-python",
|
|||
|
"name": "python",
|
|||
|
"nbconvert_exporter": "python",
|
|||
|
"pygments_lexer": "ipython3",
|
|||
|
"version": "3.9.13"
|
|||
|
},
|
|||
|
"latex_envs": {
|
|||
|
"LaTeX_envs_menu_present": true,
|
|||
|
"autoclose": false,
|
|||
|
"autocomplete": true,
|
|||
|
"bibliofile": "biblio.bib",
|
|||
|
"cite_by": "apalike",
|
|||
|
"current_citInitial": 1,
|
|||
|
"eqLabelWithNumbers": true,
|
|||
|
"eqNumInitial": 1,
|
|||
|
"hotkeys": {
|
|||
|
"equation": "Ctrl-E",
|
|||
|
"itemize": "Ctrl-I"
|
|||
|
},
|
|||
|
"labels_anchors": false,
|
|||
|
"latex_user_defs": false,
|
|||
|
"report_style_numbering": false,
|
|||
|
"user_envs_cfg": false
|
|||
|
},
|
|||
|
"livereveal": {
|
|||
|
"start_slideshow_at": "selected",
|
|||
|
"theme": "white"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 4
|
|||
|
}
|