umz21/wyk/07_KNN.ipynb

793 lines
166 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
2021-04-14 08:03:54 +02:00
"## Uczenie maszynowe zastosowania\n",
"# 7. Algorytm $k$ najbliższych sąsiadów"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### KNN intuicja"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Do której kategorii powinien należeć punkt oznaczony gwiazdką?"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Przydatne importy\n",
"\n",
"import ipywidgets as widgets\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Wczytanie danych (gatunki kosaćców)\n",
"\n",
"data_iris = pandas.read_csv('iris.csv')\n",
"data_iris_setosa = pandas.DataFrame()\n",
"data_iris_setosa['dł. płatka'] = data_iris['pl'] # \"pl\" oznacza \"petal length\"\n",
"data_iris_setosa['szer. płatka'] = data_iris['pw'] # \"pw\" oznacza \"petal width\"\n",
"data_iris_setosa['Iris setosa?'] = data_iris['Gatunek'].apply(lambda x: 1 if x=='Iris-setosa' else 0)\n",
"\n",
"m, n_plus_1 = data_iris_setosa.values.shape\n",
"n = n_plus_1 - 1\n",
"Xn = data_iris_setosa.values[:, 0:n].reshape(m, n)\n",
"\n",
"X = np.matrix(np.concatenate((np.ones((m, 1)), Xn), axis=1)).reshape(m, n_plus_1)\n",
"Y = np.matrix(data_iris_setosa.values[:, 2]).reshape(m, 1)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Wykres danych (wersja macierzowa)\n",
"def plot_data_for_classification(X, Y, xlabel, ylabel): \n",
" fig = plt.figure(figsize=(16*.6, 9*.6))\n",
" ax = fig.add_subplot(111)\n",
" fig.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9)\n",
" X = X.tolist()\n",
" Y = Y.tolist()\n",
" X1n = [x[1] for x, y in zip(X, Y) if y[0] == 0]\n",
" X1p = [x[1] for x, y in zip(X, Y) if y[0] == 1]\n",
" X2n = [x[2] for x, y in zip(X, Y) if y[0] == 0]\n",
" X2p = [x[2] for x, y in zip(X, Y) if y[0] == 1]\n",
" ax.scatter(X1n, X2n, c='r', marker='x', s=50, label='Dane')\n",
" ax.scatter(X1p, X2p, c='g', marker='o', s=50, label='Dane')\n",
" \n",
" ax.set_xlabel(xlabel)\n",
" ax.set_ylabel(ylabel)\n",
" ax.margins(.05, .05)\n",
" return fig"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"def plot_new_example(fig, x, y):\n",
" ax = fig.axes[0]\n",
" ax.scatter([x], [y], c='k', marker='*', s=100, label='?')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
2021-04-21 11:24:35 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlwAAAFkCAYAAAD13eXtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3dfZRddX3v8c83mYAhGURJbIAQw0PEAtVJJiJWLOAzYGEyhQRqK7YsU1psR7QrEG/lqr01rNx7nQ6WYilayL2KITIEFmDxAdRwhUoSEiQ8BZSUNCCCApPISjLnfO8f+xzmzMw5Z++Zs5/OOe/XWnvN7Ifz29/9myzOl71/+/c1dxcAAACSMyXrAAAAAFodCRcAAEDCSLgAAAASRsIFAACQMBIuAACAhJFwAQAAJKwj6wAmatasWT5//vyswwAAABhl06ZNL7j77Gr7mi7hmj9/vjZu3Jh1GAAAAKOY2Y5a+3ikCAAAkDASLgAAgISRcAEAACSMhAsAACBhJFwAAAAJI+ECAABIGAkXAABAwki4AAAAEpZYwmVmR5rZPWb2qJltM7O+KsecZmYvm9mW0nJFUvEAAJAId+mWW4KfUbYndY404sCkJXmHa1jSZ9z9dyWdLOkSMzu+ynEb3L2rtHwxwXgAAIjf+vVSb6906aUjSY17sN7bG+xP4xxpxIFJS6y0j7s/K+nZ0u9DZvaopCMkPZLUOQEASF1Pj9TXJw0MBOv9/UGSMzAQbO/pSe8cSceBSTNP4Rajmc2X9GNJJ7r7KxXbT5N0s6SdknZJ+lt331avrcWLFzu1FAEAuVK+k1ROdqQgyenvl8zSO0cacaAmM9vk7our7ks64TKzmZJ+JOkf3H1wzL6DJRXdfbeZnSlpwN0XVGljuaTlkjRv3rzuHTtq1oYEACAb7tKUipE6xWL8SU6Uc6QRB6qql3Al+paimU1TcAfrG2OTLUly91fcfXfp9zslTTOzWVWOu9bdF7v74tmzZycZMgAAE1e+s1SpcixVWudIIw5MSpJvKZqkr0l61N2/XOOYOaXjZGYnleJ5MamYAACIXeVjvL6+4I5SeSxVXMlOlHOkEQcmLbFB85LeLelPJf3MzLaUtn1W0jxJcvevSjpX0l+a2bCkVyWd72kMKgMAIC7r148kOeWxUv39wb6BAenUU6UlS5I/R/n3JOPApKUyaD5ODJoHAOSKe5AQ9fSMHitVa3tS55CSjwN1ZTpoPm4kXAAAII8yGzQPAAAAEi4AAIDEkXABANLXLHX/ikXpssuCn1G2AzWQcAEA0tcsdf9WrpRWr5a6u0eSq2IxWF+9OtgPREDCBQBIX2VtwHLSlce6f6tWSV1d0pYtI0lXd3ew3tUV7AciSHIeLgAAqhs7R1S59l/e6v5NmSJt2jSSZE2dGmzv6gq2T+G+BaJhWggAQHaape5fsTiSbElSoUCyhXGYFgIAkD/NUvev/BixUuWYLiACEi4AQPqape7f2DFbhcL4MV1ABCRcAID01aoNWE668vSWYjnZKo/Z2rRpJOniLUVExBguAED60qg/GIdiMUiqVq0aP9as2na0NWopAgAAJIxB8wAAABki4QIApC+stE+xGF76J4420riWKOfJSxutJG/94e5NtXR3dzsAoMkNDgYpU1+fe7EYbCsWg3XJfcWK+vsHB+NpI41riXKevLTRSjLoD0kbvUb+knkCNdGFhAsAWkDlF1/5C7FyvVCov79YjKeNNK4lynny0kYryaA/SLgAAPlT+QVYXmrdjai2P6420riWZmqjlaTcH/USLt5SBABkx0NK+4Ttj6uNOMRxnry00UpS7A/eUgQA5I+HlPYJ2x9XG3GI4zx5aaOV5Kk/at36yuvCI0UAaAGM4cpnG62EMVwkXADQ9nhLMZ9ttBLeUiThAoC2VywGX3hj7zKUtxcK9feX73A12kYa1xL17lQe2mglGfRHvYSLQfMAAAAxYNA8AABAhki4AAAAEkbCBQBALR5DPb442mg3LdhnJFwAANSyfr3U21t9bq/e3mB/Gm20mxbss46sAwAAILd6eqS+PmlgIFjv7w++9AcGgu09Pem00W5asM94SxEAgHrKd1bKX/5S8KXf3x+9REwcbbSbJuyzem8pknABABDGqXGYiSbrM6aFAABgssp3WipR4zB5LdZnJFwAANRS+Virry+4w1IeWxT1yz+ONtpNC/YZg+YBAKhl/fqRL/3y2KH+/mDfwIB06qnSkiXJt9FuWrDPGMMFAEAt7sGXf0/P6LFDtbYn1Ua7adI+Y9A8AABAwhg0DwAAkCESLgAAgISRcAEAWlOUenxhxxSLjbdBvcXR2ulaK5BwAQBaU5R6fGHHrFzZeBvUWxytna61krs31dLd3e0AAIQqFt37+oJ7UH191dfDjikUGm+jWIwn1lbRwtcqaaPXyF8yT6AmupBwAQAiq/wyLy9jv9TDjomjjbhibRUteq31Ei6mhQAAtDaPUI8v7Jg42ogr1lbRgtfKtBAAgPbkEerxhR0TRxtxxdoq2ulay2rd+srrwiNFAEAkjOHKpxa+VjGGCwDQdgYHx3+JV365Dw6GH7NiReNtDA7GE2uraOFrrZdwMYYLANCaPEI9Pqn+MeecI916a2NtUG9xtBa+VmopAgAAJIxB8wAAABki4QIAAEhYYgmXmR1pZveY2aNmts3M+qocY2Z2lZk9aWYPmdmipOIBAMTEU6g/GKUNpC/s7xbX3yWt86QoyTtcw5I+4+6/K+lkSZeY2fFjjjlD0oLSslzSNQnGAwCIQxr1B6O0gfSlVQexFest1np9Me5F0q2SPjBm279IuqBi/XFJh9Vrh2khACBjacxdFaUNpC+tObSadK4uZT0Pl6T5kv5T0sFjtt8u6ZSK9R9IWlyvLRIuAMiBNOoPtmi9vaaX1t+lCf/+9RKuxKeFMLOZkn4k6R/cfXDMvjskrXL3e0vrP5C0wt03jTluuYJHjpo3b173jh07Eo0ZABCBp1B/MEobSF9af5cm+/tnNi2EmU2TdLOkb4xNtkp2SjqyYn2upF1jD3L3a919sbsvnj17djLBAgCiK4+nqRR3/cEobSB9af1dWu3vX+vWV6OLJJO0RtI/1jnmLEnfKR17sqSfhrXLI0UAyBhjuNoXY7jqUhZjuCSdIsklPSRpS2k5U9LFki72kaTsaklPSfqZQsZvOQkXAGQvjfqDUdpA+tKqg9ik9RbrJVyU9gEATIyH1MKLo/5glDZyPJanZYX97eP6u6R1nphRSxEAACBh1FIEAADIEAkXAABAwki4AADx8gh18IpF6bLLgp+Vam2f7HnaCf2RayRcAIB4RamDt3KltHq11N09klwVi8H66tXB/jjO007oj3yr9fpiXhemhQCAnIsyh1Kh4N7VFWzr6qq+Hsd52gn9kTkxLQQAIFXlOysDAyPb+vqk/v6R1/nLd7S2bBk5pqtL2rRpdDmXRs/TTuiPTDEtBAAgfR6hDl6xKE2dOrJeKERPtiZynnZCf2SGaSEAAOkq32mpNLYOXvkOV6XKMV1xnaed0B+5RcIFAIhX5WOtvr4ggerrC9bLX/6VjxO7uoI7W11dwXrUpCvKedoJ/ZFvtQZ35XVh0DwA5FyUOnjlWomVA+QrB86vWBHPedoJ/ZE5MWgeAJAaj1AHzz2Y+mHVqvHjjaptn+x52mnsEv2ROQbNAwAAJIxB8wAAABki4QIAjCgUpCVLgp+1trdSWZ6waykUGo8zjmtNq7/y8ndpRbUGd+V1YdA8ACSopycYYD1rlvvwcLBteDhYl4L9rTTgPexayv3RSJxxXGta/ZWXv0uTUp1B85knUBNdSLgAIEGVyVU56Rq73kplecKuZXi48TjjuNa0+isvf5cmRcIFAIiuMskqL5V3vNxHJyblJWqyVVb5ZV5esvhSD7uWOOLMSxt5Ok8Lqpdw8ZYiAGC8QkHq6BhZHx4eXYJHaq2yPGHXEkeceWkjT+dpMbylCACIrlCQ5swZvW3OnNED6VupLE/YtcQRZ17ayNN52k2tW195XXikCAAJYgwXY7jy8HdpUmIMFwA
"text/plain": [
"<Figure size 691.2x388.8 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plot_data_for_classification(X, Y, xlabel=u'dł. płatka', ylabel=u'szer. płatka')\n",
"plot_new_example(fig, 2.8, 0.9)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"* Wydaje się sensownym przyjąć, że punkt oznaczony gwiazdką powinien być czerwony, ponieważ sąsiednie punkty są czerwone. Najbliższe czerwone punkty są położone bliżej niż najbliższe zielone."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
2021-04-21 11:24:35 +02:00
"* Algorytm oparty na tej intuicji nazywamy algorytmem **$k$ najbliższych sąsiadów** (*$k$ nearest neighbors*, KNN)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
2021-04-21 11:24:35 +02:00
"slide_type": "subslide"
}
},
"source": [
"* Idea (KNN dla $k = 1$):\n",
" 1. Dla nowego przykładu $x'$ znajdź najbliższy przykład $x$ ze zbioru uczącego.\n",
" 1. Jego klasa $y$ to szukana klasa $y'$."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"from scipy.spatial import Voronoi, voronoi_plot_2d\n",
"\n",
"def plot_voronoi(fig, points):\n",
" ax = fig.axes[0]\n",
" vor = Voronoi(points)\n",
" ax.scatter(vor.vertices[:, 0], vor.vertices[:, 1], s=1)\n",
" \n",
" for simplex in vor.ridge_vertices:\n",
" simplex = np.asarray(simplex)\n",
" if np.all(simplex >= 0):\n",
" ax.plot(vor.vertices[simplex, 0], vor.vertices[simplex, 1],\n",
" color='orange', linewidth=1)\n",
" \n",
" xmin, ymin = points.min(axis=0).tolist()[0]\n",
" xmax, ymax = points.max(axis=0).tolist()[0]\n",
" pad = 0.1\n",
" ax.set_xlim(xmin - pad, xmax + pad)\n",
" ax.set_ylim(ymin - pad, ymax + pad)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
2021-04-21 11:24:35 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAl8AAAFkCAYAAAAe6l7uAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOydd3hUZfbHvzeNNFoghRJ6rxFCDxAUFFABWbFgxXVFXV2s+ENd2xZW17LWFSu6gtICoUrvCCGBAKGHBBIgIT2kJzNzfn8cxrQp987cuXcG3s/zzAOZee97zr2ZyT3zvuecr0REEAgEAoFAIBBog5feDggEAoFAIBDcSIjgSyAQCAQCgUBDRPAlEAgEAoFAoCEi+BIIBAKBQCDQEBF8CQQCgUAgEGiICL4EAoFAIBAINMRHbweU0rp1a+rUqZPebrgfRceApt0A7wC9PRFYo+gY0LQ74O2vtyfyKb8ISN5AQBtt7VYXAlX5gCQBPsGAf7i29pVQkQUYy4HgrnWeJKDyClCRDQS0BfzDdHNPUIeKywAZgMAO/HNBEhDUAagpAYK72D+eTMDVk4B/BNCklWt9FTAlZwG/VkCTEOtjTDVA8XGgxQBA0ndNKSkpKY+IQu2N87jgq1OnTkhMTNTbDfcjYTbQrBfQ63m9PRFY47dHgdbDgO5P6e2JfC6uAU5/DNyyRVu7FVeAtT2Bm7cAO+8ApuwCfAK19UEuhgpgbS9g5IdA2Jj6rxWfBPY/Bng3AYZ9CzTtankOgeshE7C6KzB6LRAyiJ9bLAF3HwHiOwFTNsoLqIqOA1tjgVvWAC36udJjQflFYP0AYFqK/c//9klApweAzg9q45sVJEm6IGec2Ha8XoiYAGRt1tsLgS3CxwFXdujthTLCRgP5BwBjlbZ2A8KBwHYACAgdCZz9r7b2leATAES9CyQ9zzf4ujTvDUzYA7SbAmwaBpz6D2Ay6uPnjU7OTsC3GdDypvrP+7UA2k4Gzi+WN0+LvsBN7wN77gZqStX3U1BL+o9A5N3yvnh1mQWkfe96n1RCBF/XC+E3A7l7tL9JCuQTHgvk7AA8SVXCrwXQrCeQn6C97bBY4Mp2oP9bwMl/A4Yy7X2QS8d7eXUr/cfGr3l5A71fACb8BmTGAVvGAFdPa+/jjc657/kGLUmNX1N64+7yCBAawzsOnvR59iSIan9ncmg/BSg6ApSed6lbaiGCr+uFJiH8LTtvn96eCKwR1BHwDgKKT+jtiTLCx3EQpJfdFv15O+/M59r7IBdJAgZ9BBx5zfpqSLPuwPgdQMf7gM2jgBP/FqtgWlFzFbi0mrelLBF+M1CVBxQekT/n4E+B4hQg9St1fBTUJ3cvf3FpPVzeeG9/oMN9QPoPrvVLJUTwdT0hth7dn/BxvPrlSYTp5HPYWCBvL2AyAP3eBE594N7bPK2H8e/3xLvWx0heQM9ngdsSgKwNwOaRnheMeyIXlnKA5W8lD9rLG+j8iLLVL58AIGYZcPR1oOCQOn4KakmzsVJpja6zgLSFjbf/3RARfF1PREwAskXw5daEx+qziuQMYTG87Wis1NaufyhXpRUkcZ5N+M3Amc+09UEpA+cDZ78AyjJsjwvuwsUEXWYBW8YCx//JFVsC15AmY/uqy6Oc92Wslj9vsx5A9GfAnnuA6mKnXBTUwVDGW/SdHlJ2XMtBgG9Tzu9zc0TwdT3RegRQcoZL9AXuiXnlywO+mf2ObzOgeV8gb7/2tutuefZ7Azj1IW8huStBkUCPZ4Dk/7M/VvICuj8JTEwCruwENg4HCo+63scbjaungdI0oO0k2+OaduXUjctrlc3f8V6gzUTgwGMi/0stMpZzoU1gW2XHSRIH2efcP/HeZcGXJEmRkiRtlyTppCRJxyVJmmNhTKwkScWSJCVfe7zhKn9uCLz9gNDRQPZWvT0RWCOwPeDbknvSeBJ6VWrW3aZt3ptXd09/qr0fSugzF8jZBeT+Jm98UAdg3K9Ajz8D224Bjr6lbPVFYJu0hdx+wEtGZyVHb9yDPgDKLgCnP7E+hghYubJxgGbteWeRY09rn+QiZ6XSGp0e5Pw+d/6SBteufBkAvEhEvQEMB/BnSZL6WBi3m4iirj3ecaE/NwZi69H98citx1ggRwefw8YAuftqt+T6vwGc/o97b/H4BAED/wkcstB6whqSBHR9DJiUDBQkAhuHiDwiNTAZuQJV7o088m6uGq/IVmbHuwnnfx3/h/UV4lWrgOnTgeefrw1qiPjn6dP5dTWRY09rn+RQmsZfTtvd6djx/qH8pe3CUnX9UhmXBV9ElEVEh679vwTASQDtXGVPcI02t3LwJZa/3Re9qgedITSGc68MFdrabdKK86PyD/LPzXoCbSZx41d3pvODABmBC78oOy6wHTB2DdD7JWD7RK6eFO1jHCd7E682N7f0vd8CvsFA5F1A+v+U2wruDAz7Gth7n+XUj2nTgDlzgI8/rg12nn+ef54zh19XEzn2tPZJDmkLgY73c0DrKJ7Q84uIXP4A0AlABoBmDZ6PBZAP4AiADQD62ptr8ODBJLCByUQU146o+LTengisUXaJaFkIkcmotyfK2DiCKGur9nYTnyNK+Uftz8VniJa3Iqoq1N4XJVzZTbQykqimzLHjyy8T7ZxGtLYPUe4BdX27Udg9g+jMf62/vgiNn7uyi2hNb/5b6ghJLxJtn2z5820yEc2ZY97w48ecOY7bsocce1r7ZNNfI9HKDkT5h5ybx1hNtCKcqPiUOn4pAEAiyYiLXJ5wL0lSMIAVAJ4jooabsIcAdCSigQA+BWBxjVOSpCckSUqUJCkxNzfXtQ57OpIEtBFbj25NYFugSWugyMOSq81NT7Wm4TZts+68JXHqP9r7ooSwGC6COfmBY8cHtAFGxwF9/wrsmgIcnqv9yqMnU1UAZG3ivmpKCI3hbW5HGwtHzWdt0pP/bvyaJAEffVT/uY8+UtZOQQly7Gntky2ubAP8WgIhN9kfawsvX879SluoiluuwKXBlyRJvuDAaxERxTV8nYiuElHptf+vB+ArSVJrC+O+IqJoIooODbWrVykQeV/ujyduPYaP0y/vK29//e23vq8DZz/jm5w7E/Uu56iVX3LseEkCOt0HTD4KlJ0HNkRxDpzAPucXs2yQXwtlx0kSt51wdNvKyxcYtQQ49REXXtTFvK1Xl7r5Vmojx57WPtlCSUd7e3SZxfl+btrI2JXVjhKAbwGcJKIPrYyJuDYOkiQNveaP6JPgLBHjuTJN9A1yXzxR5zF0JFCYrL3Mj19LoGmP2rwvgNsCtJsKnLT4p8V9CO4EdJvNuVvO4B8GxCzlRP49d7OOpDvLLbkDzlTMdX4YyFjq+EpjUCQwfCGwdyZQmcPPNcynMpka51upiRx7Wvtki+oibvNhTYVAKS36AgHtOO/PHZGzN+nIA0AMAAJwFEDytcdkAE8CePLamGcAHAfnfO0HMNLevCLnSybro4hy9ujthcAa5VlES1sQGQ16e6KMTaOILm/S3m7SC0RH36n/XEka585V5mnvjxKqrxLFtSHKO6jOfJV5RHsfIIrvSpS9Q505rzcKjnC+nb3Pl6WcLzNbbyVKX+ScH8mvEW0dz37ExTXOp6qbbxUX55ythsixp7VPtjjzJdGu6SrP+QXn/WkIZOZ8aZJwr+ZDBF8yOfQy0ZE39fZCYIs1vYnyE/X2QhnJrxMdnqe93YtriLaMa/z8/j/p449SUr8h2hSjbhJzZjwX1yQ8TVRdot681wOJz/F71R62gq/0nzlwcgZjDdHmWKKjb10rhopr/B6w9ryzyLGntU+2+HUYf87VpKqAaGlzosp8dee1gdzgS3S4v14ReV/ujyduPeqlTRk62rLEUb/XgNQFQGWe9j4pofOjgKEUyFyu3pztpwC3HwOM5cD6/kD2FvXm9mTKs0BnPseynKEoKHOiWW3kNO61Zk8qyhZePsCoxfwevbIVuOuuxonskmT5eQvUbBiOc3ETUXC1xL5ta/PWfV7OGC0oPslNattMVHdev5asbHB+sbrzqoAIvq5XQmO4ms6dm1He6Hh
"text/plain": [
"<Figure size 691.2x388.8 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plot_data_for_classification(X, Y, xlabel=u'dł. płatka', ylabel=u'szer. płatka')\n",
"plot_new_example(fig, 2.8, 0.9)\n",
"plot_voronoi(fig, X[:, 1:])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
2021-04-21 11:24:35 +02:00
"slide_type": "subslide"
}
},
"source": [
2021-04-21 11:24:35 +02:00
"* Podział płaszczyzny jak na powyższym wykresie nazywamy **diagramem Woronoja** (*Voronoi diagram*)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
2021-04-21 11:24:35 +02:00
"slide_type": "fragment"
}
},
"source": [
"* Taki algorytm wyznacza dość efektowne granice klas, zwłaszcza jak na tak prosty algorytm. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
2021-04-21 11:24:35 +02:00
"slide_type": "fragment"
}
},
"source": [
"* Niestety jest bardzo podatny na obserwacje odstające:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"X_outliers = np.vstack((X, np.matrix([[1.0, 3.9, 1.7]])))\n",
"Y_outliers = np.vstack((Y, np.matrix([[1]])))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
2021-04-21 11:24:35 +02:00
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
2021-04-21 11:24:35 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAl8AAAFkCAYAAAAe6l7uAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOydd3xUVfr/35NA6C2U0KugSBWCUgUUFFARVCzYFt0Vt4mu/nRdXVfd4lfXlXXVXbFiAaU36b23hN47oZckhPRkZp7fH4eYNuXemTtzB3Ler9e8YOaee85z7kzmPnPO8zwfh4ig0Wg0Go1GowkPUXYboNFoNBqNRlOW0M6XRqPRaDQaTRjRzpdGo9FoNBpNGNHOl0aj0Wg0Gk0Y0c6XRqPRaDQaTRjRzpdGo9FoNBpNGClntwFmqVOnjjRv3txuMyKPSzuh2nUQXcluS65N8lIg5zxUv8FuSzSRTP5lyD4D1dtAzjnIPguVGkLFenZbpgHIPg3ihMpN1fOURKjSFPLToWpL/+eLGy7vhYr1oULt0NqqUaQfhJjaUCHWext3PqTthpodwWHvmlJiYuJFEanrr91V53w1b96chIQEu82IPDaNVo7BDS/Ybcm1idsF8zpA139AgzvstkYTqeRfhhkN4f71EB0DaXthw1MQXQFu+RKqtbLbwrKLuGF2K+jzE8R2Ua9NdMAD22FWcxi60JhDdWk3LO0Ht8+Bmu1DabEm6yTM6wjDdkG5yr7bLh8MzR+FFo+FxzYvOByO40ba6W3Ha4X6A+HMYrutuHaJiob2b8COv4AuTKzxRvnqULUVpG5Tz2u0hYFroNFQWHQL7Pu3cuQ14ef8SvX+1Lqp+OsxNaHhEDg20Vg/NdvBTe/DmgcgP8N6OzWFHP0Wmjzg3/ECaDkKjnwdepssQjtf1wpxt8GFNeDKtduSa5emI8CZDmcW2m2JJpKp0xMurit8HhUNbf8AA9fDiemw5Fa4vN8++8oqh79WN2iHo/Qxszfulk9C3d5qx0H/GAsNIoXvmREaD4VL2yHjWEjNsgrtfF0rVIhVv7KLfulrrCUqGjr8BXa8ob9wNd6p29Pz32H11jBgBTR7GBb3gj3/1Ktg4SL/MpyarbalPBF3G+RehNTtxvvs+hGk7YJDn1ljo6Y4F9aq79w63Y21j64ITR+Go9+E1i6L0M7XtYTeegw9Te4Hdw6cnme3JZpIpU5PdePw5KA7ouD638Odm+DMfFjcE9L2hN/GssbxycrBquglDjoqGlo8aW71q1wl6D0FdrwOKVussVNTyBEfK5XeaDUKjoxX8X0Rjna+riXqD4Sz2vkKKY4o6PAm7NSxXxovVG0J4oKsJN9tbluibi5L+sLuf6iMLU1oOGJg+6rlL1TclyvPeL/V20D8x7DmQchLC8pETRGcmWqLvvnj5s6r1QXKV1PxfRGOdr6uJer0gPQDkJtstyXXNo2HgdsJp+bYbYkmEnE4rqx++QkBcERB62dhUCKcWwkLu0PqjvDYWJa4vB8yjkDDwb7bVWulQjdO/2Su/2YPQYNBsPEp/YPMKpKmqu37yg3NnedwKCf7cOQH3ofM+XI4HE0cDsdyh8Ox1+Fw7HY4HGM8tOnncDjSHA7HtiuPN0JlT5kgOgbq9oGzS+225Nrm59WvN/WXrcYz3uK+PFGlKfRfAG1+C8tuhx1vmlt90fjmyHhVfiDKQGWlQG/cXf4Fmcdh/3+8txGBGTNKf2d4ez1YjIwXbpuMYmSl0hvNH1PxffmXrbXJYkK58uUEXhSRtkB34LcOh+NGD+1Wi0jnK4+3Q2hP2UBvPYaHxveqf0/OstcOTUSSHt2I3IPfkJKRY+wEhwNaPQWDt0FKAizspuOIrMDtUuUKjN7Imzygssazz5obJ7qCiv/a/Xe4uMFzm5kz4b774IUXCp0aEfX8vvvUcSsxMl64bTJCxhFVMLXRPYGdX7EuxPVXcX4RTMicLxE5IyJbrvw/HdgLNArVeJorNLhDOV96RSa0OBzQ4a0rsV+RH9ypCS8/nupABUln6+r/mjuxciPoOwfavgTLB8H213T5mGA4uwgqN4Yann73e6B8VWgyHI5+Z36sqi3gls9h7cOeQz+GDYMxY+DDDwudnRdeUM/HjFHHrcTIeOG2yQhHxkOzR5RDGyhXQ80vEQn5A2gOJAHVS7zeD0gGtgPzgXb++uratatofOB2i0xvJJK2325Lrn3cbpH58SLHp9ptiSbCSM7IFZmAOKc1FsnPDKyTrNMiK4eJ/HSjyIWN1hpYVlg9QuTA/7wfn0Dp186tEpnTVv19B0LiiyLLh4i4XaWPud0iY8YUbPipx5gxgY/lDyPjhdsmn/a6RGY0FUneElw/rjyRaXEiafusscsEQIIY8IscEuIVEofDURVYCfxdRKaXOFYdcItIhsPhGAJ8KCKtPfTxDPAMQNOmTbseP26oen/ZZcMoiI1XMSSa0HJqLmz7IwzZbrummCbCmOhQhXlrtIcOAYazisDxSbDleWjxhFptLaf1Ww2RmwKzW8K9x1QVe09MdMBID/FOc9pAz++hzi3mx3XnqwzWxvfCja+UPi4CUUW+K9xuc+UUzGJkvHDb5I2zS2DLSzBkW/B9bXkJospD53eC78sEDocjUUTi/bUL6d3C4XCUB6YBE0o6XgAicllEMq78fx5Q3uFw1PHQ7jMRiReR+Lp1/epVanTcV/hoOERJXyRNtdsSTSTS+V3Y/yFknQrsfIcDmj8MQ3ZA5jGY39l/FqVGcWyi+vv05nh5w+FQZScC3baKKg+9JsG+sXB+VfFjBdt6RSkab2U1RsYLt02+MFPR3h8tR6l4vwgtZBzKbEcH8CWwV0Q+8NKm/pV2OByOm6/Yo+skBEv9AXBuha4bFA4KYr92vRWxf+QaG6naAq57Brb/Kbh+KtaD3pOh0z+UpmDiC6oWksY7wWTMtXgCkiaDMzuw86s0ge7jYe1IyDmvXisZT+V2l463shIj44XbJl/kXVJlPrypEJilZjuo1EjF/UUiRvYmA3kAvQEBdgDbrjyGAM8Cz15p8ztgNyrmawPQ01+/OubLIPM6i5xfY7cVZQO3W2RBd5GjP9htiSaSKIgnyksTmd5A5OJma/rNuSiy9lGRWa1Ezq6wps9rjZTtIjOaiLicvtt5ivkqYOkdIkcnBGfHttdElg5QdkyfXjqeqmi81fTpwY1VEiPjhdsmXxz4VGTVfRb3+V8V9xdGiJSYL6uJj4+XhIQEu82IfLa+DNGVoeObdltSNjizGBKfgyG7lFSJRlM0nujQF3B0PAxYbV0szcnZsPk3Krao87sqU0+jSHwBylWFTn/13c5TzFcBx36EI1/CbUGEcLidsGwgxPWD9m+o0g3DhhX/DIh4fj1YvPVb9HUIr02+WNgd2r8Oje62rs+8VJjVAoYeUfrHYSAiYr40NqLjvsJL/QFQoTYc/9FuSzSRSMtRkJ8BJyyMDWw8FO7aCa4smNdBBStrIOsMcuATppy/mZTMIIrVNhmmaq1l+pCJ8kdUOeg1EQ6Ng3NLYfjw0s6Mw+H5dQ/kz+/O4emDSLmc7n9sb/0Wfd1Im3CQtlcVqW0wyNp+Y2opZYNjE63t1wK083WtUrc3XNqh9cbCxc+xX2+rX7saTVGioqHrWLUi7TJYeNUIMbWg+9fQ7b+w4SnY+EzEV/YOKeKGTb/kQMW7GXFhKFMSTgTeV3RFJR105JvgbKrUAHpOgHWPQ9bpwPtJmkbu5SSSzl2EBV0heXNwdkUSR76GFo8bUyEwS4TW/NLO17VKuUpK6/HccrstKTvE3QaV6kfkryxNBBDXH2p1hn3/tr7vhoNhyE71/7nt4fQC68e4GtjzHuSlUneAKpI6Ir5JcP21HKW2i4MtpBzXX5X+WftwYD/O0g/B5mdx9ZrKgbY/EtPpdVh5typzY6Uzbwdupypqa1WWY0nibofc8xGnm6qdr2sZvfUYXn5e/fqrXv3SeOamf8K+983L1xghpgbc8hl0/wo2/1rV+8tLtX6cUCMB6g2eXwX7/w29JhFbrQoAsVVigrMlNh6iK8H
"text/plain": [
"<Figure size 691.2x388.8 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plot_data_for_classification(X_outliers, Y_outliers, xlabel=u'dł. płatka', ylabel=u'szer. płatka')\n",
"plot_new_example(fig, 2.8, 0.9)\n",
"plot_voronoi(fig, X_outliers[:, 1:])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"* Pojedyncza obserwacja odstająca dramatycznie zmienia granice klas."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Aby temu zaradzić, użyjemy więcej niż jednego najbliższego sąsiada ($k > 1$)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Algorytm $k$ najbliższych sąsiadów dla problemu klasyfikacji\n",
"\n",
"1. Dany jest zbiór uczący zawierajacy przykłady $(x_i, y_i)$, gdzie: $x_i$ zestaw cech, $y_i$ klasa.\n",
"1. Dany jest przykład testowy $x'$, dla którego chcemy określić klasę.\n",
"1. Oblicz odległość $d(x', x_i)$ dla każdego przykładu $x_i$ ze zbioru uczącego.\n",
"1. Wybierz $k$ przykładów $x_{i_1}, \\ldots, x_{i_k}$, dla których wyliczona odległość jest najmniejsza.\n",
"1. Jako wynik $y'$ zwróć tę spośrod klas $y_{i_1}, \\ldots, y_{i_k}$, która występuje najczęściej."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Algorytm $k$ najbliższych sąsiadów dla problemu klasyfikacji przykład"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# Odległość euklidesowa\n",
"def euclidean_distance(x1, x2):\n",
" return np.linalg.norm(x1 - x2)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# Algorytm k najbliższych sąsiadów dla pojedynczej obserwacji\n",
"def knn(X, Y, x_new, k, distance=euclidean_distance):\n",
" \"\"\"Funkcja zwraca klasę najbliższego sąsiada dla pojedynczej obserwacji x_new\n",
" obliczoną według algorytmu KNN\n",
" \n",
" Argumenty funkcji:\n",
" X, Y - zbiór uczący\n",
" x_new - obserwacja, dla której chcemy dokonać predykcji\n",
" k - liczba sąsiadów\n",
" distance - funkcja odległości\n",
" \"\"\"\n",
" data = np.concatenate((X, Y), axis=1)\n",
" nearest = sorted(\n",
" data, key=lambda xy:distance(xy[0, :-1], x_new))[:k]\n",
" y_nearest = [xy[0, -1] for xy in nearest]\n",
" return max(y_nearest, key=lambda y:y_nearest.count(y))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# Wykres klas dla KNN\n",
"def plot_knn(fig, X, Y, k, distance=euclidean_distance):\n",
" ax = fig.axes[0]\n",
" x1min, x2min = X.min(axis=0).tolist()[0]\n",
" x1max, x2max = X.max(axis=0).tolist()[0]\n",
" pad1 = (x1max - x1min) / 10\n",
" pad2 = (x2max - x2min) / 10\n",
" step1 = (x1max - x1min) / 50\n",
" step2 = (x2max - x2min) / 50\n",
" x1grid, x2grid = np.meshgrid(\n",
" np.arange(x1min - pad1, x1max + pad1, step1),\n",
" np.arange(x2min - pad2, x2max + pad2, step2))\n",
" z = np.matrix([[knn(X, Y, [x1, x2], k, distance) \n",
" for x1, x2 in zip(x1row, x2row)] \n",
" for x1row, x2row in zip(x1grid, x2grid)])\n",
" plt.contour(x1grid, x2grid, z, levels=[0.5]);"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"# Przygotowanie interaktywnego wykresu\n",
"\n",
"slider_k = widgets.IntSlider(min=1, max=10, step=1, value=1, description=r'$k$', width=300)\n",
"\n",
"def interactive_knn_1(k):\n",
" fig = plot_data_for_classification(X_outliers, Y_outliers, xlabel=u'dł. płatka', ylabel=u'szer. płatka')\n",
" plot_voronoi(fig, X_outliers[:, 1:])\n",
" plot_knn(fig, X_outliers[:, 1:], Y_outliers, k)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
2021-04-21 11:24:35 +02:00
"model_id": "4edcfd52a18e4381867255f3eaf06f7c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=1, description='$k$', max=10, min=1), Button(description='Run Interact',…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<function __main__.interactive_knn_1(k)>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"widgets.interact_manual(interactive_knn_1, k=slider_k)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Wczytanie danych (inny przykład)\n",
"\n",
"alldata = pandas.read_csv('classification.tsv', sep='\\t')\n",
"data = np.matrix(alldata)\n",
"\n",
"m, n_plus_1 = data.shape\n",
"n = n_plus_1 - 1\n",
"Xn = data[:, 1:].reshape(m, n)\n",
"\n",
"X2 = np.matrix(np.concatenate((np.ones((m, 1)), Xn), axis=1)).reshape(m, n_plus_1)\n",
"Y2 = np.matrix(data[:, 0]).reshape(m, 1)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
2021-04-21 11:24:35 +02:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAmwAAAFmCAYAAADQ5sbeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3df4wc533f8c/nJJ5QHw+JSVEyLYmREx6CWmosq1vlh4meZZuudKjD4yHFyRVcpSVKCK3QM6UEYOE0NZoWdR3Y7DlRbNCOALlQrEth8kQk58iyGkRhDKc6GvpBWpbvrPoHfaxEU459uhQilf32j5kVR8fdu70fu/Ps3vsFLHbmeWaW3x3u7X1u5pkZR4QAAACQrp6yCwAAAMDSCGwAAACJI7ABAAAkjsAGAACQOAIbAABA4ghsAAAAibu87ALKcOWVV8b1119fdhkAAABvcOLEiR9GxLbF7RsysF1//fWanp4uuwwAAIA3sP3deu0cEgUAAEgcgQ0AACBxBDYAAIDEEdgAAAASR2ADAABIHIENAAAgcQQ2AACAxBHYAAAAEkdgAwAA3SdCOno0e26mPXEENgAA0H0mJ6WREenAgYvhLCKbHxnJ+jvIhrw1FQAA6HLDw9LYmDQ+ns0fOpSFtfHxrH14uNz6VojABgAAuo+dhTQpC2m14DY2lrXb5dW2Co4OO4a7HiqVSnDzdwAANoAIqacwAqxaTTqs2T4REZXF7YxhA8rSZQNiASA5tTFrRcUxbR2EwAaUpcsGxAJAUmrfp7Uxa9XqxTFtHRjaGMMGlKXLBsQCQFImJy9+n9bGrBXHtA0OSnv3llvjCjCGDShT8S/Amg4dEAsASYnIQtvw8Bu/Txu1J6LRGDYCG1C2DhsQCwBoHU46AFLURQNiAQCtQ2ADytJlA2IBAK3DSQdAWbpsQCwAoHUIbEBZhoelI0feOPC1FtoGBzlLFADwOgIbUBa7/h60Ru0AgA2LMWwAAACJSyKw2b7N9vO2Z20frNP/m7afyh8nbf+d7S1533dsP5v3ca0OAADQdUo/JGr7Mkn3S9ot6bSkJ20fi4hv1JaJiN+V9Lv58h+QdCAiXi68zK0R8cM2lg0AANA2Kexhu0XSbES8EBHnJT0sac8Sy39Q0hfaUhkAAEACUghs10j6fmH+dN52CdtvknSbpC8WmkPSl22fsL2/ZVUCAACUpPRDopLq3YOn0RVDPyDprxYdDn1XRMzZvkrSY7a/GRFPXPKPZGFuvyTt2LFjrTUDAAC0TQp72E5Luq4wf62kuQbL3qFFh0MjYi5/fknSUWWHWC8REYcjohIRlW3btq25aAAAgHZJIbA9KWnA9tts9yoLZccWL2T7pyQNSnqk0NZnu782Len9kk62pWoAAIA2Kf2QaES8ZvseSY9KukzSAxFxyvbdef9n8kX3SvpyRCwUVr9a0lFnV4m/XNIfRcSfta96AACA1nNswBtMVyqVmJ7mkm0AACAttk9ERGVxewqHRAEAALAEAhsAAEDiCGwAAACJI7ABAAAkjsAGAACQOAIbAABA4ghsAAAAiSOwAQAAJI7Atl4ipKNHs+dm2gEAAJpEYFsvk5PSyIh04MDFcBaRzY+MZP0AAACrUPq9RLvG8LA0NiaNj2fzhw5lYW18PGsfHi63PgAA0LEIbOvFzkKalIW0WnAbG8vasxvUAwAArBg3f19vEVJP4UhztUpYAwAATeHm7+1QG7NWVBzTBgAAsAoEtvVSC2u1MWvV6sUxbYQ2AACwBoxhWy+TkxfDWm3MWnFM2+CgtHdvuTUCAICORGBbL8PD0pEj2XNtzFottA0OcpYoAABYNQLberHr70Fr1A4AANAkxrABAAAkjsAGAACQOAIbAABA4ghsAAAAiSOwAQAAJI7ABgAAkDgCGwAAQOIIbAAAAIkjsAEAACSOwAYAAJA4AhsAAEDikghstm+z/bztWdsH6/S/2/aPbT+VP3672XUBAAA6Xek3f7d9maT7Je2WdFrSk7aPRcQ3Fi36lxHxT1e5LoAG5l+d18SpCc2cm9HA1gGN3jCq/iv6yy4LAFBQemCTdIuk2Yh4QZJsPyxpj6RmQtda1gU2vOPfO66hh4ZUjaoWLiyob1Of7n30Xk3dOaVdO3aVXR4AIJfCIdFrJH2/MH86b1vsl20/bftLtm9Y4boAFpl/dV5DDw1p/vy8Fi4sSJIWLixo/nzW/sr5V0quEABQk0Jgc522WDT/dUk/ExHvkPR7kiZXsG62oL3f9rTt6bNnz666WKBbTJyaUDWqdfuqUdXEyYk2VwQAaCSFwHZa0nWF+WslzRUXiIifRMQr+fSUpE22r2xm3cJrHI6ISkRUtm3btp71Ax1p5tzM63vWFlu4sKDZl2fbXBEAoJEUAtuTkgZsv812r6Q7JB0rLmD7LbadT9+irO5zzawLoL6BrQPq29RXt69vU592btnZ5ooAAI2UHtgi4jVJ90h6VNJzkv44Ik7Zvtv23flivybppO2nJX1K0h2Rqbtu+98F0HlGbxhVj+t/BfS4R6M3jra5IgBAI46oO+Srq1UqlZieni67DKB09c4S7XEPZ4kCQElsn4iIyuL2FC7rAaAku3bs0tx9c5o4OaHZl2e1c8tOjd44qs29m8suDQBQQGADNrjNvZu17+Z9ZZcBAFhC6WPYAAAAsDQCGwAAQOI4JAoAieC+rgAaIbABQAK4ryuApXBIFABKxn1dASyHwAYAJeO+rgCWQ2ADgJJxX1cAyyGwAUDJuK8rgOUQ2ACgZNzXFcByCGwAULL+K/o1deeU+nv7X9/T1repT/29WTu3CgPAZT0AIAHc1xVIUIQ0OSkND0v28u0tRGADgERwX1cgMZOT0siINDYmHTqUhbMI6cABaXxcOnJE2ru3LaVwSBTNi5COHs2em2kHAKCTDQ9nYW18PAtpxbA2Npb1twmBDc2r/aVR+9BKFz+8IyNZPwAA3cLO9qzVQltPz8WwVtvj1iYENjQvob80AABoi1poK2pzWJMIbFiJhP7SAACgLWo7J4qKR5rahMCGlUnkLw0AAFpu8ZGkavXSI01tQmDDyiTylwYAAC03OXnpkaTikaY2jt0msKF5Cf2lAQBAyw0PZ5fuKB5JqoW2I0faOnab67CheY3+0pCy9sHBtl2PBgCAlrPr/15r1N5CBDY0r/aXRvHKzrXQNjjIWaIAALQIgQ3NS+gvDQAANhLGsAEAACSOwAYAAJA4AhsAAEDiCGwAAACJI7ABAAAkjsAGAACQuCQCm+3bbD9ve9b2wTr9d9p+Jn981fY7Cn3fsf2s7adsT7e3cgAAgNYr/Tpsti+TdL+k3ZJOS3rS9rGI+EZhsf8jaTAifmT7dkmHJf1iof/WiPhh24oGAABooxT2sN0iaTYiXoiI85IelrSnuEBEfDUifpTPfk3StW2uEQAAoDQpBLZrJH2/MH86b2tkn6QvFeZD0pdtn7C9v9FKtvfbnrY9ffbs2TUVDAAA0E6lHxKV5DptUXdB+1ZlgW1XofldETFn+ypJj9n+ZkQ8cckLRhxWdihVlUql7usDAACkKIU9bKclXVeYv1bS3OKFbP+CpM9J2hMR52rtETGXP78k6aiyQ6wAAABdI4XA9qSkAdtvs90r6Q5Jx4oL2N4h6YikD0XEtwrtfbb7a9OS3i/pZNsqBwAAaIPSD4lGxGu275H0qKTLJD0QEads3533f0bSb0vaKukPbEvSaxFRkXS1pKN52+WS/igi/qyEtwEAANAyjth4w7kqlUpMT3PJNgAAkBbbJ/KdUm+QwiFRAAAALIHABgAAkDgCGwAAQOIIbAAAYOOKkI4ezZ6baS8JgQ0AAGxck5PSyIh04MDFcBaRzY+MZP0JKP2yHgAAAKUZHpbGxqTx8Wz+0KEsrI2PZ+3Dw+XWlyOwAQCAjcvOQpqUhbRacBsby9pd7w6a7cd12AAAACKknsJIsWq1lLDGddgAAADqqY1ZKyqOaUsAgQ0AAGxctbBWG7NWrV4c05ZQaGMMGwAAWBfzr85r4tSEZs7NaGDrgEZvGFX/Ff1ll7W0ycmLYa0
"text/plain": [
"<Figure size 691.2x388.8 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plot_data_for_classification(X2, Y2, xlabel=r'$x_1$', ylabel=r'$x_2$')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Przygotowanie interaktywnego wykresu\n",
"\n",
"slider_k = widgets.IntSlider(min=1, max=10, step=1, value=1, description=r'$k$', width=300)\n",
"\n",
"def interactive_knn_2(k):\n",
" fig = plot_data_for_classification(X2, Y2, xlabel=r'$x_1$', ylabel=r'$x_2$')\n",
" plot_voronoi(fig, X2[:, 1:])\n",
" plot_knn(fig, X2[:, 1:], Y2, k)"
]
},
{
"cell_type": "code",
2021-04-21 11:24:35 +02:00
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
2021-04-21 11:24:35 +02:00
"model_id": "55c13a313b8046ebbf77be896397a52f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=1, description='$k$', max=10, min=1), Button(description='Run Interact',…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<function __main__.interactive_knn_2(k)>"
]
},
2021-04-21 11:24:35 +02:00
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"widgets.interact_manual(interactive_knn_2, k=slider_k)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Algorytm $k$ najbliższych sąsiadów dla problemu regresji\n",
"\n",
"1. Dany jest zbiór uczący zawierajacy przykłady $(x_i, y_i)$, gdzie: $x_i$ zestaw cech, $y_i$ liczba rzeczywista.\n",
"1. Dany jest przykład testowy $x'$, dla którego chcemy określić klasę.\n",
"1. Oblicz odległość $d(x', x_i)$ dla każdego przykładu $x_i$ ze zbioru uczącego.\n",
"1. Wybierz $k$ przykładów $x_{i_1}, \\ldots, x_{i_k}$, dla których wyliczona odległość jest najmniejsza.\n",
"1. Jako wynik $y'$ zwróć średnią liczb $y_{i_1}, \\ldots, y_{i_k}$:\n",
" $$ y' = \\frac{1}{k} \\sum_{j=1}^{k} y_{i_j} $$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Wybór $k$\n",
"\n",
"* Wartość $k$ ma duży wpływ na wynik działania algorytmu KNN:\n",
" * Jeżeli $k$ jest zbyt duże, wszystkie nowe przykłady są klasyfikowane jako klasa większościowa.\n",
" * Jeżeli $k$ jest zbyt małe, granice klas są niestabilne, a algorytm jest bardzo podatny na obserwacje odstające.\n",
"* Aby dobrać optymalną wartość $k$, najlepiej użyć zbioru walidacyjnego."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Miary podobieństwa"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Odległość euklidesowa\n",
"$$ d(x, x') = \\sqrt{ \\sum_{i=1}^n \\left( x_i - x'_i \\right) ^2 } $$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Dobry wybór w przypadku numerycznych cech.\n",
"* Symetryczna, traktuje wszystkie wymiary jednakowo.\n",
"* Wrażliwa na duże wahania jednej cechy."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Odległość Hamminga\n",
"$$ d(x, x') = \\sum_{i=1}^n \\mathbf{1}_{x_i \\neq x'_i} $$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Dobry wybór w przypadku cech zero-jedynkowych.\n",
"* Liczba cech, którymi różnią się dane przykłady."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Odległość Minkowskiego ($p$-norma)\n",
"$$ d(x, x') = \\sqrt[p]{ \\sum_{i=1}^n \\left| x_i - x'_i \\right| ^p } $$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Dla $p = 2$ jest to odległość euklidesowa.\n",
"* Dla $p = 1$ jest to odległość taksówkowa.\n",
"* Jeżeli $p \\to \\infty$, to $p$-norma zbliża się do logicznej alternatywy.\n",
"* Jeżeli $p \\to 0$, to $p$-norma zbliża się do logicznej koniunkcji."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### KNN praktyczne porady"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Co zrobić z remisami?\n",
" * Można wybrać losową klasę.\n",
" * Można wybrać klasę o wyższym prawdopodobieństwie _a priori_.\n",
" * Można wybrać klasę wskazaną przez algorytm 1NN."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* KNN źle radzi sobie z brakującymi wartościami cech (nie można wówczas sensownie wyznaczyć odległości)."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"livereveal": {
"start_slideshow_at": "selected",
2021-04-14 08:03:54 +02:00
"theme": "white"
}
},
"nbformat": 4,
"nbformat_minor": 4
}