umz21/wyk/2007_KNN.ipynb

785 lines
167 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Uczenie maszynowe UMZ 2019/2020\n",
"### 28 kwietnia 2020\n",
"# 7. Algorytm $k$ najbliższych sąsiadów"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### KNN intuicja"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Do której kategorii powinien należeć punkt oznaczony gwiazdką?"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Przydatne importy\n",
"\n",
"import ipywidgets as widgets\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Wczytanie danych (gatunki kosaćców)\n",
"\n",
"data_iris = pandas.read_csv('iris.csv')\n",
"data_iris_setosa = pandas.DataFrame()\n",
"data_iris_setosa['dł. płatka'] = data_iris['pl'] # \"pl\" oznacza \"petal length\"\n",
"data_iris_setosa['szer. płatka'] = data_iris['pw'] # \"pw\" oznacza \"petal width\"\n",
"data_iris_setosa['Iris setosa?'] = data_iris['Gatunek'].apply(lambda x: 1 if x=='Iris-setosa' else 0)\n",
"\n",
"m, n_plus_1 = data_iris_setosa.values.shape\n",
"n = n_plus_1 - 1\n",
"Xn = data_iris_setosa.values[:, 0:n].reshape(m, n)\n",
"\n",
"X = np.matrix(np.concatenate((np.ones((m, 1)), Xn), axis=1)).reshape(m, n_plus_1)\n",
"Y = np.matrix(data_iris_setosa.values[:, 2]).reshape(m, 1)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Wykres danych (wersja macierzowa)\n",
"def plot_data_for_classification(X, Y, xlabel, ylabel): \n",
" fig = plt.figure(figsize=(16*.6, 9*.6))\n",
" ax = fig.add_subplot(111)\n",
" fig.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9)\n",
" X = X.tolist()\n",
" Y = Y.tolist()\n",
" X1n = [x[1] for x, y in zip(X, Y) if y[0] == 0]\n",
" X1p = [x[1] for x, y in zip(X, Y) if y[0] == 1]\n",
" X2n = [x[2] for x, y in zip(X, Y) if y[0] == 0]\n",
" X2p = [x[2] for x, y in zip(X, Y) if y[0] == 1]\n",
" ax.scatter(X1n, X2n, c='r', marker='x', s=50, label='Dane')\n",
" ax.scatter(X1p, X2p, c='g', marker='o', s=50, label='Dane')\n",
" \n",
" ax.set_xlabel(xlabel)\n",
" ax.set_ylabel(ylabel)\n",
" ax.margins(.05, .05)\n",
" return fig"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"def plot_new_example(fig, x, y):\n",
" ax = fig.axes[0]\n",
" ax.scatter([x], [y], c='k', marker='*', s=100, label='?')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlwAAAFkCAYAAAD13eXtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3dfZRddX3v8c83D0AIQZSkBgkxAhEXcHGSSTFtQ3nwGTVMppCg3gqtLdpqGYJtINy7qHpvjTftdZoiWhEt0OtDQIZAkUpFVOIV1CQkQEAMCikxIA8qTKI3kznne//Y55AzM+ecfWbO/u29zznv11p7zeyn3++7f3E5X/b+7f01dxcAAADCmZR1AAAAAO2OhAsAACAwEi4AAIDASLgAAAACI+ECAAAIjIQLAAAgsClZBzBeM2fO9Hnz5mUdBgAAwAibN29+zt1nVdvXcgnXvHnztGnTpqzDAAAAGMHMdtbaxyNFAACAwEi4AAAAAiPhAgAACIyECwAAIDASLgAAgMBIuAAAAAIj4QIAAAiMhAsAACAwEi4AAIDASLgAAGiGu3TLLdHPRraH6iONODBhwRIuMzvGzL5tZo+Y2XYz66tyzBlm9oKZbS0tV4aKBwCAIDZskHp7pZUrDyQ17tF6b2+0P40+0ogDExayluKwpI+4+xYzmyFps5l9090fHnXcRnd/Z8A4AAAIp6dH6uuT1q2L1vv7oyRn3bpoe09Pen2EjgMTFizhcvenJD1V+n3QzB6RdLSk0QkXAACtyyxKbqQouSknPH190Xaz9PoIHQcmzDyFZ7pmNk/SPZJOdvcXK7afIelmSbsk7Zb01+6+vcr5F0m6SJLmzp3bvXNnzWLcAABkw12aVDFTp1hMPslppI804kBVZrbZ3RdV2xd80ryZHaYoqbqkMtkq2SLp1e7+eklXSar6gNndr3H3Re6+aNasWWEDBgBgvMpzpSpVzqVKq4804sCEBE24zGyqomTrS+4+MHq/u7/o7ntKv98haaqZzQwZEwAAiSonOeW5UsXigblUSSU7jfSRRhyYsGBzuMzMJH1B0iPu/qkax8yW9At3dzM7VVEC+HyomAAASNyGDQeSnPJcqcq5VKefLi1bFr6P8u8h48CEBZvDZWZLJG2U9KCkYmnzFZLmSpK7/7OZfVjSXyh6o/G3ki519+/Xa3fRokW+adOmIDEDADBu7lFC1NMzcq5Ure2h+pDCx4G66s3hSmXSfJJIuAAAQB5lOmkeAACg05FwAQAABEbCBQBIX6vU/SsWpcsui342sh2ogYQLAJC+Vqn7t3q1tHat1N19ILkqFqP1tWuj/UADSLgAAOmrrA1YTrryWPdvzRqpq0vauvVA0tXdHa13dUX7gQaELF4NAEB1adQfTMKkSdLmzQeSrMmTo+1dXdH2Sdy3QGP4LAQAIDutUvevWDyQbElSoUCyhTH4LAQAIH9ape5f+TFipco5XUADSLgAAOlrlbp/o+dsFQpj53QBDSDhAgCkr1ZtwHLSlae3FMvJVnnO1ubNB5Iu3lJEg5jDBQBIXxr1B5NQLEZJ1Zo1Y+eaVduOjkYtRQAAgMCYNA8AAJAhEi4AQPriSvsUi/Glf5JoI41raaSfvLTRLvI4Fu7eUkt3d7cDAFrcwECUMvX1uReL0bZiMVqX3Fetqr9/YCCZNtK4lkb6yUsb7SKjsZC0yWvkL5knUONdSLgAoA1U/vEr/1GsXC8U6u8vFpNpI41raaSfvLTRLjIaCxIuAED+VP4RLC+17khU259UG2lcSyu10S4yGIt6CRdvKQIAsuMxpX3i9ifVRhKS6CcvbbSLlMeCtxQBAPnjMaV94vYn1UYSkugnL220i7yNRa1bX3ldeKQIAG2AOVz5bKNdMIeLhAsA4LylmNc22gVvKZJwAQA8+uM3MDD2TkN5e6FQf3/5DlezbaRxLY3encpDG+0io7Gol3AxaR4AACABTJoHAADIEAkXAABAYCRcAADU4gnU5EuijU7ThmNGwgUAQC0bNki9vdW/7dXbG+1Po41O04ZjNiXrAAAAyK2eHqmvT1q3Llrv74/+6K9bF23v6UmnjU7ThmPGW4oAANRTvrNS/uMvRX/0+/sbLxOTRBudpgXHrN5biiRcAADEcWocZqLFxozPQgAAMFHlOy2VqHEYXpuNGQkXAAC1VD7W6uuL7rCU5xY1+sc/iTY6TRuOGZPmAQCoZcOGA3/0y3OH+vujfevWSaefLi1bFr6NTtOGY8YcLgAAanGP/vj39IycO1Rre6g2Ok2LjhmT5gEAAAJj0jwAAECGSLgAAAACI+ECALSnRurxxR1TLDbfBvUWR+qka61AwgUAaE+N1OOLO2b16ubboN7iSJ10rZXcvaWW7u5uBwAgVrHo3tcX3YPq66u+HndModB8G8ViMrG2iza+VkmbvEb+knkCNd6FhAsA0LDKP+blZfQf9bhjkmgjqVjbRZtea72Ei89CAADamzdQjy/umCTaSCrWdtGG18pnIQAAnckbqMcXd0wSbSQVa7vopGstq3XrK68LjxQBAA1hDlc+tfG1ijlcAICOMzAw9o945R/3gYH4Y1atar6NgYFkYm0XbXyt9RIu5nABANqTN1CPT6p/zDnnSLfe2lwb1FscqY2vlVqKAAAAgTFpHgAAIEMkXACA8fEUSuY0UlIH6Wvk376V+klRsITLzI4xs2+b2SNmtt3M+qocY2b2T2b2mJk9YGYLQ8UDAEhIGiVzGimpg/SlVZanHcv/1JpN3+wi6ShJC0u/z5D0E0knjjrmbEn/LskkLZb0g7h2eUsRADKWxucWGvkcA9KX1icdWvTTEcrDZyEk3SrpzaO2fU7SuyvWH5V0VL12SLgAIAfSKJnTpuVfWl5a/y4t+O9fL+FK5S1FM5sn6R5JJ7v7ixXbb5f0SXf/Xmn9W5Iuc/dNo86/SNJFkjR37tzunTt3Bo8ZABDDUyiZ00gfSF9a/y4t9u+f6VuKZnaYpJslXVKZbJV3VzllTAbo7te4+yJ3XzRr1qwQYQIAxqM8n6ZS0iVzGukD6Uvr36Xd/v1r3fpKYpE0VdKdki6tsZ9HigDQapjD1bmYw1WXspjDpeju1Q2S/rHOMe/QyEnzP4xrl4QLADKWRsmcRkrqIH1pleVp0fI/9RKuYHO4zGyJpI2SHpRULG2+QtLc0p21fzYzk/RpSW+T9BtJf+Kj5m+NxpfmASBj3kBpFqm5kjmNlNTJ8VyettXIv30S/y5p9ZMwSvsAAAAERmkfAACADJFwAQAABEbCBQBInsfUwisUpMsui76rVKlYrL59In202JSZpjEeuUbCBQBIXlwtvHPPldaulbq7DyRXxWK0vnZtVEux2T5asd5eMxiPfKv1+mJeFz4LAQAtIO47SsPD7l1d0XpXV/TdrdHrzfaR0281BcN4ZE5Zl/ZJEm8pAkCLKN9dWbfuwLa+Pqm/P3qlv3xHa+vWA/u7uqTNm0eWc2mmj07DeGSKz0IAALLhMbXwikVp8uQD64VC48lWo310GsYjM3wWAgCQvvLdlkqV84vKd7gqVc7pSqKPTsN45BYJFwAgeZWPtvr6oiSqry9aX7kyupNVfpzY1RWtd3VF640mXXF9dFqSwXjkW63JXXldmDQPAC0grhZeT8/YCfKVE+dXrWq+j5zW2wuG8cicmDQPAEiVx9TCW7pUuuIKac2asfONVq8eu30ifeS03l4wjEfmmDQPAAAQGJPmAQAAMkTCBQAAEBgJFwBgpEJBWrYs+llt+/797VMHsVbM5e2FQvNxJnGtaY1XXv5d2lGt2fR5XXhLEQACK79BOHNmVILHPfo5c2a0/fjj2+cNw1Wr6l9LeSyaiTOJa01rvPLy79KiVOctxcwTqPEuJFwAEFhlclVOuirXh4bapw5itdgr14eHm48ziWtNa7zy8u/Soki4AADjU5lklZfKO16ViUl5aTTZKqv8Y15esvijHnctScSZlzby1E8bqpdw8VkIAEB1hYI0ZcqB9eHhkXUP26kOYty1JBFnXtrIUz9ths9CAADGp1CQZs8euW327AMT6dupDmL
"text/plain": [
"<Figure size 691.2x388.8 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plot_data_for_classification(X, Y, xlabel=u'dł. płatka', ylabel=u'szer. płatka')\n",
"plot_new_example(fig, 2.8, 0.9)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"* Wydaje się sensownym przyjąć, że punkt oznaczony gwiazdką powinien być czerwony, ponieważ sąsiednie punkty są czerwone. Najbliższe czerwone punkty są położone bliżej niż najbliższe zielone."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Algorytm oparty na tej intuicji nazywamy algorytmem **$k$ najbliższych sąsiadów** (_$k$ nearest neighbors_, KNN)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Idea (KNN dla $k = 1$):\n",
" 1. Dla nowego przykładu $x'$ znajdź najbliższy przykład $x$ ze zbioru uczącego.\n",
" 1. Jego klasa $y$ to szukana klasa $y'$."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"from scipy.spatial import Voronoi, voronoi_plot_2d\n",
"\n",
"def plot_voronoi(fig, points):\n",
" ax = fig.axes[0]\n",
" vor = Voronoi(points)\n",
" ax.scatter(vor.vertices[:, 0], vor.vertices[:, 1], s=1)\n",
" \n",
" for simplex in vor.ridge_vertices:\n",
" simplex = np.asarray(simplex)\n",
" if np.all(simplex >= 0):\n",
" ax.plot(vor.vertices[simplex, 0], vor.vertices[simplex, 1],\n",
" color='orange', linewidth=1)\n",
" \n",
" xmin, ymin = points.min(axis=0).tolist()[0]\n",
" xmax, ymax = points.max(axis=0).tolist()[0]\n",
" pad = 0.1\n",
" ax.set_xlim(xmin - pad, xmax + pad)\n",
" ax.set_ylim(ymin - pad, ymax + pad)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAl8AAAFkCAYAAAAe6l7uAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOydd3hUZfbHvzeNVEoghRJ6rwFCb0FBARWQFQtW1BV1dbHiD3Vtuyura1nrihVdQWmhg/SOEBIIEHpIIAES0kN6MjPn98dhTJty78yde2fg/TzPPJCZ977n3JuZ3DPve875SkQEgUAgEAgEAoE2eOntgEAgEAgEAsGNhAi+BAKBQCAQCDREBF8CgUAgEAgEGiKCL4FAIBAIBAINEcGXQCAQCAQCgYaI4EsgEAgEAoFAQ3z0dkApLVq0oPbt2+vthvtReAwI6Qx4B+jticAahceAkC6At7/ensin7CIgeQMBLbW1W1UAVOYBkgT4BAP+EdraV0J5JmAsA4I71XqSgIorQHkWENAK8A/XzT1BLcovA2QAAtvyz/mJQFBboLoYCO5o/3gyAVdPAv6RQKPmrvVVwBSfBfyaA41CrY8xVQNFx4GmfQFJ3zWlxMTEXCIKszfO44Kv9u3bIyEhQW833I/4WUDj7kD35/X2RGCN3x8BWgwBujyltyfyubgGOP0JcPMWbe2WXwHWdgNu2gLsvB2YvAvwCdTWB7kYyoG13YHhHwHho+u+VnQS2P8o4N0IGPIdENLJ8hwC10MmYHUnYNRaIHQAP7dIAu46AqxqD0zeKC+gKjwObI0Fbl4DNO3tSo8FZReB9X2Bqcn2P//bJwLt7wc6PKCNb1aQJOmCnHFi2/F6IXI8kLlZby8EtogYC1zZobcXyggfBeQdAIyV2toNiAACWwMgIGw4cPa/2tpXgk8AEP0ekPg83+Br06QHMH4P0HoysGkIcOo/gMmoj583Otk7Ad/GQLP+dZ/3awq0mgScXyRvnqa9gP4fAHvuAqpL1PdTUEPaT0DUXfK+eHWcCaT+4HqfVEIEX9cLETcBOXu0v0kK5BMRC2TvADxJVcKvKdC4G5AXr73t8Fjgynagz1vAyX8DhlLtfZBLu3t4dSvtp4aveXkDPV4Axv8OZMQBW0YDV09r7+ONzrkf+AYtSQ1fU3rj7vgwEDaSdxw86fPsSRDV/M7k0GYyUHgEKDnvUrfUQgRf1wuNQvlbdu4+vT0RWCOoHeAdBBSd0NsTZUSM5SBIL7tN+/B23pkvtPdBLpIEDPgYOPKa9dWQxl2AcTuAdvcCm0cAJ/4tVsG0ovoqcGk1b0tZIuImoDIXKDgif86BnwFFyUDK1+r4KKhLzl7+4tJiqLzx3v5A23uBtB9d65dKiODrekJsPbo/EWN59cuTCNfJ5/AxQO5ewGQAer8JnPrQvbd5Wgzh3++J96yPkbyAbs8Ct8YDmRuAzcM9Lxj3RC4s4QDL30oetJc30OFhZatfPgHAyKXA0deB/EPq+CmoIdXGSqU1Os0EUhc03P53Q0TwdT0ROR7IEsGXWxMRq88qkjOEj+RtR2OFtnb9w7gqLT+R82wibgLOfK6tD0rpNw84+yVQmm57XHBHLiboOBPYMgY4/i5XbAlcQ6qM7auOj3Del7FK/ryNuwIxnwN77gaqipxyUVALQylv0bd/UNlxzQYAviGc3+fmiODreqLFMKD4DJfoC9wT88qXB3wz+wPfxkCTXkDufu1t197y7P0GcOoj3kJyV4KigK7PAEn/Z3+s5AV0eRKYkAhc2QlsHAoUHHW9jzcaV08DJalAq4m2x4V04tSNy2uVzd/uHqDlBODAoyL/Sy3Sl3GhTWArZcdJEgfZ59w/8d5lwZckSVGSJG2XJOmkJEnHJUmabWFMrCRJRZIkJV17vOEqf24IvP2AsFFA1la9PRFYI7AN4NuMe9J4EnpVatbepm3Sg1d3T3+mvR9K6DkHyN4F5Pwub3xQW2Dsb0DXvwDbbgaOvqVs9UVgm9QF3H7AS0ZnJUdv3AM+BEovAKc/tT6GCFixomGAZu15Z5FjT2uf5CJnpdIa7R/g/D53/pIG1658GQC8SEQ9AAwF8BdJknpaGLebiKKvPd5xoT83BmLr0f3xyK3HWCBbB5/DRwM5+2q25Pq8AZz+j3tv8fgEAf3eBQ5ZaD1hDUkCOj0KTEwC8hOAjYNEHpEamIxcgSr3Rh51F1eNl2cps+PdiPO/jv/T+grxypXAtGnA88/XBDVE/PO0afy6msixp7VPcihJ5S+nre9w7Hj/MP7SdmGJun6pjMuCLyLKJKJD1/5fDOAkgNausie4RstbOPgSy9/ui17Vg84QNpJzrwzl2tpt1Jzzo/IO8s+NuwEtJ3LjV3emwwMAGYELvyo7LrA1MGYN0OMlYPsErp4U7WMcJ2sTrzY3sfS93wK+wUDUnUDa/5TbCu4ADPkG2Huv5dSPqVOB2bOBTz6pCXaef55/nj2bX1cTOfa09kkOqQuAdvdxQOsontDzi4hc/gDQHkA6gMb1no8FkAfgCIANAHrZm2vgwIEksIHJRBTXmqjotN6eCKxReoloaSiRyai3J8rYOIwoc6v2dhOeI0r+Z83PRWeIljUnqizQ3hclXNlNtCKKqLrUsePLLhPtnEq0tidRzgF1fbtR2D2d6Mx/rb++EA2fu7KLaE0P/lvqCIkvEm2fZPnzbTIRzZ5t3vDjx+zZjtuyhxx7Wvtk018j0Yq2RHmHnJvHWEW0PIKo6JQ6fikAQALJiItcnnAvSVIwgOUAniOi+puwhwC0I6J+AD4DYHGNU5KkJyRJSpAkKSEnJ8e1Dns6kgS0FFuPbk1gK6BRC6DQw5KrzU1Ptab+Nm3jLrwlceo/2vuihPCRXARz8kPHjg9oCYyKA3r9Ddg1GTg8R/uVR0+mMh/I3MR91ZQQNpK3uR1tLBw9j7VJT/674WuSBHz8cd3nPv5YWTsFJcixp7VPtriyDfBrBoT2tz/WFl6+nPuVukAVt1yBS4MvSZJ8wYHXQiKKq/86EV0lopJr/18PwFeSpBYWxn1NRDFEFBMWZlevUiDyvtwfT9x6jBirX95X7v6622+9XgfOfs43OXcm+j3OUSu75NjxkgS0vxeYdBQoPQ9siOYcOIF9zi9i2SC/psqOkyRuO+HotpWXLzBiMXDqYy68qI15W682tfOt1EaOPa19soWSjvb26DiT8/3ctJGxK6sdJQDfAThJRB9ZGRN5bRwkSRp8zR/RJ8FZIsdxZZroG+S+eKLOY9hwoCBJe5kfv2ZASNeavC+A2wK0ngKctPinxX0Ibg90nsW5W87gHw6MXMKJ/HvuYh1Jd5ZbcgecqZjr8BCQvsTxlcagKGDoAmDvDKAim5+rn09lMjXMt1ITOfa09skWVYXc5sOaCoFSmvYCAlpz3p87Imdv0pEHgJEACMBRAEnXHpMAPAngyWtjngFwHJzztR/AcHvzipwvmayPJsreo7cXAmuUZRItaUpkNOjtiTI2jSC6vEl7u4kvEB19p+5zxamcO1eRq70/Sqi6ShTXkij3oDrzVeQS7b2faFUnoqwd6sx5vZF/hPPt7H2+LOV8mdl6C1HaQuf8SHqNaOs49iMurmE+Ve18q7g452zVR449rX2yxZmviHZNU3nOLznvT0MgM+dLk4R7NR8i+JLJoZeJjryptxcCW6zpQZSXoLcXykh6nejwXO3tXlxDtGVsw+f3/1kff5SS8i3RppHqJjFnrOLimviniaqK1Zv3eiDhOX6v2sNW8JX2CwdOzmCsJtocS3T0rWvFUHEN3wPWnncWOfa09skWvw3hz7maVOYTLWlCVJGn7rw2kBt8iQ731ysi78v98cStR720KcNGWZY46v0akDIfqMjV3icldHgEMJQAGcvUm7PNZOC2Y4CxDFjfB8jaot7cnkx5JujMF1iaPRj5pU40q42ayr3W7ElF2cLLBxixiN+jV7YCd97ZMJFdkiw/b4HqDUNxLm4C8q8W27dtbd7az8sZowVFJ7lJbcsJ6s7r14yVDc4vUndeFRDB1/VK2EiupnPnZpQ3Op7
"text/plain": [
"<Figure size 691.2x388.8 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plot_data_for_classification(X, Y, xlabel=u'dł. płatka', ylabel=u'szer. płatka')\n",
"plot_new_example(fig, 2.8, 0.9)\n",
"plot_voronoi(fig, X[:, 1:])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Podział płaszczyzny jak na powyższym wykresie nazywamy **diagramem Woronoja** (_Voronoi diagram_)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"* Taki algorytm wyznacza dość efektowne granice klas, zwłaszcza jak na tak prosty algorytm. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"* Niestety jest bardzo podatny na obserwacje odstające:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"X_outliers = np.vstack((X, np.matrix([[1.0, 3.9, 1.7]])))\n",
"Y_outliers = np.vstack((Y, np.matrix([[1]])))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAl8AAAFkCAYAAAAe6l7uAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nOydd5hURdaH356BIachDDkKikRhUKKAggoqgooB06K74gZFVz9dV9dVN7i6rq6r7ooRAyg5CUjOeYacM0MOk5g8093n+6MYJ3W4t/t234ap93n6ge5bt+rU7Z6+p6vOOT+HiKDRaDQajUajCQ9Rdhug0Wg0Go1GU57QzpdGo9FoNBpNGNHOl0aj0Wg0Gk0Y0c6XRqPRaDQaTRjRzpdGo9FoNBpNGNHOl0aj0Wg0Gk0YqWC3AWapV6+etGzZ0m4zIo+0HVDjKoiuYrclVyb5KZB7DmpeY7clmkim4CLknIaa7SD3LOScgSqNoXIDuy3TAOScAnFC1ebqeUoiVGsOBRlQvbX/88UNF/dA5YZQqW5obdUoMg5ATF2oFOu9jbsA0ndB7c7gsHdNKTEx8YKI1PfX7rJzvlq2bElCQoLdZkQeG8cox+Ca5+y25MrE7YJ5naD736HRLXZbo4lUCi7CjMZwzzqIjoH0PbD+cYiuBDd8ATXa2G1h+UXcMLsN9PsRYrup1yY64N5tMKslDFtgzKFK2wVLBsDNc6B2x1BarMk+AfM6w/CdUKGq77bLhkDLh6DVw+GxzQsOh+OYkXZ62/FKoeFgOL3IbiuuXKKioeNrsP3PoAsTa7xRsSZUbwOpW9XzWu1h8GpoMgwW3gB7/60ceU34ObdCvT91riv5ekxtaDwUjk401k/tDnDdu7D6XijItN5OTRFHvoFm9/p3vABaj4bDX4XeJovQzteVQtxNcH41uPLstuTKpflIcGbA6QV2W6KJZOr1hgtri55HRUP738PgdXB8Oiy+ES7us8++8sqhr9QN2uEoe8zsjbv1Y1C/r9px0D/GQoNI0XtmhKbDIG0bZB4NqVlWoZ2vK4VKsepXdvEvfY21REVDpz/D9tf0F67GO/V7e/47rNkWBi2HFg/Aoj6w+596FSxcFFyEk7PVtpQn4m6CvAuQus14n90/hPSdcPBTa2zUlOT8GvWdW6+nsfbRlaH5A3Dk69DaZRHa+bqS0FuPoafZPeDOhVPz7LZEE6nU661uHJ4cdEcUXP003LoRTs+HRb0hfXf4bSxvHJusHKzKXuKgo6Kh1WPmVr8qVIG+U2D7q5Cy2Ro7NUUc9rFS6Y02o+HweBXfF+Fo5+tKouFgOKOdr5DiiIJOr8MOHful8UL11iAuyE7y3eamxermsrg/7Pq7ytjShIbDBravWv9CxX258o33W7MdxH8Eq++D/PSgTNQUw5mltuhbPmLuvDrdoGINFd8X4Wjn60qiXi/I2A95yXZbcmXTdDi4nXByjt2WaCIRh+PS6pefEABHFLR9Cm5LhLMrYEFPSN0eHhvLExf3QeZhaDzEd7sabVToxqkfzfXf4n5odBtseFz/ILOKpKlq+75qY3PnORzKyT4U+YH3IXO+HA5HM4fDsczhcOxxOBy7HA7HWA9tBjgcjnSHw7H10uO1UNlTLoiOgfr94MwSuy25svl59et1/WWr8Yy3uC9PVGsOA3+Cdr+FpTfD9tfNrb5ofHN4vCo/EGWgslKgN+5u/4KsY7DvP97biMCMGWW/M7y9HixGxgu3TUYxslLpjZYPq/i+govW2mQxoVz5cgLPi0h7oCfwW4fDca2HdqtEpOulx5shtKd8oLcew0PTu9S/J2bZa4cmIsmIbkLega9Jycw1doLDAW0ehyFbISUBFvTQcURW4HapcgVGb+TN7lVZ4zlnzI0TXUnFf+36G1xY77nNzJlw993w3HNFTo2Ien733eq4lRgZL9w2GSHzsCqY2uTOwM6vXB/iBqo4vwgmZM6XiJwWkc2X/p8B7AGahGo8zSUa3aKcL70iE1ocDuj0xqXYr8gP7tSElx9OdqaSZLBl1X/NnVi1CfSfA+1fgGW3wbZXdPmYYDizEKo2hVqefvd7oGJ1aDYCjnxrfqzqreCGz2DNA55DP4YPh7Fj4YMPipyd555Tz8eOVcetxMh44bbJCIfHQ4sHlUMbKJdDzS8RCfkDaAkkATVLvT4ASAa2AfOBDv766t69u2h84HaLTG8ikr7PbkuufNxukfnxIsem2m2JJsJIzswTmYA4pzUVKcgKrJPsUyIrhov8eK3I+Q3WGlheWDVSZP//vB+fQNnXzq4UmdNe/X0HQuLzIsuGirhdZY+53SJjxxZu+KnH2LGBj+UPI+OF2yaf9rpEZjQXSd4cXD+ufJFpcSLpe62xywRAghjwixwS4hUSh8NRHVgB/E1Eppc6VhNwi0imw+EYCnwgIm099PEk8CRA8+bNux87Zqh6f/ll/WiIjVcxJJrQcnIubP0DDN1mu6aYJsKY6FCFeWt1hE4BhrOKwLFJsPlZaPWoWm2toPVbDZGXArNbw11HVRV7T0x0wCgP8U5z2kHv76DeDebHdReoDNamd8G1L5U9LgJRxb4r3G5z5RTMYmS8cNvkjTOLYfMLMHRr8H1tfgGiKkLXt4LvywQOhyNRROL9tQvp3cLhcFQEpgETSjteACJyUUQyL/1/HlDR4XDU89DuUxGJF5H4+vX96lVqdNxX+Gg8VElfJE212xJNJNL1bdj3AWSfDOx8hwNaPgBDt0PWUZjf1X8WpUZxdKL6+/TmeHnD4VBlJwLdtoqqCH0mwd734dzKkscKt/WKUzzeymqMjBdum3xhpqK9P1qPVvF+EVrIOJTZjg7gC2CPiLznpU3DS+1wOBzXX7JH10kIloaD4OxyXTcoHBTGfu18I2L/yDU2Ur0VXPUkbPtjcP1UbgB9J0OXvytNwcTnVC0kjXeCyZhr9SgkTQZnTmDnV2sGPcfDmlGQe069Vjqeyu0uG29lJUbGC7dNvshPU2U+vKkQmKV2B6jSRMX9RSJG9iYDeQB9AQG2A1svPYYCTwFPXWrzO2AXKuZrPdDbX7865ssg87qKnFtttxXlA7db5KeeIke+t9sSTSRRGE+Uny4yvZHIhU3W9Jt7QWTNQyKz2oicWW5Nn1caKdtEZjQTcTl9t/MU81XIkltEjkwIzo6tr4gsGaTsmD69bDxV8Xir6dODG6s0RsYLt02+2P+JyMq7Le7zvyruL4wQKTFfVhMfHy8JCQl2mxH5bHkRoqtC59fttqR8cHoRJD4DQ3cqqRKNpng80cHP4ch4GLTKuliaE7Nh029UbFHXt1WmnkaR+BxUqA5d/uK7naeYr0KO/gCHv4CbggjhcDth6WCIGwAdX1OlG4YPL/kZEPH8erB467f46xBem3yxoCd0fBWa3GFdn/mpMKsVDDus9I/DQETEfGlsRMd9hZeGg6BSXTj2g92WaCKR1qOhIBOOWxgb2HQY3L4DXNkwr5MKVtZA9mlk/8dMOXc9KVlBFKttNlzVWsvyIRPlj6gK0GciHBwHZ5fAiBFlnRmHw/PrHiiY35ND028j5WKG/7G99Vv8dSNtwkH6HlWkttFt1vYbU0cpGxydaG2/FqCdryuV+n0hbbvWGwsXP8d+val+7Wo0xYmKhu7vqxVpl8HCq0aIqQM9v4Ie/4X1j8OGJyO+sndIETds/CX7K9/ByPPDmJJwPPC+oisr6aDDXwdnU5VG0HsCrH0Esk8F3k/SNPIuJpF09gL81B2SNwVnVyRx+Cto9YgxFQKzRGjNL+18XalUqKK0Hs8us9uS8kPcTVClYUT+ytJEAHEDoU5X2Ptv6/tuPASG7lD/n9sRTv1k/RiXA7vfgfxU6g9SRVJHxjcLrr/Wo9V2cbCFlOMGqtI/ax4I7MdZxkHY9BSuPlPZ3/4HYrq8CivuUGVurHTm7cDtVEVtrcpyLE3czZB3LuJ0U7XzdSWjtx7Dy8+rX3/Rq18az1z3T9j7rnn5GiPE1IIbPoWeX8KmX6t6f/mp1o8TaiRAvcFzK2Hfv6HPJGJrVAMgtlpMcLbExkN0FTi
"text/plain": [
"<Figure size 691.2x388.8 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plot_data_for_classification(X_outliers, Y_outliers, xlabel=u'dł. płatka', ylabel=u'szer. płatka')\n",
"plot_new_example(fig, 2.8, 0.9)\n",
"plot_voronoi(fig, X_outliers[:, 1:])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"* Pojedyncza obserwacja odstająca dramatycznie zmienia granice klas."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Aby temu zaradzić, użyjemy więcej niż jednego najbliższego sąsiada ($k > 1$)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Algorytm $k$ najbliższych sąsiadów dla problemu klasyfikacji\n",
"\n",
"1. Dany jest zbiór uczący zawierajacy przykłady $(x_i, y_i)$, gdzie: $x_i$ zestaw cech, $y_i$ klasa.\n",
"1. Dany jest przykład testowy $x'$, dla którego chcemy określić klasę.\n",
"1. Oblicz odległość $d(x', x_i)$ dla każdego przykładu $x_i$ ze zbioru uczącego.\n",
"1. Wybierz $k$ przykładów $x_{i_1}, \\ldots, x_{i_k}$, dla których wyliczona odległość jest najmniejsza.\n",
"1. Jako wynik $y'$ zwróć tę spośrod klas $y_{i_1}, \\ldots, y_{i_k}$, która występuje najczęściej."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Algorytm $k$ najbliższych sąsiadów dla problemu klasyfikacji przykład"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"# Odległość euklidesowa\n",
"def euclidean_distance(x1, x2):\n",
" return np.linalg.norm(x1 - x2)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# Algorytm k najbliższych sąsiadów\n",
"def knn(X, Y, x_new, k, distance=euclidean_distance):\n",
" data = np.concatenate((X, Y), axis=1)\n",
" nearest = sorted(\n",
" data, key=lambda xy:distance(xy[0, :-1], x_new))[:k]\n",
" y_nearest = [xy[0, -1] for xy in nearest]\n",
" return max(y_nearest, key=lambda y:y_nearest.count(y))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Wykres klas dla KNN\n",
"def plot_knn(fig, X, Y, k, distance=euclidean_distance):\n",
" ax = fig.axes[0]\n",
" x1min, x2min = X.min(axis=0).tolist()[0]\n",
" x1max, x2max = X.max(axis=0).tolist()[0]\n",
" pad1 = (x1max - x1min) / 10\n",
" pad2 = (x2max - x2min) / 10\n",
" step1 = (x1max - x1min) / 50\n",
" step2 = (x2max - x2min) / 50\n",
" x1grid, x2grid = np.meshgrid(\n",
" np.arange(x1min - pad1, x1max + pad1, step1),\n",
" np.arange(x2min - pad2, x2max + pad2, step2))\n",
" z = np.matrix([[knn(X, Y, [x1, x2], k, distance) \n",
" for x1, x2 in zip(x1row, x2row)] \n",
" for x1row, x2row in zip(x1grid, x2grid)])\n",
" plt.contour(x1grid, x2grid, z, levels=[0.5]);"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Przygotowanie interaktywnego wykresu\n",
"\n",
"slider_k = widgets.IntSlider(min=1, max=10, step=1, value=1, description=r'$k$', width=300)\n",
"\n",
"def interactive_knn_1(k):\n",
" fig = plot_data_for_classification(X_outliers, Y_outliers, xlabel=u'dł. płatka', ylabel=u'szer. płatka')\n",
" plot_voronoi(fig, X_outliers[:, 1:])\n",
" plot_knn(fig, X_outliers[:, 1:], Y_outliers, k)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f17d13b7c8634fcb9ca571208f1c9212",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=1, description='$k$', max=10, min=1), Button(description='Run Interact',…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<function __main__.interactive_knn_1(k)>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"widgets.interact_manual(interactive_knn_1, k=slider_k)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Wczytanie danych (inny przykład)\n",
"\n",
"alldata = pandas.read_csv('classification.tsv', sep='\\t')\n",
"data = np.matrix(alldata)\n",
"\n",
"m, n_plus_1 = data.shape\n",
"n = n_plus_1 - 1\n",
"Xn = data[:, 1:].reshape(m, n)\n",
"\n",
"X2 = np.matrix(np.concatenate((np.ones((m, 1)), Xn), axis=1)).reshape(m, n_plus_1)\n",
"Y2 = np.matrix(data[:, 0]).reshape(m, 1)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAmwAAAFmCAYAAADQ5sbeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3df5Ac913m8edZW6scqz0iyQoI2xvbpT3ANleKb3ABUSEwduLoD2u1+FgFUyicOBc/AluyoayUcwVloHC4KpblLndEmBDD+eJNjLQWF6V0/hU4X8XBK86JJbuc3ThFItbEwophvQmS7PncH90jtUYz2llpp7tn9v2qmprpb3fPfqbVO/vo29/udkQIAAAA5dVTdAEAAAA4PwIbAABAyRHYAAAASo7ABgAAUHIENgAAgJIjsAEAAJTcpUUXUITLLrssrrrqqqLLAAAAOMuhQ4f+MSLW1bcvy8B21VVXaWpqqugyAAAAzmL77xq1c0gUAACg5AhsAAAAJUdgAwAAKDkCGwAAQMkR2AAAAEqOwAYAAFByBDYAAICSI7ABAACUHIENAACg5AhsAAAAJUdgAwAA3SNC2rcveW6lvUMQ2AAAQPeYnJSGh6Vdu86Es4hkeng4md+BluXN3wEAQJcaGpJGR6Xx8WR6bCwJa+PjSfvQULH1XSACGwAA6B52EtKkJKTVgtvoaNJuF1fbReCQKFCULh1nAQCFy4a2mg4OaxKBDShOl46zAIDC1b5Ls7LftR2IwAYUJTvOovZF0gXjLACgUPXfpdXqud+1HYgxbEBRunScBQAUanLyTFirfZdmv2s3b5a2bSu2xgvg6NCkeTEqlUpMTU0VXQaQiJB6Mp3d1SphDQAuVEQS2oaGzv4ubdZeMrYPRUSlvp1DokCRunCcBQAUyk560OpDWbP2DkFgA4rSpeMsAABLjzFsQFG6dJwFAGDpEdiAogwNSXv3nj2eohbaNm/mLFEAwGkENqAotfEUrbYDAJYtxrABAACUHIENAACg5EoR2Gzfavsl2zO2dzeYP2b7ufTxZduvZ+a9lZm3P9/KAQAA2q/wMWy2L5H0UUm3SDoq6Vnb+yPihdoyEbErs/yvSHpX5i2+HREb86oXAAAgb2XoYbtR0kxEvBwRJyU9LGnreZZ/v6RP5lIZAABACZQhsF0u6euZ6aNp2zlsv1PS1ZKezDS/zfaU7Wdscx0EAADQdQo/JCqp0T0iml3ifbukRyLirUzbQETM2r5G0pO2n4+Ir5zzQ+w7Jd0pSQMDAxdbMwAAQG7K0MN2VNKVmekrJM02WXa76g6HRsRs+vyypM/p7PFt2eX2REQlIirr1q272JoBAAByU4bA9qykQdtX2+5VEsrOOdvT9vdKWi3p85m21bZXpq8vk/RuSS/UrwsAANDJCj8kGhFv2v6gpIOSLpH08Yg4Yvs+SVMRUQtv75f0cMRZd8T+fkkfs11VEj7vz55dCgAA0A18dv5ZHiqVSkxNTRVdBgAAwFlsH4qISn17GQ6JAgAA4DwIbAAAACVHYAMAACg5AhsAAEDJEdgAAABKjsAGAABQcgQ2AACAkiOwAQAAlByBbalFSPv2Jc+ttAMAACyAwLbUJiel4WFp164z4SwimR4eTuYDAAAsQuH3Eu06Q0PS6Kg0Pp5Mj40lYW18PGkfGiq2PgAA0HEIbEvNTkKalIS0WnAbHU3a7eJqAwAAHYmbv7dLhNSTOeJcrRLWAADAeXHz9zzVxqxlZce0AQAALAKBbanVwlptzFq1emZMG6ENAABcAMawLbXJyTNhrTZmLTumbfNmadu2YmsEAAAdhcC21IaGpL17k+famLVaaNu8mbNEAQDAohHYlprduAetWTsAAMACGMMGAABQcgQ2AACAkiOwAQAAlByBDQAAoOQIbAAAACVHYAMAACg5AhsAAEDJEdgAAABKjsAGAABQcgQ2AACAkiOwAQAAlFwpApvtW22/ZHvG9u4G8z9g+5jt59LHz2fm7bA9nT525Fs50B3mTszpgb99QPc8do8e+NsHNHdiruiSAAAZhd/83fYlkj4q6RZJRyU9a3t/RLxQt+hERHywbt01kn5DUkVSSDqUrvvNHEoHusLTX3taWx7aompUNX9qXn0r+nTXwbt04I4D2jSwqejyAAAqRw/bjZJmIuLliDgp6WFJW1tc972SHouI42lIe0zSrW2qE+g6cyfmtOWhLZo7Oaf5U/OSpPlT85o7mbS/cfKNgisEAEjlCGyXS/p6Zvpo2lbvJ21/yfYjtq9c5LqyfaftKdtTx44dW4q6gY43cWRC1ag2nFeNqiYOT+RcEQCgkTIENjdoi7rpv5R0VUT8W0mPS3pwEesmjRF7IqISEZV169ZdcLFAN5l+bfp0z1q9+VPzmjk+k3NFAIBGyhDYjkq6MjN9haTZ7AIR8VpEnEgn/1jSv2t1XQDNDa4dVN+Kvobz+lb0acOaDTlXBABopAyB7VlJg7avtt0rabuk/dkFbK/PTN4m6cX09UFJ77G92vZqSe9J2wC0YOS6EfW48ddAj3s0cv1IzhUBABopPLBFxJuSPqgkaL0o6VMRccT2fbZvSxf7VdtHbH9R0q9K+kC67nFJv6Uk9D0r6b60DUAL+lf268AdB9Tf23+6p61vRZ/6e5P2Vb2rCq4QACBJjmg45KurVSqVmJqaKroMoDTeOPmGJg5PaOb4jDas2aCR60cIawBQANuHIqJS3174ddgAFG9V7yrtvGFn0WUAAJoo/JAoAAAAzo/ABgAAUHIcEgWAkpg7MaeJIxOafm1ag2sHNXLdiPpX9hddFoASILABQAlwT1cA58MhUQAoGPd0BbAQAhsAFIx7ugJYCIENAArGPV0BLITABgAF456uABZCYAOAgnFPVwALIbABQMG4pytQMhHSvn3JcyvtOeCyHgBQApsGNmn27lnu6QqUweSkNDwsjY5KY2OSnYS0Xbuk8XFp715p27ZcSyKwAUBJcE9XoCSGhpKwNj6eTI+NnQlro6PJ/JxxSBSLU8JuYgAAlpSdhLRaaOvpORPWaj1uOSOwYXFq3cS7dp0JZ7Vu4uHhZD4AAJ2uFtqyCgprEoENi5XtJq6FtoK7iQEAWHK1v29Z2c6KnBHYsDgl7CYGAGBJ1XdGVKvndlbkzLEMxxxVKpWYmpoquozOFpGEtZpqlbAGAOgO+/YVdpao7UMRUalvp4cNi1eybmIAAJbU0FASyrJHjmpHmPbu5SxRdIASdhMDALCk7KQHrf7IUbP2HHAdNizO5OS5Y9ZqZ9GMj0ubN+d+MUEAALodgQ2LU+smHho6t5t482bOEgUAoA0IbFicWndwq+0AAOCiMYYNAACg5AhsAAAAJUdgAwAAKDkCGwAAQMkR2AAAAEquFIHN9q22X7I9Y3t3g/l32X7B9pdsP2H7nZl5b9l+Ln3sz7dyAACA9iv8sh62L5H0UUm3SDoq6Vnb+yPihcxi/09SJSK+ZfsXJf2epJF03rcjYmOuRQMAAOSoDD1sN0qaiYiXI+KkpIclbc0uEBFPRcS30slnJF2Rc40AAACFKUNgu1zS1zPTR9O2ZnZK+mxm+m22p2w/Y5vL7AMAgK5T+CFRSY3uoNrwDuK2f0ZSRdLmTPNARMzavkbSk7afj4ivNFj3Tkl3StLAwMDFVw0AAJCTMvSwHZV0ZWb6Ckmz9QvZvlnSvZJui4gTtfaImE2fX5b0OUnvavRDImJPRFQiorJu3bqlqx4AAKDNyhDYnpU0aPtq272Stks662xP2++S9DElYe3VTPtq2yvT15dJerek7MkKAAAAHa/wQ6IR8abtD0o6KOkSSR+PiCO275M0FRH7Jf1nSaskfdq2JH0tIm6T9P2SPma7qiR83l93dikAAEDHc0TD4WJdrVKpxNTUVNFlAAAAnMX2oYio1LeX4ZAoAAAAzoPABgAAUHIENgAAgJIjsAEAAJQcgQ0AACxPEdK+fclzK+0FIrABAIDlaXJSGh6Wdu06E84ikunh4WR+SRR+HTYAAIBCDA1Jo6PS+HgyPTaWhLXx8aR9qDy3KCewAQCA5clOQpq
"text/plain": [
"<Figure size 691.2x388.8 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plot_data_for_classification(X2, Y2, xlabel=r'$x_1$', ylabel=r'$x_2$')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Przygotowanie interaktywnego wykresu\n",
"\n",
"slider_k = widgets.IntSlider(min=1, max=10, step=1, value=1, description=r'$k$', width=300)\n",
"\n",
"def interactive_knn_2(k):\n",
" fig = plot_data_for_classification(X2, Y2, xlabel=r'$x_1$', ylabel=r'$x_2$')\n",
" plot_voronoi(fig, X2[:, 1:])\n",
" plot_knn(fig, X2[:, 1:], Y2, k)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "6c38b927db174f9f9ddb05513daf2612",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=1, description='$k$', max=10, min=1), Button(description='Run Interact',…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<function __main__.interactive_knn_2(k)>"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"widgets.interact_manual(interactive_knn_2, k=slider_k)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Algorytm $k$ najbliższych sąsiadów dla problemu regresji\n",
"\n",
"1. Dany jest zbiór uczący zawierajacy przykłady $(x_i, y_i)$, gdzie: $x_i$ zestaw cech, $y_i$ liczba rzeczywista.\n",
"1. Dany jest przykład testowy $x'$, dla którego chcemy określić klasę.\n",
"1. Oblicz odległość $d(x', x_i)$ dla każdego przykładu $x_i$ ze zbioru uczącego.\n",
"1. Wybierz $k$ przykładów $x_{i_1}, \\ldots, x_{i_k}$, dla których wyliczona odległość jest najmniejsza.\n",
"1. Jako wynik $y'$ zwróć średnią liczb $y_{i_1}, \\ldots, y_{i_k}$:\n",
" $$ y' = \\frac{1}{k} \\sum_{j=1}^{k} y_{i_j} $$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Wybór $k$\n",
"\n",
"* Wartość $k$ ma duży wpływ na wynik działania algorytmu KNN:\n",
" * Jeżeli $k$ jest zbyt duże, wszystkie nowe przykłady są klasyfikowane jako klasa większościowa.\n",
" * Jeżeli $k$ jest zbyt małe, granice klas są niestabilne, a algorytm jest bardzo podatny na obserwacje odstające.\n",
"* Aby dobrać optymalną wartość $k$, najlepiej użyć zbioru walidacyjnego."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Miary podobieństwa"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Odległość euklidesowa\n",
"$$ d(x, x') = \\sqrt{ \\sum_{i=1}^n \\left( x_i - x'_i \\right) ^2 } $$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Dobry wybór w przypadku numerycznych cech.\n",
"* Symetryczna, traktuje wszystkie wymiary jednakowo.\n",
"* Wrażliwa na duże wahania jednej cechy."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Odległość Hamminga\n",
"$$ d(x, x') = \\sum_{i=1}^n \\mathbf{1}_{x_i \\neq x'_i} $$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Dobry wybór w przypadku cech zero-jedynkowych.\n",
"* Liczba cech, którymi różnią się dane przykłady."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Odległość Minkowskiego ($p$-norma)\n",
"$$ d(x, x') = \\sqrt[p]{ \\sum_{i=1}^n \\left| x_i - x'_i \\right| ^p } $$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Dla $p = 2$ jest to odległość euklidesowa.\n",
"* Dla $p = 1$ jest to odległość taksówkowa.\n",
"* Jeżeli $p \\to \\infty$, to $p$-norma zbliża się do logicznej alternatywy.\n",
"* Jeżeli $p \\to 0$, to $p$-norma zbliża się do logicznej koniunkcji."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### KNN praktyczne porady"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Co zrobić z remisami?\n",
" * Można wybrać losową klasę.\n",
" * Można wybrać klasę o wyższym prawdopodobieństwie _a priori_.\n",
" * Można wybrać klasę wskazaną przez algorytm 1NN."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* KNN źle radzi sobie z brakującymi wartościami cech (nie można wówczas sensownie wyznaczyć odległości)."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
},
"livereveal": {
"start_slideshow_at": "selected",
"theme": "amu"
}
},
"nbformat": 4,
"nbformat_minor": 4
}