uczenie-maszynowe/wyk/07_Uczenie_nienadzorowane.ipynb

963 lines
250 KiB
Plaintext
Raw Normal View History

2022-12-06 09:55:40 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Uczenie maszynowe\n",
2023-04-14 10:47:15 +02:00
"# 7. Uczenie nienadzorowane"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Wyobraźmy sobie, że mamy następujący problem:\n",
"\n",
"Mamy zbiór okazów roślin i dysponujemy pewnymi danymi na ich temat (długość płatków kwiatów, ich szerokość itp.), ale zupełnie **nie wiemy**, do jakich gatunków one należą (nie wiemy nawet, ile jest tych gatunków).\n",
"\n",
"Chcemy automatycznie podzielić zbiór posiadanych okazów na nie więcej niż $k$ grup (klastrów) ($k$ ustalamy z góry), czyli dokonać **grupowania (klastrowania; analizy skupień)** zbioru przykładów.\n",
2022-12-06 09:55:40 +01:00
"\n",
"Jest to zagadnienie z kategorii uczenia nienadzorowanego.\n",
"\n",
"W celu jego rozwiązania użyjemy algorytmu $k$ średnich."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
2023-04-14 10:47:15 +02:00
"## 7.1. Algorytm $k$ średnich"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "code",
"execution_count": 2,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Przydatne importy\n",
"\n",
"import ipywidgets as widgets\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas\n",
"import random\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 3,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" x1 x2\n",
"0 1.4 3.4\n",
"1 1.5 3.7\n",
"2 5.6 3.1\n",
"3 5.1 3.2\n",
"4 4.5 2.5\n",
".. ... ...\n",
"145 1.2 4.0\n",
"146 1.4 3.5\n",
"147 5.8 3.0\n",
"148 5.4 3.1\n",
"149 5.7 3.3\n",
"\n",
"[150 rows x 2 columns]\n"
]
}
],
2022-12-06 09:55:40 +01:00
"source": [
"# Wczytanie danych (gatunki kosaćców)\n",
"\n",
"data_iris_raw = pandas.read_csv(\"iris.csv\")\n",
2022-12-06 09:55:40 +01:00
"\n",
"# Nie używamy w ogóle kolumny ostatniej kolumny (\"Gatunek\"),\n",
2022-12-06 09:55:40 +01:00
"# ponieważ chcemy dokonać uczenia nienadzorowanego.\n",
"# Przyjmujemy, że w ogóle nie dysponujemy danymi na temat gatunku,\n",
"# mamy tylko 150 nieznanych roślin.\n",
"\n",
"# Żeby łatwiej pokazać akgorytm k średnich, ograniczmy się tylko do dwóch cech.\n",
"\n",
"data_iris = pandas.DataFrame()\n",
"data_iris[\"x1\"] = data_iris_raw[\"pl\"]\n",
"data_iris[\"x2\"] = data_iris_raw[\"sw\"]\n",
"\n",
"print(data_iris)"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "code",
2022-12-09 15:06:17 +01:00
"execution_count": 4,
"metadata": {},
2022-12-06 09:55:40 +01:00
"outputs": [],
"source": [
"# Dla uproszczenia oznaczeń, cały zbiór zapisuję do zmiennej X,\n",
"# a jego kolumny (cechy) do zmiennych X1 i X2\n",
2022-12-06 09:55:40 +01:00
"\n",
"X = data_iris.values\n",
"X1 = data_iris[\"x1\"].tolist()\n",
"X2 = data_iris[\"x2\"].tolist()"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "code",
2022-12-09 15:06:17 +01:00
"execution_count": 5,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
2022-12-06 09:55:40 +01:00
"source": [
"# Wykres danych\n",
"def plot_unlabeled_data(X1, X2, x1label=r\"$x_1$\", x2label=r\"$x_2$\"):\n",
" fig = plt.figure(figsize=(16 * 0.7, 9 * 0.7))\n",
" ax = fig.add_subplot(111)\n",
" fig.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9)\n",
" ax.scatter(X1, X2, c=\"k\", marker=\"o\", s=50, label=\"Dane\")\n",
" ax.set_xlabel(x1label)\n",
" ax.set_ylabel(x2label)\n",
" ax.margins(0.05, 0.05)\n",
" return fig"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "code",
2022-12-09 15:06:17 +01:00
"execution_count": 6,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "notes"
2022-12-06 09:55:40 +01:00
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA8cAAAI5CAYAAACM4EiHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/H5lhTAAAACXBIWXMAAA9hAAAPYQGoP6dpAABXLklEQVR4nO3dcXwc9X3n//fMDljbrFeH9AjIaxuaHi4QsAlgHtTImFBoG6B5LFpfiHWk0CT0ccLiChL00ZgH17SJE9NrIpW2zBLSBvo4atzA44A6CSQ0CWRXhJ6BcrHJQeHSgrGwuAdKdhCPiYN25/cHP1SE7PVovDuzM/N6Ph77h9bfr76f+c54Zt+amR3D8zxPAAAAAACkmBl1AQAAAAAARI1wDAAAAABIPcIxAAAAACD1CMcAAAAAgNQjHAMAAAAAUo9wDAAAAABIPcIxAAAAACD1CMcAAAAAgNQjHAMAAAAAUs+KuoAoNBoNTU5OaunSpTIMI+pyAAAAAABt4nme3njjDRUKBZlmk/PDXgfZtm2bJ8m77rrrDtnmzjvv9CTNey1ZsmRR4+zdu3fB7+DFixcvXrx48eLFixcvXsl97d27t2lO7Jgzx7t27dJXvvIVrVmz5rBt8/m8nn/++bmfF3v2d+nSpZKkvXv3Kp/PL65QAAAAAEBsOI6jlStXzuXAQ+mIcDwzM6MrrrhCX/3qV7V169bDtjcMQ319fYHHeydM5/N5wjEAAAAApMDhTqp2xBdyDQ8P69JLL9VFF13kq/3MzIxOOOEErVy5UsViUc8++2zT9gcOHJDjOPNeAAAAAAC8I/JwvGPHDj399NPatm2br/YnnXSSvva1r+nBBx/U3XffrUajoXPPPVevvPLKIfts27ZN3d3dc6+VK1e2qnwAAAAAQAIYnud5UQ2+d+9erV27Vo888sjcvcYf/vCH9aEPfUh//ud/7ut3vPXWWzrllFM0ODioz3/+8wdtc+DAAR04cGDu53euOa/ValxWDQAAAAAJ5jiOuru7D5v/Ir3n+KmnntJrr72mM888c+69er2uH/zgB/qrv/orHThwQJlMpunvOOqoo3TGGWfoxRdfPGSbJUuWaMmSJS2rGwAAAACQLJGG4wsvvFC7d++e994nP/lJnXzyyfrDP/zDwwZj6e0wvXv3bl1yySXtKhMAAAAAkHCRhuOlS5fqtNNOm/fe+973PvX29s69f+WVV2r58uVz9yR/7nOf06/92q/pxBNP1M9+9jP92Z/9mV566SVdffXVodcPAAAAAEiGjniUUzMvv/yyTPPfvzfspz/9qX7v935P+/fv1zHHHKOzzjpLjz/+uD74wQ9GWCUAAAAAIM4i/UKuqPi9IRsAAAAAEG9+81/kj3ICAAAAACBqhGMAAAAAQOoRjgEAAAAAqUc4BgAAAACkHuEYAAAAAJB6hGMAAAAAQOoRjhELrutqampKrutGXQoAAACABCIco6NVq1WVSiXlcjn19fUpl8upVCppYmIi6tIAAAAAJAjhGB2rXC5rw4YN2rlzpxqNhiSp0Who586dOu+883T77bdHXCEAAACApCAcoyNVq1UNDw/L8zzNzs7O+7fZ2Vl5nqfNmzdzBhkAAABASxCO0ZHGxsaUyWSatslkMhofHw+pIgAAAABJZnie50VdRNgcx1F3d7dqtZry+XzU5eA9XNdVLpebu5S6GdM0NTMzo2w2G0JlAAAAAOLGb/7jzDE6juM4voKx9PY9yI7jtLkiAAAAAElHOEbHyefzMk1/m6Zpmpz9BwAAAHDECMfoONlsVsViUZZlNW1nWZYGBga4pBoAAADAESMcoyONjo6qXq83bVOv1zUyMhJSRQAAAACSjHCMjrR+/XrZti3DMBacQbYsS4ZhyLZt9ff3R1QhAAAAgCQhHKNjDQ0NqVKpqFgszt2DbJqmisWiKpWKhoaGIq4QAAAAQFLwKCe+zCkWXNeV4zjK5/PcYwwAAADAN7/5r/k3HgEdIpvNEooBAAAAtA2XVQMAAAAAUo9wDAAAAABIPcIxAAAAACD1CMcAAAAAgNQjHAMAAAAAUo9wDAAAAABIPcIxAAAAACD1CMcAAAAAgNQjHAMAAAAAUo9wDAAAAABIPcIxAAAAACD1CMcAAAAAgNQjHAMAAAAAUo9wDAAAAABIPcIxAAAAACD1CMcAAAAAgNQjHAMAAAAAUo9wDAAAAABIPcIxAAAAACD1CMcAAAAAgNQjHAMAAAAAUo9wDAAAAABIPcIxAAAAACD1CMcAAAAAgNQjHCN0rutqampKrutGXQoAAAAASCIcI0TValWlUkm5XE59fX3K5XIqlUqamJiIujQAAAAAKUc4RijK5bI2bNignTt3qtFoSJIajYZ27typ8847T7fffnvEFQIAAABIM8Ix2q5arWp4eFie52l2dnbev83OzsrzPG3evJkzyAAAAAAiQzhG242NjSmTyTRtk8lkND4+HlJFAAAAADCf4XmeF3URYXMcR93d3arVasrn81GXk2iu6yqXy81dSt2MaZqamZlRNpsNoTIAAAAAaeA3/3HmGG3lOI6vYCy9fQ+y4zhtrggAAAAAFiIco63y+bxM099mZpomZ/IBAAAARIJwjLbKZrMqFouyLKtpO8uyNDAwwCXVAAAAACJBOEbbjY6Oql6vN21Tr9c1MjISUkUAAAAAMB/hGG23fv162bYtwzAWnEG2LEuGYci2bfX390dUIQAAAIC0IxwjFENDQ6pUKioWi3P3IJumqWKxqEqloqGhoYgrBAAAAJBmPMqJL4AKneu6chxH+Xyee4wBAAAAtJXf/Nf8W5KANshms4RiAAAAAB2Fy6oBAAAAAKlHOAYAAAAApB7hGAAAAACQeoRjAAAAAEDqEY4BAAAAAKlHOAYAAAAApB7hGLHguq6mpqbkum7UpQAAAABIIMIxOlq1WlWpVFIul1NfX59yuZxKpZImJiaiLg0AAABAghCO0bHK5bI2bNignTt3qtFoSJIajYZ27typ8847T7fffnvEFQIAAABICsIxOlK1WtXw8LA8z9Ps7Oy8f5udnZXnedq8eTNnkAEAAAC0BOEYHWlsbEyZTKZpm0wmo/Hx8ZAqAgAAAJBkhud5XtRFhM1xHHV3d6tWqymfz0ddDt7DdV3lcrm5S6mbMU1TMzMzymazIVQGAAAAIG785j/OHKPjOI7jKxhLb9+D7DhOmysCAAAAkHSEY3ScfD4v0/S3aZqmydl/AAAAAEeMcIyOk81mVSwWZVlW03aWZWlgYIBLqgEAAAAcMcIxOtLo6Kjq9XrTNvV6XSMjIyFVBAAAACDJCMfoSOvXr5dt2zIMY8EZZMuyZBiGbNtWf39/RBUCAAAASBLCMTrW0NCQKpWKisXi3D3IpmmqWCyqUqloaGgo4goBAAAAJAWPcuLLnGLBdV05jqN8Ps89xgAAAAB885v/mn/jEdAhstksoRgAAABA23BZNQAAAAAg9QjHAAAAAIDUIxwDAAAAAFKPcAwAAAAASD3CMQAAAAAg9QjHkOu6mpqakuu6UZcCAAAAAJHoqHB8yy23yDAMXX/99U3b3XvvvTr55JPV1dWl1atX61vf+lY4BSZMtVpVqVRSLpdTX1+fcrmcSqWSJiYmoi4NAAAAAELVMeF4165d+spXvqI1a9Y0bff4449rcHBQn/70p/XP//zPuuyyy3TZZZdpz549IVWaDOVyWRs2bNDOnTvVaDQkSY1GQzt37tR5552n22+/PeIKAQAAACA8HRGOZ2ZmdMUVV+irX/2qjjnmmKZtb731Vn3kIx/RH/zBH+iUU07R5z//eZ155pn6q7/6q5Cqjb9qtarh4WF5nqfZ2dl5/zY7OyvP87R582bOIAMAAABIjY4Ix8PDw7r00kt10UUXHbbtD3/4wwXtfuu3fks//OEP21Ve4oyNjSmTyTRtk8lkND4+HlJFAAAAABAtK+oCduzYoaefflq7du3y1X7//v067rjj5r133HHHaf/+/Yfsc+DAAR04cGDuZ8dxghWbAK7r6sEHH5y7lPpQZmdndf/998t1XWWz2ZCqAwAAAIBoRHrmeO/evbruuuv
2022-12-06 09:55:40 +01:00
"text/plain": [
"<Figure size 1120x630 with 1 Axes>"
2022-12-06 09:55:40 +01:00
]
},
2022-12-09 15:06:17 +01:00
"metadata": {},
2022-12-06 09:55:40 +01:00
"output_type": "display_data"
}
],
"source": [
"# Generowanie wykresu\n",
"fig = plot_unlabeled_data(X1, X2, x1label=\"$x_1$\", x2label=\"$x_2$\")"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "code",
2022-12-09 15:06:17 +01:00
"execution_count": 7,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Odległość euklidesowa\n",
"def euclidean_distance(x1, x2):\n",
" return np.linalg.norm(x1 - x2)"
]
},
{
"cell_type": "code",
2022-12-09 15:06:17 +01:00
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Funkcja kosztu\n",
"def cost_function(X, Y, centroids, distance):\n",
" return np.mean([distance(x, centroids[y]) ** 2 for x, y in zip(X, Y)])"
]
},
{
"cell_type": "code",
"execution_count": 9,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Algorytm k średnich\n",
"def k_means(X, k, distance=euclidean_distance):\n",
" history = []\n",
" Y = []\n",
"\n",
2022-12-06 09:55:40 +01:00
" # Wylosuj centroid dla każdej klasy\n",
" centroids = [\n",
" [random.uniform(X.min(axis=0)[f], X.max(axis=0)[f]) for f in range(X.shape[1])]\n",
" for c in range(k)\n",
" ]\n",
" cost = cost_function(X, Y, centroids, distance)\n",
" history.append((centroids, Y, cost))\n",
2022-12-06 09:55:40 +01:00
"\n",
" # Powtarzaj, dopóki klasy się zmieniają\n",
" while True:\n",
" distances = [[distance(centroids[c], x) for c in range(k)] for x in X]\n",
" Y_new = [d.index(min(d)) for d in distances]\n",
" if Y_new == Y:\n",
" break\n",
" Y = Y_new\n",
" cost = cost_function(X, Y, centroids, distance)\n",
" history.append((centroids, Y, cost))\n",
2022-12-06 09:55:40 +01:00
" XY = np.asarray(np.concatenate((X, np.matrix(Y).T), axis=1))\n",
" Xc = [XY[XY[:, 2] == c][:, :-1] for c in range(k)]\n",
" centroids = [\n",
" [Xc[c].mean(axis=0)[f] for f in range(X.shape[1])] for c in range(k)\n",
" ]\n",
" cost = cost_function(X, Y, centroids, distance)\n",
" history.append((centroids, Y, cost))\n",
2022-12-06 09:55:40 +01:00
"\n",
" result = history[-1][1]\n",
" return result, history"
]
},
{
"cell_type": "code",
"execution_count": 10,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Wykres danych - klastrowanie\n",
"def plot_clusters(X, Y, k, centroids=None, cost=\"\"):\n",
" color = [\"r\", \"g\", \"b\", \"c\", \"m\", \"y\", \"k\"]\n",
" fig = plt.figure(figsize=(16 * 0.7, 9 * 0.7))\n",
2022-12-06 09:55:40 +01:00
" ax = fig.add_subplot(111)\n",
" fig.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9)\n",
"\n",
" if not Y:\n",
" ax.scatter(X[:, 0], X[:, 1], c=\"gray\", marker=\"o\", s=25, label=\"Dane\")\n",
2022-12-06 09:55:40 +01:00
"\n",
" X1 = [[x for x, y in zip(X[:, 0].tolist(), Y) if y == c] for c in range(k)]\n",
" X2 = [[x for x, y in zip(X[:, 1].tolist(), Y) if y == c] for c in range(k)]\n",
"\n",
" for c in range(k):\n",
" ax.scatter(X1[c], X2[c], c=color[c], marker=\"o\", s=25, label=\"Dane\")\n",
2022-12-06 09:55:40 +01:00
" if centroids:\n",
" ax.scatter(\n",
" [centroids[c][0]],\n",
" [centroids[c][1]],\n",
" c=color[c],\n",
" marker=\"+\",\n",
" s=500,\n",
" label=\"Centroid\",\n",
" )\n",
"\n",
" ax.set_xlabel(r\"$x_1$\")\n",
" ax.set_ylabel(r\"$x_2$\")\n",
" ax.annotate(\n",
" f\"koszt: {cost:.3f}\",\n",
" xy=(1, 0),\n",
" xycoords=\"axes fraction\",\n",
" xytext=(-20, 20),\n",
" textcoords=\"offset pixels\",\n",
" horizontalalignment=\"right\",\n",
" verticalalignment=\"bottom\",\n",
" fontsize=12,\n",
" )\n",
" ax.margins(0.05, 0.05)\n",
2022-12-06 09:55:40 +01:00
" return fig"
]
},
{
"cell_type": "code",
"execution_count": 11,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/pawel/.local/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice.\n",
" return _methods._mean(a, axis=axis, dtype=dtype,\n",
"/home/pawel/.local/lib/python3.10/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide\n",
" ret = ret.dtype.type(ret / rcount)\n"
]
}
],
"source": [
"Y, history = k_means(X, 2)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
2022-12-06 09:55:40 +01:00
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA8cAAAI5CAYAAACM4EiHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/H5lhTAAAACXBIWXMAAA9hAAAPYQGoP6dpAABUXUlEQVR4nO3dfZxT9Z33//fJjNwIJAoVEERinZTxpsigVkHxBoRWWQfaWi11xTGZdl3hEupjf5cdZbFScKysl11d7z0JdBV5LLZ4V1t3QEC90K4ygwu6TjN6RbEIdlEyDla05Pz+yE50OpmQCZlzknNez8cjj+Ock2++n/PNMTNvvjnnGJZlWQIAAAAAwMN8ThcAAAAAAIDTCMcAAAAAAM8jHAMAAAAAPI9wDAAAAADwPMIxAAAAAMDzCMcAAAAAAM8jHAMAAAAAPI9wDAAAAADwPMIxAAAAAMDzKp0uwAmpVEo7d+7UkCFDZBiG0+UAAAAAAPqIZVn6+OOPNWrUKPl8OeaHrRLS2NhoSbIWLFjQ43NisZglqcujf//+vepnx44d3V6DBw8ePHjw4MGDBw8ePHi497Fjx46cObFkZo5feeUV3X///Ro/fvxBn+v3+9Xa2pr5ubezv0OGDJEk7dixQ36/v3eFAgAAAADKRnt7u8aMGZPJgT0piXDc0dGhyy+/XA8++KCWLl160OcbhqGRI0cW3F9nmPb7/YRjAAAAAPCAg02qlsQFuebNm6eZM2fqggsuyOv5HR0dGjt2rMaMGaNZs2bp9ddfz/n8/fv3q729vcsDAAAAAIBOjofj1atXq7m5WY2NjXk9f9y4cYpGo3riiSf08MMPK5VKafLkyXrvvfd6bNPY2KhAIJB5jBkzpljlAwAAAABcwLAsy3Kq8x07dui0005TU1NT5lzj8847TxMmTNAvfvGLvF7j888/1wknnKA5c+boZz/7Wdbn7N+/X/v378/83Pmd82QyydeqAQAAAMDF2tvbFQgEDpr/HD3neMuWLfrggw80ceLEzLoDBw7o+eef17/8y79o//79qqioyPkahx12mGpqatTW1tbjc/r376/+/fsXrW4AAAAAgLs4Go6nTZumbdu2dVl31VVXqbq6Wtdff/1Bg7GUDtPbtm3TRRdd1FdlAgAAAABcztFwPGTIEJ188sld1g0aNEjDhg3LrJ87d65Gjx6dOSd5yZIlOvPMM1VVVaW9e/dq+fLleuedd1RfX297/QAAAAAAdyiJWznl8u6778rn++K6YR999JF++MMfateuXTryyCN16qmnavPmzTrxxBMdrBIAAAAAUM4cvSCXU/I9IRsAAAAAUN7yzX+O38oJAAAAAACnEY4BAAAAAJ5HOAYAAAAAeB7hGAAAAADgeYRjAAAAAIDnEY4BAAAAAJ5HOAYAAAAAeB7hGAAAAADgeZVOFwAcVDwuRaNSIiEFg1I4LIVCTlcFAAAAwEUIxyhtsZhUXy8ZhmRZ6eVtt0mmKdXVOV0dAAAAAJfga9UoXfF4OhinUtKBA12XkYjU1uZ0hQAAAABcgnCM0hWNpmeKszGM9OwxAAAAABQB4RilK5FIf5U6G8tKbwcAAACAIiAco3QFg7lnjoNBO6sBAAAA4GKEY5SucDj3zHEkYm89AAAAAFyLcIzSFQqlzyv2+aSKiq5L05SqqpyuEAAAAIBLcCsnlLa6Ounss9NhuPM+x5EIwRgAAABAURGOUfqqqqTGRqerAAAAAOBifK0aAAAAAOB5hGMAAAAAgOcRjgEAAAAAnkc4BgAAAAB4HuEYAAAAAOB5hGMAAAAAgOcRjgEAAAAAnkc4BgAAAAB4HuEYAAAAAOB5hGMAAAAAgOcRjgEAAAAAnkc4BgAAAAB4HuEYAAAAAOB5hGMAAAAAgOcRjgEAAAAAnkc4BgAAAAB4HuEYAAAAAOB5hGMAAAAAgOcRjgEAAAAAnkc4BgAAAAB4HuEYAAAAAOB5hGMAAAAAgOcRjgEAAAAAnkc4BgAAAAB4HuEYAAAAAOB5hGMAAAAAgOcRjgEAAAAAnkc4BgAAAAB4XqXTBcBj4nEpGpUSCSkYlMJhKRRyuioAAAAAHkc4hn1iMam+XjIMybLSy9tuk0xTqqtzujoAAAAAHsbXqmGPeDwdjFMp6cCBrstIRGprc7pCAAAAAB5GOIY9otH0THE2hpGePQYAAAAAhxCOYY9EIv1V6mwsK70dAAAAABxCOIY9gsHcM8fBoJ3VAAAAAEAXhGPYIxzOPXMcidhbDwAAAAB8CeEY9giF0ucV+3xSRUXXpWlKVVVOVwgAAADAw7iVE+xTVyedfXY6DHfe5zgSIRgDAAAAcBzhGPaqqpIaG52uAgAAAAC64GvVAAAAAADPIxwDAAAAADyPcAwAAAAA8DzCMQAAAADA8wjHAAAAAADPIxwDAAAAADyPcAwAAAAA8DzCMQAAAADA8wjHAAAAAADPIxwDAAAAADyv0ukCgIOKx6VoVEokpGBQCoelUMjpqgAAAAC4COEYpS0Wk+rrJcOQLCu9vO02yTSlujqnqwMAAADgEnytGqUrHk8H41RKOnCg6zISkdranK4QAAAAgEsQjlG6otH0THE2hpGePQYAAACAIiAco3QlEumvUmdjWentAAAAAFAEhGOUrmAw98xxMGhnNQAAAABcjHCM0hUO5545jkTsrQcAAACAaxGOUbpCofR5xT6fVFHRdWmaUlWV0xUCAAAAcAlu5YTSVlcnnX12Ogx33uc4EiEYAwAAACgqwjFKX1WV1NjodBUAAAAAXIyvVQMAAAAAPI9wDAAAAADwPMIxAAAAAMDzCMcAAAAAAM8jHAMAAAAAPI9wDAAAAADwPMIxAAAAAMDzCMcAAAAAAM8rqXB86623yjAMLVy4MOfz1qxZo+rqag0YMEBf//rX9cwzz9hTIAAAAADAlUomHL/yyiu6//77NX78+JzP27x5s+bMmaNIJKKWlhbNnj1bs2fP1vbt222q1GXicamhQZozJ72Mx52uCAAAAABsZ1iWZTldREdHhyZOnKh77rlHS5cu1YQJE/SLX/wi63Mvu+wy7du3T08//XRm3ZlnnqkJEybovvvuy6u/9vZ2BQIBJZNJ+f3+YuxCeYrFpPp6yTAky/piaZpSXZ3T1QEAAADAIcs3/5XEzPG8efM0c+ZMXXDBBQd97ksvvdTted/85jf10ksv9VV57hSPp4NxKiUdONB1GYlIbW1OVwgAAAAAtql0uoDVq1erublZr7zySl7P37Vrl0aMGNFl3YgRI7Rr164e2+zfv1/79+/P/Nze3l5YsW4SjaZnirMxjPTscWOjvTUBAAAAgEMcnTnesWOHFixYoEceeUQDBgzos34aGxsVCAQyjzFjxvRZX2UjkUh/hToby0pvBwAAAACPcDQcb9myRR988IEmTpyoyspKVVZWatOmTbrzzjtVWVmpAwcOdGszcuRI7d69u8u63bt3a+TIkT3209DQoGQymXns2LGj6PtSdoLB3DPHwaCd1QAAAACAoxwNx9OmTdO2bdu0devWzOO0007T5Zdfrq1bt6qioqJbm0mTJmn9+vVd1jU1NWnSpEk99tO/f3/5/f4uD88Lh3PPHEci9tYDAAAAAA5y9JzjIUOG6OSTT+6ybtCgQRo2bFhm/dy5czV69Gg1/s/5rwsWLNC5556r22+/XTNnztTq1av16quv6oEHHrC9/rIWCqXPK45Esl+tuqrK6QoBAAAAwDaOX5DrYN599135fF9McE+ePFmrVq3SokWLdMMNNygUCunxxx/vFrKRh7o66eyz02E4kUh/lToSIRgDAAAA8JySuM+x3bjPMQAAAAB4Q1nd5xgAAAAAACcRjgEAAAAAnkc4BgAAAAB4HuEYAAAAAOB5hGMAAAAAgOcRjgEAAAAAnkc4BgAAAAB4HuEYAAAAAOB5hGMAAAAAgOdVOl0AylQ8LkWjUiIhBYNSOCyFQk5XVRxu3jcAAAAAWRmWZVlOF2G39vZ2BQIBJZNJ+f1+p8spP7GYVF8vGYZkWV8sTVOqq3O6ukPj5n0DAAAAPCjf/Ec4Jhz3TjwuVVdLqVT3bT6f1No
2022-12-06 09:55:40 +01:00
"text/plain": [
2022-12-09 15:06:17 +01:00
"<Figure size 1120x630 with 1 Axes>"
2022-12-06 09:55:40 +01:00
]
},
2022-12-09 15:06:17 +01:00
"metadata": {},
2022-12-06 09:55:40 +01:00
"output_type": "display_data"
}
],
"source": [
"fig = plot_clusters(X, Y, 2, centroids=history[-1][0], cost=history[-1][2])"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "code",
"execution_count": 14,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"# Przygotowanie interaktywnego wykresu\n",
"\n",
"MAXSTEPS = 15\n",
"\n",
"slider_k = widgets.IntSlider(\n",
" min=1, max=7, step=1, value=2, description=r\"$k$\", width=300\n",
")\n",
"\n",
2022-12-06 09:55:40 +01:00
"\n",
"def interactive_kmeans_k(steps, history, k):\n",
" if steps >= len(history) or steps == MAXSTEPS:\n",
" steps = len(history) - 1\n",
" fig = plot_clusters(\n",
" X, history[steps][1], k, centroids=history[steps][0], cost=history[steps][2]\n",
" )\n",
"\n",
"\n",
2022-12-06 09:55:40 +01:00
"def interactive_kmeans(k):\n",
" slider_steps = widgets.IntSlider(\n",
" min=0, max=MAXSTEPS, step=1, value=0, description=r\"steps\", width=300\n",
" )\n",
" _, history = k_means(X, k)\n",
" widgets.interact(\n",
" interactive_kmeans_k,\n",
" steps=slider_steps,\n",
" history=widgets.fixed(history),\n",
" k=widgets.fixed(k),\n",
" )"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "code",
"execution_count": 15,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2c45fd83bd37477eaa2555f33b878064",
2022-12-06 09:55:40 +01:00
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"interactive(children=(IntSlider(value=2, description='$k$', max=7, min=1), Button(description='Run Interact', …"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<function __main__.interactive_kmeans(k)>"
]
},
"execution_count": 15,
2022-12-06 09:55:40 +01:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"widgets.interact_manual(interactive_kmeans, k=slider_k)"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Algorytm $k$ średnich dane wejściowe\n",
"\n",
"* $k$ liczba klastrów\n",
"* zbiór uczący $X = \\{ x^{(1)}, x^{(2)}, \\ldots, x^{(m)} \\}$, $x^{(i)} \\in \\mathbb{R}^n$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Na wejściu nie ma zbioru $Y$, ponieważ jest to uczenie nienadzorowane!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Algorytm $k$ średnich pseudokod\n",
"\n",
"1. Zainicjalizuj losowo $k$ centroidów (środków ciężkości klastrów): $\\mu_1, \\ldots, \\mu_k$.\n",
"1. Powtarzaj dopóki przyporządkowania klastrów się zmieniają:\n",
" 1. Dla $i = 1$ do $m$:\n",
" za $y^{(i)}$ przyjmij klasę najbliższego centroidu.\n",
" 1. Dla $c = 1$ do $k$:\n",
" za $\\mu_c$ przyjmij średnią wszystkich punktów $x^{(i)}$ takich, że $y^{(i)} = c$."
]
},
{
"cell_type": "code",
2023-12-14 18:04:47 +01:00
"execution_count": 14,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# Algorytm k średnich\n",
"def kmeans(X, k, distance=euclidean_distance):\n",
" Y = []\n",
" centroids = [\n",
" [random.uniform(X.min(axis=0)[f], X.max(axis=0)[f]) for f in range(X.shape[1])]\n",
" for c in range(k)\n",
" ] # Wylosuj centroidy\n",
2022-12-06 09:55:40 +01:00
" while True:\n",
" distances = [\n",
" [distance(centroids[c], x) for c in range(k)] for x in X\n",
" ] # Oblicz odległości\n",
2022-12-06 09:55:40 +01:00
" Y_new = [d.index(min(d)) for d in distances]\n",
" if Y_new == Y:\n",
" break # Jeśli nic się nie zmienia, przerwij\n",
" Y = Y_new\n",
" XY = np.asarray(np.concatenate((X, np.matrix(Y).T), axis=1))\n",
2022-12-06 09:55:40 +01:00
" Xc = [XY[XY[:, 2] == c][:, :-1] for c in range(k)]\n",
" centroids = [\n",
" [Xc[c].mean(axis=0)[f] for f in range(X.shape[1])] for c in range(k)\n",
" ] # Przesuń centroidy\n",
2022-12-06 09:55:40 +01:00
" return Y"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"* Liczba klastrów jest określona z góry i wynosi $k$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Jeżeli w którymś kroku algorytmu jedna z klas nie zostanie przyporządkowana żadnemu z przykładów, pomija się ją w ten sposób wynikiem działania algorytmu może być mniej niż $k$ klastrów."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Funkcja kosztu dla problemu klastrowania\n",
"\n",
"$$ J \\left( y^{(i)}, \\ldots, y^{(m)}, \\mu_{1}, \\ldots, \\mu_{k} \\right) = \\frac{1}{m} \\sum_{i=1}^{m} || x^{(i)} - \\mu_{y^{(i)}} || ^2 $$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Zauważmy, że z każdym krokiem algorytmu $k$ średnich koszt się zmniejsza (lub ewentualnie pozostaje taki sam)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Wielokrotna inicjalizacja\n",
"\n",
"* Algorytm $k$ średnich zawsze znajdzie lokalne minimum funkcji kosztu $J$, ale nie zawsze będzie to globalne minimum."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Aby temu zaradzić, można uruchomić algorytm $k$ średnich wiele razy, za każdym razem z innym losowym położeniem centroidów (tzw. **wielokrotna losowa inicjalizacja** *multiple random initialization*). Za każdym razem obliczamy koszt $J$. Wybieramy ten wynik, który ma najniższy koszt."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Wybór liczby klastrów $k$\n",
"\n",
"Ile powinna wynosić liczba grup $k$?\n",
"* Najlepiej wybrać $k$ ręcznie w zależności od kształtu danych i celu, który chcemy osiągnąć.\n",
"* Możemy też zastosować \"metodę łokcia\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_66762/241557999.py:24: RuntimeWarning: Mean of empty slice.\n",
" centroids = [[Xc[c].mean(axis=0)[f] for f in range(X.shape[1])]\n",
"/home/pawel/.local/lib/python3.10/site-packages/numpy/core/_methods.py:121: RuntimeWarning: invalid value encountered in divide\n",
" ret = um.true_divide(\n"
]
}
],
"source": [
"# Przygotowanie wykresu\n",
"\n",
"ks = []\n",
"costs = []\n",
"for k in range(1, 10):\n",
" min_cost = 100000.0\n",
" best_Y = None\n",
" for _ in range(10): # wielokrotna inicjalizacja\n",
" Y, history = k_means(X, k)\n",
" cost = history[-1][2]\n",
" if cost < min_cost:\n",
" best_Y = Y\n",
" min_cost = cost\n",
" ks.append(k)\n",
" costs.append(min_cost)\n",
"\n",
"\n",
"def elbow_plot(ks, costs):\n",
" fig = plt.figure(figsize=(16 * 0.7, 9 * 0.7))\n",
" ax = fig.add_subplot(111)\n",
" ax.set_xlabel(r\"$k$\")\n",
" ax.set_ylabel(\"koszt\")\n",
" ax.plot(ks, costs, marker=\"o\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA6sAAAIlCAYAAAAzL26ZAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/H5lhTAAAACXBIWXMAAA9hAAAPYQGoP6dpAABPKUlEQVR4nO3de3xU9YH///eZyWUC5EICuRBCuMSC3EFuSVivWKQulXXXsrRdqJf2ty62UnpZaXdL7S3269q1u7WibbfYIkWrFautKKJiIUGusUSqknCHJEBCMrlfZs7vj2SGBJKQhCTnzMzr+XjMg+TkzOSdHenyzudmmKZpCgAAAAAAG3FYHQAAAAAAgEtRVgEAAAAAtkNZBQAAAADYDmUVAAAAAGA7lFUAAAAAgO1QVgEAAAAAtkNZBQAAAADYDmUVAAAAAGA7YVYHGGher1dnzpxRdHS0DMOwOg4AAAAAhAzTNFVVVaURI0bI4eh67DTkyuqZM2eUlpZmdQwAAAAACFknT57UyJEju7wn5MpqdHS0pJb/48TExFicBgAAAABCh9vtVlpamr+XdSXkyqpv6m9MTAxlFQAAAAAs0J0lmWywBAAAAACwHcoqAAAAAMB2KKsAAAAAANuhrAIAAAAAbIeyCgAAAACwHcoqAAAAAMB2KKsAAAAAANuhrAIAAAAAbIeyCgAAAACwHcoqAAAAAMB2KKsAAAAAANuhrAIAAAAAbIeyCgAAAACwHcoqAAAAAMB2wqwOgIs8XlO7j5brbFW9EqNdmjMmXk6HYXUsAAAAABhwlFWb2FJQrIdfOaTiynr/tZRYl9YunqjbJqdYmAwAAAAABh7TgG1gS0Gx7t+wv11RlaSSynrdv2G/thQUW5QMAAAAAKxBWbWYx2vq4VcOyezga75rD79ySB5vR3cAAAAAQHCirFps99Hyy0ZU2zIlFVfWa/fR8oELBQAAAAAWo6xa7GxV50W1N/cBAAAAQDCgrFosMdrVp/cBAAAAQDCgrFpszph4pcS61NkBNYZadgWeMyZ+IGMBAAAAgKUoqxZzOgytXTxRki4rrL7P1y6eyHmrAAAAAEIKZdUGbpucoic/P1PJse2n+ibGROrJz8/knFUAAAAAISfM6gBocdvkFN06MVm7j5br679/X6cr6rRm0bUUVQAAAAAhiZFVG3E6DGWOS9CnpiRLknYdKbM4EQAAAABYg7JqQ9kZwyRJOwrPW5wEAAAAAKxBWbWhOWPiFe40dOpCnU6U1VodBwAAAAAGHGXVhgZFhGlG2lBJjK4CAAAACE2UVZvyTQXeWURZBQAAABB6KKs2lZ2RIEnKLTwvr9e0OA0AAAAADCzKqk1NS4vT4AinLtQ26W8lbqvjAAAAAMCAoqzaVLjTobljW0ZXd7JuFQAAAECIoazamH/daiHnrQIAAAAILZRVG/OtW919tFyNzV6L0wAAAADAwKGs2tj4pGgNGxKhuiaPDpy4YHUcAAAAABgwlFUbMwxDWeN8U4FZtwoAAAAgdFBWbW6+/7xV1q0CAAAACB2UVZvLal23mn+yQlX1TRanAQAAAICBQVm1uZFDB2l0wiB5vKZ2Hy23Og4AAAAADAjKagDIap0KvIN1qwAAAABCBGU1APjWreZy3ioAAACAEEFZDQCZYxNkGNJHpVU6W1VvdRwAAAAA6HeU1QAwdHCEJo2IkSTlsSswAAAAgBBAWQ0Q2Zy3CgAAACCEUFYDRLbvvNXCMpmmaXEaAAAAAOhflNUAMXt0vCKcDp2uqNPxslqr4wAAAABAv6KsBoioCKdmpsdJ4ggbAAAAAMGPshpAfOtWc4soqwAAAACCG2U1gGT5zlstKpPXy7pVAAAAAMGLshpApo2M1ZDIMFXUNulQsdvqOAAAAADQbyirASTM6dC8sfGSWLcKAAAAILhRVgPMxSNsKKsAAAAAgpelZfXJJ5/U1KlTFRMTo5iYGGVmZuq1117r8jm///3vNWHCBLlcLk2ZMkV//vOfByitPfjK6p5j5Wpo9licBgAAAAD6h6VldeTIkXrkkUe0b98+7d27VzfffLPuuOMOffDBBx3en5ubq2XLlunee+/VgQMHtGTJEi1ZskQFBQUDnNw61yQO0fDoSNU3ebX/eIXVcQAAAACgXximadpqW9n4+Hg9+uijuvfeey/72tKlS1VTU6NXX33Vf23evHmaPn261q1b163Xd7vdio2NVWVlpWJiYvos90BatemANuef0QM3ZejrC8dbHQcAAAAAuqUnfcw2a1Y9Ho82bdqkmpoaZWZmdnhPXl6eFixY0O7awoULlZeX1+nrNjQ0yO12t3sEOv+6Vc5bBQAAABCkLC+rBw8e1JAhQxQZGal//dd/1UsvvaSJEyd2eG9JSYmSkpLaXUtKSlJJSUmnr5+Tk6PY2Fj/Iy0trU/zW8FXVt8/WSF3fZPFaQAAAACg71leVsePH6/8/Hy99957uv/++7VixQodOnSoz15/zZo1qqys9D9OnjzZZ69tlRFxURo7bLC8pvTekXKr4wAAAABAn7O8rEZERCgjI0PXXXedcnJyNG3aNP30pz/t8N7k5GSVlpa2u1ZaWqrk5OROXz8yMtK/27DvEQyyMhIkcYQNAAAAgOBkeVm9lNfrVUNDQ4dfy8zM1LZt29pd27p1a6drXIPZfM5bBQAAABDEwqz85mvWrNGiRYs0atQoVVVVaePGjXrnnXf0+uuvS5KWL1+u1NRU5eTkSJIefPBB3XDDDXrsscd0++23a9OmTdq7d6+efvppK38MS8wbmyDDkA6frVapu15JMS6rIwEAAABAn7F0ZPXs2bNavny5xo8fr1tuuUV79uzR66+/rltvvVWSdOLECRUXF/vvz8rK0saNG/X0009r2rRpeuGFF7R582ZNnjzZqh/BMnGDIjQlNVaSlMuuwAAAAACCjO3OWe1vwXDOqs8jr32odduL9I8zR+qxz0yzOg4AAAAAdCkgz1lFz/nWreYWnVeI/c4BAAAAQJCjrAawWaOHKiLMoeLKeh05X2N1HAAAAADoM5TVAOYKd2pW+lBJUi67AgMAAAAIIpTVAJftP8KmzOIkAAAAANB3KKsBLrvNulWPl3WrAAAAAIIDZTXATUmNVbQrTO76Zn1wptLqOAAAAADQJyirAc7pMJQ5NkGStIN1qwAAAACCBGU1CPinArNuFQAAAECQoKwGgeyMlpHVPcfKVd/ksTgNAAAAAFw9ymoQGDd8iJJiItXQ7NX+4xesjgMAAAAAV42yGgQMw1D2uJapwKxbBQAAABAMKKtBwn/eahHrVgEAAAAEPspqkPCV1YOnKlRZ12RxGgAAAAC4OpTVIJEc69K44YPlNaVdRxhdBQAAABDYKKtBxD8VmHWrAAAAAAIcZTWIUFYBAAAABAvKahCZNzZBDkMqOlejksp6q+MAAAAAQK9RVoNIbFS4poyMk8ToKgAAAIDARlkNMtnjEiRRVgEAAAAENspqkJnvP2/1vEzTtDgNAAAAAPQOZTXIzEwfqsgwh0rdDSo6V211HAAAAADoFcpqkHGFOzV7dLwkaWch560CAAAACEyU1SCUldGybnUH61YBAAAABCjKahDyrVvddaRMzR6vxWkAAAAAoOcoq0Fo0ohYxbjCVFXfrIIzbqvjAAAAAECPUVaDkNNhKGtc667ATAUGAAAAEIAoq0EqO4PzVgEAAAAELspqkMpuXbe69/gF1Td5LE4DAAAAAD1DWQ1SY4YNVkqsS43NXu09dsHqOAAAAADQI5TVIGUYF9etcoQNAAAAgEBDWQ1i869pWbeaW0RZBQAAABBYKKtBzDeyevB0pSpqGy1OAwAAAADdR1kNYkkxLl2TOESmKe06UmZ1HAAAAADoNspqkPPtCsy6VQAAAACBhLIa5HxlNbeQkVUAAAAAgYOyGuTmjo2Xw5COnK/RmYo6q+MAAAAAQLdQVoNcjCtc09LiJEk7mQoMAAAAIEBQVkNAduuuwJRVAAAAAIGCshoCfOtWdxaVyTRNi9MAAAAAwJVRVkPAzPQ4ucIdOlfVoMNnq62OAwAAAABXRFkNAZFhTs0eHS+JqcAAAAA
"text/plain": [
"<Figure size 1120x630 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"elbow_plot(ks, costs)"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
2023-04-14 10:47:15 +02:00
"## 7.2. Analiza głównych składowych"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Analiza głównych składowych to inny przykład zagadnienia z dziedziny uczenia nienadzorowanego.\n",
"\n",
"Polega na próbie zredukowania liczby wymiarów dla danych wielowymiarowych, czyli zmniejszenia liczby cech, gdy rozpatrujemy przykłady o dużej liczbie cech."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Redukcja liczby wymiarów\n",
"\n",
"Z jakich powodów chcemy redukować liczbę wymiarów?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Chcemy pozbyć się nadmiarowych cech, np. „długość w cm” / „długość w calach”, „długość” i „szerokość” / „powierzchnia”."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Chcemy znaleźć bardziej optymalną kombinację cech."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Chcemy przyspieszyć działanie algorytmów."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"* Chcemy zwizualizować dane."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Błąd rzutowania\n",
"\n",
"**Błąd rzutowania** błąd średniokwadratowy pomiędzy danymi oryginalnymi a danymi zrzutowanymi."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Sformułowanie problemu\n",
"\n",
"**Analiza głównych składowych** (*Principal Component Analysis*, PCA):\n",
"\n",
2023-01-13 14:18:12 +01:00
"Zredukować liczbę wymiarów z $n$ do $k$, czyli znaleźć $k$ wektorów $u^{(1)}, u^{(2)}, \\ldots, u^{(k)}$ takich, że rzutowanie danych na podprzestrzeń rozpiętą na tych wektorach minimalizuje błąd rzutowania."
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"* **Uwaga:** analiza głównych składowych to (mimo pozornych podobieństw) zupełnie inne zagadnienie niż regresja liniowa!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Algorytm PCA\n",
"\n",
"1. Dany jest zbiór składający się z $x^{(1)}, x^{(2)}, \\ldots, x^{(m)} \\in \\mathbb{R}^n$.\n",
"1. Chcemy zredukować liczbę wymiarów z $n$ do $k$ ($k < n$).\n",
"1. W ramach wstępnego przetwarzania dokonujemy skalowania i normalizacji średniej.\n",
"1. Znajdujemy macierz kowariancji:\n",
" $$ \\Sigma = \\frac{1}{m} \\sum_{i=1}^{n} \\left( x^{(i)} \\right) \\left( x^{(i)} \\right)^T $$\n",
"1. Znajdujemy wektory własne macierzy $\\Sigma$ (rozkład SVD):\n",
" $$ (U, S, V) := \\mathop{\\rm SVD}(\\Sigma) $$\n",
"1. Pierwszych $k$ kolumn macierzy $U$ to szukane wektory."
]
},
{
"cell_type": "code",
"execution_count": 17,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"\n",
2022-12-06 09:55:40 +01:00
"# Algorytm PCA - implementacja\n",
"def pca(X, k):\n",
" X_std = StandardScaler().fit_transform(X) # normalizacja\n",
" mean_vec = np.mean(X_std, axis=0)\n",
" cov_mat = np.cov(X_std.T) # macierz kowariancji\n",
" n = cov_mat.shape[0]\n",
" eig_vals, eig_vecs = np.linalg.eig(cov_mat) # wektory własne\n",
" eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:, i]) for i in range(len(eig_vals))]\n",
2022-12-06 09:55:40 +01:00
" eig_pairs.sort()\n",
" eig_pairs.reverse()\n",
" matrix_w = np.hstack([eig_pairs[i][1].reshape(n, 1) for i in range(k)]) # wybór\n",
2022-12-06 09:55:40 +01:00
" return X_std.dot(matrix_w) # transformacja"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" sl sw pl pw\n",
"0 5.2 3.4 1.4 0.2\n",
"1 5.1 3.7 1.5 0.4\n",
"2 6.7 3.1 5.6 2.4\n",
"3 6.5 3.2 5.1 2.0\n",
"4 4.9 2.5 4.5 1.7\n",
".. ... ... ... ...\n",
"145 5.8 4.0 1.2 0.2\n",
"146 5.1 3.5 1.4 0.3\n",
"147 6.5 3.0 5.8 2.2\n",
"148 6.9 3.1 5.4 2.1\n",
"149 6.7 3.3 5.7 2.1\n",
"\n",
"[150 rows x 4 columns]\n"
]
}
],
"source": [
"data_iris_no_labels = data_iris_raw[[\"sl\", \"sw\", \"pl\", \"pw\"]]\n",
"print(data_iris_no_labels)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/pawel/.local/lib/python3.10/site-packages/seaborn/axisgrid.py:2100: UserWarning: The `size` parameter has been renamed to `height`; please update your code.\n",
" warnings.warn(msg, UserWarning)\n"
]
},
{
"data": {
"text/plain": [
"<seaborn.axisgrid.PairGrid at 0x7f591cd3a710>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABAsAAAJRCAYAAAAj/z4cAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/H5lhTAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOyde5xT1bn3f7lOMpdkhgQQKiMjQbkNMoiizIC3Y1srWhD1LeopMFjPUSlteU8PgoJXQGy1PV7PsQX09FU8PUWo0ru1KjP1zrRcBMsoOrSMjBlnksl15/b+MeyQy76sJDvJTvJ8Px8+HyZ7Z+21117PWk/Wfp7f0sRisRgIgiAIgiAIgiAIgiBOoi12BQiCIAiCIAiCIAiCUBe0WEAQBEEQBEEQBEEQRBK0WEAQBEEQBEEQBEEQRBK0WEAQBEEQBEEQBEEQRBK0WEAQBEEQBEEQBEEQRBK0WEAQBEEQBEEQBEEQRBK0WEAQBEEQBEEQBEEQRBK0WEAQBEEQBEEQBEEQRBK0WEAQBEEQBEEQBEEQRBK0WEAQBEEQBEEQBEEQRBJFXSyIRCJYt24dmpqaYDabMWHCBNx///2IxWJM34/FYnC73cznEwRRGpBtE0T5QvZNEOUJ2TZBlB/6Yl588+bNeOqpp/Dss89i6tSpeO+997Bs2TJYrVasXLlS9vtDQ0OwWq1wuVywWCwFqDFBEIWAbJsgyheyb4IoT8i2CaL8KOpiwZ///Gd8/etfx5VXXgkAGD9+PLZv34533nmnmNUiCIIgCIIgCIIgiIqmqIsFc+bMwdNPP42//e1vOOuss/DXv/4VHR0deOSRRwTPDwaDCAaD8b/dbnehqkoQRB4h2yZKEZePg9PDwR0IwWI2wF5jhLXaWOxqqQ6yb4LIDbWONWTbRLmiVpsrBkVdLLjjjjvgdrsxadIk6HQ6RCIRbNiwATfeeKPg+Zs2bcK9995b4FoSBJFvyLaJUuP4oB+rd+zDniPO+GfzJtrx4KLpGFtvLmLN1AfZN0Fkj5rHGrJtohxRs80VA02siCokL7zwAr7//e/jBz/4AaZOnYq//OUv+O53v4tHHnkES5YsSTtfaAVz3LhxlBtFECUO2TZRSrh8HFZs70pyJHjmTbTjscUtFfsGQgiyb4LIDrWPNWTbRLmhdpsrBkWNLPj+97+PO+64A9/4xjcAAM3Nzfj000+xadMmwcWCqqoqVFVVFbqaBEHkGbJtopRwejhBRwIA3jjihNPDVZwzIQXZN0Fkh9rHGrJtotxQu80Vg6Junejz+aDVJldBp9MhGo0WqUYEQRAEIY07EJI8PiRznCAIggUaawiisJDNpVPUyIKrrroKGzZsQGNjI6ZOnYquri488sgjaG9vL2a1CIIgCEIUi8kgebxO5jhBEAQLNNYQRGEhm0unqIsFjz32GNatW4fbbrsNfX19GDt2LP7lX/4F69evL2a1CIIgiooSKrxKKfmSInA69lojLp88CmePsaBlXD2C4ShMBh329gzgw1437LWV3T4EQcjDMrbaa42YN9GON0Typ4XGmhPuAAa8HNyBMCxmPRqqjRhtMTFfkyAqmWxsTgoWm5OyWTVQVIHDXHG73bBarSSkQhBlRiXbthIqvEop+ZIisDg9/V6s2bkfnd398c/aHDZsXNiMRltNEWumfirZvgkCyGxsPT7oxx079iX9eJk30Y7Ni6ZjTMq5YuPSpoXN0Om0eR/PybaJcqCn34u1O/ejI8f5ncXOS8GXoMUCgiBUR6XathIqvEop+ZIisDjUNrlRqfZNEEB24wf/dnIoEEKdyQB7rfDbyVU//0vSjw6eTQun4df7P8Oe7vyOWWTbRKnj8nH4v//7V0xKiBys0mvRdWwQH/a68cPrzlHMhwqEo6I22+aw4eHrZ6giwqCoaQgEQRDEKZRQ4VVKydfp4fD+pwNYcakjLdR+a8fRilQE5iG1ZIIgsiWb8cNaLZ4uwC8kfOHlsLztTLQ0NmBrx1H4uEj8nFEWk+BCgdQ1CaIScXo4vHKoD68c6hM9ztuKVIoBi52HIlHBhQIA6Ojux4CXo8UCgiAI4hRKqPAqpeTrCYbw6OIWbOs8isdf7Y5/3uqw4dHFLfAGK08RmIfUkgmCyBYlxw+hMGd+jF65vSu+YBAMS+8yRmMWQQzDap9yKQYs5XAR6eB+dyDMWOv8UtStEwmCIIhTKKHCq5SSb73ZiG2dR9NWvTu7+7Gt8yis5sp9C0VqyQRBZItS44fLx6X9WAFOjdHtbU3xz6r00u4+jVkEMQyLfYrZ3htHnLhjxz64fBxTORaT9Dt7ueOFghYLCIIgVAKvwisEqwqvEmUAACcRHtfZ3Q8uIv2mqpxRqo0Jgqg8lBo/pMKcO7v70TKuPv53nzuAuTRmEYQsLPbJkmLAUk5DjRFtDpvgOW0OGxpq1GGXtFhAEAShEqzVRjy4aHraBMMrX7PklCpRBgB4gtLhb16Z4+WMUm1MEETlodT4IRfmzKcetDlsaHMMl01jFkFIw2KfLCkGLOWMtpiwcWFz2oIBvxuCGvQKANoNgSAIFVLpts2ifJ3vMj7q8+CyR14XPf7HVRdhwqjajOpUbijxnCqRSrdvggDyP0b/amUbdBoNGmpO7dme7zGLbJsoF6RsJRP/iHUnkwEvB3cgDItJn2SzakAdyRAEQRBEHCnl60yJAYAm8+/xIXRviGz7Q2Gryj4ngiAqk3yN0afXmzPaVYEgKgWpXQx4pGwlE/+IxeZGW0yqWhxIhRYLCIIgygw5lV4W+BC6O3bsS5oQKWyVIAgiN2iMJojiQLaXOZSGQBCE6iDbzh6Xj8OK7V2C4jvzJtrx2OKWjCYyCrUnlIbsm6hkynmMJtsm1Ew5214+ocgCgiCIMoJFpZefzHINxSMIgiAyQ26M7vdy8fOkxmYeGqMJ4hRSfk0m/hELlWJ7tFhAEARRRrCo9ALKhOIRBEEQmSE1RlcbdYgBaW8/aWwmCHnk/BpW/4hIhrZOJAiCKCMsJoPk8TqTAS4flzahAsMr63fs2AeXj8tnFQmCICoWqTG6va0J9/zyAI3NBJEhLH4Ni39EpEORBQRBEGUEi0qv0qkKLChVjlKorT4EQZQ+cuOKy8chGoth69JZsJqNMOg06HUFYNBpsbdnAK0TbHj81W7BsrMJkyaIUiHXOdnp4fD+pwNYcakDLePqEQxHYTLosLdnAFs7jsLp4cp6l6d8+jS0WEAQBFFGsKj0fuz0SpahdKqC2lIe1FYfgiBKH7lxReh4q8OGZa1NWPH8XsxsrMfFZ42UvAaFSRPliBJzsicYwqOLW7Ct82jSglurw4ZHF7fAGwxhwqjastzFIN8+De2GQBCE6iDbzh0pld6P+jy47JHXRb/7x1UXwV5rVEQ1WGn14VxRW30qEbJvotyQG1d+cN05+Lf//avg8VaHDS2NDXj81W48f/Ns3PDTt0Wv88dVF2HCqFpF664kZNtEpig1J3/q9GLtrv3o7O5PO9bqsGHjgmacYa+JX7NcdjEohE9DmgUEQRBliLXaiAmjajGjsQETRtUmTRZ8KJ4QmaQqsKBUOUqhtvoQBFH6yI0rA17x453d/WgZVw8A+PPH/ZgrMzYTRDmh1JzMRaKCCwXAsI1xkWj8byn/qNQohE9DaQgEQRAVBmuqQrVRh/a2JsH8P9ZwWLWpD6utPgRBlD5C40ri+OnyS48rwfDwD5mtHUfx8rfbcO9LB8sqTJogxFBqTvYEw5LHvTLHs0EN2keF8GlosYAgCKICGVtvxmOLW0RD8axmg2T+n8XMphqsNvVhtdWHIIjSJ3VcqTbqksbPLUtmSX6/Sj8c6OvjItAAkmMzQZQTSs3JhZ7b1aJ9VIj7pjQEgiCICkUqFK+mSo9tnUfTwvo6u/vxTOdR1FSxrTWzpDwUErXVhyCI0id1XGlva0oaP7uODaLVYRP8bqvDhq5jgwCGxyDbybeT5RImTRBSKDUnF3JuV9P204W476IuFowfPx4ajSbt3+23317MahEEQVQ8nkBYNP+vo7sfngBbSB+f8pA6mRUrrFZt9SEIgh2Xj8NHfR509Qzgo88
"text/plain": [
"<Figure size 1050x600 with 20 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import seaborn\n",
"\n",
"# Oryginalne dane (4 cechy 4 wymiary)\n",
"seaborn.pairplot(\n",
" data_iris_no_labels, vars=data_iris_no_labels.columns, size=1.5, aspect=1.75\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 23,
2022-12-06 09:55:40 +01:00
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA8YAAAI3CAYAAABZKELJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/H5lhTAAAACXBIWXMAAA9hAAAPYQGoP6dpAABZiklEQVR4nO3df3Aj533f8c8u1pLgQGB6rUT6pHMsOZ10XMWSLNuKjCP1O5FsRxBwcVs300iu6wlFKInISxrJcappxx51RjZZ/xBATye1PGlspYaHouE0rn5ZMkDHjm31UsmNPJFjVfKdQCtSTQh36MkAtn8opI+/wF1gF7uLfb9m+MfxluBDEsDuZ5/n+X4N27ZtAQAAAAAQU2bQAwAAAAAAIEgEYwAAAABArBGMAQAAAACxRjAGAAAAAMQawRgAAAAAEGsEYwAAAABArBGMAQAAAACxZgU9gGHqdrs6duyYzjzzTBmGEfRwAAAAAAA+sW1bL7/8svbv3y/T7D0nHKtgfOzYMR04cCDoYQAAAAAAhuS5557Tueee2/OYWAXjM888U9Krv5h0Oh3waAAAAAAAfmk0Gjpw4MBGDuwlVsF4ffl0Op0mGAMAAABADDjZRkvxLQAAAABArBGMAQAAAACxRjAGAAAAAMQawRgAAAAAEGsEYwAAAABArBGMAQAAAACxRjAGAAAAAMQawRgAAAAAEGsEYwAAAABArBGMAQAAAACxRjAGAAAAAMQawRgAAAAAEGsEYwAAAABArBGMAQAAAACxRjAGAGCLVqul1dVVtVqtoIcCAACGgGAMAMDfq9VqyufzSqVSmpiYUCqVUj6f18rKStBDAwAAPiIYAwAgqVQqaWpqSpVKRd1uV5LU7XZVqVQ0OTmpxcXFgEcIAAD8QjAGAMRerVZToVCQbdtqt9ub/q/dbsu2bc3MzDBzDADAiCIYAwBib35+XolEoucxiURCCwsLQxoRAAAYJsO2bTvoQQxLo9HQ2NiY1tbWlE6ngx4OACAEWq2WUqnUxvLpXkzTVLPZVDKZHMLIAADAINzkP2aMAQCx1mg0HIVi6dU9x41Gw+cRAQCAYSMYAwBiLZ1OyzSdnQ5N02TFEQAAI4hgDACItWQyqWw2K8uyeh5nWZZyuRzLqAEAGEEEYwBA7M3NzanT6fQ8ptPpaHZ2dkgjAgAAw0QwBgDE3sGDB1UsFmUYxraZY8uyZBiGisWiMplMQCMEAAB+IhgDACBpenpa1WpV2Wx2Y8+xaZrKZrOqVquanp4OeIQAAMAvtGsCAGCLVqulRqOhdDrNnmIAACLKTf7rXWkEAIAYSiaTBGIAAGKEpdQAAAAAgFgjGAMAAAAAYo1gDAAAAACINYIxAAAAACDWCMYAAAAAgFgjGAMAAAAAYo1gDAAAAACINYIxAAAAACDWCMYAAAAAgFgjGAMAAAAAYo1gDAAAAACINYIxAAAAACDWCMYAAAAAgFgjGAMAAAAAYo1gDAxZq9XS6uqqWq1W0EMBAAAAIIIxMDS1Wk35fF6pVEoTExNKpVLK5/NaWVkJemgAAABArBGMgSEolUqamppSpVJRt9uVJHW7XVUqFU1OTmpxcTHgEQIAAADxRTAGfFar1VQoFGTbttrt9qb/a7fbsm1bMzMzzBwDAAAAASEYAz6bn59XIpHoeUwikdDCwsKQRgQAAADgVIZt23bQgxiWRqOhsbExra2tKZ1OBz0cxECr1VIqldpYPt2LaZpqNptKJpNDGBkAAAAw2tzkP2aMAR81Gg1HoVh6dc9xo9HweUQAAAAAtiIYAz5Kp9MyTWcvM9M0WckAAAAABIBgDPgomUwqm83Ksqyex1mWpVwuxzJqAAAAIAAEY8Bnc3Nz6nQ6PY/pdDqanZ0d0ogAAAAAnIpgDPjs4MGDKhaLMgxj28yxZVkyDEPFYlGZTCagEQIAAADxRjAGhmB6elrValXZbHZjz7Fpmspms6pWq5qeng54hAAAAEB80a4JGLJWq6VGo6F0Os2eYgAAAMAnbvJf74pAADyXTCYJxAAAAECIsJQaAAAAABBrBGMAAAAAQKwRjAEAAAAAsUYwBgAAAADEGsEYAAAAABBrBGMAAAAAQKwRjAEAAAAAsUYwBgAAAADEGsEYAAAAABBrBGMAAAAAQKwRjAEAAAAAsUYwBgAAAADEGsEYAAAAABBrBGMAQGi0Wi2trq6q1WoFPRQAABAjBGMAQOBqtZry+bxSqZQmJiaUSqWUz+e1srIS9NAAAEAMEIwBAIEqlUqamppSpVJRt9uVJHW7XVUqFU1OTmpxcTHgEQIAgFEXqWB811136W1ve5vOPPNMnX322brxxhv1ve99L+hhAQD6VKvVVCgUZNu22u32pv9rt9uybVszMzPMHAMAAF9FKhg/9thjKhQK+sY3vqEHH3xQP/nJT/TLv/zLOn78eNBDAwD0YX5+XolEoucxiURCCwsLQxoRAACII8O2bTvoQfTrhRde0Nlnn63HHntMU1NTex7faDQ0NjamtbU1pdPpIYwQALCbVqulVCq1sXy6F9M01Ww2lUwmhzAyAAAwCtzkP2tIY/LF2tqaJGnfvn07/v/Jkyd18uTJjX83Go2hjAsAsLdGo+EoFEuv7jluNBoEYwAA4ItILaU+Vbfb1W233aZMJqMLLrhgx2PuuusujY2NbXwcOHBgyKMEAOwmnU7LNJ2dhkzTZKUPAADwTWSDcaFQ0JNPPqn77rtv12PuuOMOra2tbXw899xzQxwhAKCXZDKpbDYry+q9eMmyLOVyOWaLAQCAbyIZjG+99VZ9+ctf1le/+lWde+65ux53+umnK51Ob/oAAITH3NycOp1Oz2M6nY5mZ2eHNCIAABBHkQrGtm3r1ltv1dLSkh555BGdd955QQ8JADCAgwcPqlgsyjCMbTPHlmXJMAwVi0VlMpmARggAAOIgUsG4UCjov/7X/6rPfe5zOvPMM1Wv11Wv19VqtYIeGgCgT9PT06pWq8pmsxt7jk3TVDabVbVa1fT0dMAjBAAAoy5S7ZoMw9jx85/5zGd088037/n1tGsCgHBrtVpqNBpKp9PsKQYAAAMZ2XZNEcrwAIA+JJNJAjEAABi6SC2lBgCEW6vV0urqKltcAABApBCMAQADq9VqyufzSqVSmpiYUCqVUj6f18rKStBDAwAA2BPBGAAwkFKppKmpKVUqFXW7XUlSt9tVpVLR5OSkFhcXAx4hAABAbwRjAEDfarWaCoWCbNtWu93e9H/tdlu2bWtmZoaZYwAAEGoEYwBA3+bn55VIJHoek0gktLCwMKQRAQAAuBepdk2Dol0TAHin1WoplUptLJ/uxTRNNZtNKk4DAIChcZP/mDEGAPSl0Wg4CsXSq3uOG42GzyMCAADoD8EYANCXdDot03R2GjFNk5U62Ib2XgCAsCAYAwD6kkwmlc1mZVlWz+Msy1Iul2MZ9QjwKsjS3gsAEDYEYwBA3+bm5tTpdHoe0+l0NDs7O6QRwQ9eBlnaewEAwohgDADo28GDB1UsFmUYxraZY8uyZBiGisWiMplMQCPEoLwMsrT3AgCEFcEYADCQ6elpVatVZbPZjT3Hpmkqm82qWq1qeno64BGiX14HWdp7hQN7uwFgO9o1AQA802q11Gg0lE6n2VM8AvL5vCqVyrZQfCrLspTNZlUul3s+Fu29gler1TQ/P6/l5WV1u92NG1iHDx9mVQeAkeQm/xGMAQDANl4H2dXVVU1MTDj+/vV6XePj446PR2+lUkmFQkGJRGLTjQ7LstTpdFQsFlndAWDk0McYAAAMxOs+1bT3Cg57uwFgbwRjAACwjddBlvZewWFvNwDsjWAMAAC28SPI0t5r+FqtlpaXl3vuE5denTleWlqiIBeA2CIYAwCAHXkdZGnvNXxeL4kHgFFFMAYAADvyI8jS3mu4XvOa18gwDEfHsrcbQJwRjAEAwK78CLKZTEblclnNZlP1el3NZlPlcpmZYg/VajXl83mdddZZctKAhL3dAOKOdk0AAMAR+lRHw26tmXoxDEPVapWbEwBGipv817uiBgAAwN9LJpME4pDr1ZppJ6f2MSYUA4gzllIDAAC
2022-12-06 09:55:40 +01:00
"text/plain": [
2022-12-09 15:06:17 +01:00
"<Figure size 1120x630 with 1 Axes>"
2022-12-06 09:55:40 +01:00
]
},
2022-12-09 15:06:17 +01:00
"metadata": {},
2022-12-06 09:55:40 +01:00
"output_type": "display_data"
}
],
"source": [
"transformed_data = pca(data_iris_no_labels, 2) # dane przekształcone za pomocą PCA\n",
"fig = plot_unlabeled_data(transformed_data[:, 0], transformed_data[:, 1])"
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Analiza głównych składowych umożliwiła stworzenie powyższego wykresu, który wizualizuje 4-wymiarowe dane ze zbioru *iris* na 2-wymiarowej płaszczyźnie.\n",
"\n",
"Współrzędne $x_1$ i $x_2$, stanowiące osi wykresu, zostały uzyskane w wyniku działania algorytmu PCA (**nie** są to żadne z oryginalnych cech ze zbioru *iris* ani długość płatka, ani szerokość płatka itp. tylko nowo utworzone cechy)."
2022-12-06 09:55:40 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Tutaj można zobaczyć, jak algorytmy redukcji wymiarów (w tym PCA) działają w praktyce:\n",
" * https://projector.tensorflow.org\n",
" * https://biit.cs.ut.ee/clustvis\n",
" * https://labriata.github.io/jsinscience/pca/index.html"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
2022-12-09 15:06:17 +01:00
"display_name": "Python 3 (ipykernel)",
2022-12-06 09:55:40 +01:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2023-12-14 18:04:47 +01:00
"version": "3.10.12"
2022-12-06 09:55:40 +01:00
},
"livereveal": {
"start_slideshow_at": "selected",
"theme": "white"
}
},
"nbformat": 4,
"nbformat_minor": 4
}