mat-proj/bayes_final.ipynb

796 lines
1.8 MiB
Plaintext
Raw Permalink Normal View History

2024-06-12 15:03:12 +02:00
{
"cells": [
2024-06-13 19:55:44 +02:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Autorzy\n",
"\n",
"Studia 2 stopnia - Informatyka (zaoczne) 1 rok\n",
"\n",
"- Krzysztof Bojakowski\n",
"- Adam Stelmaszyk\n",
"- Patryk Osiński\n",
"- Marcin Jakubik"
]
},
2024-06-12 15:03:12 +02:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zadanie\n",
"\n",
"Klasyfikacja za pomocą naiwnej metody bayesowskiej (rozkłady ciągłe). Implementacja powinna założyć, że cechy są ciągłe (do wyboru rozkład normalny i jądrowe wygładzenie). Na wejściu oczekiwany jest zbiór, który zawiera p-cech ciągłych, wektor etykiet oraz wektor prawdopodobieństw a priori dla klas. Na wyjściu otrzymujemy prognozowane etykiety oraz prawdopodobieństwa a posteriori. Dodatkową wartością może być wizualizacja obszarów decyzyjnych w przypadku dwóch cech.\n"
]
},
{
2024-06-13 19:55:44 +02:00
"attachments": {
"image.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABEwAAAJOCAIAAAAiewXcAAAgAElEQVR4Aey9CVBUx94+3OcczpmFYQAB4brE8iqWa7mWC5Ya/WLUcsuNopZbyhgtE5O/opYm8TMur8ZYJtF/1tc1iUJUhIvbdcFXUS+iBlGvG1cWX3F5VeQFxA+YGeac83Hm0fZkWIIwQ1z61BT09PT6dJ/fr5/uX3cTlT0MAYYAQ4AhwBBgCDAEGAIMAYYAQ+AlQoC8RHVhVWEIMAQYAgwBhgBDgCHAEGAIMAQYAiojOawTMAQYAgwBhgBDgCHAEGAIMAQYAi8VAozkvFTNySrDEGAIMAQYAgwBhgBDgCHAEGAIMJLD+gBDgCHAEGAIMAQYAgwBhgBDgCHwUiHASM5L1ZysMgwBhgBDgCHAEGAIMAQYAgwBhgAjOawPMAQYAgwBhgBDgCHAEGAIMAQYAi8VAozkvFTNySrDEGAIMAQYAgwBhgBDgCHAEGAIMJLD+gBDgCHAEGAIMAQYAgwBhgBDgCHwUiHASM5L1ZysMgwBhgBDgCHAEGAIMAQYAgwBhgAjOawPMAQYAgwBhgBDgCHAEGAIMAQYAi8VAozkvFTNySrDEGAIMAQYAgwBhgBDgCHAEGAIMJLD+gBDgCHAEGAIMAQYAgwBhgBDgCHwUiHASM5L1ZysMgwBhgBDgCHAEGAIMAQYAgwBhgAjOZX3AeXJU/nPzJchwBBgCDAEGAIMAYYAQ4AhwBB4XhFgJKfylnnCcZTKf2a+DAGGAEOAIcAQYAgwBBgCDAGGwPOKACM57i1D6U1Fh3vQyr5XjMV8GAIMAYYAQ8BTCFQmd73l56kys3QYAgwBhgBDoHYI1EW+M5Ljjl41beAetLLv1URnPzEEGAIMAYZAHRGoTO56y6+ORWXRGQIMAYYAQ6COCNRFvjOS444eGkN+8ujbxj1oZd/14ZmbIcAQYAgwBDyLQGVy11t+ni05S40hwBBgCDAEnhWBush3RnLc0QP6TziOrG8M96CVfdeHZ26GAEOAIcAQ8B4ClclgT/p5r+QsZYYAQ4AhwBCoCQJ1kemM5DxGT5ZlVVWTkpIiIiJ69erVr1+/iApP3xo8fdjDEGAIMAQYAh5FoHfv3hERET179hw3bpxeKdZF+R08eLB3795dunTp27dvVYXtzR6GAEOAIcAQqBcEMPzu2bNnRERE165dJ06cqKqqotTpADBGcn6nJUePHk0IEQSB/iXsYQgwBBgCDIHnAwGO47DMruie3wnxGn8ZP348rRNXxUMDMAdDgCHAEGAIeBUBnuclSeI4jhAiiiLHcSUlJTWW6JUHZCTnKS52u33kyJEcx3Xt2nXmzJnvvPPO9OnTpzzj8w57GAIMAYYAQ8CjCMyaNWvQoEFQfp4iOdOnTxdFsVWrVrNnz55axfMuexgCDAGGAEOgXhCY7HqioqIGDhwoiiIhpLi4mK3kPGUpdXeNGTOmnEp+/vnnzjInJgqfNU3d9CJzMgQYAgwBhoBnENizZ4/RaOR53lMkZ+rUqYSQ8ePHwyKi0lLSzZnMwRBgCDAEGAJeRcDhcEAax8fHY0qrtLT0WQfhbuHZSs7vAHn7b2/zPP/xxx/XkTv+LlH2hSHAEGAIMARqhYAsy84ypyzL0dHRsJTQs5FaJfk40rRp0ziOGz58eF0SYXEZAgwBhgBDwFMIOMucqqpu3LiR53lCiM1mc5uEetaMGMn5HWKM5PwODvaFIcAQYAj8qQgwkvOnws8yZwgwBBgC9YcAIznexZqRHO/iy1JnCDAEGALPggAjOc+CFgvLEGAIMAReYAQYyfFu4zGS4118WeoMAYYAQ+BZEGAk51nQYmEZAgwBhsALjAAjOd5tPEZyvIsvS50hwBBgCDwLAozkPAtaLCxDgCHAEHiBEaiU5NSlPmxPzu/QYyTnd3CwLwwBhgBD4E9FgJGcPxV+ljlDgCHAEKg/BBjJ8S7WjOR4F1+WOkOAIcAQeBYEGMl5FrRYWIYAQ4Ah8AIjwEiOdxuPkRzv4stSZwgwBBgCz4IAIznPghYLyxBgCDAEXmAEGMnxbuMxkuNdfFnqDAGGAEPgWRDwHsl57733CCHDhg17luKwsAwBhgBDgCHgLQRwH+iGDRvYZaBegZiRHK/AyhJlCDAEGAK1QoCRnFrBxiIxBBgCDIEXDwFGcrzbZozkeBdfljpDgCHAEHgWBBjJeRa0WFiGAEOAIfACI8BIjncbj5Ec7+LLUmcIMAQYAs+CACM5z4IWC/sUAUVRnn75vauan34f8KX69mrW+qVqwlegMozkeLeRGcnxLr51S91Z5rTZbAcOHNi7d+/BgwePHTt26NChgwcPHj9+/LDrOX78eHp6enFxsd1ux/a1umVYy9iKohw/fjwhIeHUqVPFxcWqqjLtUksoaxxNcT12uz0xMXHnzp3Jyck1jsoCPtcIMJLzXDdPDQpXUFBw+PDhgwcPHj58eN++fVeuXFFVVZbl6qNeuXLlwIEDSUlJCQkJGRkZkKJ4zWsoTmmwoqKi/fv3b/t1G81aluVHjx7t27dvt+upviQv4q9u6g8DR4oeHC9ivepYZnSJ4uLio0ePxsXFnThxohb9qo5lYNGrR4CRnOrxqeuvjOTUFUFvxlcUpbi42Gq1iqJIdI8gCPgmCILVam3VqtW8efPy8vL+RHYxaNAgQkjv3r0ZyfFmj3iatrPM6XA4ZFnu3Lkzz/MDBgyw2WxPf2auFxYBRnJe2KZ7XPCzZ89arVYqsCdOnFi9ZHY4HLm5uR06dOA4zmAwcBw3f/780tJS/Ri9JpggvLPMeefOnUaNGgmCEBUVpSiKw+FQFOXUqVNGo5FzPTVJ7UUMoygKPavqzp07siw/K4YvYq2rKTMlOe3btyeEdO3aNT8//xXHpBq4/pSfGMnxLuyM5HgX37ql7ixzOsuckuuhKpPjOJ7n8ZU6BEHo1q1bRkYG5gvdprXqVoo/jq0oSp8+fXie79SpEyM5f4yXh0JgNNy9e3eO4/r06eOhVFkyfzICjOT8yQ1Q5+zPnz8fGBhIJXbz5s0LCgqqT/XIkSOEEDAQnucXLVpUixl3Pclp2rQpIWT27NnIV5bl1NRUWqTqC/Mi/lpaWqqqqsPhuHHjxuTJkwkhFy9e1I/m6TLXi1i7WpeZkpzWrVvzPN+lS5eSkhI9LLVOmUX0FAKM5HgKycrTYSSnclyeG19nmdNsNvM8b7Vax40bN378+MjIyClTpkx0PVOmTHnjjTfCwsKwttO/f/+HDx/WvzR3ljknTZrUsGHD8ePHM5JTP30Hs7MOh6Nbt24Gg2HAgAH1ky/LxdsIMJLjbYS9nf6FCxdCQ0MJIXTJfffu3dWYq9lstk8++YTneZwhy/P8Z599ph+J1lCk0yyKioqaN28uiuK8efNQWWeZ88KFCx06dAgPD2/btq23Eaj/9LGGI8vyqFGjwOXS09Ofw5WcGjalpwCkJKdjx45YyYHioL3LUxmxdGqNACM5tYauRhEZyakRTH9eIEVR/Pz8CCHt27cvLS2lagwjIWeZU1GUnJycfv36QbLv2LED74y+yFjYsdvtqqqWlJTYXA9S0AeDnsBGIKgHVVVhFmW32202m8PhqMomKj8//969e48ePdInSEuLdPQbh/CT3W6HA8kqioKvsuvBXKbNZnOWOUtLSyvWS59XVW4kqChKaWmp3W6nIh7TflWZkciyXFJSYtc9KEClKgrw0sI7y5zICIDTjGRZLh/NIMmSkhJaYEREyqgj3LIsOxyOkpISZE2rj19pdt27dyeE9O/fv2Jd0KAA0GazgX8iHQo71XZwoPWdZU6UFhZxFVOmhWcOjyPASI7HIa3nBC9duhQcHEwI+etf/xoeHi5J0tSpU6uSnJDJXbp0gZDHYs6iRYvcXkyIESrEHE8eCBm3xPPz85s1a0YImT9/PuqO4IWFhUVFRbm5uRXlGORSaWkpBD6VaTQkxBE2FymKgnwdDofdblcUBdoBAqdStCFw8BONXlxcDGlZlYRBLGqYAL1QUlLicDhoRKQJVeJwON566y2siWVnZ+MnfdaVlk1VVUS32+2wEoQig6ik
}
},
2024-06-12 15:03:12 +02:00
"cell_type": "markdown",
"metadata": {},
"source": [
"# Zbiór danych\n",
"\n",
"Do prezentacji wyników użyliśmy danych pochodzących z [UCI ML Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic). Zbiór zawiera 30 cech, na podstawie których klasyfikowany jest nowotwór - jako złośliwy lub niezsłośliwy.\n",
"\n",
2024-06-13 19:55:44 +02:00
"![image.png](attachment:image.png)"
2024-06-12 15:03:12 +02:00
]
},
{
"cell_type": "code",
"execution_count": 66,
2024-06-12 15:03:12 +02:00
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>mean radius</th>\n",
" <th>mean texture</th>\n",
" <th>mean perimeter</th>\n",
" <th>mean area</th>\n",
" <th>mean smoothness</th>\n",
" <th>mean compactness</th>\n",
" <th>mean concavity</th>\n",
" <th>mean concave points</th>\n",
" <th>mean symmetry</th>\n",
" <th>mean fractal dimension</th>\n",
" <th>...</th>\n",
" <th>worst texture</th>\n",
" <th>worst perimeter</th>\n",
" <th>worst area</th>\n",
" <th>worst smoothness</th>\n",
" <th>worst compactness</th>\n",
" <th>worst concavity</th>\n",
" <th>worst concave points</th>\n",
" <th>worst symmetry</th>\n",
" <th>worst fractal dimension</th>\n",
" <th>target</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>17.99</td>\n",
" <td>10.38</td>\n",
" <td>122.80</td>\n",
" <td>1001.0</td>\n",
" <td>0.11840</td>\n",
" <td>0.27760</td>\n",
" <td>0.3001</td>\n",
" <td>0.14710</td>\n",
" <td>0.2419</td>\n",
" <td>0.07871</td>\n",
" <td>...</td>\n",
" <td>17.33</td>\n",
" <td>184.60</td>\n",
" <td>2019.0</td>\n",
" <td>0.1622</td>\n",
" <td>0.6656</td>\n",
" <td>0.7119</td>\n",
" <td>0.2654</td>\n",
" <td>0.4601</td>\n",
" <td>0.11890</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>20.57</td>\n",
" <td>17.77</td>\n",
" <td>132.90</td>\n",
" <td>1326.0</td>\n",
" <td>0.08474</td>\n",
" <td>0.07864</td>\n",
" <td>0.0869</td>\n",
" <td>0.07017</td>\n",
" <td>0.1812</td>\n",
" <td>0.05667</td>\n",
" <td>...</td>\n",
" <td>23.41</td>\n",
" <td>158.80</td>\n",
" <td>1956.0</td>\n",
" <td>0.1238</td>\n",
" <td>0.1866</td>\n",
" <td>0.2416</td>\n",
" <td>0.1860</td>\n",
" <td>0.2750</td>\n",
" <td>0.08902</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>19.69</td>\n",
" <td>21.25</td>\n",
" <td>130.00</td>\n",
" <td>1203.0</td>\n",
" <td>0.10960</td>\n",
" <td>0.15990</td>\n",
" <td>0.1974</td>\n",
" <td>0.12790</td>\n",
" <td>0.2069</td>\n",
" <td>0.05999</td>\n",
" <td>...</td>\n",
" <td>25.53</td>\n",
" <td>152.50</td>\n",
" <td>1709.0</td>\n",
" <td>0.1444</td>\n",
" <td>0.4245</td>\n",
" <td>0.4504</td>\n",
" <td>0.2430</td>\n",
" <td>0.3613</td>\n",
" <td>0.08758</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>11.42</td>\n",
" <td>20.38</td>\n",
" <td>77.58</td>\n",
" <td>386.1</td>\n",
" <td>0.14250</td>\n",
" <td>0.28390</td>\n",
" <td>0.2414</td>\n",
" <td>0.10520</td>\n",
" <td>0.2597</td>\n",
" <td>0.09744</td>\n",
" <td>...</td>\n",
" <td>26.50</td>\n",
" <td>98.87</td>\n",
" <td>567.7</td>\n",
" <td>0.2098</td>\n",
" <td>0.8663</td>\n",
" <td>0.6869</td>\n",
" <td>0.2575</td>\n",
" <td>0.6638</td>\n",
" <td>0.17300</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>20.29</td>\n",
" <td>14.34</td>\n",
" <td>135.10</td>\n",
" <td>1297.0</td>\n",
" <td>0.10030</td>\n",
" <td>0.13280</td>\n",
" <td>0.1980</td>\n",
" <td>0.10430</td>\n",
" <td>0.1809</td>\n",
" <td>0.05883</td>\n",
" <td>...</td>\n",
" <td>16.67</td>\n",
" <td>152.20</td>\n",
" <td>1575.0</td>\n",
" <td>0.1374</td>\n",
" <td>0.2050</td>\n",
" <td>0.4000</td>\n",
" <td>0.1625</td>\n",
" <td>0.2364</td>\n",
" <td>0.07678</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 31 columns</p>\n",
"</div>"
],
"text/plain": [
" mean radius mean texture mean perimeter mean area mean smoothness \\\n",
"0 17.99 10.38 122.80 1001.0 0.11840 \n",
"1 20.57 17.77 132.90 1326.0 0.08474 \n",
"2 19.69 21.25 130.00 1203.0 0.10960 \n",
"3 11.42 20.38 77.58 386.1 0.14250 \n",
"4 20.29 14.34 135.10 1297.0 0.10030 \n",
"\n",
" mean compactness mean concavity mean concave points mean symmetry \\\n",
"0 0.27760 0.3001 0.14710 0.2419 \n",
"1 0.07864 0.0869 0.07017 0.1812 \n",
"2 0.15990 0.1974 0.12790 0.2069 \n",
"3 0.28390 0.2414 0.10520 0.2597 \n",
"4 0.13280 0.1980 0.10430 0.1809 \n",
"\n",
" mean fractal dimension ... worst texture worst perimeter worst area \\\n",
"0 0.07871 ... 17.33 184.60 2019.0 \n",
"1 0.05667 ... 23.41 158.80 1956.0 \n",
"2 0.05999 ... 25.53 152.50 1709.0 \n",
"3 0.09744 ... 26.50 98.87 567.7 \n",
"4 0.05883 ... 16.67 152.20 1575.0 \n",
"\n",
" worst smoothness worst compactness worst concavity worst concave points \\\n",
"0 0.1622 0.6656 0.7119 0.2654 \n",
"1 0.1238 0.1866 0.2416 0.1860 \n",
"2 0.1444 0.4245 0.4504 0.2430 \n",
"3 0.2098 0.8663 0.6869 0.2575 \n",
"4 0.1374 0.2050 0.4000 0.1625 \n",
"\n",
" worst symmetry worst fractal dimension target \n",
"0 0.4601 0.11890 0.0 \n",
"1 0.2750 0.08902 0.0 \n",
"2 0.3613 0.08758 0.0 \n",
"3 0.6638 0.17300 0.0 \n",
"4 0.2364 0.07678 0.0 \n",
"\n",
"[5 rows x 31 columns]"
]
},
"execution_count": 66,
2024-06-12 15:03:12 +02:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
2024-06-12 15:03:12 +02:00
"from sklearn.datasets import load_breast_cancer\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"breast_cancer = load_breast_cancer()\n",
"\n",
"# Sprawdzenie poprawności rozmiarów\n",
"assert len(breast_cancer.data[0]) == len(breast_cancer.feature_names)\n",
"assert len(breast_cancer.target) == len(breast_cancer.data)\n",
"\n",
"values = np.c_[breast_cancer.data, breast_cancer.target]\n",
"columns = np.array(list(breast_cancer.feature_names) + ['target'])\n",
"\n",
"df = pd.DataFrame(values, columns=columns)\n",
"\n",
"X = df.iloc[:, 0:-1]\n",
"y = df.iloc[:,-1]\n",
"\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Implementacja naiwnej metody Bayesowskiej\n",
"\n",
"Implementacja jest w dwóch wariantach:\n",
"- Rozkład normalny - Gaussian (Normal) Distribution\n",
"- Rozkład log-normalny - Log-Gaussian (Log-Normal) Distribution\n",
"\n",
"# Matematyczny opis algorytmu\n",
"\n",
"## Twierdzenie Bayesa\n",
"\n",
"${X}$ będzie pewnym zdarzeniem o ${n}$ cechach $$ X = (x_1, x_2, \\ldots, x_n)$$\n",
"prawdopodobieństwo a posteriori ${P(C_k | X)}$ dla klasy ${C_k}$ jest dane przez: $$P(C_k | X) = \\frac{P(X | C_k) P(C_k)}{P(X)}$$\n",
"\n",
"Gdzie:\n",
"$$\n",
2024-06-13 19:55:44 +02:00
"\\begin{aligned}\n",
2024-06-12 15:03:12 +02:00
"& \\text{(1)}\\hspace{1cm}P(C_k | X)\\text{ to prawdopodobieństwo (probability) a posteriori klasy } C_k \\text{ pod warunkiem zajścia zdarzenia } X \\\\\n",
"& \\text{(2)}\\hspace{1cm}P(X | C_k) \\text{ to prawdopodobieństwo (likelihood) obserwacji zdarzenia } X \\text{ pod warunkiem klasy } C_k \\\\\n",
"& \\text{(3)}\\hspace{1cm}P(C_k) \\text{ to prawdopodobieństwo apriori klasy } C_k \\\\\n",
2024-06-13 19:55:44 +02:00
"& \\text{(4)}\\hspace{1cm}P(X) \\text{ to całkowite prawdopodobieństwo obserwacji zdarzenia X we wszystkich klasach}\n",
"\\end{aligned}\n",
2024-06-12 15:03:12 +02:00
"$$\n",
"\n",
"\n",
"### Wzory na średnią i odchylenie standardowe\n",
"\n",
"#### Średnia (Mean)\n",
"\n",
"Średnia ${\\mu}$ zestawu danych ${X}$ składającego się z ${n}$ obserwacji jest obliczana jako:\n",
"\n",
"$$\n",
"\\mu = \\frac{1}{n} \\sum_{i=1}^{n} x_i\n",
"$$\n",
"\n",
"Gdzie:\n",
"- ${\\mu}$ to średnia (mean)\n",
"- ${n}$ to liczba obserwacji w zestawie danych\n",
"- ${x_i}$ to pojedyncza obserwacja w zestawie danych\n",
"\n",
"#### Odchylenie standardowe (Standard Deviation)\n",
"\n",
"Odchylenie standardowe ${ \\sigma }$ zestawu danych ${ X }$ składającego się z ${ n }$ obserwacji jest obliczane jako pierwiastek kwadratowy z wariancji:\n",
"\n",
"$$\n",
"\\sigma = \\sqrt{\\frac{1}{n} \\sum_{i=1}^{n} (x_i - \\mu)^2}\n",
"$$\n",
"\n",
"Gdzie:\n",
"- ${\\sigma}$ to odchylenie standardowe (standard deviation)\n",
"- ${\\mu}$ to średnia (mean) zestawu danych\n",
"- ${n}$ to liczba obserwacji w zestawie danych\n",
"- ${x_i}$ to pojedyncza obserwacja w zestawie danych\n",
"\n",
"\n",
"### Prawdpodobiestwo (Likelihood) obserwacji zdarzenia - rozkład normalny\n",
"$$ \n",
"\\text{Likelihood}(x | \\mu, \\sigma) = \\frac{1}{\\sqrt{2\\pi\\sigma^2}} \\exp\\left(-\\frac{(x - \\mu)^2}{2\\sigma^2}\\right) \n",
"$$\n",
"\n",
"$$\n",
"\\text{Posterior} = \\text{Likelihood} \\times \\text{Prior}\n",
"$$\n",
"\n",
"### Prawdopodobieństwo (Log-Likelihood) obserwacji zdarzenia - rozkład log-normalny\n",
"\n",
"$$\n",
"\\text{Log-Likelihood}(x | \\mu, \\sigma) = -\\frac{1}{2} \\log(2\\pi\\sigma^2) - \\frac{(x - \\mu)^2}{2\\sigma^2} \n",
"$$\n",
"\n",
"$$\n",
"\\text{Log-Posterior} = \\text{Log-Likelihood} + \\text{Log-Prior}\n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": 67,
2024-06-12 15:03:12 +02:00
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from scipy.stats import norm\n",
"\n",
"class GaussianNaiveBayes:\n",
" def __init__(self, priors=None, var_smoothing=1e-8, log=False):\n",
" self.priors = priors\n",
" self.classes = None\n",
" self.means = None\n",
" self.stds = None\n",
" self.log = log\n",
" self.var_smoothing = var_smoothing\n",
"\n",
" def fit(self, X, y):\n",
" self.epsilon = self.var_smoothing * np.var(X, axis=0).max()\n",
" \n",
" self.classes = np.unique(y)\n",
" self.means = np.zeros((len(self.classes), X.shape[1]))\n",
" self.stds = np.zeros((len(self.classes), X.shape[1]))\n",
" \n",
" for idx, cls in enumerate(self.classes):\n",
" X_c = X[y == cls]\n",
" self.means[idx, :] = X_c.mean(axis=0)\n",
" self.stds[idx, :] = X_c.std(axis=0)\n",
"\n",
" if self.priors is None:\n",
" self.priors = np.bincount(y) / len(y)\n",
"\n",
" self.stds += self.epsilon\n",
"\n",
" def predict(self, X):\n",
" posteriors = self.predict_proba(X)\n",
" return np.argmax(posteriors, axis=1)\n",
"\n",
" def score(self, X, y_true):\n",
" posteriors = self.predict_proba(X)\n",
" y_pred = np.argmax(posteriors, axis=1)\n",
"\n",
" correct_predictions = np.sum(y_true == y_pred)\n",
" accuracy = correct_predictions / len(y_true)\n",
" return accuracy\n",
"\n",
" def score_visualize(self, X, y_true):\n",
" \"\"\"\n",
" Funkcja rysuje dwa wykresy słupkowe przedstawiające rzeczywiste wartości i przewidywane wartości.\n",
" \"\"\"\n",
" posteriors = self.predict_proba(X)\n",
" y_pred = np.argmax(posteriors, axis=1)\n",
"\n",
" # Liczymy ilość wystąpień każdej klasy dla rzeczywistych i przewidywanych wartości\n",
" true_counts = np.bincount(y_true)\n",
" pred_counts = np.bincount(y_pred)\n",
"\n",
" # Tworzymy subplots - dwa wykresy obok siebie\n",
" fig, axes = plt.subplots(1, 2, figsize=(12, 6))\n",
" \n",
" # Pierwszy wykres - rzeczywiste wartości\n",
" axes[0].bar([0, 1], true_counts, color=['blue', 'orange'])\n",
" axes[0].set_title('Rzeczywiste wartości')\n",
" axes[0].set_xticks([0, 1])\n",
" axes[0].set_xticklabels(['Nowotwór złośliwy', 'Nowotwór łagodny'])\n",
" axes[0].set_ylabel('Liczba wystąpień')\n",
" \n",
" # Drugi wykres - przewidywane wartości\n",
" axes[1].bar([0, 1], pred_counts, color=['blue', 'orange'])\n",
" axes[1].set_title('Przewidywane wartości')\n",
" axes[1].set_xticks([0, 1])\n",
" axes[1].set_xticklabels(['Nowotwór złośliwy', 'Nowotwór łagodny'])\n",
" axes[1].set_ylabel('Liczba wystąpień')\n",
2024-06-12 15:03:12 +02:00
" \n",
" # Wyświetlamy wykresy\n",
" plt.show()\n",
"\n",
2024-06-12 15:03:12 +02:00
" def predict_proba(self, X):\n",
" if self.log:\n",
" return self._calculate_posterior_log(X)\n",
" else:\n",
" return self._calculate_posterior_normal_vectorized(X)\n",
"\n",
" def _normal_gaussian_pdf(self, x, mean, std):\n",
" \"\"\"\n",
" Implementacja Normal Gaussian (likelihood) z użyciem SciKit - używana do testów\n",
" \"\"\"\n",
" return np.prod(norm.pdf(x, mean, std))\n",
"\n",
" def normal_gaussian_pdf(self, x, mean, std):\n",
" \"\"\"\n",
" implementacja Normal Gaussian (likelihood)\n",
"\n",
" kształt x jest zmieniany na (n_samples, 1, n_features) aby numpy mógł broadcastować\n",
" operacje z mediąną, której kształt to (n_classes, n_features)\n",
" \"\"\"\n",
" coefficient = 1 / (np.sqrt(2 * np.pi * std**2))\n",
" exponent = np.exp(-(np.square(x[:, np.newaxis] - mean)) / (2 * np.square(std)))\n",
" return coefficient * exponent\n",
" \n",
" def log_gaussian_pdf(self, x, mean, std):\n",
" \"\"\"\n",
" implementacja Log-Normal Gaussian (likelihood)\n",
" \"\"\"\n",
" return -0.5 * np.log(2 * np.pi * std**2) - ((x - mean)**2 / (2 * std**2))\n",
"\n",
" def _calculate_posterior_log(self, x):\n",
" \"\"\"\n",
" Log-normal\n",
" Użyliśmy log-sum-exp trick dla numerycznej stabilności.\n",
" \"\"\"\n",
" log_likelihoods = np.zeros((x.shape[0], len(self.classes)))\n",
" for idx, _ in enumerate(self.classes):\n",
" prior = np.log(self.priors[idx])\n",
" log_likelihood = np.sum(self.log_gaussian_pdf(x, self.means[idx, :], self.stds[idx, :]), axis=1)\n",
" log_likelihoods[:, idx] = log_likelihood + prior\n",
" \n",
" # The Log-Sum-Exp Trick\n",
" max_log_likelihoods = np.max(log_likelihoods, axis=1, keepdims=True)\n",
" log_posteriors = log_likelihoods - max_log_likelihoods\n",
" posteriors = np.exp(log_posteriors)\n",
" posteriors = posteriors / np.sum(posteriors, axis=1, keepdims=True)\n",
" return posteriors\n",
" \n",
" def _calculate_posterior_normal_vectorized(self, X):\n",
" \"\"\"\n",
" Implementacja z użyciem wektorów - zwraca takie same wyniki jak _calculate_posterior_normal_iterative\n",
" Działa szybciej i pomija konieczność iteracji w pętli.\n",
" \"\"\"\n",
" likelihoods = self.normal_gaussian_pdf(X, self.means, self.stds)\n",
" posteriors = likelihoods.prod(axis=2) * self.priors\n",
" posteriors = posteriors / np.sum(posteriors, axis=1, keepdims=True)\n",
" return posteriors\n",
"\n",
" def _calculate_posterior_normal_iterative(self, X):\n",
" \"\"\"\n",
" Implementacja iteracyjna - zwraca takie same wyniki _calculate_posterior_normal_vectorized\n",
" Używana do testów, aby sprawdzić czy wersja z użyciem wektorów działa poprawnie\n",
" \"\"\"\n",
" posteriors = []\n",
" for x in X:\n",
" likelihoods = np.zeros((len(self.classes)))\n",
" for idx, _ in enumerate(self.classes):\n",
" likelihoods[idx] = self._normal_gaussian_pdf(x, self.means[idx, :], self.stds[idx, ]) * self.priors[idx]\n",
" posteriors.append(likelihoods)\n",
" posteriors = np.array(posteriors)\n",
" posteriors = posteriors / np.sum(posteriors, axis=1, keepdims=True)\n",
" return posteriors\n",
"\n",
"def pp_predictions(preds, probs):\n",
" i = 1\n",
" for pred, prob in zip(preds, probs):\n",
" print(f\"\\n=== {i} ===\")\n",
" \n",
" if pred == 0:\n",
" print(f\"Predykcja: {pred} - (Nowotwór złośliwy)\")\n",
" else:\n",
" print(f\"Predykcja: {pred} - (Nowotwór łagodny)\")\n",
" \n",
" print(f\"Prawdpodobieństwo (Złośliwy, Łagodny): {np.round(prob, 4)}\")\n",
" i += 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Prezentacja działania algorytmu"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnQAAAFkCAYAAACpV6bJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAABfFklEQVR4nO3dd3xT5f4H8E/SdO8BLR2UvWQUEJmlDBEHCCpLkD3UKyJe91WviF5/1+tAcXsvgiAIoqCCiiBaKIJsZJTV0l1G906a8fz+OBKptFBKkycn+bxfr7ygTXLOJ0n79HvOM45GCCFARERERKqllR2AiIiIiK4PCzoiIiIilWNBR0RERKRyLOiIiIiIVI4FHREREZHKsaAjIiIiUjkWdA5m2bJlGDBggN33O23aNDz33HP1eqxGo0FKSkqt961cuRK33HLLNe27pKQEN954I/bv339Nz3NEfn5+OHPmjOwYRDaVlJSE9u3bN/j56enp0Gg0MJlMjZhKHW644QYkJiZe03Pee+893HfffbjSKmPX0oY3NNegQYPwv//9r17bu1JbeKXtXMs+qCYWdJdo0aIFvL294efnh/DwcEyfPh3l5eWyY6nKpEmTsHnz5no/XgiBmTNn4rXXXkPPnj3r9Zzrbbhsqby8HK1atZIdg6hRtGjRAj/99NNl34+Pj8fJkyclJLK/xi4wjh07hkGDBtX78b/++it++uknLFu2DBqNptFyXG+uq2FbaH8s6P5iw4YNKC8vx4EDB7B37168/PLLlz3GFY8qbUWj0eDLL7/E4MGDZUe5Ltf7M8GfKSKqTf/+/bF+/XrodDrZUcjBsaCrQ1RUFG677TYcPXoUgFJ4vPfee2jbti3atm0LAHjkkUcQExODgIAA9OzZE0lJSQAAvV4Pb29v5OfnAwBefvll6HQ6lJaWAgCee+45zJ8/HwBQUFCAO++8EwEBAbjpppuQmppaI8fOnTvRq1cvBAYGolevXti5c6f1vkGDBuGZZ57BTTfdhMDAQIwaNQqFhYXW+7/99lvccMMNCAoKwqBBg3D8+HHrfQcPHkSPHj3g7++P8ePHQ6/X19jvf//7X7Rp0wYhISG48847kZubW+P+77//Hq1atUJYWBieeOIJWCwWAJd3GZ84cQLDhg1DSEgI2rdvjy+++MJ637Rp0/DQQw/hjjvugL+/P3r37m19/UIIPProo2jatCkCAwPRtWtXHD16FB9//DFWrlyJ//znP/Dz88PIkSOxdOlSjBw50rrdNm3aYNy4cdavY2JicOjQoSt+ZrWZNm0aHnjgAQwbNgz+/v5ISEhARkaG9f7afiYu7Y4uKSnBlClT0KRJE8TGxuLll1+u8T71798fjz76KEJCQrBgwYI6cxA5msTERERHR1u/zsrKwt13340mTZogNDQUc+fOBQB069YNfn5+1ptGo6nRrffJJ58gMjISzZo1wxtvvGH9/p49e9C3b18EBQWhWbNmmDt3Lqqrq2vNMnXqVOtzc3JyoNFo8P777wMAUlJSEBISAiEEOnfujA0bNlifZzQaERYWhkOHDkGv1+O+++5DaGgogoKC0KtXL5w/fx7PPvsskpKSMHfuXPj5+Vlf15XakQULFmDMmDEYP348/P390aNHD/z+++/W+y8962mxWPDvf/8brVu3RmhoKMaNG2dtwy92S3/66ado3rw5wsLC8K9//aten09ZWRkGDx6MefPmQQhh3Ye/vz86deqE9evXWx87cODAGp+RVqvF119/DQDYsmULOnTogMDAQMydO/eyLt9PPvkEHTt2RHBwMIYPH35Z+1jX0JxLnT17Fl27dsXrr79+2X2pqakYMmQIQkNDERYWhkmTJqG4uNh6/6uvvoqoqCj4+/ujffv22Lp1K86dOwcfHx8UFBRYH7d//340adIERqOxXu+fagmyio2NFVu2bBFCCJGZmSk6deoknnvuOSGEEADEzTffLAoKCkRlZaUQQogVK1aI/Px8YTQaxeuvvy7Cw8NFVVWVEEKI+Ph48eWXXwohhBg2bJho1aqV+P777633rVu3TgghxPjx48XYsWNFeXm5OHLkiIiMjBT9+/cXQghRUFAggoKCxPLly4XRaBSrVq0SQUFBIj8/XwghREJCgoiMjBRHjhwR5eXl4u677xaTJk0SQghx8uRJ4ePjIzZv3iyqq6vFq6++Klq3bi0MBoMwGAyiefPm4s033xTV1dVi7dq1QqfTiWeffVYIIcTWrVtFaGio2L9/v9Dr9WLu3LkiPj7e+j4BEIMGDRIFBQUiIyNDtG3bVvz3v/8VQgixdOlSa/7y8nIRHR0tPvnkE2E0GsX+/ftFaGioOHr0qBBCiKlTp4rg4GCxe/duYTQaxcSJE8X48eOFEEJs2rRJ9OjRQxQVFQmLxSKSk5NFbm6u9XkXswohRGpqqggMDBRms1nk5uaK5s2bi8jISOt9QUFBwmw2X/Uz+6upU6cKPz8/sW3bNqHX68W8efOsr62unwkA4vTp00IIISZPnizuvPNOUVpaKtLS0kTbtm3F//73P+v75ObmJhYvXiyMRqP1+USO5NI28VK//PKLiIqKEkIIYTKZRNeuXcX8+fNFeXm5qKqqEklJSZc956OPPhLt27cXJSUlIi0tTQAQEyZMEOXl5eLw4cMiLCzMuq99+/aJXbt2CaPRKNLS0kSHDh3EokWLas24ZMkSMWLECCGEECtXrhStWrUS48aNs9535513CiGEePXVV63fF0KIr7/+WnTu3FkIIcSHH34oRowYISoqKoTJZBL79u0TJSUlQgilnb3Yvl10pXbkhRdeEDqdTqxdu1ZUV1eL1157TbRo0UJUV1df9p4uWrRI9O7dW2RlZQm9Xi/mzJkjJkyYIIQQ1vdo1qxZorKyUhw6dEh4eHiI5OTkWt+Hi+1ifn6+6NWrV4028osvvhA5OTnCbDaL1atXCx8fH2t7eqn169eL2NhYceHCBZGXlyf8/f2tr+PNN98Ubm5u1vdi/fr1onXr1iI5OVkYjUbx0ksvib59+1q3dWlb+FcX39OL7eJHH3102X1CCHH69GmxefNmodfrxYULF0R8fLx45JFHhBBCnDhxQkRHR4ucnBzr+5WSkiKEEOK2224T77//vnWb8+fPF3Pnzq01izNhQXeJ2NhY4evrKwIDA0Xz5s3Fgw8+WOMP9datW6/4/KCgIHHo0CEhhBDPPfecePjhh4XRaBTh4eHirbfeEk899ZSoqqoSXl5eIi8vT5hMJqHT6cTx48et23jmmWesRcPy5ctFr169auyjT58+YunSpUII5Qf/qaeest537Ngx4e7uLkwmk1i4cKEYO3as9T6z2SwiIyPFL7/8IrZt2yaaNWsmLBaL9f6+fftaG4AZM2aIJ554wnpfWVmZ0Ol0Ii0tzfpe/PDDD9b733vvPTFkyBAhRM2CbvXq1WLAgAE18s+ZM0csWLBACKE0QDNnzrTe991334n27dsLIZSism3btmLXrl3WYuyivxZ0QggRHR0t9u/fLz7//HMxe/Zs0atXL3H8+HHxySefiJEjR4q6XPqZ/dXUqVOtBebF90Gr1YrMzEzr+/DXn4mLjZjJZBIeHh7i2LFj1vs+/PBDkZCQYH2fYmJi6sxF5AjqU9Dt3LlThIWFCaPRWOd2kpKSRJMmTcTJkyeFEH8WK5e2fU888YSYMWNGrc9ftGiRGD16dK33paSkWA/o7r//fvHhhx9as02ZMkW88cYbQgghcnJyhJ+fn7VQu+eee8Srr74qhFAKv759+4rff//9su3XVtD91aXtyAsvvCB69+5tvc9sNouIiAixfft2IUTN97RDhw7ip59+sj42NzdX6HQ6ayELQGRlZVnv79Wrl/j8889rzTB16lQxffp0ccMNN4j//Oc/V8zbrVs38fXXX9f43uHDh0XTpk3FgQMHhBBCfPrppzVeh8ViEVFRUdb34tZbb7UeoF58nd7e3iI9PV0IcfWC7tFHHxWxsbFi1apVl91X1/u9fv16ERcXJ4RQir0mTZqILVu2WIvli1avXi369esnhFAOOMLDw8Xu3buv+J44A3a5/sXXX3+
"text/plain": [
"<Figure size 792x432 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Podział zbioru danych na treningowy i testowy\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=48)\n",
"\n",
"gnb = GaussianNaiveBayes(log=False)\n",
"gnb.fit(X_train.values, y_train)\n",
"\n",
"# Prawdopodobieństwa prior\n",
"priors = np.array(gnb.priors)\n",
"counts = [np.sum(y_train == 0.0), np.sum(y_train == 1.0)]\n",
"\n",
"# Etykiety klas\n",
"labels = ['Nowotwór złośliwy', 'Nowotwór łagodny']\n",
"\n",
"# Tworzenie diagramu kołowego\n",
"fig, axes = plt.subplots(1, 2, figsize=(11, 6), facecolor='white')\n",
"\n",
"wedges, _, _ = axes[0].pie(priors, autopct='%1.1f%%', startangle=90, colors=['skyblue', 'orange'])\n",
"axes[0].axis('equal') # Równe proporcje, aby koło było kołem\n",
"axes[0].set_title('Prawdopodobieństwa prior')\n",
"\n",
"# Funkcja zwracająca liczbę dla autopct\n",
"def func(pct, allvals):\n",
" absolute = int(np.round(pct/100.*np.sum(allvals)))\n",
" return f\"{absolute}\"\n",
"\n",
"axes[1].pie(counts, autopct=lambda pct: func(pct, counts), startangle=90, colors=['skyblue', 'orange'])\n",
"axes[1].axis('equal') # Równe proporcje, aby koło było kołem\n",
"axes[1].set_title('Liczba wystąpień każdej klasy')\n",
"\n",
"# Dodanie jednej legendy dla obu wykresów\n",
"fig.legend(wedges, labels, title=\"Klasy\", loc=\"center right\", bbox_to_anchor=(0.6, 0.75))\n",
"\n",
"# Ustawienie białego tła dla wykresów\n",
"fig.patch.set_facecolor('white')\n",
"\n",
"# Wyświetlenie wykresu\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 70,
2024-06-12 15:03:12 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Prawdopodobieństwa prior: [0.38085938 0.61914062]\n",
"Celność na zbiorze testowym: 0.9824561403508771\n",
"\n",
"=== 1 ===\n",
"Predykcja: 1 - (Nowotwór łagodny)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [0. 1.]\n",
"\n",
"=== 2 ===\n",
"Predykcja: 1 - (Nowotwór łagodny)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [0.1798 0.8202]\n",
"\n",
"=== 3 ===\n",
"Predykcja: 1 - (Nowotwór łagodny)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [0.0067 0.9933]\n",
"\n",
"=== 4 ===\n",
"Predykcja: 1 - (Nowotwór łagodny)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [0.0472 0.9528]\n",
"\n",
"=== 5 ===\n",
"Predykcja: 0 - (Nowotwór złośliwy)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [1. 0.]\n"
]
}
],
"source": [
"print(f\"Prawdopodobieństwa prior: {gnb.priors}\")\n",
"score = gnb.score(X_test.values, y_test)\n",
"print(\"Celność na zbiorze testowym:\", score)\n",
"\n",
"# Predykcje\n",
"preds = gnb.predict(X_test.values[15:20])\n",
"probs = gnb.predict_proba(X_test.values[15:20])\n",
"pp_predictions(preds, probs)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Figure size 1008x1008 with 0 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAs4AAAF2CAYAAAB+nR6pAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAqLklEQVR4nO3deZhlZXnv/e+PQVGZoUWUoR1JMDmitkRFEAcSHKIYo4YoEEPSJmoUo5cak1dbTE5I4oA5JirGoXGGGJSDUxBlcm4GAUVf1KCCLTSIAg4ocJ8/1lO4KauqV1fVql1V/f1c1772mte9d61977ue9ay1U1VIkiRJmtkW4w5AkiRJWgosnCVJkqQeLJwlSZKkHiycJUmSpB4snCVJkqQeLJwlSZKkHiyctSwlOTDJN8YdxzgkeV6SU5JstZHlNtv3SNLsJHlFkv+YYf7lSR67kDEtV0lWJrkkyW/2WParSQ4ePirF+zhrQpLLgd2AW4AbgU8Az6+qG8cZ15CSrAHuU1XPGncso5KsBP4H2Lqqbt6E9R4L/BXwtKr6xUDhSVpAk3LzT4CPAX+1GHNzi/XPqupT445lsUhyJvCeqpr2H44p1tkOOI3u73zRULFp09nirMl+v6q2BfYDHgj8zXjD2fxsrKV4JlX1qap6skWztOxM5OYHAQ8B/m7yAnPJHZp/6cyqzqqqG6rqkRbNi4+Fs6ZUVT8APklXQJPk7kluHHn8NMltpyuS/GmSS5Ncl+STSfYemXf/JKcn+WGSq5K8ok3/0cj2fpKkRk5N/f7I+lsnuSbJfknWJnlxm36Pts5z2/h92j6S5OAkV4xs42VJrkxyQ5JvJHlMkkOBVwDPaDF8pS27Q5K3J1nf1vn7JFtOfo+SbJPkZ0l2beN/l+TmJNu38b9PcnwbfkKSC5Jcn+R7raV7Yjsr2+s4Osl3gU8DZ7fZE+/Rw5Js0fbxnSRXJzkxyQ4jsbwnybXtff1ykt3avJ2TvDPJ99vf58Nt+u3eI0mLX1VdCXwc+C2Aljuel+Qy4LIkL52Uq3+Z5F1t2WlzW8srD27Dz2rb3beN/9lI3liT5D0T8SQ5oq17bZK/HZl+t/Y9scvItAcn2dByet/97Z/k8y2vrU/ypiR3GNlmJfmLJJe1/PZvSTIyf9rvplHp/92yU5LT2uu4rg3vMbKdM5P8Q5LPAj8F3g0cCLyp/T3e1JZ7eMvTP27PDx/Zxp8k+Xa676v/SfLMkXl/3l7PDUm+luRBbbpdZBaIhbOm1BLB44BvAlTV96tq24kHcArwgbbsYXQF6B8AK4BzgPe3edsBn6Lr9nF34D7AGW2bO45s741tvSuBE4HRrhOPB9ZX1YXAWcDBbfojgW+3Z4CDgHNqUv+jJPsAzwceUlXbAb8HXF5VnwD+N/DBFscD2iprgZtbrA8Efhf4s8nvUVX9HPjypP1/BzhgZPysNvwT4EhgR+AJwF+2923UI4HfbPEd1KZNvEefB/6kPR4F3AvYFnhTW+4oYAdgT2AX4C+An7V57wbuDNwfuCvwhsmvRdLSkGRPupx4wcjkw4DfAfatqn8eyau/CWwATmrLzZTbRnPrQfx6bp3IZaOx7Au8GTiCLr/vAuwBtzW+nAk8fWSVZwEfqKpfbsL+bgFeBOwKPAx4DPDcSaE8ka4V/gFtf7/X4juMab6bptD3u2UL4J3A3sBedHn2TdzeEcBqYDu6nH0OXbfHbavq+Ul2Bj4K/Cvde/Z64KNJdklylzb9ce376uHAhe31PA1YQ/ddsj3wJODaaV6PhlJVPnxQVQCX0/VtvgEougJ3xymWexlwHnCnNv5x4OiR+VvQ/ae9N3A4cMFG9vuMtu8VbfzuLYbt2/h/Ai9tw/cGftT28RbgOcAVbd5a4K/b8MEj0+8DXA08lq7P8Oi+19D1PZsY3w24aeK1tWmHA5+ZJvbX0CW5rYAfAC8EjgO2oUuou06z3vHAG9rwyvZ+32tk/sS0rUamnQE8d2R8H+CXbd9/CnwO+F+T9rM7cCuw0xQx3PYe+fDhY/E+RnLzj+j+Of/3kfxbwKOnWOdOLU+/rI3PmNuAo4FT2/CldAX1B9r4d4AHteHbcibwyoll2vhdgF8Aj23jzwA+24a3bDly/03Z3xSv6xjglJHxAh4xMn4S8PI2PO130xTb7fXdMsV6+wHXjYyfCRw7aZkz6fp9T4wfAXxp0jITjSN3aXE8dfRv1Zb5JPDCGY6Rx477WN0cHrY4a7LDqvsv92DgN+j+y79NksfRFYeHVdVEi+bewBvbqbQfAT8EAtyDrgX0W9PtLMkD6f5bf0pVbYCudRv4LPDUJDvStXy/t837Ft0XyH50p79OA77fWpUfyRStIlX1Tbpkuwa4OskHktx9mpD2BrYG1o+8nrfStdROZaKV4kHAxcDpLY6HAt+sqmva6/ydJJ9pp/d+TNcivOukbX1vmn1MuDvdF8qE79AVzbvRtSp/EvhAui4Z/5xka7r3/4dVdd1Gti1pcTusurN0e1fVc0fyL0ydO94OfKOq/qmNbyy3nQUcmORudEXuB4ED0l2ovAOt1XOSu4/uu6p+wu1bQD8C7JvkXsAhwI+r6kubsr8k92vdIX6Q5Hq6s4STc+cPRoZ/Snc2buI1T/fddDt9v1uS3DnJW1tXk+vputXtmNt359vUXE4bv0d7D59B9x2xPslHk/xGW2bG71MtDAtnTamqzgLeBbx2YlpLIGuBp1fVaGL4HvCcltQnHneqqs+1efeeah9JVtB1+Xh+VV0wafZautN6TwM+X12/vglnAX8I3KFNP4vu1NVOTJ3cqar3VdUj6BJpARNfJpNvK/M9ulaZXUdey/ZVdf+ptkvXyrsP8BTgrKr6Gt3puydw+yL+fcCpwJ5VtQNdi0YmbaumGZ7w/Rb/hL3oTrteVVW/rKpXV9W+dKf2nkj3nnwP2Ln9AyJpeZrcPe3ldHnp6JHJM+a21sDwU+AFwNlVdQNdQboaOLeqbp1iv+vpirmJ/d6ZrusBbZs/p2sBfiZdK+u7R+b13d+bga8D962q7em6XkzOndOZ6btpKn2+W15M997+TotnolvdaEyT8/fk8cm5HLp8fiVAVX2yqg6hO2P4deBtI69nyu9TLRwLZ83keOCQdBflbU/XevB3VXXupOXeAvxNkvvDbRegPK3NOw24W5JjktwxyXat9XUr4EPAe6vqg1Ps+8N0rbgvpOvzPOosuj7LExfQnUl3C7Zzq+qWyRtKsk+SRye5I/Bzui4UE8tdBaxMu/K5qtYD/w28Lsn26S7Iu3eSR07eblv+p3SnQ5/Hrwrlz9Gd5hstnLeja/n9eZL9gT+eansjNtB1sbjXyLT3Ay9Kcs8k2/Kr/tk3J3lUkt9urR7X03XhuKW9no8D/94uatk6yUFIWpbaWcEXcPuzgn1z20RunchdZ04an+w/gScmeUS6C/aO5dfrihPpuiA8CXjPpHl99rcdXU67sbW8/uW0L/7XzfTdNJU+3y3b0X2H/Kj1VX5Vjziu4va5/GPA/ZL8cZKtkjwD2Bc4LcluSZ7U+jrfRNcKPrHv/wBeku4iy6S7aHHKix01HAtnTat1nTgR+P/oith9gNdn5IrtttwpdC24H2inri6h615Ba0U4BPh9utaEy+gubtuD7nTYMbn9FeB7tfV+RldY3xP4r0mhnUWXvCaS27l0F7+dzdTuSNfv+JoWw13pWi0ATm7P1yY5vw0fCdwB+BpwHd2Xw+4zvFVn0Z0CHT0FORofdBezHJvkBrp+gScxg1aQ/wPw2Xaa8aHAO+habM6mu8fzz+mSOsDdWpzX0/UXPItffUkdQVdIf52ur/cxM+1b0pL2DLoL4S4dyatvafM2ltsm566pctltquqrdI0G76Nrfb4OuGLSMp+lawQ4v6oun7SJPvt7CV1Dww10La9TNbRMaabvpmn0+W45nq7/+DXAF+gufN+YNwJ/mO4uHP9aVdfSnRV8MV3XlpcCT2xd+7Zo079P17XkkbSLIavqZLrvhffRvR8fBnbusX/NI38ARYtWklcC96tF9uMkkqT+knwaeF9twg+ASIuVN0vXotROgR1
"text/plain": [
"<Figure size 864x432 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(14, 14))\n",
"score = gnb.score_visualize(X_test.values, y_test)"
]
},
2024-06-12 15:03:12 +02:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Porównanie wyników z gotową implementacją z użyciem biblioteki sklearn"
]
},
{
"cell_type": "code",
"execution_count": 72,
2024-06-12 15:03:12 +02:00
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Celność na zbiorze testowym: 0.9824561403508771\n",
2024-06-12 15:03:12 +02:00
"\n",
"=== 1 ===\n",
"Predykcja: 1.0 - (Nowotwór łagodny)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [0. 1.]\n",
"\n",
"=== 2 ===\n",
"Predykcja: 1.0 - (Nowotwór łagodny)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [0.1123 0.8877]\n",
"\n",
"=== 3 ===\n",
"Predykcja: 1.0 - (Nowotwór łagodny)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [0. 1.]\n",
"\n",
"=== 4 ===\n",
"Predykcja: 1.0 - (Nowotwór łagodny)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [0.047 0.953]\n",
"\n",
"=== 5 ===\n",
"Predykcja: 0.0 - (Nowotwór złośliwy)\n",
"Prawdpodobieństwo (Złośliwy, Łagodny): [1. 0.]\n"
]
}
],
"source": [
"from sklearn.naive_bayes import GaussianNB\n",
"\n",
"clf = GaussianNB()\n",
"clf.fit(X_train.values, y_train)\n",
"print(\"Celność na zbiorze testowym:\", clf.score(X_test.values, y_test))\n",
2024-06-12 15:03:12 +02:00
"preds = clf.predict(X_test.values[15:20])\n",
"probs = clf.predict_proba(X_test.values[15:20])\n",
"pp_predictions(preds, probs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Wizualizacja obszarów decyzyjnych"
]
},
{
"cell_type": "code",
"execution_count": 73,
2024-06-12 15:03:12 +02:00
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAs4AAAIYCAYAAAB9i2oeAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAACkSUlEQVR4nOzddZyUVRfA8d+dma3ZYFm6O6SlFAMDFAxAMcBuTBRb7O5OFAywMBEEUQFfAUWRBunuZmG7Zu77x5l1Z3dnloFtON/3sx/ZeSbuzO4L5znPuecYay1KKaWUUkqpojnKewFKKaWUUkpVBho4K6WUUkopFQINnJVSSimllAqBBs5KKaWUUkqFQANnpZRSSimlQqCBs1JKKaWUUiHQwFkpVeqMMScbY1aW8ms8boz5zPfnhsaYFGOMs5Re60FjzKhSeN7/3kN5M8a0MMYsMsY0KoXn/t0Yc31JP29x+X5nmpbi819tjPmjtJ5fKVX6NHBWSh0yY8xwY8xPBW5bHeS2wdbamdbaVmW1PmvtJmttjLXWU0rP/6y1tsIFfiXFGFMFGAlcaK3dWN7rKSu+35l15b0OpVTFpYGzUupwzABOzM3oGmNqA2FA5wK3NffdV1VwRjgArLUHrLWnWmtXl/e6lFKqItHAWSl1OOYggXIn3/c9gf8BKwvcttZau80Yc6oxZguAMWaQ75J47lemMeZ337F8l/ALXto2xrxhjNlsjEkyxswzxpwcaHHGmMbGGGuMcRljehR4vQxjzAbf/bobY/4yxuw3xmw3xrxtjAn3e562xpgpxph9xpidxpgHfbfnK6kwxnxjjNlhjDlgjJlhjGkb7IMzxtQ1xkzwPecaY8wNBe4SaYz5yhiTbIyZb4zp6PfY+40xW33HVhpjevlu3+/3/lJ9772xMaaqMWaiMWa3MSbR9+f6fs/3uzHmGWPMn0Aa0NQYc4IxZo7vvcwxxpzgu+9pxpglfo+daoz5x+/7P4wx5wV5z2cYY1b4nvNtwBQ4fq0xZrlvjb/4l4cE+hkYY2obY9KMMdX87tfF9z7DjJSY+P/Mre93cJIxZmiB116cu27f/Zr7/vyJMeYd32OSjTGzjTHNfMfeMca8UuB5fjTGDPP9uYEx5nvfevb63rP/fV/2vdf1xpizAn1mSqmKSQNnpdQhs9ZmAbOR4Bjff2cCfxS4rVC22Vr7le+SeAxQF1gHfBniS89BAvME4AvgG2NM5EHW+pff61UF/vZ7PQ9wJ1Ad6AH0Am4BMMbEAlOBn33rbA5MC/Iyk4EWQE1gPvB5EUv6Etjie84LgWdzA2CfAcA3fu/xB18w2Aq4DehmrY0F+gAbfO8x3u89voH8LLYif8d/DDQCGgLpQL4gDrgCGALEAsnAJOBNoBrwKjDJF6D+BTQ3xlQ3xriAdkB9Y0ysMSYK6OJ73XyMMdWB74CHkc95LXCi3/HzgAeBgUAN33N86TsW8Gdgrd0B/A5c7PdSlwNjrbXZ1tqOfp/HXcgJ3XxgtO9+ua/dEagH5Csx8nMJ8ATye7MGeMZ3+2jgEuPL0PveYy/gSyNXXCYCG4HGvucf6/ecx/nWUx14EfjQGJPvREIpVXFp4KyUOlzTyQuST0YCnpkFbpse7MG+oOML4Hdr7fuhvKC19jNr7V5rbY619hUgAjiU2uk3gVTgId/zzbPW/u17vg3A+8ApvvueC+yw1r5irc2w1iZba2cHWddHvuOZwONARyN1wgXfcwPgJOB+33MuBEYhwWuuedbab6212UjgGgkcjwT5EUAbY0yYtXaDtXZtgecfBFwKXOALIPdaa7+z1qZZa5ORwO8U8vvEWrvUWpsDnAmsttZ+6vtMvgRWAP2stRnAXOTn2xVYjJwonehb32pr7d4AH8/ZwDK/9/Q6sMPv+I3Ac9ba5b41PAt08mWdi/oZ/BcE+4LVS4BPC3weJwFPA/2ttUnAeKCFMaaF7y5XAF/5TgQD+d5a+49vXZ/ju5pirf0HOIAEywCDkd/jnUB3JMi/11qb6lu3/4bAjdbakb76+9FAHaBWkNdXSlUwGjgrpQ7XDOAkY0xVoIavHnYWcILvtnYUXd/8DJLlvD3UFzTG3O27pH/AGLMfqIJk7kJ57I3AqcCl1lqv77aWvvKFHcaYJCRoy32+Bkh29GDP6zTGPG+MWet7jg2+Q4HWVRfY5wtic21EspK5Nuf+wbfOLUBda+0aYBgSmO8yxow1xtT1W8exSDb5fGvtbt9tbmPM+8aYjb61zQDiTf5uI5v9/lzXtx5//uubjnyGPX1//h0JxE8h+ElS3QLvyRZ4zUbAG0bKTfYD+5BSjnoU/TMYj5xENAXOAA74Atrcz6MB8DVwlbV2le+1M323Xe47cSsUbBfgH+CnATF+3/tnry/3e54GSHCcc7DntNam+f4YE+S+SqkKRgNnpdTh+gsJXIcAfwL4snrbfLdts9auD/RAY8xgJGi50JeFzJUKuP2+r+33mJOB+5HL81WttfFI1u+gl7l9j30KGGCtPeB36D0ko9rCWhuHlAzkPt9moNnBnhvJ8A4AeiOfR+Pclw1w321Agq8EIVdDpKwiVwO/dTuA+r7HYa39wlp7EhJsWuAF3/1qAOOA26y1C/ye624kI3+c7/3lXg3wX5stsL6C7ef811cwcJ7OwQPn7QXek/H/Hvmcb/SVm+R+RVlrZ1HEz8CXAf8auAzJHP8XAPtKR34AXrfWTi7w0NG+x/QC0qy1fwVZ98F8BgzwlXsc43u93PfT0FfOopQ6wmjgrJQ6LNbadOTS/V3kr239w3dbwGyzLzP6FnBebmbUz0JgoC9T2hy4zu9YLJAD7AZcxphHgbiDrdOXefwKuDI381jgOZOAFGNMa+Bmv2MTgdrGmGHGmAhfLe9xAV4iFsgE9iJB/7PB1mKt3Yxk5Z8zxkQaYzr43qN/TXQXY8xAX+A1zPfcfxtjWhljTjfGRAAZSL2yx3e/74DPrbVfBVhbOrDfGJMAPBZsbT4/AS2NMZca2Vg5CGjj+yzwrb0VUo7wj7V2KRJoH0fwqwuTgLZ+7+l2/E6IgBHAcOPbUGmMqWKMuch37GA/gzHA1UB/JJDN9RGwwlr7YsHF+AJlL/AKRWebi2St3YLU3H8KfOf7/wPAP8jJwvPGmGjfz/nEYM+jlKpcNHBWShXHdGRDnH8N50zfbcECqQHIZqs/TF7Xg9ys4GtAFrATyQz6B5S/IJvwViHlAxnkv+QfTC8kUPvW7/WW+o7dg2SMk5G+xf8Fnr5yijOAfsjl9dXAaQGef4xvPVuBZcjmw6JcgmSltyFZ4sestVP8jo8HBgGJSCZ1oC8rHwE8D+zxracmkiGvj9STDzP5O0k0ROqJo3yP+RvZZBeUr0b5XCRTvRe4DzjXWrvHdzwV2WS31K8u+C+kNGFXkOfcA1zkW/teZBPln37HxyGZ87G+cpJ/gbN8x4r8GVhr/0SC4Pm+GvVcg4HzC3we/h1YxgDtyR9sH47Rvuf5LwD31S73QzYybkJKbQYV83WUUhWEkXIzpZRSoTLGPAnUt9ZeW95rOdoZY34DvrDWhjzJ0RhzJTDEV/aSe5sD2YDZyFq7KcTn6YkE341z6+aVUkc2zTgrpdQh8NXotgEC1m+rsmOM6QZ0xu9KQQiPcSMtBz8ocKgdchVjR6EHBX6eMOAOYJQGzUodPTRwVkqpQzMfKY8YWd4LOZoZY0YjPZ6HFehSUtRj+iA18juRVoi5t1+ADPC5v4jWdP7PcwywH2kl9/qhrl0pVXlpqYZSSimllFIh0IyzUkoppZRSIdDAWSmllFJKqRBUmgbt1atXt40bNy7vZSillFLlbt684MeMgc6dy24tSh1p5s2bt8daWyPQsUoTODdu3Ji5c+eW9zKUUkqpcnfhhTBuHHgL9PNwOGDgQPjmm/JZl1JHAmPMxmDHtFRDKaWUqmReeQWqVoWwsLzbwsLktpdfLr91KXWk08BZKaWUqmQaNYIlS+DWW6F+ffm69Va5rVGj8l6dUkeuStOOrmvXrlZLNZRSSimlVGkyxsyz1nYNdEwzzkoppZRSSoVAA2ellFJKKaVCoIGzUko
2024-06-12 15:03:12 +02:00
"text/plain": [
"<Figure size 864x648 with 1 Axes>"
2024-06-12 15:03:12 +02:00
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from matplotlib.colors import ListedColormap\n",
"from sklearn.decomposition import PCA\n",
"\n",
"# Wizualizacja obszarów decyzyjnych\n",
"def plot_decision_boundaries(X, y, model, title, h=0.9):\n",
" x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n",
" y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n",
" xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))\n",
" \n",
" Z = model.predict(np.c_[xx.ravel(), yy.ravel()])\n",
" Z = Z.reshape(xx.shape)\n",
" plt.figure(figsize=(12, 9))\n",
2024-06-12 15:03:12 +02:00
" colors_background = ['#FFFFFF', '#00AACC']\n",
" colors_points = ['#0000FF', '#00FF00']\n",
" plt.contourf(xx, yy, Z, alpha=0.8, cmap=ListedColormap(colors_background))\n",
" plt.scatter(X[:, 0], X[:, 1], c=y, s=50, edgecolor='b', cmap=ListedColormap(colors_points))\n",
" plt.title(title)\n",
" plt.show()\n",
"\n",
"# 30 cech zostało zredukowane do dwóch\n",
"# PCA dokonuje liniowej redukcji wymiarów za pomocą algorytmu Singular Value Decomposition (SVD).\n",
"pca = PCA(n_components=2)\n",
"X_pca = pca.fit_transform(X_train.values)\n",
"\n",
"gnb = GaussianNaiveBayes(log=True)\n",
"gnb.fit(X_pca, y_train)\n",
"plot_decision_boundaries(X_pca, y_train, gnb, \"Wizualizacja obszarów decyzyjnych\")"
]
}
],
"metadata": {
"kernelspec": {
2024-06-13 19:55:44 +02:00
"display_name": "Python 3 (ipykernel)",
2024-06-12 15:03:12 +02:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}