848 lines
23 KiB
Plaintext
848 lines
23 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Uczenie maszynowe UMZ 2017/2018\n",
|
||
"# 3. Naiwny klasyfikator bayesowski, drzewa decyzyjne\n",
|
||
"### Część 2"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## 3.2. Drzewa decyzyjne"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Drzewa decyzyjne – przykład"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "notes"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Przydatne importy\n",
|
||
"\n",
|
||
"import ipywidgets as widgets\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import numpy as np\n",
|
||
"import pandas\n",
|
||
"\n",
|
||
"%matplotlib inline"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style>\n",
|
||
" .dataframe thead tr:only-child th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: left;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Day</th>\n",
|
||
" <th>Outlook</th>\n",
|
||
" <th>Humidity</th>\n",
|
||
" <th>Wind</th>\n",
|
||
" <th>Play</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Sunny</td>\n",
|
||
" <td>High</td>\n",
|
||
" <td>Weak</td>\n",
|
||
" <td>No</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>2</td>\n",
|
||
" <td>Sunny</td>\n",
|
||
" <td>High</td>\n",
|
||
" <td>Strong</td>\n",
|
||
" <td>No</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>3</td>\n",
|
||
" <td>Overcast</td>\n",
|
||
" <td>High</td>\n",
|
||
" <td>Weak</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>4</td>\n",
|
||
" <td>Rain</td>\n",
|
||
" <td>High</td>\n",
|
||
" <td>Weak</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>5</td>\n",
|
||
" <td>Rain</td>\n",
|
||
" <td>Normal</td>\n",
|
||
" <td>Weak</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5</th>\n",
|
||
" <td>6</td>\n",
|
||
" <td>Rain</td>\n",
|
||
" <td>Normal</td>\n",
|
||
" <td>Strong</td>\n",
|
||
" <td>No</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6</th>\n",
|
||
" <td>7</td>\n",
|
||
" <td>Overcast</td>\n",
|
||
" <td>Normal</td>\n",
|
||
" <td>Strong</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>8</td>\n",
|
||
" <td>Sunny</td>\n",
|
||
" <td>High</td>\n",
|
||
" <td>Weak</td>\n",
|
||
" <td>No</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8</th>\n",
|
||
" <td>9</td>\n",
|
||
" <td>Sunny</td>\n",
|
||
" <td>Normal</td>\n",
|
||
" <td>Weak</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>9</th>\n",
|
||
" <td>10</td>\n",
|
||
" <td>Rain</td>\n",
|
||
" <td>Normal</td>\n",
|
||
" <td>Weak</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>10</th>\n",
|
||
" <td>11</td>\n",
|
||
" <td>Sunny</td>\n",
|
||
" <td>Normal</td>\n",
|
||
" <td>Strong</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>11</th>\n",
|
||
" <td>12</td>\n",
|
||
" <td>Overcast</td>\n",
|
||
" <td>High</td>\n",
|
||
" <td>Strong</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>12</th>\n",
|
||
" <td>13</td>\n",
|
||
" <td>Overcast</td>\n",
|
||
" <td>Normal</td>\n",
|
||
" <td>Weak</td>\n",
|
||
" <td>Yes</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>13</th>\n",
|
||
" <td>14</td>\n",
|
||
" <td>Rain</td>\n",
|
||
" <td>High</td>\n",
|
||
" <td>Strong</td>\n",
|
||
" <td>No</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Day Outlook Humidity Wind Play\n",
|
||
"0 1 Sunny High Weak No\n",
|
||
"1 2 Sunny High Strong No\n",
|
||
"2 3 Overcast High Weak Yes\n",
|
||
"3 4 Rain High Weak Yes\n",
|
||
"4 5 Rain Normal Weak Yes\n",
|
||
"5 6 Rain Normal Strong No\n",
|
||
"6 7 Overcast Normal Strong Yes\n",
|
||
"7 8 Sunny High Weak No\n",
|
||
"8 9 Sunny Normal Weak Yes\n",
|
||
"9 10 Rain Normal Weak Yes\n",
|
||
"10 11 Sunny Normal Strong Yes\n",
|
||
"11 12 Overcast High Strong Yes\n",
|
||
"12 13 Overcast Normal Weak Yes\n",
|
||
"13 14 Rain High Strong No"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"alldata = pandas.read_csv('tennis.tsv', sep='\\t')\n",
|
||
"alldata"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"{'Humidity': {'High', 'Normal'},\n",
|
||
" 'Outlook': {'Overcast', 'Rain', 'Sunny'},\n",
|
||
" 'Wind': {'Strong', 'Weak'}}"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# Dane jako lista słowników\n",
|
||
"data = alldata.T.to_dict().values()\n",
|
||
"features = ['Outlook', 'Humidity', 'Wind']\n",
|
||
"\n",
|
||
"# Możliwe wartości w poszczególnych kolumnach\n",
|
||
"values = {feature: set(row[feature] for row in data)\n",
|
||
" for feature in features}\n",
|
||
"values"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Czy John zagra w tenisa, jeżeli będzie padać, przy wysokiej wilgotności i silnym wietrze?\n",
|
||
"* Algorytm drzew decyzyjnych spróbuje _zrozumieć_ „taktykę” Johna.\n",
|
||
"* Wykorzystamy metodę „dziel i zwyciężaj”."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Podziel dane\n",
|
||
"def split(features, data):\n",
|
||
" values = {feature: list(set(row[feature]\n",
|
||
" for row in data))\n",
|
||
" for feature in features}\n",
|
||
" if not features:\n",
|
||
" return data\n",
|
||
" return {val: split(features[1:],\n",
|
||
" [row for row in data\n",
|
||
" if row[features[0]] == val])\n",
|
||
" for val in values[features[0]]}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n",
|
||
"\tOutlook\tHumid\tWind\tPlay\n",
|
||
"Day 3:\tOvercast\tHigh\tWeak\tYes\n",
|
||
"Day 7:\tOvercast\tNormal\tStrong\tYes\n",
|
||
"Day 12:\tOvercast\tHigh\tStrong\tYes\n",
|
||
"Day 13:\tOvercast\tNormal\tWeak\tYes\n",
|
||
"\n",
|
||
"\tOutlook\tHumid\tWind\tPlay\n",
|
||
"Day 1:\tSunny\tHigh\tWeak\tNo\n",
|
||
"Day 2:\tSunny\tHigh\tStrong\tNo\n",
|
||
"Day 8:\tSunny\tHigh\tWeak\tNo\n",
|
||
"Day 9:\tSunny\tNormal\tWeak\tYes\n",
|
||
"Day 11:\tSunny\tNormal\tStrong\tYes\n",
|
||
"\n",
|
||
"\tOutlook\tHumid\tWind\tPlay\n",
|
||
"Day 4:\tRain\tHigh\tWeak\tYes\n",
|
||
"Day 5:\tRain\tNormal\tWeak\tYes\n",
|
||
"Day 6:\tRain\tNormal\tStrong\tNo\n",
|
||
"Day 10:\tRain\tNormal\tWeak\tYes\n",
|
||
"Day 14:\tRain\tHigh\tStrong\tNo\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"split_data = split(['Outlook'], data)\n",
|
||
"\n",
|
||
"for outlook in values['Outlook']:\n",
|
||
" print('\\n\\tOutlook\\tHumid\\tWind\\tPlay')\n",
|
||
" for row in split_data[outlook]:\n",
|
||
" print('Day {Day}:\\t{Outlook}\\t{Humidity}\\t{Wind}\\t{Play}'\n",
|
||
" .format(**row))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Obserwacja: John lubi grać, gdy jest pochmurnie.\n",
|
||
"\n",
|
||
"W pozostałych przypadkach podzielmy dane ponownie:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n",
|
||
"\tOutlook\tHumid\tWind\tPlay\n",
|
||
"Day 1:\tSunny\tHigh\tWeak\tNo\n",
|
||
"Day 2:\tSunny\tHigh\tStrong\tNo\n",
|
||
"Day 8:\tSunny\tHigh\tWeak\tNo\n",
|
||
"\n",
|
||
"\tOutlook\tHumid\tWind\tPlay\n",
|
||
"Day 9:\tSunny\tNormal\tWeak\tYes\n",
|
||
"Day 11:\tSunny\tNormal\tStrong\tYes\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"split_data_sunny = split(['Outlook', 'Humidity'], data)\n",
|
||
"\n",
|
||
"for humidity in values['Humidity']:\n",
|
||
" print('\\n\\tOutlook\\tHumid\\tWind\\tPlay')\n",
|
||
" for row in split_data_sunny['Sunny'][humidity]:\n",
|
||
" print('Day {Day}:\\t{Outlook}\\t{Humidity}\\t{Wind}\\t{Play}'\n",
|
||
" .format(**row))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n",
|
||
"\tOutlook\tHumid\tWind\tPlay\n",
|
||
"Day 6:\tRain\tNormal\tStrong\tNo\n",
|
||
"Day 14:\tRain\tHigh\tStrong\tNo\n",
|
||
"\n",
|
||
"\tOutlook\tHumid\tWind\tPlay\n",
|
||
"Day 4:\tRain\tHigh\tWeak\tYes\n",
|
||
"Day 5:\tRain\tNormal\tWeak\tYes\n",
|
||
"Day 10:\tRain\tNormal\tWeak\tYes\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"split_data_rain = split(['Outlook', 'Wind'], data)\n",
|
||
"\n",
|
||
"for wind in values['Wind']:\n",
|
||
" print('\\n\\tOutlook\\tHumid\\tWind\\tPlay')\n",
|
||
" for row in split_data_rain['Rain'][wind]:\n",
|
||
" print('Day {Day}:\\t{Outlook}\\t{Humidity}\\t{Wind}\\t{Play}'\n",
|
||
" .format(**row))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Outlook=\n",
|
||
" * Overcast\n",
|
||
" * → Playing\n",
|
||
" * Sunny\n",
|
||
" * Humidity=\n",
|
||
" * High\n",
|
||
" * → Not playing\n",
|
||
" * Normal\n",
|
||
" * → Playing\n",
|
||
" * Rain\n",
|
||
" * Wind=\n",
|
||
" * Weak\n",
|
||
" * → Playing\n",
|
||
" * Strong\n",
|
||
" * → Not playing"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* (9/5)\n",
|
||
" * Outlook=Overcast (4/0)\n",
|
||
" * YES\n",
|
||
" * Outlook=Sunny (2/3)\n",
|
||
" * Humidity=High (0/3)\n",
|
||
" * NO\n",
|
||
" * Humidity=Normal (2/0)\n",
|
||
" * YES\n",
|
||
" * Outlook=Rain (3/2)\n",
|
||
" * Wind=Weak (3/0)\n",
|
||
" * YES\n",
|
||
" * Wind=Strong (0/2)\n",
|
||
" * NO"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Algorytm ID3\n",
|
||
"\n",
|
||
"* podziel(węzeł, zbiór przykładów):\n",
|
||
" 1. A ← najlepszy atrybut do podziału zbioru przykładów\n",
|
||
" 1. Dla każdej wartości atrybutu A, utwórz nowy węzeł potomny\n",
|
||
" 1. Podziel zbiór przykładów na podzbiory według węzłów potomnych\n",
|
||
" 1. Dla każdego węzła potomnego i podzbioru:\n",
|
||
" * jeżeli podzbiór jest jednolity: zakończ\n",
|
||
" * w przeciwnym przypadku: podziel(węzeł potomny, podzbiór)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Jak wybrać „najlepszy atrybut”?\n",
|
||
"* powinien zawierać jednolity podzbiór\n",
|
||
"* albo przynajmniej „w miarę jednolity”\n",
|
||
"\n",
|
||
"Skąd wziąć miarę „jednolitości” podzbioru?\n",
|
||
"* miara powinna być symetryczna (4/0 vs. 0/4)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Entropia\n",
|
||
"\n",
|
||
"$$ H(S) = - p_{(+)} \\log p_{(+)} - p_{(-)} \\log p_{(-)} $$\n",
|
||
"\n",
|
||
"* $S$ – podzbiór przykładów\n",
|
||
"* $p_{(+)}$, $p_{(-)}$ – procent pozytywnych/negatywnych przykładów w $S$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"Entropię można traktować jako „liczbę bitów” potrzebną do sprawdzenia, czy losowo wybrany $x \\in S$ jest pozytywnym, czy negatywnym przykładem."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"#### Entropia – przykład\n",
|
||
"\n",
|
||
"* (3 TAK / 3 NIE):\n",
|
||
"$$ H(S) = -\\frac{3}{6} \\log\\frac{3}{6} - \\frac{3}{6} \\log\\frac{3}{6} = 1 \\mbox{ bit} $$\n",
|
||
"* (4 TAK / 0 NIE):\n",
|
||
"$$ H(S) = -\\frac{4}{4} \\log\\frac{4}{4} - \\frac{0}{4} \\log\\frac{0}{4} = 0 \\mbox{ bitów} $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### _Information gain_\n",
|
||
"\n",
|
||
"_Information gain_ – różnica między entropią przed podziałem a entropią po podziale (podczas podziału entropia zmienia się):\n",
|
||
"\n",
|
||
"$$ \\mathop{\\rm Gain}(S,A) = H(S) - \\sum_{V \\in \\mathop{\\rm Values(A)}} \\frac{|S_V|}{|S|} H(S_V) $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"#### _Information gain_ – przykład\n",
|
||
"\n",
|
||
"$$ \\mathop{\\rm Gain}(S, Wind) = H(S) - \\frac{8}{14} H(S_{Wind={\\rm Weak}}) - \\frac{6}{14} H(S_{Wind={\\rm Strong}}) = \\\\\n",
|
||
"= 0.94 - \\frac{8}{14} \\cdot 0.81 - \\frac{6}{14} \\cdot 1.0 = 0.049 $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* _Information gain_ jest całkiem sensowną heurystyką wskazującą, który atrybut jest najlepszy do dokonania podziału.\n",
|
||
"* **Ale**: _information gain_ przeszacowuje użyteczność atrybutów, które mają dużo różnych wartości.\n",
|
||
"* **Przykład**: gdybyśmy wybrali jako atrybut _datę_, otrzymalibyśmy bardzo duży _information gain_, ponieważ każdy podzbiór byłby jednolity, a nie byłoby to ani trochę użyteczne!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### _Information gain ratio_\n",
|
||
"\n",
|
||
"$$ \\mathop{\\rm GainRatio}(S, A) = \\frac{ \\mathop{\\rm Gain}(S, A) }{ -\\sum_{V \\in \\mathop{\\rm Values}(A)} \\frac{|S_V|}{|S|} \\log\\frac{|S_V|}{|S|} } $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* _Information gain ratio_ może być lepszym wyborem heurystyki wskazującej najużyteczniejszy atrybut, jeżeli atrybuty mają wiele różnych wartości."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Drzewa decyzyjne a formuły logiczne\n",
|
||
"\n",
|
||
"Drzewo decyzyjne można pzekształcić na formułę logiczną w postaci normalnej (DNF):\n",
|
||
"\n",
|
||
"$$ Play={\\rm True} \\Leftrightarrow \\left( Outlook={\\rm Overcast} \\vee \\\\\n",
|
||
"( Outlook={\\rm Rain} \\wedge Wind={\\rm Weak} ) \\vee \\\\\n",
|
||
"( Outlook={\\rm Sunny} \\wedge Humidity={\\rm Normal} ) \\right) $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Klasyfikacja wieloklasowa przy użyciu drzew decyzyjnych\n",
|
||
"\n",
|
||
"Algorytm przebiega analogicznie, zmienia się jedynie wzór na entropię:\n",
|
||
"\n",
|
||
"$$ H(S) = -\\sum_{y \\in Y} p_{(y)} \\log p_{(y)} $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Skuteczność algorytmu ID3\n",
|
||
"\n",
|
||
"* Przyjmujemy, że wśród danych uczących nie ma duplikatów (tj. przykładów, które mają jednakowe cechy $x$, a mimo to należą do różnych klas $y$).\n",
|
||
"* Wówczas algorytm drzew decyzyjnych zawsze znajdzie rozwiązanie, ponieważ w ostateczności będziemy mieli węzły 1-elementowe na liściach drzewa."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Nadmierne dopasowanie drzew decyzyjnych\n",
|
||
"\n",
|
||
"* Zauważmy, że w miarę postępowania algorytmu dokładność przewidywań drzewa (_accuracy_) liczona na zbiorze uczącym dąży do 100% (i w ostateczności osiąga 100%, nawet kosztem jednoelementowych liści).\n",
|
||
"* Takie rozwiązanie niekoniecznie jest optymalne. Dokładność na zbiorze testowym może być dużo niższa, a to oznacza nadmierne dopasowanie."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Jak zapobiec nadmiernemu dopasowaniu?\n",
|
||
"\n",
|
||
"Aby zapobiegać nadmiernemu dopasowaniu drzew decyzyjnych, należy je przycinać (_pruning_)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Można tego dokonywać na kilka sposobów:\n",
|
||
"* Można zatrzymywać procedurę podziału w pewnym momencie (np. kiedy podzbiory staja się zbyt małe).\n",
|
||
"* Można najpierw wykonać algorytm ID3 w całości, a następnie przyciąć drzewo, np. kierując się wynikami uzyskanymi na zbiorze walidacyjnym.\n",
|
||
"* Algorytm _sub-tree replacement pruning_ (algorytm zachłanny)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"#### _Sub-tree replacement pruning_\n",
|
||
"\n",
|
||
"1. Dla każdego węzła:\n",
|
||
" 1. Udaj, że usuwasz węzeł wraz z całym zaczepionym w nim poddrzewem.\n",
|
||
" 1. Dokonaj ewaluacji na zbiorze walidacyjnym.\n",
|
||
"1. Usuń węzeł, którego usunięcie daje największą poprawę wyniku.\n",
|
||
"1. Powtarzaj, dopóki usuwanie węzłów poprawia wynik."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Zalety drzew decyzyjnych\n",
|
||
"\n",
|
||
"* Zasadę działania drzew decyzyjnych łatwo zrozumieć człowiekowi.\n",
|
||
"* Atrybuty, które nie wpływają na wynik, mają _gain_ równy 0, zatem są od razu pomijane przez algorytm.\n",
|
||
"* Po zbudowaniu, drzewo decyzyjne jest bardzo szybkim klasyfikatorem (złożoność $O(d)$, gdzie $d$ jest głębokościa dzewa)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Wady drzew decyzyjnych\n",
|
||
"\n",
|
||
"* ID3 jest algorytmem zachłannym – może nie wskazać najlepszego drzewa.\n",
|
||
"* Nie da się otrzymać granic klas (_decision boundaries_), które nie są równoległe do osi wykresu."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## 3.3. Lasy losowe"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Algorytm lasów losowych – idea\n",
|
||
"\n",
|
||
"* Algorytm lasów losowych jest rozwinięciem algorytmu ID3.\n",
|
||
"* Jest to bardzo wydajny algorytm klasyfikacji.\n",
|
||
"* Zamiast jednego, będziemy budować $k$ drzew."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Algorytm lasów losowych – budowa lasu\n",
|
||
"\n",
|
||
"1. Weź losowy podzbiór $S_r$ zbioru uczącego.\n",
|
||
"1. Zbuduj pełne (tj. bez przycinania) drzewo decyzyjne dla $S_r$, używając algorytmu ID3 z następującymi modyfikacjami:\n",
|
||
" * podczas podziału używaj losowego $d$-elementowego podzbioru atrybutów,\n",
|
||
" * obliczaj _gain_ względem $S_r$.\n",
|
||
"1. Powyższą procedurę powtórz $k$-krotnie, otrzymując $k$ drzew ($T_1, T_2, \\ldots, T_k$)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Algorytm lasów losowych – predykcja\n",
|
||
"\n",
|
||
"1. Sklasyfikuj $x$ według każdego z drzew $T_1, T_2, \\ldots, T_k$ z osobna.\n",
|
||
"1. Użyj głosowania większościowego: przypisz klasę przewidzianą przez najwięcej drzew."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"celltoolbar": "Slideshow",
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 2
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython2",
|
||
"version": "2.7.15rc1"
|
||
},
|
||
"livereveal": {
|
||
"start_slideshow_at": "selected",
|
||
"theme": "amu"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|