1074 lines
27 KiB
Plaintext
1074 lines
27 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Uczenie maszynowe UMZ 2019/2020\n",
|
||
"### 19 maja 2020\n",
|
||
"# 10. Sieci neuronowe – propagacja wsteczna"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "notes"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"%matplotlib inline\n",
|
||
"\n",
|
||
"import numpy as np\n",
|
||
"import math"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## 10.1. Metoda propagacji wstecznej – wprowadzenie"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"<img src=\"nn1.png\" />"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Architektura sieci neuronowych\n",
|
||
"\n",
|
||
"* Budowa warstwowa, najczęściej sieci jednokierunkowe i gęste.\n",
|
||
"* Liczbę i rozmiar warstw dobiera się do każdego problemu.\n",
|
||
"* Rozmiary sieci określane poprzez liczbę neuronów lub parametrów."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### _Feedforward_\n",
|
||
"\n",
|
||
"Mając daną $n$-warstwową sieć neuronową oraz jej parametry $\\Theta^{(1)}, \\ldots, \\Theta^{(L)} $ oraz $\\beta^{(1)}, \\ldots, \\beta^{(L)} $, obliczamy:\n",
|
||
"\n",
|
||
"$$a^{(l)} = g^{(l)}\\left( a^{(l-1)} \\Theta^{(l)} + \\beta^{(l)} \\right). $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"<img src=\"nn2.png\" />"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Funkcje $g^{(l)}$ to **funkcje aktywacji**.<br/>\n",
|
||
"Dla $i = 0$ przyjmujemy $a^{(0)} = x$ (wektor wierszowy cech) oraz $g^{(0)}(x) = x$ (identyczność)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Parametry $\\Theta$ to wagi na połączeniach miedzy neuronami dwóch warstw.<br/>\n",
|
||
"Rozmiar macierzy $\\Theta^{(l)}$, czyli macierzy wag na połączeniach warstw $a^{(l-1)}$ i $a^{(l)}$, to $\\dim(a^{(l-1)}) \\times \\dim(a^{(l)})$."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Parametry $\\beta$ zastępują tutaj dodawanie kolumny z jedynkami do macierzy cech.<br/>Macierz $\\beta^{(l)}$ ma rozmiar równy liczbie neuronów w odpowiedniej warstwie, czyli $1 \\times \\dim(a^{(l)})$."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* **Klasyfikacja**: dla ostatniej warstwy $L$ (o rozmiarze równym liczbie klas) przyjmuje się $g^{(L)}(x) = \\mathop{\\mathrm{softmax}}(x)$.\n",
|
||
"* **Regresja**: pojedynczy neuron wyjściowy; funkcją aktywacji może wtedy być np. funkcja identycznościowa."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Pozostałe funkcje aktywacji najcześciej mają postać sigmoidy, np. sigmoidalna, tangens hiperboliczny.<br/> Ale niekoniecznie, np. ReLU, leaky ReLU, maxout."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Jak uczyć sieci neuronowe?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* W poznanych do tej pory algorytmach (regresja liniowa, regresja logistyczna) do uczenia używaliśmy funkcji kosztu, jej gradientu oraz algorytmu gradientu prostego (GD/SGD)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Dla sieci neuronowych potrzebowalibyśmy również znaleźć gradient funkcji kosztu."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Co sprowadza się do bardziej ogólnego problemu:<br/>jak obliczyć gradient $\\nabla f(x)$ dla danej funkcji $f$ i wektora wejściowego $x$?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Pochodna funkcji\n",
|
||
"\n",
|
||
"* **Pochodna** mierzy, jak szybko zmienia się wartość funkcji względem zmiany jej argumentów:\n",
|
||
"\n",
|
||
"$$ \\frac{d f(x)}{d x} = \\lim_{h \\to 0} \\frac{ f(x + h) - f(x) }{ h } $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Pochodna cząstkowa i gradient\n",
|
||
"\n",
|
||
"* **Pochodna cząstkowa** mierzy, jak szybko zmienia się wartość funkcji względem zmiany jej *pojedynczego argumentu*."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* **Gradient** to wektor pochodnych cząstkowych:\n",
|
||
"\n",
|
||
"$$ \\nabla f = \\left( \\frac{\\partial f}{\\partial x_1}, \\ldots, \\frac{\\partial f}{\\partial x_n} \\right) $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Gradient – przykłady\n",
|
||
"\n",
|
||
"$$ f(x_1, x_2) = x_1 + x_2 \\qquad \\to \\qquad \\frac{\\partial f}{\\partial x_1} = 1, \\quad \\frac{\\partial f}{\\partial x_2} = 1, \\quad \\nabla f = (1, 1) $$ "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"$$ f(x_1, x_2) = x_1 \\cdot x_2 \\qquad \\to \\qquad \\frac{\\partial f}{\\partial x_1} = x_2, \\quad \\frac{\\partial f}{\\partial x_2} = x_1, \\quad \\nabla f = (x_2, x_1) $$ "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"$$ f(x_1, x_2) = \\max(x_1 + x_2) \\hskip{12em} \\\\\n",
|
||
"\\to \\qquad \\frac{\\partial f}{\\partial x_1} = \\mathbb{1}_{x \\geq y}, \\quad \\frac{\\partial f}{\\partial x_2} = \\mathbb{1}_{y \\geq x}, \\quad \\nabla f = (\\mathbb{1}_{x \\geq y}, \\mathbb{1}_{y \\geq x}) $$ "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Własności pochodnych cząstkowych\n",
|
||
"\n",
|
||
"Jezeli $f(x, y, z) = (x + y) \\, z$ oraz $x + y = q$, to:\n",
|
||
"$$f = q z,\n",
|
||
"\\quad \\frac{\\partial f}{\\partial q} = z,\n",
|
||
"\\quad \\frac{\\partial f}{\\partial z} = q,\n",
|
||
"\\quad \\frac{\\partial q}{\\partial x} = 1,\n",
|
||
"\\quad \\frac{\\partial q}{\\partial y} = 1 $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Reguła łańcuchowa\n",
|
||
"\n",
|
||
"$$ \\frac{\\partial f}{\\partial x} = \\frac{\\partial f}{\\partial q} \\, \\frac{\\partial q}{\\partial x},\n",
|
||
"\\quad \\frac{\\partial f}{\\partial y} = \\frac{\\partial f}{\\partial q} \\, \\frac{\\partial q}{\\partial y} $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Propagacja wsteczna – prosty przykład"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Dla ustalonego wejścia\n",
|
||
"x = -2; y = 5; z = -4"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"(3, -12)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Krok w przód\n",
|
||
"q = x + y\n",
|
||
"f = q * z\n",
|
||
"print(q, f)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[-4, -4, 3]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Propagacja wsteczna dla f = q * z\n",
|
||
"dz = q\n",
|
||
"dq = z\n",
|
||
"# Propagacja wsteczna dla q = x + y\n",
|
||
"dx = 1 * dq # z reguły łańcuchowej\n",
|
||
"dy = 1 * dq # z reguły łańcuchowej\n",
|
||
"print([dx, dy, dz])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"<img src=\"exp1.png\" />"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Właśnie tak wygląda obliczanie pochodnych metodą propagacji wstecznej!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Spróbujmy czegoś bardziej skomplikowanego:<br/>metodą propagacji wstecznej obliczmy pochodną funkcji sigmoidalnej."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Propagacja wsteczna – funkcja sigmoidalna"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"Funkcja sigmoidalna:\n",
|
||
"\n",
|
||
"$$f(\\theta,x) = \\frac{1}{1+e^{-(\\theta_0 x_0 + \\theta_1 x_1 + \\theta_2)}}$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"$$\n",
|
||
"\\begin{array}{lcl}\n",
|
||
"f(x) = \\frac{1}{x} \\quad & \\rightarrow & \\quad \\frac{df}{dx} = -\\frac{1}{x^2} \\\\\n",
|
||
"f_c(x) = c + x \\quad & \\rightarrow & \\quad \\frac{df}{dx} = 1 \\\\\n",
|
||
"f(x) = e^x \\quad & \\rightarrow & \\quad \\frac{df}{dx} = e^x \\\\\n",
|
||
"f_a(x) = ax \\quad & \\rightarrow & \\quad \\frac{df}{dx} = a \\\\\n",
|
||
"\\end{array}\n",
|
||
"$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"<img src=\"exp2.png\" />"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[0.3932238664829637, -0.5898357997244456]\n",
|
||
"[-0.19661193324148185, -0.3932238664829637, 0.19661193324148185]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Losowe wagi i dane\n",
|
||
"w = [2,-3,-3]\n",
|
||
"x = [-1, -2]\n",
|
||
"\n",
|
||
"# Krok w przód\n",
|
||
"dot = w[0]*x[0] + w[1]*x[1] + w[2]\n",
|
||
"f = 1.0 / (1 + math.exp(-dot)) # funkcja sigmoidalna\n",
|
||
"\n",
|
||
"# Krok w tył\n",
|
||
"ddot = (1 - f) * f # pochodna funkcji sigmoidalnej\n",
|
||
"dx = [w[0] * ddot, w[1] * ddot]\n",
|
||
"dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot]\n",
|
||
"\n",
|
||
"print(dx)\n",
|
||
"print(dw)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Obliczanie gradientów – podsumowanie\n",
|
||
"\n",
|
||
"* Gradient $f$ dla $x$ mówi jak zmieni się całe wyrażenie przy zmianie wartości $x$.\n",
|
||
"* Gradienty łączymy korzystając z **reguły łańcuchowej**.\n",
|
||
"* W kroku wstecz gradienty informują, które części grafu powinny być zwiększone lub zmniejszone (i z jaką siłą), aby zwiększyć wartość na wyjściu.\n",
|
||
"* W kontekście implementacji chcemy dzielić funkcję $f$ na części, dla których można łatwo obliczyć gradienty."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## 10.2. Uczenie wielowarstwowych sieci neuronowych metodą propagacji wstecznej"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Mając algorytm SGD oraz gradienty wszystkich wag, moglibyśmy trenować każdą sieć."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Niech:\n",
|
||
"$$\\Theta = (\\Theta^{(1)},\\Theta^{(2)},\\Theta^{(3)},\\beta^{(1)},\\beta^{(2)},\\beta^{(3)})$$\n",
|
||
"\n",
|
||
"* Funkcja sieci neuronowej z grafiki:\n",
|
||
"\n",
|
||
"$$\\small h_\\Theta(x) = \\tanh(\\tanh(\\tanh(x\\Theta^{(1)}+\\beta^{(1)})\\Theta^{(2)} + \\beta^{(2)})\\Theta^{(3)} + \\beta^{(3)})$$\n",
|
||
"* Funkcja kosztu dla regresji:\n",
|
||
"$$J(\\Theta) = \\dfrac{1}{2m} \\sum_{i=1}^{m} (h_\\Theta(x^{(i)})- y^{(i)})^2 $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Jak obliczymy gradienty?\n",
|
||
"\n",
|
||
"$$\\nabla_{\\Theta^{(l)}} J(\\Theta) = ? \\quad \\nabla_{\\beta^{(l)}} J(\\Theta) = ?$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### W kierunku propagacji wstecznej\n",
|
||
"\n",
|
||
"* Pewna (niewielka) zmiana wagi $\\Delta z^l_j$ dla $j$-ego neuronu w warstwie $l$ pociąga za sobą (niewielką) zmianę kosztu: \n",
|
||
"\n",
|
||
"$$\\frac{\\partial J(\\Theta)}{\\partial z^{l}_j} \\Delta z^{l}_j$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Jeżeli $\\frac{\\partial J(\\Theta)}{\\partial z^{l}_j}$ jest duża, $\\Delta z^l_j$ ze znakiem przeciwnym zredukuje koszt.\n",
|
||
"* Jeżeli $\\frac{\\partial J(\\Theta)}{\\partial z^l_j}$ jest bliska zeru, koszt nie będzie mocno poprawiony."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"* Definiujemy błąd $\\delta^l_j$ neuronu $j$ w warstwie $l$: \n",
|
||
"\n",
|
||
"$$\\delta^l_j := \\dfrac{\\partial J(\\Theta)}{\\partial z^l_j}$$ \n",
|
||
"$$\\delta^l := \\nabla_{z^l} J(\\Theta) \\quad \\textrm{ (zapis wektorowy)} $$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Podstawowe równania propagacji wstecznej\n",
|
||
"\n",
|
||
"$$\n",
|
||
"\\begin{array}{rcll}\n",
|
||
"\\delta^L & = & \\nabla_{a^L}J(\\Theta) \\odot { \\left( g^{L} \\right) }^{\\prime} \\left( z^L \\right) & (BP1) \\\\[2mm]\n",
|
||
"\\delta^{l} & = & \\left( \\left( \\Theta^{l+1} \\right) \\! ^\\top \\, \\delta^{l+1} \\right) \\odot {{ \\left( g^{l} \\right) }^{\\prime}} \\left( z^{l} \\right) & (BP2)\\\\[2mm]\n",
|
||
"\\nabla_{\\beta^l} J(\\Theta) & = & \\delta^l & (BP3)\\\\[2mm]\n",
|
||
"\\nabla_{\\Theta^l} J(\\Theta) & = & a^{l-1} \\odot \\delta^l & (BP4)\\\\\n",
|
||
"\\end{array}\n",
|
||
"$$\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"#### (BP1)\n",
|
||
"$$ \\delta^L_j \\; = \\; \\frac{ \\partial J }{ \\partial a^L_j } \\, g' \\!\\! \\left( z^L_j \\right) $$\n",
|
||
"$$ \\delta^L \\; = \\; \\nabla_{a^L}J(\\Theta) \\odot { \\left( g^{L} \\right) }^{\\prime} \\left( z^L \\right) $$\n",
|
||
"Błąd w ostatniej warstwie jest iloczynem szybkości zmiany kosztu względem $j$-tego wyjścia i szybkości zmiany funkcji aktywacji w punkcie $z^L_j$."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"#### (BP2)\n",
|
||
"$$ \\delta^{l} \\; = \\; \\left( \\left( \\Theta^{l+1} \\right) \\! ^\\top \\, \\delta^{l+1} \\right) \\odot {{ \\left( g^{l} \\right) }^{\\prime}} \\left( z^{l} \\right) $$\n",
|
||
"Aby obliczyć błąd w $l$-tej warstwie, należy przemnożyć błąd z następnej ($(l+1)$-szej) warstwy przez transponowany wektor wag, a uzyskaną macierz pomnożyć po współrzędnych przez szybkość zmiany funkcji aktywacji w punkcie $z^l$."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"#### (BP3)\n",
|
||
"$$ \\nabla_{\\beta^l} J(\\Theta) \\; = \\; \\delta^l $$\n",
|
||
"Błąd w $l$-tej warstwie jest równy wartości gradientu funkcji kosztu."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"#### (BP4)\n",
|
||
"$$ \\nabla_{\\Theta^l} J(\\Theta) \\; = \\; a^{l-1} \\odot \\delta^l $$\n",
|
||
"Gradient funkcji kosztu względem wag $l$-tej warstwy można obliczyć jako iloczyn po współrzędnych $a^{l-1}$ przez $\\delta^l$."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Algorytm propagacji wstecznej"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Dla jednego przykładu $(x,y)$:\n",
|
||
"\n",
|
||
"1. **Wejście**: Ustaw aktywacje w warstwie cech $a^{(0)}=x$ \n",
|
||
"2. **Feedforward:** dla $l=1,\\dots,L$ oblicz \n",
|
||
"$$z^{(l)} = a^{(l-1)} \\Theta^{(l)} + \\beta^{(l)} \\textrm{ oraz } a^{(l)}=g^{(l)} \\!\\! \\left( z^{(l)} \\right) $$\n",
|
||
"3. **Błąd wyjścia $\\delta^{(L)}$:** oblicz wektor $$\\delta^{(L)}= \\nabla_{a^{(L)}}J(\\Theta) \\odot {g^{\\prime}}^{(L)} \\!\\! \\left( z^{(L)} \\right) $$\n",
|
||
"4. **Propagacja wsteczna błędu:** dla $l = L-1,L-2,\\dots,1$ oblicz $$\\delta^{(l)} = \\delta^{(l+1)}(\\Theta^{(l+1)})^T \\odot {g^{\\prime}}^{(l)} \\!\\! \\left( z^{(l)} \\right) $$\n",
|
||
"5. **Gradienty:** \n",
|
||
" * $\\dfrac{\\partial}{\\partial \\Theta_{ij}^{(l)}} J(\\Theta) = a_i^{(l-1)}\\delta_j^{(l)} \\textrm{ oraz } \\dfrac{\\partial}{\\partial \\beta_{j}^{(l)}} J(\\Theta) = \\delta_j^{(l)}$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"W naszym przykładzie:\n",
|
||
"\n",
|
||
"$$\\small J(\\Theta) = \\frac{1}{2} \\left( a^{(L)} - y \\right) ^2 $$\n",
|
||
"$$\\small \\dfrac{\\partial}{\\partial a^{(L)}} J(\\Theta) = a^{(L)} - y$$\n",
|
||
"\n",
|
||
"$$\\small \\tanh^{\\prime}(x) = 1 - \\tanh^2(x)$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"<img src=\"nn3.png\" />"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Algorytm SGD z propagacją wsteczną\n",
|
||
"\n",
|
||
"Pojedyncza iteracja:\n",
|
||
"* Dla parametrów $\\Theta = (\\Theta^{(1)},\\ldots,\\Theta^{(L)})$ utwórz pomocnicze macierze zerowe $\\Delta = (\\Delta^{(1)},\\ldots,\\Delta^{(L)})$ o takich samych wymiarach (dla uproszczenia opuszczono wagi $\\beta$).\n",
|
||
"* Dla $m$ przykładów we wsadzie (_batch_), $i = 1,\\ldots,m$:\n",
|
||
" * Wykonaj algortym propagacji wstecznej dla przykładu $(x^{(i)}, y^{(i)})$ i przechowaj gradienty $\\nabla_{\\Theta}J^{(i)}(\\Theta)$ dla tego przykładu;\n",
|
||
" * $\\Delta := \\Delta + \\dfrac{1}{m}\\nabla_{\\Theta}J^{(i)}(\\Theta)$\n",
|
||
"* Wykonaj aktualizację wag: $\\Theta := \\Theta - \\alpha \\Delta$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"### Propagacja wsteczna – podsumowanie\n",
|
||
"\n",
|
||
"* Algorytm pierwszy raz wprowadzony w latach 70. XX w.\n",
|
||
"* W 1986 David Rumelhart, Geoffrey Hinton i Ronald Williams pokazali, że jest znacznie szybszy od wcześniejszych metod.\n",
|
||
"* Obecnie najpopularniejszy algorytm uczenia sieci neuronowych."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## 10.3. Implementacja sieci neuronowych"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>łod.dł.</th>\n",
|
||
" <th>łod.sz.</th>\n",
|
||
" <th>pł.dł.</th>\n",
|
||
" <th>pł.sz.</th>\n",
|
||
" <th>Iris setosa?</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>5.2</td>\n",
|
||
" <td>3.4</td>\n",
|
||
" <td>1.4</td>\n",
|
||
" <td>0.2</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>5.1</td>\n",
|
||
" <td>3.7</td>\n",
|
||
" <td>1.5</td>\n",
|
||
" <td>0.4</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>6.7</td>\n",
|
||
" <td>3.1</td>\n",
|
||
" <td>5.6</td>\n",
|
||
" <td>2.4</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>6.5</td>\n",
|
||
" <td>3.2</td>\n",
|
||
" <td>5.1</td>\n",
|
||
" <td>2.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>4.9</td>\n",
|
||
" <td>2.5</td>\n",
|
||
" <td>4.5</td>\n",
|
||
" <td>1.7</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5</th>\n",
|
||
" <td>6.0</td>\n",
|
||
" <td>2.7</td>\n",
|
||
" <td>5.1</td>\n",
|
||
" <td>1.6</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" łod.dł. łod.sz. pł.dł. pł.sz. Iris setosa?\n",
|
||
"0 5.2 3.4 1.4 0.2 1.0\n",
|
||
"1 5.1 3.7 1.5 0.4 1.0\n",
|
||
"2 6.7 3.1 5.6 2.4 0.0\n",
|
||
"3 6.5 3.2 5.1 2.0 0.0\n",
|
||
"4 4.9 2.5 4.5 1.7 0.0\n",
|
||
"5 6.0 2.7 5.1 1.6 0.0"
|
||
]
|
||
},
|
||
"execution_count": 6,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"import pandas\n",
|
||
"src_cols = ['łod.dł.', 'łod.sz.', 'pł.dł.', 'pł.sz.', 'Gatunek']\n",
|
||
"trg_cols = ['łod.dł.', 'łod.sz.', 'pł.dł.', 'pł.sz.', 'Iris setosa?']\n",
|
||
"data = (\n",
|
||
" pandas.read_csv('iris.csv', usecols=src_cols)\n",
|
||
" .apply(lambda x: [x[0], x[1], x[2], x[3], 1 if x[4] == 'Iris-setosa' else 0], axis=1))\n",
|
||
"data.columns = trg_cols\n",
|
||
"data[:6]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[[1. 5.2 3.4 1.4 0.2]\n",
|
||
" [1. 5.1 3.7 1.5 0.4]\n",
|
||
" [1. 6.7 3.1 5.6 2.4]\n",
|
||
" [1. 6.5 3.2 5.1 2. ]\n",
|
||
" [1. 4.9 2.5 4.5 1.7]\n",
|
||
" [1. 6. 2.7 5.1 1.6]]\n",
|
||
"[[1.]\n",
|
||
" [1.]\n",
|
||
" [0.]\n",
|
||
" [0.]\n",
|
||
" [0.]\n",
|
||
" [0.]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"m, n_plus_1 = data.values.shape\n",
|
||
"n = n_plus_1 - 1\n",
|
||
"Xn = data.values[:, 0:n].reshape(m, n)\n",
|
||
"X = np.matrix(np.concatenate((np.ones((m, 1)), Xn), axis=1)).reshape(m, n_plus_1)\n",
|
||
"Y = np.matrix(data.values[:, n]).reshape(m, 1)\n",
|
||
"\n",
|
||
"print(X[:6])\n",
|
||
"print(Y[:6])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"scrolled": true,
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"/home/pawel/.local/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
|
||
" from ._conv import register_converters as _register_converters\n",
|
||
"Using TensorFlow backend.\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Epoch 1/1\n",
|
||
"150/150 [==============================] - 0s 2ms/step - loss: 3.6282 - acc: 0.3333\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<keras.callbacks.History at 0x7f9bd195e190>"
|
||
]
|
||
},
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"from keras.models import Sequential\n",
|
||
"from keras.layers import Dense\n",
|
||
"\n",
|
||
"model = Sequential()\n",
|
||
"model.add(Dense(3, input_dim=5))\n",
|
||
"model.add(Dense(3))\n",
|
||
"model.add(Dense(1, activation='sigmoid'))\n",
|
||
"\n",
|
||
"model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
|
||
"\n",
|
||
"model.fit(X, Y)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.05484907701611519"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"model.predict(np.array([1.0, 3.0, 1.0, 2.0, 4.0]).reshape(-1, 5)).tolist()[0][0]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"150/150 [==============================] - 0s 293us/step\n",
|
||
"()\n",
|
||
"loss:\t3.4469\n",
|
||
"acc:\t0.3333\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"scores = model.evaluate(X, Y)\n",
|
||
"print()\n",
|
||
"for i in range(len(scores)):\n",
|
||
" print('{}:\\t{:.4f}'.format(model.metrics_names[i], scores[i]))"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"celltoolbar": "Slideshow",
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.8.3"
|
||
},
|
||
"livereveal": {
|
||
"start_slideshow_at": "selected",
|
||
"theme": "amu"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|