246 KiB

Raw Blame History

Uczenie maszynowe

3. Regresja liniowa – część 2

3.1. Regresja liniowa wielu zmiennych

Do przewidywania wartości $y$ możemy użyć więcej niż jednej cechy $x$:

Przykład – ceny mieszkań

import numpy as np
import pandas as pd

data = pd.read_csv("data02_train.tsv", sep="\t")
data.rename(columns={col: f"x{i}:{col}" if i > 0 else f"y:{col}" for i, col in enumerate(data.columns)}, inplace=True)
data.index = np.arange(1, len(data) + 1)
print(data)

       y:price  x1:isNew  x2:rooms  x3:floor   x4:location  x5:sqrMetres
1     476118.0     False         3         1       Centrum            78
2     459531.0     False         3         2        Sołacz            62
3     411557.0     False         3         0        Sołacz            15
4     496416.0     False         4         0        Sołacz            14
5     406032.0     False         3         0        Sołacz            15
...        ...       ...       ...       ...           ...           ...
1335  349000.0     False         4         0  Szczepankowo            29
1336  399000.0     False         5         0  Szczepankowo            68
1337  234000.0      True         2         7         Wilda            50
1338  210000.0      True         2         1         Wilda            65
1339  279000.0      True         2         2        Łazarz            36

[1339 rows x 6 columns]

$$ x^{(2)} = ({\rm "False"}, 3, 2, {\rm "Sołacz"}, 62), \quad x_3^{(2)} = 2 $$

Hipoteza

W naszym przypadku (wybraliśmy 5 cech):

$$ h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4 + \theta_5 x_5 $$

W ogólności ($n$ cech):

$$ h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n $$

Jeżeli zdefiniujemy $x_0 = 1$, będziemy mogli powyższy wzór zapisać w bardziej kompaktowy sposób:

$$ \begin{array}{rcl} h_\theta(x) & = & \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n \\ & = & \displaystyle\sum_{i=0}^{n} \theta_i x_i \\ & = & \theta^T , x \\ & = & x^T , \theta \\ \end{array} $$

Metoda gradientu prostego – notacja macierzowa

Metoda gradientu prostego przyjmie bardzo elegancką formę, jeżeli do jej zapisu użyjemy wektorów i macierzy.

$$ X=\left[\begin{array}{cc} 1 & \left( \vec x^{(1)} \right)^T \\ 1 & \left( \vec x^{(2)} \right)^T \\ \vdots & \vdots\\ 1 & \left( \vec x^{(m)} \right)^T \\ \end{array}\right] = \left[\begin{array}{cccc} 1 & x_1^{(1)} & \cdots & x_n^{(1)} \\ 1 & x_1^{(2)} & \cdots & x_n^{(2)} \\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_1^{(m)} & \cdots & x_n^{(m)} \\ \end{array}\right] \quad \vec{y} = \left[\begin{array}{c} y^{(1)}\\ y^{(2)}\\ \vdots\\ y^{(m)}\\ \end{array}\right] \quad \theta = \left[\begin{array}{c} \theta_0\\ \theta_1\\ \vdots\\ \theta_n\\ \end{array}\right] $$

# Wersje macierzowe funkcji rysowania wykresów punktowych oraz krzywej regresyjnej

import matplotlib.pyplot as plt
import numpy as np


def h(theta, x):
    return x * theta


def regdots(x, y, xlabel="populacja", ylabel="zysk"):
    fig = plt.figure(figsize=(16 * 0.6, 9 * 0.6))
    ax = fig.add_subplot(111)
    fig.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9)
    ax.scatter([x[:, 1]], [y], c="r", s=50, label="Dane")

    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.margins(0.05, 0.05)
    plt.ylim(y.min() - 1, y.max() + 1)
    plt.xlim(np.min(x[:, 1]) - 1, np.max(x[:, 1]) + 1)
    return fig


def regline(fig, fun, theta, x):
    ax = fig.axes[0]
    x_min = np.min(x[:, 1])
    x_max = np.max(x[:, 1])
    x_range = [x_min, x_max]
    x_matrix = np.matrix([1, x_min, 1, x_max]).reshape(2, 2)
    ax.plot(
        x_range,
        fun(theta, x_matrix),
        linewidth="2",
        label=(
            r"$y={theta0:.2}{op}{theta1:.2}x$".format(
                theta0=float(theta[0][0]),
                theta1=(
                    float(theta[1][0]) if theta[1][0] >= 0 else float(-theta[1][0])
                ),
                op="+" if theta[1][0] >= 0 else "-",
            )
        ),
    )

# Wczytwanie danych z pliku – regresja liniowa wielu zmiennych – notacja macierzowa

import pandas as pd

data = pd.read_csv(
    "data02_train.tsv", delimiter="\t", usecols=["price", "rooms", "floor", "sqrMetres"]
)
m, n_plus_1 = data.values.shape
n = n_plus_1 - 1
xn = data.values[:, 1:].reshape(m, n)

# Dodaj kolumnę jedynek do macierzy
x = np.matrix(np.concatenate((np.ones((m, 1)), xn), axis=1)).reshape(m, n_plus_1)
y = np.matrix(data.values[:, 0]).reshape(m, 1)

print(f"{x[:5]=}")
print(f"{x.shape=}")
print(f"{y[:5]=}")
print(f"{y.shape=}")

x[:5]=matrix([[ 1.,  3.,  1., 78.],
        [ 1.,  3.,  2., 62.],
        [ 1.,  3.,  0., 15.],
        [ 1.,  4.,  0., 14.],
        [ 1.,  3.,  0., 15.]])
x.shape=(1339, 4)
y[:5]=matrix([[476118.],
        [459531.],
        [411557.],
        [496416.],
        [406032.]])
y.shape=(1339, 1)

Funkcja kosztu – notacja macierzowa

$$J(\theta)=\dfrac{1}{2|\vec y|}\left(X\theta-\vec{y}\right)^T\left(X\theta-\vec{y}\right)$$

from IPython.display import display, Math, Latex


def J(theta, X, y):
    """Wersja macierzowa funkcji kosztu"""
    m = len(y)
    cost = 1.0 / (2.0 * m) * ((X * theta - y).T * (X * theta - y))
    return cost.item()


theta = np.matrix([10, 90, -1, 2.5]).reshape(4, 1)

cost = J(theta, x, y)
display(Math(r"\Large J(\theta) = %.4f" % cost))

$\displaystyle \Large J(\theta) = 85104141370.9717$

Gradient – notacja macierzowa

$$\nabla J(\theta) = \frac{1}{|\vec y|} X^T\left(X\theta-\vec y\right)$$

# Wyświetlanie macierzy w LaTeX-u


def latex_matrix(matrix):
    ltx = r"\left[\begin{array}"
    m, n = matrix.shape
    ltx += "{" + ("r" * n) + "}"
    for i in range(m):
        ltx += r" & ".join([("%.4f" % j.item()) for j in matrix[i]]) + r" \\\\ "
    ltx += r"\end{array}\right]"
    return ltx

from IPython.display import display, Math, Latex


def dJ(theta, X, y):
    """Wersja macierzowa gradientu funckji kosztu"""
    return 1.0 / len(y) * (X.T * (X * theta - y))


theta = np.matrix([10, 90, -1, 2.5]).reshape(4, 1)

display(
    Math(
        r"\large \theta = "
        + latex_matrix(theta)
        + r"\quad"
        + r"\large \nabla J(\theta) = "
        + latex_matrix(dJ(theta, x, y))
    )
)

$\displaystyle \large \theta = \left[\begin{array}{r}10.0000 \\\\ 90.0000 \\\\ -1.0000 \\\\ 2.5000 \\\\ \end{array}\right]\quad\large \nabla J(\theta) = \left[\begin{array}{r}-373492.7442 \\\\ -1075656.5086 \\\\ -989554.4921 \\\\ -23806475.6561 \\\\ \end{array}\right]$

Algorytm gradientu prostego – notacja macierzowa

$$ \theta := \theta - \alpha , \nabla J(\theta) $$

# Implementacja algorytmu gradientu prostego za pomocą numpy i macierzy


def gradient_descent(fJ, fdJ, theta, X, y, alpha, eps):
    current_cost = fJ(theta, X, y)
    history = [[current_cost, theta]]
    while True:
        theta = theta - alpha * fdJ(theta, X, y)  # implementacja wzoru
        current_cost, prev_cost = fJ(theta, X, y), current_cost
        if abs(prev_cost - current_cost) <= eps:
            break
        if current_cost > prev_cost:
            print("Długość kroku (alpha) jest zbyt duża!")
            break
        history.append([current_cost, theta])
    return theta, history


theta_start = np.zeros((n + 1, 1))

# Zmieniamy wartości alpha (rozmiar kroku) oraz eps (kryterium stopu)
theta_best, history = gradient_descent(J, dJ, theta_start, x, y, alpha=0.0001, eps=0.1)

display(
    Math(
        r"\large\textrm{Wynik:}\quad \theta = "
        + latex_matrix(theta_best)
        + (r" \quad J(\theta) = %.4f" % history[-1][0])
        + r" \quad \textrm{po %d iteracjach}" % len(history)
    )
)

$\displaystyle \large\textrm{Wynik:}\quad \theta = \left[\begin{array}{r}17446.2135 \\\\ 86476.7960 \\\\ -1374.8950 \\\\ 2165.0689 \\\\ \end{array}\right] \quad J(\theta) = 10324864803.1591 \quad \textrm{po 374575 iteracjach}$

3.2. Metoda gradientu prostego w praktyce

Kryterium stopu

Algorytm gradientu prostego polega na wykonywaniu określonych kroków w pętli. Pytanie brzmi: kiedy należy zatrzymać wykonywanie tej pętli?

W każdej kolejnej iteracji wartość funkcji kosztu maleje o coraz mniejszą wartość. Parametr eps określa, jaka wartość graniczna tej różnicy jest dla nas wystarczająca:

Im mniejsza wartość eps, tym dokładniejszy wynik, ale dłuższy czas działania algorytmu.
Im większa wartość eps, tym krótszy czas działania algorytmu, ale mniej dokładny wynik.

Na wykresie zobaczymy porównanie regresji dla różnych wartości eps

theta_start = np.zeros((2, 1))

fig = regdots(x[1], y)
# theta_e1, history1 = GDMx(
#     JMx, dJMx, thetaStartMx, XMx, yMx, alpha=0.01, eps=0.01
# )  # niebieska linia
# reglineMx(fig, hMx, theta_e1, XMx)
# theta_e2, history2 = GDMx(
#     JMx, dJMx, thetaStartMx, XMx, yMx, alpha=0.01, eps=0.000001
# )  # pomarańczowa linia
# reglineMx(fig, hMx, theta_e2, XMx)
# legend(fig)

[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
Cell [0;32mIn [18], line 3[0m
[1;32m      1[0m theta_start [39m=[39m np[39m.[39mzeros(([39m2[39m, [39m1[39m))
[0;32m----> 3[0m fig [39m=[39m regdots(x[[39m1[39m], y)

Cell [0;32mIn [6], line 15[0m, in [0;36mregdots[0;34m(x, y, xlabel, ylabel)[0m
[1;32m     13[0m ax [39m=[39m fig[39m.[39madd_subplot([39m111[39m)
[1;32m     14[0m fig[39m.[39msubplots_adjust(left[39m=[39m[39m0.1[39m, right[39m=[39m[39m0.9[39m, bottom[39m=[39m[39m0.1[39m, top[39m=[39m[39m0.9[39m)
[0;32m---> 15[0m ax[39m.[39;49mscatter([x[:, [39m1[39;49m]], [y], c[39m=[39;49m[39m"[39;49m[39mr[39;49m[39m"[39;49m, s[39m=[39;49m[39m50[39;49m, label[39m=[39;49m[39m"[39;49m[39mDane[39;49m[39m"[39;49m)
[1;32m     17[0m ax[39m.[39mset_xlabel(xlabel)
[1;32m     18[0m ax[39m.[39mset_ylabel(ylabel)

File [0;32m~/.local/lib/python3.10/site-packages/matplotlib/__init__.py:1423[0m, in [0;36m_preprocess_data.<locals>.inner[0;34m(ax, data, *args, **kwargs)[0m
[1;32m   1420[0m [39m@functools[39m[39m.[39mwraps(func)
[1;32m   1421[0m [39mdef[39;00m [39minner[39m(ax, [39m*[39margs, data[39m=[39m[39mNone[39;00m, [39m*[39m[39m*[39mkwargs):
[1;32m   1422[0m     [39mif[39;00m data [39mis[39;00m [39mNone[39;00m:
[0;32m-> 1423[0m         [39mreturn[39;00m func(ax, [39m*[39;49m[39mmap[39;49m(sanitize_sequence, args), [39m*[39;49m[39m*[39;49mkwargs)
[1;32m   1425[0m     bound [39m=[39m new_sig[39m.[39mbind(ax, [39m*[39margs, [39m*[39m[39m*[39mkwargs)
[1;32m   1426[0m     auto_label [39m=[39m (bound[39m.[39marguments[39m.[39mget(label_namer)
[1;32m   1427[0m                   [39mor[39;00m bound[39m.[39mkwargs[39m.[39mget(label_namer))

File [0;32m~/.local/lib/python3.10/site-packages/matplotlib/axes/_axes.py:4512[0m, in [0;36mAxes.scatter[0;34m(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors, plotnonfinite, **kwargs)[0m
[1;32m   4510[0m y [39m=[39m np[39m.[39mma[39m.[39mravel(y)
[1;32m   4511[0m [39mif[39;00m x[39m.[39msize [39m!=[39m y[39m.[39msize:
[0;32m-> 4512[0m     [39mraise[39;00m [39mValueError[39;00m([39m"[39m[39mx and y must be the same size[39m[39m"[39m)
[1;32m   4514[0m [39mif[39;00m s [39mis[39;00m [39mNone[39;00m:
[1;32m   4515[0m     s [39m=[39m ([39m20[39m [39mif[39;00m mpl[39m.[39mrcParams[[39m'[39m[39m_internal.classic_mode[39m[39m'[39m] [39melse[39;00m
[1;32m   4516[0m          mpl[39m.[39mrcParams[[39m'[39m[39mlines.markersize[39m[39m'[39m] [39m*[39m[39m*[39m [39m2.0[39m)

[0;31mValueError[0m: x and y must be the same size

display(
    Math(
        r"\theta_{10^{-2}} = "
        + latex_matrix(theta_e1)
        + r"\quad\theta_{10^{-6}} = "
        + latex_matrix(theta_e2)
    )
)

$\displaystyle \theta_{10^{-2}} = \left[\begin{array}{r}0.0531 \\\\ 0.8365 \\\\ \end{array}\right]\quad\theta_{10^{-6}} = \left[\begin{array}{r}-3.4895 \\\\ 1.1786 \\\\ \end{array}\right]$

Długość kroku ($\alpha$)

# Jak zmienia się koszt w kolejnych krokach w zależności od alfa


def costchangeplot(history):
    fig = plt.figure(figsize=(16 * 0.6, 9 * 0.6))
    ax = fig.add_subplot(111)
    fig.subplots_adjust(left=0.1, right=0.9, bottom=0.1, top=0.9)
    ax.set_xlabel("krok")
    ax.set_ylabel(r"$J(\theta)$")

    X = np.arange(0, 500, 1)
    Y = [history[step][0] for step in X]
    ax.plot(X, Y, linewidth="2", label=(r"$J(\theta)$"))
    return fig


def slide7(alpha):
    best_theta, history = gradient_descent(
        h, J, [0.0, 0.0], x, y, alpha=alpha, eps=0.0001
    )
    fig = costchangeplot(history)
    legend(fig)


sliderAlpha1 = widgets.FloatSlider(
    min=0.01, max=0.03, step=0.001, value=0.02, description=r"$\alpha$", width=300
)

widgets.interact_manual(slide7, alpha=sliderAlpha1)

interactive(children=(FloatSlider(value=0.02, description='$\\\\alpha$', max=0.03, min=0.01, step=0.001), Button…

<function __main__.slide7(alpha)>

3.3. Normalizacja danych

Normalizacja danych to proces, który polega na dostosowaniu danych wejściowych w taki sposób, żeby ułatwić działanie algorytmowi gradientu prostego.

Wyjaśnię to na przykladzie.

Użyjemy danych z „Gratka flats challenge 2017”.

Rozważmy model $h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$, w którym cena mieszkania prognozowana jest na podstawie liczby pokoi $x_1$ i metrażu $x_2$:

# Wczytanie danych przy pomocy biblioteki pandas
import pandas

alldata = pandas.read_csv(
    "data_flats.tsv", header=0, sep="\t", usecols=["price", "rooms", "sqrMetres"]
)
alldata[:10]

	price	rooms	sqrMetres
0	476118.00	3	78
1	459531.00	3	62
2	411557.00	3	15
3	496416.00	4	14
4	406032.00	3	15
5	450026.00	3	80
6	571229.15	2	39
7	325000.00	3	54
8	268229.00	2	90
9	604836.00	4	40

# Funkcja, która pokazuje wartości minimalne i maksymalne w macierzy X


def show_mins_and_maxs(XMx):
    mins = np.amin(XMx, axis=0).tolist()[0]  # wartości minimalne
    maxs = np.amax(XMx, axis=0).tolist()[0]  # wartości maksymalne
    for i, (xmin, xmax) in enumerate(zip(mins, maxs)):
        display(Math(r"${:.2F} \leq x_{} \leq {:.2F}$".format(xmin, i, xmax)))

# Przygotowanie danych

import numpy as np

%matplotlib inline

data2 = np.matrix(alldata[['rooms', 'sqrMetres', 'price']])

m, n_plus_1 = data2.shape
n = n_plus_1 - 1
Xn = data2[:, 0:n]

XMx2 = np.matrix(np.concatenate((np.ones((m, 1)), Xn), axis=1)).reshape(m, n_plus_1)
yMx2 = np.matrix(data2[:, -1]).reshape(m, 1) / 1000.0

Cechy w danych treningowych przyjmują wartości z zakresu:

show_mins_and_maxs(XMx2)

$\displaystyle 1.00 \leq x_0 \leq 1.00$

$\displaystyle 2.00 \leq x_1 \leq 7.00$

$\displaystyle 12.00 \leq x_2 \leq 196.00$

Jak widzimy, $x_2$ przyjmuje wartości dużo większe niż $x_1$. Powoduje to, że wykres funkcji kosztu jest bardzo „spłaszczony” wzdłuż jednej z osi:

def contour_plot(X, y, rescale=10**8):
    theta0_vals = np.linspace(-100000, 100000, 100)
    theta1_vals = np.linspace(-100000, 100000, 100)

    J_vals = np.zeros(shape=(theta0_vals.size, theta1_vals.size))
    for t1, element in enumerate(theta0_vals):
        for t2, element2 in enumerate(theta1_vals):
            thetaT = np.matrix([1.0, element, element2]).reshape(3, 1)
            J_vals[t1, t2] = JMx(thetaT, X, y) / rescale

    plt.figure()
    plt.contour(theta0_vals, theta1_vals, J_vals.T, np.logspace(-2, 3, 20))
    plt.xlabel(r"$\theta_1$")
    plt.ylabel(r"$\theta_2$")

contour_plot(XMx2, yMx2, rescale=10**10)

Jeżeli funkcja kosztu ma kształt taki, jak na powyższym wykresie, to łatwo sobie wyobrazić, że znalezienie minimum lokalnego przy użyciu metody gradientu prostego musi stanowć nie lada wyzwanie: algorytm szybko znajdzie „rynnę”, ale „zjazd” wzdłuż „rynny” w poszukiwaniu minimum będzie odbywał się bardzo powoli.

Jak temu zaradzić?

Spróbujemy przekształcić dane tak, żeby funkcja kosztu miała „ładny”, regularny kształt.

Skalowanie

Będziemy dążyć do tego, żeby każda z cech przyjmowała wartości w podobnym zakresie.

W tym celu przeskalujemy wartości każdej z cech, dzieląc je przez wartość maksymalną:

$$ \hat{x_i}^{(j)} := \frac{x_i^{(j)}}{\max_j x_i^{(j)}} $$

XMx2_scaled = XMx2 / np.amax(XMx2, axis=0)

show_mins_and_maxs(XMx2_scaled)

$\displaystyle 1.00 \leq x_0 \leq 1.00$

$\displaystyle 0.29 \leq x_1 \leq 1.00$

$\displaystyle 0.06 \leq x_2 \leq 1.00$

contour_plot(XMx2_scaled, yMx2)

Normalizacja średniej

Będziemy dążyć do tego, żeby dodatkowo średnia wartość każdej z cech była w okolicach $0$.

W tym celu oprócz przeskalowania odejmiemy wartość średniej od wartości każdej z cech:

$$ \hat{x_i}^{(j)} := \frac{x_i^{(j)} - \mu_i}{\max_j x_i^{(j)}} $$

XMx2_norm = (XMx2 - np.mean(XMx2, axis=0)) / np.amax(XMx2, axis=0)

show_mins_and_maxs(XMx2_norm)

$\displaystyle 0.00 \leq x_0 \leq 0.00$

$\displaystyle -0.10 \leq x_1 \leq 0.62$

$\displaystyle -0.23 \leq x_2 \leq 0.70$

contour_plot(XMx2_norm, yMx2)

Teraz funkcja kosztu ma wykres o bardzo regularnym kształcie – algorytm gradientu prostego zastosowany w takim przypadku bardzo szybko znajdzie minimum funkcji kosztu.

246 KiB Raw Blame History Unescape Escape