regresja liniowa

This commit is contained in:
Marcin Szczepański 2023-10-18 14:37:13 +02:00
parent 0358f4ee6f
commit f192e3fb2c
2 changed files with 662 additions and 2 deletions

View File

@ -215,10 +215,340 @@
"## Przykład kodu na regresję liniową"
]
},
{
"cell_type": "markdown",
"id": "1c1dd0af-74e4-4f7a-93bf-64a9d2be14c3",
"metadata": {},
"source": [
"Zaczynamy od importu pewnych bibliotek, które ułatwiają pracę z danymi i uruchamianie algorytmów uczenia maszynowego:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "74d886ec-b819-48fa-aac2-e69c6160287b",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
" \n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error"
]
},
{
"cell_type": "markdown",
"id": "d9e1ee83-3b9d-439f-9aee-5ac095ded83a",
"metadata": {},
"source": [
"Wczytajmy dane i podejrzyjmy nasz zbiór danych:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "f1111528-20ce-4ed5-9c5a-511f2c1041d2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cwiczenia</th>\n",
" <th>czas_min</th>\n",
" <th>wejscia</th>\n",
" <th>nieodwiedzone</th>\n",
" <th>czas_do_testu_godziny</th>\n",
" <th>test</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>4</td>\n",
" <td>12</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>20</td>\n",
" <td>20.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>8</td>\n",
" <td>25</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>36</td>\n",
" <td>60.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>14</td>\n",
" <td>48</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>33</td>\n",
" <td>92.86</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>11</td>\n",
" <td>37</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>42</td>\n",
" <td>65.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>29</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>22</td>\n",
" <td>42.35</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>13.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>16</td>\n",
" <td>36</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>47</td>\n",
" <td>85.13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>18</td>\n",
" <td>55</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>39</td>\n",
" <td>98.33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>20</td>\n",
" <td>48</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>45</td>\n",
" <td>100.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>42</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>37</td>\n",
" <td>70.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cwiczenia czas_min wejscia nieodwiedzone czas_do_testu_godziny test\n",
"0 4 12 1 1 20 20.00\n",
"1 8 25 3 3 36 60.00\n",
"2 14 48 3 1 33 92.86\n",
"3 11 37 5 0 42 65.34\n",
"4 5 29 2 0 22 42.35\n",
"5 2 5 1 4 5 13.14\n",
"6 16 36 7 1 47 85.13\n",
"7 18 55 5 0 39 98.33\n",
"8 20 48 4 0 45 100.00\n",
"9 10 42 7 1 37 70.00"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.read_csv('data.csv', sep=';')\n",
"data"
]
},
{
"cell_type": "markdown",
"id": "2158a205-2c9e-41b9-9fe0-d56dd7b97213",
"metadata": {},
"source": [
"Teraz obliczamy podstawowy model regresji liniowej i wyliczamy błąd (czyli jak bardzo obliczone wartości funkcji regresji różnią się od tych rzeczywistych) - weźmy miarę RMSE (omówimy ją na późniejszych zajęciach - na razie zapamiętajmy: im niższy błąd tym lepiej):"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "f7eade7c-def7-4944-8eea-7cba985e5bf5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Współczynnik a: 4.8869552414605435\n",
"Wyraz wolny: 11.935883392226131\n",
"Błąd: Root Mean Squared Error (RMSE): 8.27\n"
]
}
],
"source": [
"model = LinearRegression()\n",
" \n",
"X = data[['cwiczenia']]\n",
"y = data['test']\n",
" \n",
"model.fit(X,y, sample_weight=None)\n",
"y_pred = model.predict(X)\n",
" \n",
"print('Współczynnik a: ', model.coef_[0])\n",
"print('Wyraz wolny: ', model.intercept_)\n",
"print('Błąd: Root Mean Squared Error (RMSE): %.2f'% np.sqrt(mean_squared_error(y, y_pred)))\n"
]
},
{
"cell_type": "markdown",
"id": "04bf6923-da81-40ca-96bd-33f1ad973ddc",
"metadata": {},
"source": [
"A co by było jakbyśmy wzięli dodatkowo pod uwagę czas spędzony w kursie:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "2b232e81-9d0e-4817-ac0a-5f7b1d293666",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Współczynnik przy argumencie \"cwiczenia\": 2.793448759670846\n",
"Współczynnik przy argumencie \"czas_min\": 0.9017691542564438\n",
"Wyraz wolny: 4.156132897112705\n",
"Błąd: Root Mean Squared Error (RMSE): 5.19\n"
]
}
],
"source": [
"X = data[['cwiczenia', 'czas_min']]\n",
"y = data['test']\n",
" \n",
"model.fit(X,y, sample_weight=None)\n",
"y_pred = model.predict(X)\n",
" \n",
"print('Współczynnik przy argumencie \"cwiczenia\": ', model.coef_[0])\n",
"print('Współczynnik przy argumencie \"czas_min\": ', model.coef_[1])\n",
"print('Wyraz wolny: ', model.intercept_)\n",
"print('Błąd: Root Mean Squared Error (RMSE): %.2f'% np.sqrt(mean_squared_error(y, y_pred)))\n"
]
},
{
"cell_type": "markdown",
"id": "e49fa078-6216-4d83-bbbd-db0b58aa7c8e",
"metadata": {},
"source": [
"Mamy lepszy wynik - błąd jest niższy :)"
]
},
{
"cell_type": "markdown",
"id": "d501bcc0-d3bf-4552-bf2b-c78753b4d5d6",
"metadata": {},
"source": [
"A jak wyglądają nasze wyznaczone wartości funkcji regresji?"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "e539d645-86c6-42c0-9059-7a3c310cc2a3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 26.15115779, 49.04795183, 86.54933494, 68.24952796,\n",
" 44.27468217, 14.25187619, 81.31500261, 104.03551406,\n",
" 103.31002749, 69.96492497])"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_with_y_pred = data.copy()\n",
"data_with_y_pred['y_pred'] = y_pred\n",
"data_with_y_pred\n",
"y_pred"
]
},
{
"cell_type": "markdown",
"id": "7ea9c2cb-b654-4190-9e4f-972e4d3bf92f",
"metadata": {},
"source": [
"## Zadanie 1.\n",
"\n",
"Dla całego zbioru danych oblicz sumę kwadratów róźnic pomiędzy kolumnami **test** i **y_pred**. Wynik zaokrąglij do dwóch miejsc po przecinku (pamiętaj, że od 5 zaokrąglamy w górę). Podpowiedź:"
]
},
{
"cell_type": "markdown",
"id": "90d8f306-e823-4552-8f41-6aac1ee37611",
"metadata": {},
"source": [
"- `data_with_y_pred['nazwa_kolumny'].size` - liczba elementów (wierszy) w danej kolumnie\n",
"\n",
"- `data_with_y_pred['nazwa_kolumny'][0]` - pierwszy (indeksujemy od zera) element (wiersz) kolumny"
]
},
{
"cell_type": "markdown",
"id": "8d9c213b-5758-409b-a73d-1012a3709032",
"metadata": {},
"source": [
"Konkretne zadanie: napisz funkcję o nazwie `fun1()`, która zwróci oczekiwany wynik. Kod tej funkcji wklej na Moodle'u w aktywności o nazwie **Regresja liniowa - zadanie 1**. Na Moodle'u poza powyższym zbiorem danych kod będzie sprawdzony na innym zbiorze danych, więc nie można napisać prostej funkcji, która zwróci wartość obliczoną np. na kartce - niestety trzeba napisać funkcję obliczającą poszukiwaną wartość ;)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28e2ef8c-dbbf-4c3d-924e-f59318e4f383",
"id": "7d1a8edd-0a16-4504-8383-1e9240504306",
"metadata": {},
"outputs": [],
"source": []

View File

@ -215,10 +215,340 @@
"## Przykład kodu na regresję liniową"
]
},
{
"cell_type": "markdown",
"id": "1c1dd0af-74e4-4f7a-93bf-64a9d2be14c3",
"metadata": {},
"source": [
"Zaczynamy od importu pewnych bibliotek, które ułatwiają pracę z danymi i uruchamianie algorytmów uczenia maszynowego:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "74d886ec-b819-48fa-aac2-e69c6160287b",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
" \n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error"
]
},
{
"cell_type": "markdown",
"id": "d9e1ee83-3b9d-439f-9aee-5ac095ded83a",
"metadata": {},
"source": [
"Wczytajmy dane i podejrzyjmy nasz zbiór danych:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "f1111528-20ce-4ed5-9c5a-511f2c1041d2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cwiczenia</th>\n",
" <th>czas_min</th>\n",
" <th>wejscia</th>\n",
" <th>nieodwiedzone</th>\n",
" <th>czas_do_testu_godziny</th>\n",
" <th>test</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>4</td>\n",
" <td>12</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>20</td>\n",
" <td>20.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>8</td>\n",
" <td>25</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>36</td>\n",
" <td>60.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>14</td>\n",
" <td>48</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>33</td>\n",
" <td>92.86</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>11</td>\n",
" <td>37</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>42</td>\n",
" <td>65.34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>29</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>22</td>\n",
" <td>42.35</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>13.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>16</td>\n",
" <td>36</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>47</td>\n",
" <td>85.13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>18</td>\n",
" <td>55</td>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>39</td>\n",
" <td>98.33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>20</td>\n",
" <td>48</td>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>45</td>\n",
" <td>100.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>42</td>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>37</td>\n",
" <td>70.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cwiczenia czas_min wejscia nieodwiedzone czas_do_testu_godziny test\n",
"0 4 12 1 1 20 20.00\n",
"1 8 25 3 3 36 60.00\n",
"2 14 48 3 1 33 92.86\n",
"3 11 37 5 0 42 65.34\n",
"4 5 29 2 0 22 42.35\n",
"5 2 5 1 4 5 13.14\n",
"6 16 36 7 1 47 85.13\n",
"7 18 55 5 0 39 98.33\n",
"8 20 48 4 0 45 100.00\n",
"9 10 42 7 1 37 70.00"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.read_csv('data.csv', sep=';')\n",
"data"
]
},
{
"cell_type": "markdown",
"id": "2158a205-2c9e-41b9-9fe0-d56dd7b97213",
"metadata": {},
"source": [
"Teraz obliczamy podstawowy model regresji liniowej i wyliczamy błąd (czyli jak bardzo obliczone wartości funkcji regresji różnią się od tych rzeczywistych) - weźmy miarę RMSE (omówimy ją na późniejszych zajęciach - na razie zapamiętajmy: im niższy błąd tym lepiej):"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "f7eade7c-def7-4944-8eea-7cba985e5bf5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Współczynnik a: 4.8869552414605435\n",
"Wyraz wolny: 11.935883392226131\n",
"Błąd: Root Mean Squared Error (RMSE): 8.27\n"
]
}
],
"source": [
"model = LinearRegression()\n",
" \n",
"X = data[['cwiczenia']]\n",
"y = data['test']\n",
" \n",
"model.fit(X,y, sample_weight=None)\n",
"y_pred = model.predict(X)\n",
" \n",
"print('Współczynnik a: ', model.coef_[0])\n",
"print('Wyraz wolny: ', model.intercept_)\n",
"print('Błąd: Root Mean Squared Error (RMSE): %.2f'% np.sqrt(mean_squared_error(y, y_pred)))\n"
]
},
{
"cell_type": "markdown",
"id": "04bf6923-da81-40ca-96bd-33f1ad973ddc",
"metadata": {},
"source": [
"A co by było jakbyśmy wzięli dodatkowo pod uwagę czas spędzony w kursie:"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "2b232e81-9d0e-4817-ac0a-5f7b1d293666",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Współczynnik przy argumencie \"cwiczenia\": 2.793448759670846\n",
"Współczynnik przy argumencie \"czas_min\": 0.9017691542564438\n",
"Wyraz wolny: 4.156132897112705\n",
"Błąd: Root Mean Squared Error (RMSE): 5.19\n"
]
}
],
"source": [
"X = data[['cwiczenia', 'czas_min']]\n",
"y = data['test']\n",
" \n",
"model.fit(X,y, sample_weight=None)\n",
"y_pred = model.predict(X)\n",
" \n",
"print('Współczynnik przy argumencie \"cwiczenia\": ', model.coef_[0])\n",
"print('Współczynnik przy argumencie \"czas_min\": ', model.coef_[1])\n",
"print('Wyraz wolny: ', model.intercept_)\n",
"print('Błąd: Root Mean Squared Error (RMSE): %.2f'% np.sqrt(mean_squared_error(y, y_pred)))\n"
]
},
{
"cell_type": "markdown",
"id": "e49fa078-6216-4d83-bbbd-db0b58aa7c8e",
"metadata": {},
"source": [
"Mamy lepszy wynik - błąd jest niższy :)"
]
},
{
"cell_type": "markdown",
"id": "d501bcc0-d3bf-4552-bf2b-c78753b4d5d6",
"metadata": {},
"source": [
"A jak wyglądają nasze wyznaczone wartości funkcji regresji?"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "e539d645-86c6-42c0-9059-7a3c310cc2a3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 26.15115779, 49.04795183, 86.54933494, 68.24952796,\n",
" 44.27468217, 14.25187619, 81.31500261, 104.03551406,\n",
" 103.31002749, 69.96492497])"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_with_y_pred = data.copy()\n",
"data_with_y_pred['y_pred'] = y_pred\n",
"data_with_y_pred\n",
"y_pred"
]
},
{
"cell_type": "markdown",
"id": "7ea9c2cb-b654-4190-9e4f-972e4d3bf92f",
"metadata": {},
"source": [
"## Zadanie 1.\n",
"\n",
"Dla całego zbioru danych oblicz sumę kwadratów róźnic pomiędzy kolumnami **test** i **y_pred**. Wynik zaokrąglij do dwóch miejsc po przecinku (pamiętaj, że od 5 zaokrąglamy w górę). Podpowiedź:"
]
},
{
"cell_type": "markdown",
"id": "90d8f306-e823-4552-8f41-6aac1ee37611",
"metadata": {},
"source": [
"- `data_with_y_pred['nazwa_kolumny'].size` - liczba elementów (wierszy) w danej kolumnie\n",
"\n",
"- `data_with_y_pred['nazwa_kolumny'][0]` - pierwszy (indeksujemy od zera) element (wiersz) kolumny"
]
},
{
"cell_type": "markdown",
"id": "8d9c213b-5758-409b-a73d-1012a3709032",
"metadata": {},
"source": [
"Konkretne zadanie: napisz funkcję o nazwie `fun1()`, która zwróci oczekiwany wynik. Kod tej funkcji wklej na Moodle'u w aktywności o nazwie **Regresja liniowa - zadanie 1**. Na Moodle'u poza powyższym zbiorem danych kod będzie sprawdzony na innym zbiorze danych, więc nie można napisać prostej funkcji, która zwróci wartość obliczoną np. na kartce - niestety trzeba napisać funkcję obliczającą poszukiwaną wartość ;)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28e2ef8c-dbbf-4c3d-924e-f59318e4f383",
"id": "7d1a8edd-0a16-4504-8383-1e9240504306",
"metadata": {},
"outputs": [],
"source": []