2023-programowanie-w-pythonie/zajecia3/sklearn cz. 1.ipynb

628 lines
55 KiB
Plaintext
Raw Normal View History

2023-11-19 12:33:16 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kkolejna część zajęć będzie wprowadzeniem do drugiej, szeroko używanej biblioteki w Pythonie: `sklearn`. Zajęcia będą miały charaktere case-study poprzeplatane zadaniami do wykonania. Zacznijmy od załadowania odpowiednich bibliotek."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 3,
2023-11-24 20:16:16 +01:00
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# ! pip install matplotlib"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 4,
2023-11-19 12:33:16 +01:00
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zacznijmy od załadowania danych. Na dzisiejszych zajęciach będziemy korzystać z danych z portalu [gapminder.org](https://www.gapminder.org/data/)."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 23,
2023-11-19 12:33:16 +01:00
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('gapminder.csv', index_col=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dane zawierają różne informacje z większość państw świata (z roku 2008). Poniżej znajduje się opis kolumn:\n",
" * female_BMI - średnie BMI u kobiet\n",
" * male_BMI - średnie BMI u mężczyzn\n",
" * gdp - PKB na obywatela\n",
" * population - wielkość populacji\n",
" * under5mortality - wskaźnik śmiertelności dzieni pon. 5 roku życia (na 1000 urodzonych dzieci)\n",
" * life_expectancy - średnia długość życia\n",
" * fertility - wskaźnik dzietności"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 1**\n",
"Na podstawie danych zawartych w `df` odpowiedz na następujące pytania:\n",
" * Jaki był współczynniki dzietności w Polsce w 2018?\n",
" * W którym kraju ludzie żyją najdłużej?\n",
" * Z ilu krajów zostały zebrane dane?"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"175"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)\n",
"\n",
"df.loc['Poland', 'fertility']\n",
"df[df['life_expectancy'].max() == df['life_expectancy']]\n",
"len(df.index)\n"
]
2023-11-19 12:33:16 +01:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 2** Stwórz kolumnę `gdp_log`, która powstanie z kolumny `gdp` poprzez zastowanie funkcji `log` (logarytm). \n",
"\n",
"Hint 1: Wykorzystaj funkcję `apply` (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply).\n",
"\n",
"Hint 2: Wykorzystaj fukcję `log` z pakietu `np`."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 30,
2023-11-19 12:33:16 +01:00
"metadata": {},
"outputs": [],
2023-11-26 09:12:43 +01:00
"source": [
"df['gdp_log'] = df['gdp'].apply(np.log)\n"
]
2023-11-19 12:33:16 +01:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Naszym zadaniem będzie oszacowanie długości życia (kolumna `life_expectancy`) na podstawie pozostałych zmiennych. Na samym początku, zastosujemy regresje jednowymiarową na `fertility`."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Y shape: (175,)\n",
"X shape: (175,)\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"y = df['life_expectancy'].values\n",
"X = df['fertility'].values\n",
"\n",
"print(\"Y shape:\", y.shape)\n",
"print(\"X shape:\", X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Będziemy korzystać z gotowej implementacji regreji liniowej z pakietu sklearn. Żeby móc wykorzystać, musimy napierw zmienić shape na dwuwymiarowy."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Y shape: (175, 1)\n",
"X shape: (175, 1)\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"y = y.reshape(-1, 1)\n",
"X = X.reshape(-1, 1)\n",
"\n",
"print(\"Y shape:\", y.shape)\n",
"print(\"X shape:\", X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jeszcze przed właściwą analizą, narysujmy wykres i zobaczny czy istnieje \"wizualny\" związek pomiędzy kolumnami."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Axes: xlabel='gdp', ylabel='life_expectancy'>"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjMAAAGwCAYAAABcnuQpAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAABS3ElEQVR4nO3deVxVdfoH8M9lXxREQBBDFiE3tHBHzNzSyiktxyaiNDVNs8zMMqbMrMzlp+bEJGVjlKPiZJljTcuokeYKKWqmIYrgForIIqCIcH5/OPfGhbuce+5yzuF+3q8Xr1ecczh8OcNwHr/f5/k+GkEQBBARERGplIvcAyAiIiKyBoMZIiIiUjUGM0RERKRqDGaIiIhI1RjMEBERkaoxmCEiIiJVYzBDREREquYm9wDsrb6+HhcuXEDLli2h0WjkHg4RERGJIAgCrl69irCwMLi4mJ57afbBzIULFxAeHi73MIiIiEiCs2fP4rbbbjN5TbMPZlq2bAng1sPw8/OTeTREREQkRkVFBcLDw3XvcVOafTCjXVry8/NjMENERKQyYlJEmABMREREqsZghoiIiFSNwQwRERGpGoMZIiIiUjUGM0RERKRqDGaIiIhI1RjMEBERkaoxmCEiIiJVYzBDREREqsZghoiIiFSt2bczILKF/OJKFF6pRmSgL6KCfOUejlFqGScRkS0xmCEyoaz6BmZkHMLOvGLdsYGxwUhNioe/j7vNv5/UYMTR4yQiUhKNIAiC3IOwp4qKCvj7+6O8vJyNJsli41ZnYffJy6hr8H8TV40GiTFBWDOpj82+j7XBiKPGSUTkKJa8v5kzQ2REfnElduYV6wUIAFAnCNiZV4zTl6ts9r1mZBzC7pOX9Y7tPnkZz2XkKGqcRERKxGCGyIjCK9UmzxeU2CZIsDYYcdQ4TckvrkRm7iUGTkQkC+bMEBkR0drH5PnIQNsk2IoJRkzlzzhqnIYwV4eIlIAzM0RGRAe3wMDYYLhqNHrHXTUaDIwNtlm1kLXBiKPGaYg1y2NERLbCYIbIhNSkeCTGBOkdS4wJQmpSvM2+hy2CEUeMszHm6hCRUnCZicgEfx93rJnUB6cvV6GgpMou+7fkF1fikd634VrtTWQXlOqOWxKMOGKcjVm7PEZEZCsMZohEiAoyHxxYukeMoXyT3hEBeLJ/JLq085cUCIgZp63ImatDRNQQgxkiK0lNgjWUb3LwTBm8Pc5hzR1hdhuvrWiXx4ztb8NZGSJyFObMEFlJShJsc8k3kSNXh4ioMc7MEJlhavlIG5Q01jAoMTRDITbfROm9luTI1SEiaozBDJERYpaPpCbBmss3ae3jgXGrs1Szf4sjc3Wo+VF60E7Kx2UmIiPELB9JTYI1V4697L8nuH8LNXtl1TcwbnUWhizbgQnp2Ri89EeMW52F8upauYdGKsNghsgAR+S0GMs3eXH47c0in0ZubLGgfNx0kWyFy0ykevaYoha7fGTNXisCDDesP8f9W6zCFgvqIDXfjMgQBjOkWoZeWnHt/PDOQ93Q/bZWVt1b7PKRNXutGPtXafWNm5LvSYaf6668YiT/Yx9SH+vBF6RCcNNFsiUuM5FqGXppHT1fgQf/vtvqdXexLQaktiIwtYz1c2EpekcGNLmniwboHRnAP/AmGHuu9QCOXqhgToaCcNNFsiUGM6RKxl5aWrvyikWtu5vKqxC7h4qUvVbM/at0fP/IJvesF4DsglLRL2NnzBkx91wB5mQohZwNUqn54TITqZK5l1Y9YHLdXUxehdg9VKTstWLuX6Vdw/yxZlIYxqbtwYHCUtQ3OKd9Ga+Z1Mfg1zpzzoi55wowJ0NJUpPi8VxGjt7vKjddJCkYzJAqiXlpAcbX3Z9ZdxB7TpXoHduZV4xp6w5g/eR+esfF7qFiyV4rYloB5BdXIruwtMnXmnsZG8wZOVlsMgBqLow9V0OYkyE/brpItiLrMlNdXR3mzp2LqKgoeHt7o0OHDnjrrbcgNPgjJAgCXn/9dbRt2xbe3t4YNmwY8vLyZBw1KYH2peWiMX2doXX3/OLKJoGM1p5TJWaXZWy1fGNueUpMgqShsRnMGRFuBWtHzpZZNWY1MPRcDWFOhnJEBflicMc2DGRIMlmDmcWLFyMtLQ1///vfcfz4cSxevBhLlixBamqq7polS5bgvffewwcffID9+/fD19cXI0aMwPXr12UcOSlBalI8BsQEGzxnat19/2nDgYzufL7h87be4MtYabaWlARJcwHQX7/8xfzAVE77r/3M2YMQ186vScDLnAyi5kfWYGbPnj0YNWoURo4cicjISPz5z3/G8OHDkZWVBeDWrMyKFSvw2muvYdSoUejevTvWrFmDCxcuYPPmzXIOXTWacxKo9qW1ZXoi4sL89M6ZXnc3PZ1jLMSw9QZf5u4nJUHSXAB09EJFs/xdMCQqyBfrJvVrEvAyJ4Oo+ZE1Z6Z///5YtWoVTpw4gdtvvx2HDx/Grl27sHz5cgDA6dOnUVRUhGHDhum+xt/fH3379sXevXvx6KOPNrlnTU0NampqdJ9XVFTY/wdRIGdKAu0e3gpfz7hL9Lp736jWJu/XLzqwyTGxG3yJ3cBP7P0sTZCMDm6BuHZ+OHre+O+9M+WKMCeDyDnIGsy88sorqKioQKdOneDq6oq6ujosWLAAycnJAICioiIAQEhIiN7XhYSE6M41tnDhQsyfP9++A1cBU//qb65JoGITcKODWyAhOhB7DSwnJUQHSupyffRCOeb9+1fRwaO5++3PL0FUkK+kl/GC0XEY9f4eo+edMVeEjTCJmjdZl5k+++wzrFu3DuvXr8fBgwfx6aefYunSpfj0008l3zMlJQXl5eW6j7Nnz9pwxOrgiL5CSiF1Ge2Dx3tiYKz+8sPA2GB88HhPg9ebW75Zs6fAoiUoc/d7ZdMvevk4liRI3hEecCs5utFx5ooQUXMl68zMSy+9hFdeeUW3XNStWzcUFhZi4cKFGD9+PEJDQwEAFy9eRNu2bXVfd/HiRdx5550G7+np6QlPT0+7j13JnGGbcGuX0Syd8TBVSh3fvhWyCywrodbez9BSk5Y1M2ncv6P5skcvMiK1kzWYqa6uhouL/r8fXV1dUV9/a4uwqKgohIaGYvv27brgpaKiAvv378e0adMcPVzVcIZtwm21jGbJ8oOxAOEvvW7Dzwb2g9EyFjy+ODzWZDBjzeZuzBVpfpwpD47IUrIGMw888AAWLFiA9u3bo2vXrsjJycHy5csxceJEAIBGo8HMmTPx9ttvIzY2FlFRUZg7dy7CwsIwevRoOYeuaGI2ZFMzubrtGgsQ8osrTX6dseDxisiSbmtm0pgr0nw4Yx4ckViyBjOpqamYO3cunnnmGVy6dAlhYWF4+umn8frrr+uuefnll1FVVYUpU6agrKwMAwYMwHfffQcvLy8ZR658zXmZQe5ltMYBgtTgUewuxs1hJo2sI1cAT6QWGkEws+e3ylVUVMDf3x/l5eXw8/Mz/wXNTHNcZsgvrsSQZTuMns+cPcjhP2t5dW2T4FHMEsC41VlGt97XBkOG/tXNvAnnkpl7CRPSs42eT5/QG4M7tnHgiIjsz5L3N3szNXPNcZlBictoUnNUDM2gaRmaSWPehHNyhjw4ImtwZoZUScpMiJJnM7RBkJuLBjfrBaNjNDSTY2oGh5oP/m9PzsaS9zeDmWbO3i9wuQMEMTMhzWU2Q4nLa+Q4UpcyidSKy0xk9xe4UgIEMctozaUKRO7EZ5IXy+2JjJN1B2CyH1s3RXT0/W2lOe2GbO7/rNbkTTTnhqTNjSW7QRM5C87MqJSp5R17l3GqqUy0OcxmGJoFa8iaxGelzLAREVmDwYzKiHn52PsFrqYAoTlUgRiaBWvImv2DmssSHBE5Ny4zqYyY5R17v8DVFCBoy7hdNRq942ppumhsmUzrn5P6YM2kPpJmUZrTEhwROTcGMyoi9uUj9gUuJk/C0DVqCxBSk+KRGBOkd0wtuyGbmwW7WS+9GFHMDBsRkSlKybfjMpOKWLK8Y6qdgZilKnPXqKldgpqrQOw5C6amGTYiUhal5dsxmFERS14+pl7g2s2
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df.plot.scatter('gdp', 'life_expectancy')"
2023-11-19 12:33:16 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 3** Zaimportuj `LinearRegression` z pakietu `sklearn.linear_model`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tworzymy obiekt modelu regresji liniowej."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 35,
2023-11-19 12:33:16 +01:00
"metadata": {},
"outputs": [],
"source": [
2023-11-26 09:12:43 +01:00
"from sklearn.linear_model import LinearRegression\n",
2023-11-19 12:33:16 +01:00
"model = LinearRegression()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Trening modelu ogranicza się do wywołania metodu `fit`, która przyjmuje dwa argumenty:"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-r
],
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
2023-11-19 12:33:16 +01:00
"source": [
"model.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Współczynniki modelu:"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyraz wolny (bias): [83.2025629]\n",
"Współczynniki cech: [[-4.41400624]]\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"print(\"Wyraz wolny (bias):\", model.intercept_)\n",
"print(\"Współczynniki cech:\", model.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 4** Wytrenuj nowy model `model2`, który będzie jako X przyjmie kolumnę `gdp_log`. Wyświetl parametry nowego modelu."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyraz wolny (bias): [20.24520889]\n",
"Współczynniki cech: [[5.47719379]]\n"
]
}
],
"source": [
"y = df['life_expectancy'].values\n",
"X = df['gdp_log'].values\n",
"\n",
"y = y.reshape(-1, 1)\n",
"X = X.reshape(-1, 1)\n",
"\n",
"model.fit(X, y)\n",
"\n",
"print(\"Wyraz wolny (bias):\", model.intercept_)\n",
"print(\"Współczynniki cech:\", model.coef_)"
]
2023-11-19 12:33:16 +01:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mając wytrenowany model możemy wykorzystać go do predykcji. Wystarczy wywołać metodę `predict`."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input: 7.1785454837637\t predicted: 59.56349361537759\t expected: 52.8\n",
"input: 9.064620717626777\t predicted: 69.89389316839483\t expected: 76.8\n",
"input: 9.418492105471156\t predicted: 71.83211533534711\t expected: 75.5\n",
"input: 8.86827250899781\t predicted: 68.81845597997369\t expected: 56.7\n",
"input: 10.155646068918863\t predicted: 75.86965044411781\t expected: 75.5\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"X_test = X[:5,:]\n",
"y_test = y[:5,:]\n",
"output = model.predict(X_test)\n",
"\n",
"for i in range(5):\n",
" print(\"input: {}\\t predicted: {}\\t expected: {}\".format(X_test[i,0], output[i,0], y_test[i,0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2023-11-24 20:16:16 +01:00
"## Sprawdzenie jakości modelu - metryki: $MSE$"
2023-11-19 12:33:16 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Istnieją 3 metryki, które określają jak dobry jest nasz model:\n",
" * $MSE$: [błąd średnio-kwadratowy](https://pl.wikipedia.org/wiki/B%C5%82%C4%85d_%C5%9Bredniokwadratowy) \n",
" * $RMSE = \\sqrt{MSE}$"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Root Mean Squared Error: 5.542126033117308\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"from sklearn.metrics import mean_squared_error\n",
"\n",
"rmse = np.sqrt(mean_squared_error(y, model.predict(X)))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R^2: 0.768485708231896\n",
"Root Mean Squared Error: 4.108807300711791\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"# Import necessary modules\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Create training and test sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
"\n",
"# Create the regressor: reg_all\n",
2023-11-26 09:12:43 +01:00
"reg_all = LinearRegression() \n",
2023-11-19 12:33:16 +01:00
"\n",
"# Fit the regressor to the training data\n",
"reg_all.fit(X_train, y_train)\n",
"\n",
"# Predict on the test data: y_pred\n",
"y_pred = reg_all.predict(X_test)\n",
"\n",
"# Compute and print R^2 and RMSE\n",
"print(\"R^2: {}\".format(reg_all.score(X_test, y_test)))\n",
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regresja wielu zmiennych"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Model regresji liniowej wielu zmiennych nie różni się istotnie od modelu jednej zmiennej. Np. chcąc zbudować model oparty o dwie kolumny: `fertility` i `gdp` wystarczy zmienić X (cechy wejściowe):"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(175, 2)\n",
"Wyraz wolny (bias): [78.39388437]\n",
"Współczynniki cech: [[-3.68816683e+00 1.38298454e-04]]\n",
"Root Mean Squared Error: 4.347105512793037\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"X = df[['fertility', 'gdp']]\n",
2023-11-24 20:16:16 +01:00
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
"\n",
"print(X.shape)\n",
2023-11-19 12:33:16 +01:00
"\n",
"model_mv = LinearRegression()\n",
2023-11-24 20:16:16 +01:00
"model_mv.fit(X_train, y_train)\n",
2023-11-19 12:33:16 +01:00
"\n",
2023-11-24 20:16:16 +01:00
"print(\"Wyraz wolny (bias):\", model_mv.intercept_)\n",
"print(\"Współczynniki cech:\", model_mv.coef_)\n",
"\n",
"y_pred = model_mv.predict(X_test)\n",
2023-11-19 12:33:16 +01:00
"\n",
2023-11-24 20:16:16 +01:00
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))"
2023-11-19 12:33:16 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 7** \n",
" * Zbuduj model regresji liniowej, która oszacuje wartność kolumny `life_expectancy` na podstawie pozostałych kolumn.\n",
2023-11-24 20:16:16 +01:00
" * Wyświetl współczynniki modelu.\n",
" * Oblicz wartości metryki rmse na zbiorze trenującym.\n",
2023-11-19 12:33:16 +01:00
" "
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 87,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyraz wolny (bias): 51.77966340645703\n",
"Współczynniki cech: [-1.26558913e+00 1.58457647e+00 -1.19465585e-05 8.99682207e-10\n",
" -1.32027358e-01 3.09413223e-01 1.74214537e+00]\n",
"Root Mean Sqared Error: 3.42188778846474\n"
]
}
],
"source": [
"from sklearn.linear_model import LogisticRegression, LinearRegression, LogisticRegressionCV\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X = df.drop('life_expectancy', axis='columns')\n",
"y = df['life_expectancy']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)\n",
"\n",
"model = LinearRegression()\n",
"model.fit(X_train, y_train)\n",
"\n",
"print(\"Wyraz wolny (bias):\", model.intercept_)\n",
"print(\"Współczynniki cech:\", model.coef_)\n",
"\n",
"y_pred = model.predict(X_test)\n",
"\n",
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(f\"Root Mean Sqared Error: {rmse}\")"
]
2023-11-19 12:33:16 +01:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2023-11-24 20:16:16 +01:00
"**zad. 6**\n"
2023-11-19 12:33:16 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Zaimplementuj metrykę $RMSE$ jako fukcję rmse (szablon poniżej). Fukcja rmse przyjmuje dwa parametry typu list i ma zwrócić wartość metryki $RMSE$ ."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 96,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Found input variables with inconsistent numbers of samples: [175, 176]",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32mj:\\Python\\2023-programowanie-w-pythonie\\zajecia3\\sklearn cz. 1.ipynb Cell 39\u001b[0m line \u001b[0;36m2\n\u001b[0;32m <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=25'>26</a>\u001b[0m expected\u001b[39m.\u001b[39mappend(\u001b[39m1\u001b[39m)\n\u001b[0;32m <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=27'>28</a>\u001b[0m \u001b[39m# print(rmse(predicted,expected))\u001b[39;00m\n\u001b[1;32m---> <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=28'>29</a>\u001b[0m \u001b[39mprint\u001b[39m(np\u001b[39m.\u001b[39msqrt(mean_squared_error(predicted, expected)))\n",
"File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\utils\\_param_validation.py:211\u001b[0m, in \u001b[0;36mvalidate_params.<locals>.decorator.<locals>.wrapper\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m 205\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m 206\u001b[0m \u001b[39mwith\u001b[39;00m config_context(\n\u001b[0;32m 207\u001b[0m skip_parameter_validation\u001b[39m=\u001b[39m(\n\u001b[0;32m 208\u001b[0m prefer_skip_nested_validation \u001b[39mor\u001b[39;00m global_skip_validation\n\u001b[0;32m 209\u001b[0m )\n\u001b[0;32m 210\u001b[0m ):\n\u001b[1;32m--> 211\u001b[0m \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39margs, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[0;32m 212\u001b[0m \u001b[39mexcept\u001b[39;00m InvalidParameterError \u001b[39mas\u001b[39;00m e:\n\u001b[0;32m 213\u001b[0m \u001b[39m# When the function is just a wrapper around an estimator, we allow\u001b[39;00m\n\u001b[0;32m 214\u001b[0m \u001b[39m# the function to delegate validation to the estimator, but we replace\u001b[39;00m\n\u001b[0;32m 215\u001b[0m \u001b[39m# the name of the estimator by the name of the function in the error\u001b[39;00m\n\u001b[0;32m 216\u001b[0m \u001b[39m# message to avoid confusion.\u001b[39;00m\n\u001b[0;32m 217\u001b[0m msg \u001b[39m=\u001b[39m re\u001b[39m.\u001b[39msub(\n\u001b[0;32m 218\u001b[0m \u001b[39mr\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mparameter of \u001b[39m\u001b[39m\\\u001b[39m\u001b[39mw+ must be\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m 219\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mparameter of \u001b[39m\u001b[39m{\u001b[39;00mfunc\u001b[39m.\u001b[39m\u001b[39m__qualname__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m must be\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m 220\u001b[0m \u001b[39mstr\u001b[39m(e),\n\u001b[0;32m 221\u001b[0m )\n",
"File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\metrics\\_regression.py:474\u001b[0m, in \u001b[0;36mmean_squared_error\u001b[1;34m(y_true, y_pred, sample_weight, multioutput, squared)\u001b[0m\n\u001b[0;32m 404\u001b[0m \u001b[39m@validate_params\u001b[39m(\n\u001b[0;32m 405\u001b[0m {\n\u001b[0;32m 406\u001b[0m \u001b[39m\"\u001b[39m\u001b[39my_true\u001b[39m\u001b[39m\"\u001b[39m: [\u001b[39m\"\u001b[39m\u001b[39marray-like\u001b[39m\u001b[39m\"\u001b[39m],\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 415\u001b[0m y_true, y_pred, \u001b[39m*\u001b[39m, sample_weight\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m, multioutput\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39muniform_average\u001b[39m\u001b[39m\"\u001b[39m, squared\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m\n\u001b[0;32m 416\u001b[0m ):\n\u001b[0;32m 417\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"Mean squared error regression loss.\u001b[39;00m\n\u001b[0;32m 418\u001b[0m \n\u001b[0;32m 419\u001b[0m \u001b[39m Read more in the :ref:`User Guide <mean_squared_error>`.\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 472\u001b[0m \u001b[39m 0.825...\u001b[39;00m\n\u001b[0;32m 473\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[1;32m--> 474\u001b[0m y_type, y_true, y_pred, multioutput \u001b[39m=\u001b[39m _check_reg_targets(\n\u001b[0;32m 475\u001b[0m y_true, y_pred, multioutput\n\u001b[0;32m 476\u001b[0m )\n\u001b[0;32m 477\u001b[0m check_consistent_length(y_true, y_pred, sample_weight)\n\u001b[0;32m 478\u001b[0m output_errors \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39maverage((y_true \u001b[39m-\u001b[39m y_pred) \u001b[39m*\u001b[39m\u001b[39m*\u001b[39m \u001b[39m2\u001b[39m, axis\u001b[39m=\u001b[39m\u001b[39m0\u001b[39m, weights\u001b[39m=\u001b[39msample_weight)\n",
"File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\metrics\\_regression.py:99\u001b[0m, in \u001b[0;36m_check_reg_targets\u001b[1;34m(y_true, y_pred, multioutput, dtype)\u001b[0m\n\u001b[0;32m 65\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_check_reg_targets\u001b[39m(y_true, y_pred, multioutput, dtype\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mnumeric\u001b[39m\u001b[39m\"\u001b[39m):\n\u001b[0;32m 66\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"Check that y_true and y_pred belong to the same regression task.\u001b[39;00m\n\u001b[0;32m 67\u001b[0m \n\u001b[0;32m 68\u001b[0m \u001b[39m Parameters\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 97\u001b[0m \u001b[39m correct keyword.\u001b[39;00m\n\u001b[0;32m 98\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[1;32m---> 99\u001b[0m check_consistent_length(y_true, y_pred)\n\u001b[0;32m 100\u001b[0m y_true \u001b[39m=\u001b[39m check_array(y_true, ensure_2d\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, dtype\u001b[39m=\u001b[39mdtype)\n\u001b[0;32m 101\u001b[0m y_pred \u001b[39m=\u001b[39m check_array(y_pred, ensure_2d\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, dtype\u001b[39m=\u001b[39mdtype)\n",
"File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\utils\\validation.py:409\u001b[0m, in \u001b[0;36mcheck_consistent_length\u001b[1;34m(*arrays)\u001b[0m\n\u001b[0;32m 407\u001b[0m uniques \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39munique(lengths)\n\u001b[0;32m 408\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(uniques) \u001b[39m>\u001b[39m \u001b[39m1\u001b[39m:\n\u001b[1;32m--> 409\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m 410\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mFound input variables with inconsistent numbers of samples: \u001b[39m\u001b[39m%r\u001b[39;00m\u001b[39m\"\u001b[39m\n\u001b[0;32m 411\u001b[0m \u001b[39m%\u001b[39m [\u001b[39mint\u001b[39m(l) \u001b[39mfor\u001b[39;00m l \u001b[39min\u001b[39;00m lengths]\n\u001b[0;32m 412\u001b[0m )\n",
"\u001b[1;31mValueError\u001b[0m: Found input variables with inconsistent numbers of samples: [175, 176]"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"def rmse(expected, predicted):\n",
" \"\"\"\n",
" argumenty:\n",
" expected (type: list): poprawne wartości\n",
" predicted (type: list): oszacowanie z modelu\n",
" \"\"\"\n",
2023-11-26 09:12:43 +01:00
"\n",
" if len(expected) != len(predicted):\n",
" raise ValueError(\"Lists have to be equal length, can't proceed.\")\n",
"\n",
" mse = 0\n",
" for i in range(len(expected)):\n",
" mse += pow((expected[i] - predicted[i]),2)\n",
" return np.sqrt(mse/len(expected))\n",
" \n",
2023-11-24 20:16:16 +01:00
" \n",
2023-11-19 12:33:16 +01:00
"\n",
"y = df['life_expectancy'].values\n",
"X = df[['fertility', 'gdp']].values\n",
"\n",
"test_model = LinearRegression()\n",
"test_model.fit(X, y)\n",
"\n",
"predicted = list(test_model.predict(X))\n",
"expected = list(y)\n",
2023-11-26 09:12:43 +01:00
"# expected.append(1)\n",
2023-11-19 12:33:16 +01:00
"\n",
2023-11-24 20:16:16 +01:00
"print(rmse(predicted,expected))\n",
2023-11-26 09:12:43 +01:00
"print(np.sqrt(mean_squared_error(predicted, expected)))\n"
2023-11-19 12:33:16 +01:00
]
2023-11-24 20:16:16 +01:00
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
2023-11-19 12:33:16 +01:00
}
],
"metadata": {
"kernelspec": {
2023-11-24 20:16:16 +01:00
"display_name": "Python 3 (ipykernel)",
2023-11-19 12:33:16 +01:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2023-11-26 09:12:43 +01:00
"version": "3.10.11"
2023-11-19 12:33:16 +01:00
}
},
"nbformat": 4,
"nbformat_minor": 2
}