2023-programowanie-w-pythonie/zajecia3/sklearn cz. 1.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Kkolejna część zajęć będzie wprowadzeniem do drugiej, szeroko używanej biblioteki w Pythonie: `sklearn`. Zajęcia będą miały charaktere case-study poprzeplatane zadaniami do wykonania. Zacznijmy od załadowania odpowiednich bibliotek."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# ! pip install matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Zacznijmy od załadowania danych. Na dzisiejszych zajęciach będziemy korzystać z danych z portalu [gapminder.org](https://www.gapminder.org/data/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('gapminder.csv', index_col=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dane zawierają różne informacje z większość państw świata (z roku 2008). Poniżej znajduje się opis kolumn:\n",
    " * female_BMI - średnie BMI u kobiet\n",
    " * male_BMI - średnie BMI u mężczyzn\n",
    " * gdp - PKB na obywatela\n",
    " * population - wielkość populacji\n",
    " * under5mortality - wskaźnik śmiertelności dzieni pon. 5 roku życia (na 1000 urodzonych dzieci)\n",
    " * life_expectancy - średnia długość życia\n",
    " * fertility - wskaźnik dzietności"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 1**\n",
    "Na podstawie danych zawartych w `df` odpowiedz na następujące pytania:\n",
    " * Jaki był współczynniki dzietności w Polsce w 2018?\n",
    " * W którym kraju ludzie żyją najdłużej?\n",
    " * Z ilu krajów zostały zebrane dane?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "175"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)\n",
    "\n",
    "df.loc['Poland', 'fertility']\n",
    "df[df['life_expectancy'].max() == df['life_expectancy']]\n",
    "len(df.index)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 2** Stwórz kolumnę `gdp_log`, która powstanie z kolumny `gdp` poprzez zastowanie funkcji `log` (logarytm). \n",
    "\n",
    "Hint 1: Wykorzystaj funkcję `apply` (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply).\n",
    "\n",
    "Hint 2: Wykorzystaj fukcję `log` z pakietu `np`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['gdp_log'] = df['gdp'].apply(np.log)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Naszym zadaniem będzie oszacowanie długości życia (kolumna `life_expectancy`) na podstawie pozostałych zmiennych. Na samym początku, zastosujemy regresje jednowymiarową na `fertility`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Y shape: (175,)\n",
      "X shape: (175,)\n"
     ]
    }
   ],
   "source": [
    "y = df['life_expectancy'].values\n",
    "X = df['fertility'].values\n",
    "\n",
    "print(\"Y shape:\", y.shape)\n",
    "print(\"X shape:\", X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Będziemy korzystać z gotowej implementacji regreji liniowej z pakietu sklearn. Żeby móc wykorzystać, musimy napierw zmienić shape na dwuwymiarowy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Y shape: (175, 1)\n",
      "X shape: (175, 1)\n"
     ]
    }
   ],
   "source": [
    "y = y.reshape(-1, 1)\n",
    "X = X.reshape(-1, 1)\n",
    "\n",
    "print(\"Y shape:\", y.shape)\n",
    "print(\"X shape:\", X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Jeszcze przed właściwą analizą, narysujmy wykres i zobaczny czy istnieje \"wizualny\" związek pomiędzy kolumnami."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Axes: xlabel='gdp', ylabel='life_expectancy'>"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjMAAAGwCAYAAABcnuQpAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAABS3ElEQVR4nO3deVxVdfoH8M9lXxREQBBDFiE3tHBHzNzSyiktxyaiNDVNs8zMMqbMrMzlp+bEJGVjlKPiZJljTcuokeYKKWqmIYrgForIIqCIcH5/OPfGhbuce+5yzuF+3q8Xr1ecczh8OcNwHr/f5/k+GkEQBBARERGplIvcAyAiIiKyBoMZIiIiUjUGM0RERKRqDGaIiIhI1RjMEBERkaoxmCEiIiJVYzBDREREquYm9wDsrb6+HhcuXEDLli2h0WjkHg4RERGJIAgCrl69irCwMLi4mJ57afbBzIULFxAeHi73MIiIiEiCs2fP4rbbbjN5TbMPZlq2bAng1sPw8/OTeTREREQkRkVFBcLDw3XvcVOafTCjXVry8/NjMENERKQyYlJEmABMREREqsZghoiIiFSNwQwRERGpGoMZIiIiUjUGM0RERKRqDGaIiIhI1RjMEBERkaoxmCEiIiJVYzBDREREqsZghoiIiFSt2bczILKF/OJKFF6pRmSgL6KCfOUejlFqGScRkS0xmCEyoaz6BmZkHMLOvGLdsYGxwUhNioe/j7vNv5/UYMTR4yQiUhKNIAiC3IOwp4qKCvj7+6O8vJyNJsli41ZnYffJy6hr8H8TV40GiTFBWDOpj82+j7XBiKPGSUTkKJa8v5kzQ2REfnElduYV6wUIAFAnCNiZV4zTl6ts9r1mZBzC7pOX9Y7tPnkZz2XkKGqcRERKxGCGyIjCK9UmzxeU2CZIsDYYcdQ4TckvrkRm7iUGTkQkC+bMEBkR0drH5PnIQNsk2IoJRkzlzzhqnIYwV4eIlIAzM0RGRAe3wMDYYLhqNHrHXTUaDIwNtlm1kLXBiKPGaYg1y2NERLbCYIbIhNSkeCTGBOkdS4wJQmpSvM2+hy2CEUeMszHm6hCRUnCZicgEfx93rJnUB6cvV6GgpMou+7fkF1fikd634VrtTWQXlOqOWxKMOGKcjVm7PEZEZCsMZohEiAoyHxxYukeMoXyT3hEBeLJ/JLq085cUCIgZp63ImatDRNQQgxkiK0lNgjWUb3LwTBm8Pc5hzR1hdhuvrWiXx4ztb8NZGSJyFObMEFlJShJsc8k3kSNXh4ioMc7MEJlhavlIG5Q01jAoMTRDITbfROm9luTI1SEiaozBDJERYpaPpCbBmss3ae3jgXGrs1Szf4sjc3Wo+VF60E7Kx2UmIiPELB9JTYI1V4697L8nuH8LNXtl1TcwbnUWhizbgQnp2Ri89EeMW52F8upauYdGKsNghsgAR+S0GMs3eXH47c0in0ZubLGgfNx0kWyFy0ykevaYoha7fGTNXisCDDesP8f9W6zCFgvqIDXfjMgQBjOkWoZeWnHt/PDOQ93Q/bZWVt1b7PKRNXutGPtXafWNm5LvSYaf6668YiT/Yx9SH+vBF6RCcNNFsiUuM5FqGXppHT1fgQf/vtvqdXexLQaktiIwtYz1c2EpekcGNLmniwboHRnAP/AmGHuu9QCOXqhgToaCcNNFsiUGM6RKxl5aWrvyikWtu5vKqxC7h4qUvVbM/at0fP/IJvesF4DsglLRL2NnzBkx91wB5mQohZwNUqn54TITqZK5l1Y9YHLdXUxehdg9VKTstWLuX6Vdw/yxZlIYxqbtwYHCUtQ3OKd9Ga+Z1Mfg1zpzzoi55wowJ0NJUpPi8VxGjt7vKjddJCkYzJAqiXlpAcbX3Z9ZdxB7TpXoHduZV4xp6w5g/eR+esfF7qFiyV4rYloB5BdXIruwtMnXmnsZG8wZOVlsMgBqLow9V0OYkyE/brpItiLrMlNdXR3mzp2LqKgoeHt7o0OHDnjrrbcgNPgjJAgCXn/9dbRt2xbe3t4YNmwY8vLyZBw1KYH2peWiMX2doXX3/OLKJoGM1p5TJWaXZWy1fGNueUpMgqShsRnMGRFuBWtHzpZZNWY1MPRcDWFOhnJEBflicMc2DGRIMlmDmcWLFyMtLQ1///vfcfz4cSxevBhLlixBamqq7polS5bgvffewwcffID9+/fD19cXI0aMwPXr12UcOSlBalI8BsQEGzxnat19/2nDgYzufL7h87be4MtYabaWlARJcwHQX7/8xfzAVE77r/3M2YMQ186vScDLnAyi5kfWYGbPnj0YNWoURo4cicjISPz5z3/G8OHDkZWVBeDWrMyKFSvw2muvYdSoUejevTvWrFmDCxcuYPPmzXIOXTWacxKo9qW1ZXoi4sL89M6ZXnc3PZ1jLMSw9QZf5u4nJUHSXAB09EJFs/xdMCQqyBfrJvVrEvAyJ4Oo+ZE1Z6Z///5YtWoVTpw4gdtvvx2HDx/Grl27sHz5cgDA6dOnUVRUhGHDhum+xt/fH3379sXevXvx6KOPNrlnTU0NampqdJ9XVFTY/wdRIGdKAu0e3gpfz7hL9Lp736jWJu/XLzqwyTGxG3yJ3cBP7P0sTZCMDm6BuHZ+OHre+O+9M+WKMCeDyDnIGsy88sorqKioQKdOneDq6oq6ujosWLAAycnJAICioiIAQEhIiN7XhYSE6M41tnDhQsyfP9++A1cBU//qb65JoGITcKODWyAhOhB7DSwnJUQHSupyffRCOeb9+1fRwaO5++3PL0FUkK+kl/GC0XEY9f4eo+edMVeEjTCJmjdZl5k+++wzrFu3DuvXr8fBgwfx6aefYunSpfj0008l3zMlJQXl5eW6j7Nnz9pwxOrgiL5CSiF1Ge2Dx3tiYKz+8sPA2GB88HhPg9ebW75Zs6fAoiUoc/d7ZdMvevk4liRI3hEecCs5utFx5ooQUXMl68zMSy+9hFdeeUW3XNStWzcUFhZi4cKFGD9+PEJDQwEAFy9eRNu2bXVfd/HiRdx5550G7+np6QlPT0+7j13JnGGbcGuX0Syd8TBVSh3fvhWyCywrodbez9BSk5Y1M2ncv6P5skcvMiK1kzWYqa6uhouL/r8fXV1dUV9/a4uwqKgohIaGYvv27brgpaKiAvv378e0adMcPVzVcIZtwm21jGbJ8oOxAOEvvW7Dzwb2g9EyFjy+ODzWZDBjzeZuzBVpfpwpD47IUrIGMw888AAWLFiA9u3bo2vXrsjJycHy5csxceJEAIBGo8HMmTPx9ttvIzY2FlFRUZg7dy7CwsIwevRoOYeuaGI2ZFMzubrtGgsQ8osrTX6dseDxisiSbmtm0pgr0nw4Yx4ckViyBjOpqamYO3cunnnmGVy6dAlhYWF4+umn8frrr+uuefnll1FVVYUpU6agrKwMAwYMwHfffQcvLy8ZR658zXmZQe5ltMYBgtTgUewuxs1hJo2sI1cAT6QWGkEws+e3ylVUVMDf3x/l5eXw8/Mz/wXNTHNcZsgvrsSQZTuMns+cPcjhP2t5dW2T4FHMEsC41VlGt97XBkOG/tXNvAnnkpl7CRPSs42eT5/QG4M7tnHgiIjsz5L3N3szNXPNcZlBictoUnNUDM2gaRmaSWPehHNyhjw4ImtwZoZUScpMiJJnM7RBkJuLBjfrBaNjNDSTY2oGh5oP/m9PzsaS9zeDmWbO3i9wuQMEMTMhzWU2Q4nLa+Q4UpcyidSKy0xk9xe4UgIEMctozaUKRO7EZ5IXy+2JjJN1B2CyH1s3RXT0/W2lOe2GbO7/rNbkTTTnhqTNjSW7QRM5C87MqJSp5R17l3GqqUy0OcxmGJoFa8iaxGelzLAREVmDwYzKiHn52PsFrqYAoTlUgRiaBWvImv2DmssSHBE5Ny4zqYyY5R17v8DVFCBoy7hdNRq942ppumhsmUzrn5P6YM2kPpJmUZrTEhwROTcGMyoi9uUj9gUuJk/C0DVqCxBSk+KRGBOkd0wtuyGbmwW7WS+9GFHMDBsRkSlKybfjMpOKWLK8Y6qdgZilKnPXqKldgpqrQOw5C6amGTYiUhal5dsxmFERS14+pl7g2s2
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df.plot.scatter('gdp', 'life_expectancy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 3** Zaimportuj `LinearRegression` z pakietu `sklearn.linear_model`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tworzymy obiekt modelu regresji liniowej."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "model = LinearRegression()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Trening modelu ogranicza się do wywołania metodu `fit`, która przyjmuje dwa argumenty:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-r
      ],
      "text/plain": [
       "LinearRegression()"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Współczynniki modelu:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wyraz wolny (bias): [83.2025629]\n",
      "Współczynniki cech: [[-4.41400624]]\n"
     ]
    }
   ],
   "source": [
    "print(\"Wyraz wolny (bias):\", model.intercept_)\n",
    "print(\"Współczynniki cech:\", model.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 4** Wytrenuj nowy model `model2`, który będzie jako X przyjmie kolumnę `gdp_log`. Wyświetl parametry nowego modelu."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wyraz wolny (bias): [20.24520889]\n",
      "Współczynniki cech: [[5.47719379]]\n"
     ]
    }
   ],
   "source": [
    "y = df['life_expectancy'].values\n",
    "X = df['gdp_log'].values\n",
    "\n",
    "y = y.reshape(-1, 1)\n",
    "X = X.reshape(-1, 1)\n",
    "\n",
    "model.fit(X, y)\n",
    "\n",
    "print(\"Wyraz wolny (bias):\", model.intercept_)\n",
    "print(\"Współczynniki cech:\", model.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mając wytrenowany model możemy wykorzystać go do predykcji. Wystarczy wywołać metodę `predict`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input: 7.1785454837637\t predicted: 59.56349361537759\t expected: 52.8\n",
      "input: 9.064620717626777\t predicted: 69.89389316839483\t expected: 76.8\n",
      "input: 9.418492105471156\t predicted: 71.83211533534711\t expected: 75.5\n",
      "input: 8.86827250899781\t predicted: 68.81845597997369\t expected: 56.7\n",
      "input: 10.155646068918863\t predicted: 75.86965044411781\t expected: 75.5\n"
     ]
    }
   ],
   "source": [
    "X_test = X[:5,:]\n",
    "y_test = y[:5,:]\n",
    "output = model.predict(X_test)\n",
    "\n",
    "for i in range(5):\n",
    "    print(\"input: {}\\t predicted: {}\\t expected: {}\".format(X_test[i,0], output[i,0], y_test[i,0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sprawdzenie jakości modelu - metryki: $MSE$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Istnieją 3 metryki, które określają jak dobry jest nasz model:\n",
    " * $MSE$: [błąd średnio-kwadratowy](https://pl.wikipedia.org/wiki/B%C5%82%C4%85d_%C5%9Bredniokwadratowy) \n",
    " * $RMSE = \\sqrt{MSE}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Root Mean Squared Error: 5.542126033117308\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "rmse = np.sqrt(mean_squared_error(y, model.predict(X)))\n",
    "print(\"Root Mean Squared Error: {}\".format(rmse))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R^2: 0.768485708231896\n",
      "Root Mean Squared Error: 4.108807300711791\n"
     ]
    }
   ],
   "source": [
    "# Import necessary modules\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Create training and test sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
    "\n",
    "# Create the regressor: reg_all\n",
    "reg_all = LinearRegression()  \n",
    "\n",
    "# Fit the regressor to the training data\n",
    "reg_all.fit(X_train, y_train)\n",
    "\n",
    "# Predict on the test data: y_pred\n",
    "y_pred = reg_all.predict(X_test)\n",
    "\n",
    "# Compute and print R^2 and RMSE\n",
    "print(\"R^2: {}\".format(reg_all.score(X_test, y_test)))\n",
    "rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
    "print(\"Root Mean Squared Error: {}\".format(rmse))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regresja wielu zmiennych"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Model regresji liniowej wielu zmiennych nie różni się istotnie od modelu jednej zmiennej. Np. chcąc zbudować model oparty o dwie kolumny: `fertility` i `gdp` wystarczy zmienić X (cechy wejściowe):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(175, 2)\n",
      "Wyraz wolny (bias): [78.39388437]\n",
      "Współczynniki cech: [[-3.68816683e+00  1.38298454e-04]]\n",
      "Root Mean Squared Error: 4.347105512793037\n"
     ]
    }
   ],
   "source": [
    "X = df[['fertility', 'gdp']]\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
    "\n",
    "print(X.shape)\n",
    "\n",
    "model_mv = LinearRegression()\n",
    "model_mv.fit(X_train, y_train)\n",
    "\n",
    "print(\"Wyraz wolny (bias):\", model_mv.intercept_)\n",
    "print(\"Współczynniki cech:\", model_mv.coef_)\n",
    "\n",
    "y_pred = model_mv.predict(X_test)\n",
    "\n",
    "rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
    "print(\"Root Mean Squared Error: {}\".format(rmse))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 7** \n",
    " * Zbuduj model regresji liniowej, która oszacuje wartność kolumny `life_expectancy` na podstawie pozostałych kolumn.\n",
    " * Wyświetl współczynniki modelu.\n",
    " * Oblicz wartości metryki rmse na zbiorze trenującym.\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wyraz wolny (bias): 51.77966340645703\n",
      "Współczynniki cech: [-1.26558913e+00  1.58457647e+00 -1.19465585e-05  8.99682207e-10\n",
      " -1.32027358e-01  3.09413223e-01  1.74214537e+00]\n",
      "Root Mean Sqared Error: 3.42188778846474\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression, LinearRegression, LogisticRegressionCV\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = df.drop('life_expectancy', axis='columns')\n",
    "y = df['life_expectancy']\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)\n",
    "\n",
    "model = LinearRegression()\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "print(\"Wyraz wolny (bias):\", model.intercept_)\n",
    "print(\"Współczynniki cech:\", model.coef_)\n",
    "\n",
    "y_pred = model.predict(X_test)\n",
    "\n",
    "rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
    "print(f\"Root Mean Sqared Error: {rmse}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 6**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Zaimplementuj metrykę  $RMSE$  jako fukcję rmse (szablon poniżej). Fukcja rmse przyjmuje dwa parametry typu list i ma zwrócić wartość metryki  $RMSE$ ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Found input variables with inconsistent numbers of samples: [175, 176]",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32mj:\\Python\\2023-programowanie-w-pythonie\\zajecia3\\sklearn cz. 1.ipynb Cell 39\u001b[0m line \u001b[0;36m2\n\u001b[0;32m     <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=25'>26</a>\u001b[0m expected\u001b[39m.\u001b[39mappend(\u001b[39m1\u001b[39m)\n\u001b[0;32m     <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=27'>28</a>\u001b[0m \u001b[39m# print(rmse(predicted,expected))\u001b[39;00m\n\u001b[1;32m---> <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=28'>29</a>\u001b[0m \u001b[39mprint\u001b[39m(np\u001b[39m.\u001b[39msqrt(mean_squared_error(predicted, expected)))\n",
      "File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\utils\\_param_validation.py:211\u001b[0m, in \u001b[0;36mvalidate_params.<locals>.decorator.<locals>.wrapper\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m    205\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m    206\u001b[0m     \u001b[39mwith\u001b[39;00m config_context(\n\u001b[0;32m    207\u001b[0m         skip_parameter_validation\u001b[39m=\u001b[39m(\n\u001b[0;32m    208\u001b[0m             prefer_skip_nested_validation \u001b[39mor\u001b[39;00m global_skip_validation\n\u001b[0;32m    209\u001b[0m         )\n\u001b[0;32m    210\u001b[0m     ):\n\u001b[1;32m--> 211\u001b[0m         \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39margs, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[0;32m    212\u001b[0m \u001b[39mexcept\u001b[39;00m InvalidParameterError \u001b[39mas\u001b[39;00m e:\n\u001b[0;32m    213\u001b[0m     \u001b[39m# When the function is just a wrapper around an estimator, we allow\u001b[39;00m\n\u001b[0;32m    214\u001b[0m     \u001b[39m# the function to delegate validation to the estimator, but we replace\u001b[39;00m\n\u001b[0;32m    215\u001b[0m     \u001b[39m# the name of the estimator by the name of the function in the error\u001b[39;00m\n\u001b[0;32m    216\u001b[0m     \u001b[39m# message to avoid confusion.\u001b[39;00m\n\u001b[0;32m    217\u001b[0m     msg \u001b[39m=\u001b[39m re\u001b[39m.\u001b[39msub(\n\u001b[0;32m    218\u001b[0m         \u001b[39mr\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mparameter of \u001b[39m\u001b[39m\\\u001b[39m\u001b[39mw+ must be\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m    219\u001b[0m         \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mparameter of \u001b[39m\u001b[39m{\u001b[39;00mfunc\u001b[39m.\u001b[39m\u001b[39m__qualname__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m must be\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m    220\u001b[0m         \u001b[39mstr\u001b[39m(e),\n\u001b[0;32m    221\u001b[0m     )\n",
      "File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\metrics\\_regression.py:474\u001b[0m, in \u001b[0;36mmean_squared_error\u001b[1;34m(y_true, y_pred, sample_weight, multioutput, squared)\u001b[0m\n\u001b[0;32m    404\u001b[0m \u001b[39m@validate_params\u001b[39m(\n\u001b[0;32m    405\u001b[0m     {\n\u001b[0;32m    406\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39my_true\u001b[39m\u001b[39m\"\u001b[39m: [\u001b[39m\"\u001b[39m\u001b[39marray-like\u001b[39m\u001b[39m\"\u001b[39m],\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m    415\u001b[0m     y_true, y_pred, \u001b[39m*\u001b[39m, sample_weight\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m, multioutput\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39muniform_average\u001b[39m\u001b[39m\"\u001b[39m, squared\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m\n\u001b[0;32m    416\u001b[0m ):\n\u001b[0;32m    417\u001b[0m \u001b[39m    \u001b[39m\u001b[39m\"\"\"Mean squared error regression loss.\u001b[39;00m\n\u001b[0;32m    418\u001b[0m \n\u001b[0;32m    419\u001b[0m \u001b[39m    Read more in the :ref:`User Guide <mean_squared_error>`.\u001b[39;00m\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m    472\u001b[0m \u001b[39m    0.825...\u001b[39;00m\n\u001b[0;32m    473\u001b[0m \u001b[39m    \"\"\"\u001b[39;00m\n\u001b[1;32m--> 474\u001b[0m     y_type, y_true, y_pred, multioutput \u001b[39m=\u001b[39m _check_reg_targets(\n\u001b[0;32m    475\u001b[0m         y_true, y_pred, multioutput\n\u001b[0;32m    476\u001b[0m     )\n\u001b[0;32m    477\u001b[0m     check_consistent_length(y_true, y_pred, sample_weight)\n\u001b[0;32m    478\u001b[0m     output_errors \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39maverage((y_true \u001b[39m-\u001b[39m y_pred) \u001b[39m*\u001b[39m\u001b[39m*\u001b[39m \u001b[39m2\u001b[39m, axis\u001b[39m=\u001b[39m\u001b[39m0\u001b[39m, weights\u001b[39m=\u001b[39msample_weight)\n",
      "File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\metrics\\_regression.py:99\u001b[0m, in \u001b[0;36m_check_reg_targets\u001b[1;34m(y_true, y_pred, multioutput, dtype)\u001b[0m\n\u001b[0;32m     65\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_check_reg_targets\u001b[39m(y_true, y_pred, multioutput, dtype\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mnumeric\u001b[39m\u001b[39m\"\u001b[39m):\n\u001b[0;32m     66\u001b[0m \u001b[39m    \u001b[39m\u001b[39m\"\"\"Check that y_true and y_pred belong to the same regression task.\u001b[39;00m\n\u001b[0;32m     67\u001b[0m \n\u001b[0;32m     68\u001b[0m \u001b[39m    Parameters\u001b[39;00m\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m     97\u001b[0m \u001b[39m        correct keyword.\u001b[39;00m\n\u001b[0;32m     98\u001b[0m \u001b[39m    \"\"\"\u001b[39;00m\n\u001b[1;32m---> 99\u001b[0m     check_consistent_length(y_true, y_pred)\n\u001b[0;32m    100\u001b[0m     y_true \u001b[39m=\u001b[39m check_array(y_true, ensure_2d\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, dtype\u001b[39m=\u001b[39mdtype)\n\u001b[0;32m    101\u001b[0m     y_pred \u001b[39m=\u001b[39m check_array(y_pred, ensure_2d\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, dtype\u001b[39m=\u001b[39mdtype)\n",
      "File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\utils\\validation.py:409\u001b[0m, in \u001b[0;36mcheck_consistent_length\u001b[1;34m(*arrays)\u001b[0m\n\u001b[0;32m    407\u001b[0m uniques \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39munique(lengths)\n\u001b[0;32m    408\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(uniques) \u001b[39m>\u001b[39m \u001b[39m1\u001b[39m:\n\u001b[1;32m--> 409\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m    410\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39mFound input variables with inconsistent numbers of samples: \u001b[39m\u001b[39m%r\u001b[39;00m\u001b[39m\"\u001b[39m\n\u001b[0;32m    411\u001b[0m         \u001b[39m%\u001b[39m [\u001b[39mint\u001b[39m(l) \u001b[39mfor\u001b[39;00m l \u001b[39min\u001b[39;00m lengths]\n\u001b[0;32m    412\u001b[0m     )\n",
      "\u001b[1;31mValueError\u001b[0m: Found input variables with inconsistent numbers of samples: [175, 176]"
     ]
    }
   ],
   "source": [
    "def rmse(expected, predicted):\n",
    "    \"\"\"\n",
    "    argumenty:\n",
    "    expected (type: list): poprawne wartości\n",
    "    predicted (type: list): oszacowanie z modelu\n",
    "    \"\"\"\n",
    "\n",
    "    if len(expected) != len(predicted):\n",
    "        raise ValueError(\"Lists have to be equal length, can't proceed.\")\n",
    "\n",
    "    mse = 0\n",
    "    for i in range(len(expected)):\n",
    "        mse += pow((expected[i] - predicted[i]),2)\n",
    "    return np.sqrt(mse/len(expected))\n",
    "    \n",
    "    \n",
    "\n",
    "y = df['life_expectancy'].values\n",
    "X = df[['fertility', 'gdp']].values\n",
    "\n",
    "test_model = LinearRegression()\n",
    "test_model.fit(X, y)\n",
    "\n",
    "predicted = list(test_model.predict(X))\n",
    "expected = list(y)\n",
    "# expected.append(1)\n",
    "\n",
    "print(rmse(predicted,expected))\n",
    "print(np.sqrt(mean_squared_error(predicted, expected)))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}