2023-programowanie-w-pythonie/zajecia3/sklearn cz. 1.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Kkolejna część zajęć będzie wprowadzeniem do drugiej, szeroko używanej biblioteki w Pythonie: `sklearn`. Zajęcia będą miały charaktere case-study poprzeplatane zadaniami do wykonania. Zacznijmy od załadowania odpowiednich bibliotek."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# ! pip install matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Zacznijmy od załadowania danych. Na dzisiejszych zajęciach będziemy korzystać z danych z portalu [gapminder.org](https://www.gapminder.org/data/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('gapminder.csv', index_col=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dane zawierają różne informacje z większość państw świata (z roku 2008). Poniżej znajduje się opis kolumn:\n",
    " * female_BMI - średnie BMI u kobiet\n",
    " * male_BMI - średnie BMI u mężczyzn\n",
    " * gdp - PKB na obywatela\n",
    " * population - wielkość populacji\n",
    " * under5mortality - wskaźnik śmiertelności dzieni pon. 5 roku życia (na 1000 urodzonych dzieci)\n",
    " * life_expectancy - średnia długość życia\n",
    " * fertility - wskaźnik dzietności"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 1**\n",
    "Na podstawie danych zawartych w `df` odpowiedz na następujące pytania:\n",
    " * Jaki był współczynniki dzietności w Polsce w 2018?\n",
    " * W którym kraju ludzie żyją najdłużej?\n",
    " * Z ilu krajów zostały zebrane dane?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "175"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)\n",
    "\n",
    "df.loc['Poland', 'fertility']\n",
    "df[df['life_expectancy'].max() == df['life_expectancy']]\n",
    "len(df.index)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 2** Stwórz kolumnę `gdp_log`, która powstanie z kolumny `gdp` poprzez zastowanie funkcji `log` (logarytm). \n",
    "\n",
    "Hint 1: Wykorzystaj funkcję `apply` (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply).\n",
    "\n",
    "Hint 2: Wykorzystaj fukcję `log` z pakietu `np`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['gdp_log'] = df['gdp'].apply(np.log)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Naszym zadaniem będzie oszacowanie długości życia (kolumna `life_expectancy`) na podstawie pozostałych zmiennych. Na samym początku, zastosujemy regresje jednowymiarową na `fertility`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Y shape: (175,)\n",
      "X shape: (175,)\n"
     ]
    }
   ],
   "source": [
    "y = df['life_expectancy'].values\n",
    "X = df['fertility'].values\n",
    "\n",
    "print(\"Y shape:\", y.shape)\n",
    "print(\"X shape:\", X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Będziemy korzystać z gotowej implementacji regreji liniowej z pakietu sklearn. Żeby móc wykorzystać, musimy napierw zmienić shape na dwuwymiarowy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Y shape: (175, 1)\n",
      "X shape: (175, 1)\n"
     ]
    }
   ],
   "source": [
    "y = y.reshape(-1, 1)\n",
    "X = X.reshape(-1, 1)\n",
    "\n",
    "print(\"Y shape:\", y.shape)\n",
    "print(\"X shape:\", X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Jeszcze przed właściwą analizą, narysujmy wykres i zobaczny czy istnieje \"wizualny\" związek pomiędzy kolumnami."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Axes: xlabel='gdp_log', ylabel='gdp'>"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAl0AAAGxCAYAAABY7ANPAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAABFk0lEQVR4nO3de3RU9b3//9eE3AO5EHIhGCBI5KIoKBAil2rNIVpq5ait5FClGK8HQaS14FcBrVoQtLWggrZUsSoqHm8FLz8KSAqkAUO4CmloQgBpICQkQxIgIfn8/rAZMyTkAsmemeT5WGvWMnu/Z+/3TK3zWp/92Z9tM8YYAQAAoE15uboBAACAjoDQBQAAYAFCFwAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAW8HZ1Ax1JTU2Njhw5oi5dushms7m6HQAA0AzGGJ08eVIxMTHy8rrw8SpCl4WOHDmi2NhYV7cBAAAuwKFDh3TJJZdc8PsJXRbq0qWLpO/+RwsODnZxNwAAoDnsdrtiY2Mdv+MXitBlodpLisHBwYQuAAA8zMVODWIiPQAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABQhdAAAAFuAxQAAAwOPlFpYpv7hCvcODFNctyNXtNIjQBQAAPFZJRaWmrdiutJxCx7Yx8RFanDJEIYE+LuysPi4vAgAAjzVtxXZt2n/cadum/cc1dUWWizo6P0IXAADwSLmFZUrLKVS1MU7bq41RWk6h8o6Xu6izhhG6AACAR8ovrmh0/4EiQhcAAMBF69U1sNH9vcPda0I9oQsAAHikPhGdNSY+Qp1sNqftnWw2jYmPcLu7GAldAADAYy1OGaKRfbs5bRvZt5sWpwxxUUfnx5IRAADAY4UE+ujN1OHKO16uA0XlrNMFAADQluK6uW/YqsXlRQAAAAsQugAAACxA6AIAALAAoQsAAMAChC4AAAALELoAAAAs4NLQlZaWpptvvlkxMTGy2Wz6+OOPHfuqqqo0c+ZMDRo0SEFBQYqJidFdd92lI0eOOB2juLhYEydOVHBwsEJDQ5WamqqysjKnmp07d2r06NHy9/dXbGysFixYUK+XlStXqn///vL399egQYP02WefOe03xmjOnDnq3r27AgIClJSUpJycnNb7MgAAQLvm0tBVXl6uq666Si+//HK9fRUVFdq2bZtmz56tbdu26cMPP1R2drZ+8pOfONVNnDhRe/bs0Zo1a7Rq1SqlpaXpvvvuc+y32+0aO3asevXqpczMTC1cuFBPPvmkXnvtNUfN5s2blZKSotTUVGVlZWn8+PEaP368du/e7ahZsGCBFi1apKVLlyojI0NBQUFKTk7W6dOn2+CbAQAA7Y5xE5LMRx991GjNli1bjCSTn59vjDHmm2++MZLM1q1bHTWff/65sdls5ttvvzXGGPPKK6+YsLAwc+bMGUfNzJkzTb9+/Rx//+xnPzPjxo1zOldCQoK5//77jTHG1NTUmOjoaLNw4ULH/pKSEuPn52dWrFjR7M9YWlpqJJnS0tJmvwcAALhWa/1+e9ScrtLSUtlsNoWGhkqS0tPTFRoaqqFDhzpqkpKS5OXlpYyMDEfNmDFj5Ovr66hJTk5Wdna2Tpw44ahJSkpyOldycrLS09MlSXl5eSooKHCqCQkJUUJCgqOmIWfOnJHdbnd6AQCAjsljQtfp06c1c+ZMpaSkKDg4WJJUUFCgyMhIpzpvb2917dpVBQUFjpqoqCinmtq/m6qpu7/u+xqqaci8efMUEhLieMXGxrboMwMAgPbDI0JXVVWVfvazn8kYoyVLlri6nWZ77LHHVFpa6ngdOnTI1S0BAAAXcfsHXtcGrvz8fK1bt84xyiVJ0dHROnbsmFP92bNnVVxcrOjoaEfN0aNHnWpq/26qpu7+2m3du3d3qhk8ePB5e/fz85Ofn19LPi4AAGin3HqkqzZw5eTk6G9/+5vCw8Od9icmJqqkpESZmZmObevWrVNNTY0SEhIcNWlpaaqqqnLUrFmzRv369VNYWJijZu3atU7HXrNmjRITEyVJcXFxio6Odqqx2+3KyMhw1AAAADTGpaGrrKxM27dv1/bt2yV9N2F9+/btOnjwoKqqqnT77bfr66+/1ttvv63q6moVFBSooKBAlZWVkqQBAwboxhtv1L333qstW7Zo06ZNeuihhzRhwgTFxMRIkv7nf/5Hvr6+Sk1N1Z49e/Tee+/pD3/4g2bMmOHo4+GHH9YXX3yhF154Qfv27dOTTz6pr7/+Wg899JAkyWazafr06XrmmWf06aefateuXbrrrrsUExOj8ePHW/qdAQAAD9U6N1NemPXr1xtJ9V6TJk0yeXl5De6TZNavX+84RlFRkUlJSTGdO3c2wcHBZvLkyebkyZNO59mxY4cZNWqU8fPzMz169DDz58+v18v7779vLrvsMuPr62suv/xys3r1aqf9NTU1Zvbs2SYqKsr4+fmZG264wWRnZ7fo87JkBAAAnqe1fr9txhjjkrTXAdntdoWEhKi0tNRpbhoAAHBfrfX77dZzugAAANoLQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABQhdAAAAFiB0AQAAWIDQBQAAYAFCFwAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABQhdAAAAFiB0AQAAWIDQBQAAYAFCFwAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABQhdAAAAFiB0AQAAWIDQBQAAYAFCFwAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABVwautLS0nTzzTcrJiZGNptNH3/8sdN+Y4zmzJmj7t27KyAgQElJScrJyXGqKS4u1sSJExUcHKzQ0FClpqaqrKzMqWbnzp0aPXq0/P39FRsbqwULFtTrZeXKlerfv7/8/f01aNAgffbZZy3uBQAA4HxcGrrKy8t11VVX6eWXX25w/4IFC7Ro0SItXbpUGRkZCgoKUnJysk6fPu2omThxovbs2aM1a9Zo1apVSktL03333efYb7fbNXbsWPXq1UuZmZlauHChnnzySb322muOms2bNyslJUWpqanKysrS+PHjNX78eO3evbtFvQAAAJyXcROSzEcffeT4u6amxkRHR5uFCxc6tpWUlBg/Pz+zYsUKY4wx33zzjZFktm7d6qj5/PPPjc1mM99++60xxphXXnnFhIWFmTNnzjhqZs6cafr16+f4+2c/+5kZN26cUz8JCQnm/vvvb3YvzVFaWmokmdLS0ma/BwAAuFZr/X677ZyuvLw8FRQUKCkpybEtJCRECQkJSk9PlySlp6crNDRUQ4cOddQkJSXJy8tLGRkZjpoxY8bI19fXUZOcnKzs7GydOHHCUVP3PLU1tedpTi8AAACN8XZ1A+dTUFAgSYqKinLaHhUV5dhXUFCgyMhIp/3e3t7q2rWrU01cXFy9Y9TuCwsLU0FBQZPnaaqXhpw5c0Znzpxx/G232xv5xAAAoD1z25Gu9mDevHkKCQlxvGJjY13dEgAAcBG3DV3R0dGSpKNHjzptP3r0qGNfdHS0jh075rT/7NmzKi4udqpp6Bh1z3G+mrr7m+qlIY899phKS0sdr0OHDjXxqQEAQHvltqErLi5O0dHRWrt2rWOb3W5XRkaGEhMTJUmJiYkqKSlRZmamo2bdunWqqalRQkKCoyYtLU1VVVWOmjVr1qhfv34KCwtz1NQ9T21N7Xma00tD/Pz8FBwc7PQCAAAdVCtN7L8gJ0+eNFlZWSYrK8tIMr/73e9MVlaWyc/PN8YYM3/+fBMaGmo++eQTs3PnTnPLLbeYuLg4c+rUKccxbrzxRjNkyBCTkZFhNm7caOLj401KSopjf0lJiYmKijJ33nmn2b17t3n33XdNYGCgefXVVx01mzZtMt7e3ub55583e/fuNXPnzjU+Pj5m165
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df.plot.scatter('gdp_log', 'gdp')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 3** Zaimportuj `LinearRegression` z pakietu `sklearn.linear_model`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tworzymy obiekt modelu regresji liniowej."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "model = LinearRegression()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Trening modelu ogranicza się do wywołania metodu `fit`, która przyjmuje dwa argumenty:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-r
      ],
      "text/plain": [
       "LinearRegression()"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Współczynniki modelu:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wyraz wolny (bias): [83.2025629]\n",
      "Współczynniki cech: [[-4.41400624]]\n"
     ]
    }
   ],
   "source": [
    "print(\"Wyraz wolny (bias):\", model.intercept_)\n",
    "print(\"Współczynniki cech:\", model.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 4** Wytrenuj nowy model `model2`, który będzie jako X przyjmie kolumnę `gdp_log`. Wyświetl parametry nowego modelu."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wyraz wolny (bias): [20.24520889]\n",
      "Współczynniki cech: [[5.47719379]]\n"
     ]
    }
   ],
   "source": [
    "y = df['life_expectancy'].values\n",
    "X = df['gdp_log'].values\n",
    "\n",
    "y = y.reshape(-1, 1)\n",
    "X = X.reshape(-1, 1)\n",
    "\n",
    "model.fit(X, y)\n",
    "\n",
    "print(\"Wyraz wolny (bias):\", model.intercept_)\n",
    "print(\"Współczynniki cech:\", model.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mając wytrenowany model możemy wykorzystać go do predykcji. Wystarczy wywołać metodę `predict`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input: 7.1785454837637\t predicted: 59.56349361537759\t expected: 52.8\n",
      "input: 9.064620717626777\t predicted: 69.89389316839483\t expected: 76.8\n",
      "input: 9.418492105471156\t predicted: 71.83211533534711\t expected: 75.5\n",
      "input: 8.86827250899781\t predicted: 68.81845597997369\t expected: 56.7\n",
      "input: 10.155646068918863\t predicted: 75.86965044411781\t expected: 75.5\n"
     ]
    }
   ],
   "source": [
    "X_test = X[:5,:]\n",
    "y_test = y[:5,:]\n",
    "output = model.predict(X_test)\n",
    "\n",
    "for i in range(5):\n",
    "    print(\"input: {}\\t predicted: {}\\t expected: {}\".format(X_test[i,0], output[i,0], y_test[i,0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sprawdzenie jakości modelu - metryki: $MSE$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Istnieją 3 metryki, które określają jak dobry jest nasz model:\n",
    " * $MSE$: [błąd średnio-kwadratowy](https://pl.wikipedia.org/wiki/B%C5%82%C4%85d_%C5%9Bredniokwadratowy) \n",
    " * $RMSE = \\sqrt{MSE}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Root Mean Squared Error: 5.542126033117308\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "rmse = np.sqrt(mean_squared_error(y, model.predict(X)))\n",
    "print(\"Root Mean Squared Error: {}\".format(rmse))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R^2: 0.768485708231896\n",
      "Root Mean Squared Error: 4.108807300711791\n"
     ]
    }
   ],
   "source": [
    "# Import necessary modules\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Create training and test sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
    "\n",
    "# Create the regressor: reg_all\n",
    "reg_all = LinearRegression()  \n",
    "\n",
    "# Fit the regressor to the training data\n",
    "reg_all.fit(X_train, y_train)\n",
    "\n",
    "# Predict on the test data: y_pred\n",
    "y_pred = reg_all.predict(X_test)\n",
    "\n",
    "# Compute and print R^2 and RMSE\n",
    "print(\"R^2: {}\".format(reg_all.score(X_test, y_test)))\n",
    "rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
    "print(\"Root Mean Squared Error: {}\".format(rmse))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regresja wielu zmiennych"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Model regresji liniowej wielu zmiennych nie różni się istotnie od modelu jednej zmiennej. Np. chcąc zbudować model oparty o dwie kolumny: `fertility` i `gdp` wystarczy zmienić X (cechy wejściowe):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(175, 2)\n",
      "Wyraz wolny (bias): [78.39388437]\n",
      "Współczynniki cech: [[-3.68816683e+00  1.38298454e-04]]\n",
      "Root Mean Squared Error: 4.347105512793037\n"
     ]
    }
   ],
   "source": [
    "X = df[['fertility', 'gdp']]\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
    "\n",
    "print(X.shape)\n",
    "\n",
    "model_mv = LinearRegression()\n",
    "model_mv.fit(X_train, y_train)\n",
    "\n",
    "print(\"Wyraz wolny (bias):\", model_mv.intercept_)\n",
    "print(\"Współczynniki cech:\", model_mv.coef_)\n",
    "\n",
    "y_pred = model_mv.predict(X_test)\n",
    "\n",
    "rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
    "print(\"Root Mean Squared Error: {}\".format(rmse))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 7** \n",
    " * Zbuduj model regresji liniowej, która oszacuje wartność kolumny `life_expectancy` na podstawie pozostałych kolumn.\n",
    " * Wyświetl współczynniki modelu.\n",
    " * Oblicz wartości metryki rmse na zbiorze trenującym.\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wyraz wolny (bias): 51.77966340645703\n",
      "Współczynniki cech: [-1.26558913e+00  1.58457647e+00 -1.19465585e-05  8.99682207e-10\n",
      " -1.32027358e-01  3.09413223e-01  1.74214537e+00]\n",
      "Root Mean Sqared Error: 3.42188778846474\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression, LinearRegression, LogisticRegressionCV\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = df.drop('life_expectancy', axis='columns')\n",
    "y = df['life_expectancy']\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)\n",
    "\n",
    "model = LinearRegression()\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "print(\"Wyraz wolny (bias):\", model.intercept_)\n",
    "print(\"Współczynniki cech:\", model.coef_)\n",
    "\n",
    "y_pred = model.predict(X_test)\n",
    "\n",
    "rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
    "print(f\"Root Mean Sqared Error: {rmse}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**zad. 6**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " Zaimplementuj metrykę  $RMSE$  jako fukcję rmse (szablon poniżej). Fukcja rmse przyjmuje dwa parametry typu list i ma zwrócić wartość metryki  $RMSE$ ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Found input variables with inconsistent numbers of samples: [175, 176]",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32mj:\\Python\\2023-programowanie-w-pythonie\\zajecia3\\sklearn cz. 1.ipynb Cell 39\u001b[0m line \u001b[0;36m2\n\u001b[0;32m     <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=25'>26</a>\u001b[0m expected\u001b[39m.\u001b[39mappend(\u001b[39m1\u001b[39m)\n\u001b[0;32m     <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=27'>28</a>\u001b[0m \u001b[39m# print(rmse(predicted,expected))\u001b[39;00m\n\u001b[1;32m---> <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=28'>29</a>\u001b[0m \u001b[39mprint\u001b[39m(np\u001b[39m.\u001b[39msqrt(mean_squared_error(predicted, expected)))\n",
      "File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\utils\\_param_validation.py:211\u001b[0m, in \u001b[0;36mvalidate_params.<locals>.decorator.<locals>.wrapper\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m    205\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m    206\u001b[0m     \u001b[39mwith\u001b[39;00m config_context(\n\u001b[0;32m    207\u001b[0m         skip_parameter_validation\u001b[39m=\u001b[39m(\n\u001b[0;32m    208\u001b[0m             prefer_skip_nested_validation \u001b[39mor\u001b[39;00m global_skip_validation\n\u001b[0;32m    209\u001b[0m         )\n\u001b[0;32m    210\u001b[0m     ):\n\u001b[1;32m--> 211\u001b[0m         \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39margs, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[0;32m    212\u001b[0m \u001b[39mexcept\u001b[39;00m InvalidParameterError \u001b[39mas\u001b[39;00m e:\n\u001b[0;32m    213\u001b[0m     \u001b[39m# When the function is just a wrapper around an estimator, we allow\u001b[39;00m\n\u001b[0;32m    214\u001b[0m     \u001b[39m# the function to delegate validation to the estimator, but we replace\u001b[39;00m\n\u001b[0;32m    215\u001b[0m     \u001b[39m# the name of the estimator by the name of the function in the error\u001b[39;00m\n\u001b[0;32m    216\u001b[0m     \u001b[39m# message to avoid confusion.\u001b[39;00m\n\u001b[0;32m    217\u001b[0m     msg \u001b[39m=\u001b[39m re\u001b[39m.\u001b[39msub(\n\u001b[0;32m    218\u001b[0m         \u001b[39mr\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mparameter of \u001b[39m\u001b[39m\\\u001b[39m\u001b[39mw+ must be\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m    219\u001b[0m         \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mparameter of \u001b[39m\u001b[39m{\u001b[39;00mfunc\u001b[39m.\u001b[39m\u001b[39m__qualname__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m must be\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m    220\u001b[0m         \u001b[39mstr\u001b[39m(e),\n\u001b[0;32m    221\u001b[0m     )\n",
      "File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\metrics\\_regression.py:474\u001b[0m, in \u001b[0;36mmean_squared_error\u001b[1;34m(y_true, y_pred, sample_weight, multioutput, squared)\u001b[0m\n\u001b[0;32m    404\u001b[0m \u001b[39m@validate_params\u001b[39m(\n\u001b[0;32m    405\u001b[0m     {\n\u001b[0;32m    406\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39my_true\u001b[39m\u001b[39m\"\u001b[39m: [\u001b[39m\"\u001b[39m\u001b[39marray-like\u001b[39m\u001b[39m\"\u001b[39m],\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m    415\u001b[0m     y_true, y_pred, \u001b[39m*\u001b[39m, sample_weight\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m, multioutput\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39muniform_average\u001b[39m\u001b[39m\"\u001b[39m, squared\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m\n\u001b[0;32m    416\u001b[0m ):\n\u001b[0;32m    417\u001b[0m \u001b[39m    \u001b[39m\u001b[39m\"\"\"Mean squared error regression loss.\u001b[39;00m\n\u001b[0;32m    418\u001b[0m \n\u001b[0;32m    419\u001b[0m \u001b[39m    Read more in the :ref:`User Guide <mean_squared_error>`.\u001b[39;00m\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m    472\u001b[0m \u001b[39m    0.825...\u001b[39;00m\n\u001b[0;32m    473\u001b[0m \u001b[39m    \"\"\"\u001b[39;00m\n\u001b[1;32m--> 474\u001b[0m     y_type, y_true, y_pred, multioutput \u001b[39m=\u001b[39m _check_reg_targets(\n\u001b[0;32m    475\u001b[0m         y_true, y_pred, multioutput\n\u001b[0;32m    476\u001b[0m     )\n\u001b[0;32m    477\u001b[0m     check_consistent_length(y_true, y_pred, sample_weight)\n\u001b[0;32m    478\u001b[0m     output_errors \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39maverage((y_true \u001b[39m-\u001b[39m y_pred) \u001b[39m*\u001b[39m\u001b[39m*\u001b[39m \u001b[39m2\u001b[39m, axis\u001b[39m=\u001b[39m\u001b[39m0\u001b[39m, weights\u001b[39m=\u001b[39msample_weight)\n",
      "File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\metrics\\_regression.py:99\u001b[0m, in \u001b[0;36m_check_reg_targets\u001b[1;34m(y_true, y_pred, multioutput, dtype)\u001b[0m\n\u001b[0;32m     65\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_check_reg_targets\u001b[39m(y_true, y_pred, multioutput, dtype\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mnumeric\u001b[39m\u001b[39m\"\u001b[39m):\n\u001b[0;32m     66\u001b[0m \u001b[39m    \u001b[39m\u001b[39m\"\"\"Check that y_true and y_pred belong to the same regression task.\u001b[39;00m\n\u001b[0;32m     67\u001b[0m \n\u001b[0;32m     68\u001b[0m \u001b[39m    Parameters\u001b[39;00m\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m     97\u001b[0m \u001b[39m        correct keyword.\u001b[39;00m\n\u001b[0;32m     98\u001b[0m \u001b[39m    \"\"\"\u001b[39;00m\n\u001b[1;32m---> 99\u001b[0m     check_consistent_length(y_true, y_pred)\n\u001b[0;32m    100\u001b[0m     y_true \u001b[39m=\u001b[39m check_array(y_true, ensure_2d\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, dtype\u001b[39m=\u001b[39mdtype)\n\u001b[0;32m    101\u001b[0m     y_pred \u001b[39m=\u001b[39m check_array(y_pred, ensure_2d\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, dtype\u001b[39m=\u001b[39mdtype)\n",
      "File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\utils\\validation.py:409\u001b[0m, in \u001b[0;36mcheck_consistent_length\u001b[1;34m(*arrays)\u001b[0m\n\u001b[0;32m    407\u001b[0m uniques \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39munique(lengths)\n\u001b[0;32m    408\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(uniques) \u001b[39m>\u001b[39m \u001b[39m1\u001b[39m:\n\u001b[1;32m--> 409\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m    410\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39mFound input variables with inconsistent numbers of samples: \u001b[39m\u001b[39m%r\u001b[39;00m\u001b[39m\"\u001b[39m\n\u001b[0;32m    411\u001b[0m         \u001b[39m%\u001b[39m [\u001b[39mint\u001b[39m(l) \u001b[39mfor\u001b[39;00m l \u001b[39min\u001b[39;00m lengths]\n\u001b[0;32m    412\u001b[0m     )\n",
      "\u001b[1;31mValueError\u001b[0m: Found input variables with inconsistent numbers of samples: [175, 176]"
     ]
    }
   ],
   "source": [
    "def rmse(expected, predicted):\n",
    "    \"\"\"\n",
    "    argumenty:\n",
    "    expected (type: list): poprawne wartości\n",
    "    predicted (type: list): oszacowanie z modelu\n",
    "    \"\"\"\n",
    "\n",
    "    if len(expected) != len(predicted):\n",
    "        raise ValueError(\"Lists have to be equal length, can't proceed.\")\n",
    "\n",
    "    mse = 0\n",
    "    for i in range(len(expected)):\n",
    "        mse += pow((expected[i] - predicted[i]),2)\n",
    "    return np.sqrt(mse/len(expected))\n",
    "    \n",
    "    \n",
    "\n",
    "y = df['life_expectancy'].values\n",
    "X = df[['fertility', 'gdp']].values\n",
    "\n",
    "test_model = LinearRegression()\n",
    "test_model.fit(X, y)\n",
    "\n",
    "predicted = list(test_model.predict(X))\n",
    "expected = list(y)\n",
    "# expected.append(1)\n",
    "\n",
    "print(rmse(predicted,expected))\n",
    "print(np.sqrt(mean_squared_error(predicted, expected)))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}