2023-programowanie-w-pythonie/zajecia3/sklearn cz. 1.ipynb

628 lines
50 KiB
Plaintext
Raw Permalink Normal View History

2023-11-19 12:33:16 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kkolejna część zajęć będzie wprowadzeniem do drugiej, szeroko używanej biblioteki w Pythonie: `sklearn`. Zajęcia będą miały charaktere case-study poprzeplatane zadaniami do wykonania. Zacznijmy od załadowania odpowiednich bibliotek."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 3,
2023-11-24 20:16:16 +01:00
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# ! pip install matplotlib"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 4,
2023-11-19 12:33:16 +01:00
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zacznijmy od załadowania danych. Na dzisiejszych zajęciach będziemy korzystać z danych z portalu [gapminder.org](https://www.gapminder.org/data/)."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 23,
2023-11-19 12:33:16 +01:00
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('gapminder.csv', index_col=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dane zawierają różne informacje z większość państw świata (z roku 2008). Poniżej znajduje się opis kolumn:\n",
" * female_BMI - średnie BMI u kobiet\n",
" * male_BMI - średnie BMI u mężczyzn\n",
" * gdp - PKB na obywatela\n",
" * population - wielkość populacji\n",
" * under5mortality - wskaźnik śmiertelności dzieni pon. 5 roku życia (na 1000 urodzonych dzieci)\n",
" * life_expectancy - średnia długość życia\n",
" * fertility - wskaźnik dzietności"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 1**\n",
"Na podstawie danych zawartych w `df` odpowiedz na następujące pytania:\n",
" * Jaki był współczynniki dzietności w Polsce w 2018?\n",
" * W którym kraju ludzie żyją najdłużej?\n",
" * Z ilu krajów zostały zebrane dane?"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"175"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(10)\n",
"\n",
"df.loc['Poland', 'fertility']\n",
"df[df['life_expectancy'].max() == df['life_expectancy']]\n",
"len(df.index)\n"
]
2023-11-19 12:33:16 +01:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 2** Stwórz kolumnę `gdp_log`, która powstanie z kolumny `gdp` poprzez zastowanie funkcji `log` (logarytm). \n",
"\n",
"Hint 1: Wykorzystaj funkcję `apply` (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply).\n",
"\n",
"Hint 2: Wykorzystaj fukcję `log` z pakietu `np`."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 30,
2023-11-19 12:33:16 +01:00
"metadata": {},
"outputs": [],
2023-11-26 09:12:43 +01:00
"source": [
"df['gdp_log'] = df['gdp'].apply(np.log)\n"
]
2023-11-19 12:33:16 +01:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Naszym zadaniem będzie oszacowanie długości życia (kolumna `life_expectancy`) na podstawie pozostałych zmiennych. Na samym początku, zastosujemy regresje jednowymiarową na `fertility`."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Y shape: (175,)\n",
"X shape: (175,)\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"y = df['life_expectancy'].values\n",
"X = df['fertility'].values\n",
"\n",
"print(\"Y shape:\", y.shape)\n",
"print(\"X shape:\", X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Będziemy korzystać z gotowej implementacji regreji liniowej z pakietu sklearn. Żeby móc wykorzystać, musimy napierw zmienić shape na dwuwymiarowy."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Y shape: (175, 1)\n",
"X shape: (175, 1)\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"y = y.reshape(-1, 1)\n",
"X = X.reshape(-1, 1)\n",
"\n",
"print(\"Y shape:\", y.shape)\n",
"print(\"X shape:\", X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jeszcze przed właściwą analizą, narysujmy wykres i zobaczny czy istnieje \"wizualny\" związek pomiędzy kolumnami."
]
},
{
"cell_type": "code",
2023-11-26 11:14:05 +01:00
"execution_count": 98,
2023-11-26 09:12:43 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2023-11-26 11:14:05 +01:00
"<Axes: xlabel='gdp_log', ylabel='gdp'>"
2023-11-26 09:12:43 +01:00
]
},
2023-11-26 11:14:05 +01:00
"execution_count": 98,
2023-11-26 09:12:43 +01:00
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
2023-11-26 11:14:05 +01:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAl0AAAGxCAYAAABY7ANPAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAABFk0lEQVR4nO3de3RU9b3//9eE3AO5EHIhGCBI5KIoKBAil2rNIVpq5ait5FClGK8HQaS14FcBrVoQtLWggrZUsSoqHm8FLz8KSAqkAUO4CmloQgBpICQkQxIgIfn8/rAZMyTkAsmemeT5WGvWMnu/Z+/3TK3zWp/92Z9tM8YYAQAAoE15uboBAACAjoDQBQAAYAFCFwAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAW8HZ1Ax1JTU2Njhw5oi5dushms7m6HQAA0AzGGJ08eVIxMTHy8rrw8SpCl4WOHDmi2NhYV7cBAAAuwKFDh3TJJZdc8PsJXRbq0qWLpO/+RwsODnZxNwAAoDnsdrtiY2Mdv+MXitBlodpLisHBwYQuAAA8zMVODWIiPQAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABQhdAAAAFuAxQAAAwOPlFpYpv7hCvcODFNctyNXtNIjQBQAAPFZJRaWmrdiutJxCx7Yx8RFanDJEIYE+LuysPi4vAgAAjzVtxXZt2n/cadum/cc1dUWWizo6P0IXAADwSLmFZUrLKVS1MU7bq41RWk6h8o6Xu6izhhG6AACAR8ovrmh0/4EiQhcAAMBF69U1sNH9vcPda0I9oQsAAHikPhGdNSY+Qp1sNqftnWw2jYmPcLu7GAldAADAYy1OGaKRfbs5bRvZt5sWpwxxUUfnx5IRAADAY4UE+ujN1OHKO16uA0XlrNMFAADQluK6uW/YqsXlRQAAAAsQugAAACxA6AIAALAAoQsAAMAChC4AAAALELoAAAAs4NLQlZaWpptvvlkxMTGy2Wz6+OOPHfuqqqo0c+ZMDRo0SEFBQYqJidFdd92lI0eOOB2juLhYEydOVHBwsEJDQ5WamqqysjKnmp07d2r06NHy9/dXbGysFixYUK+XlStXqn///vL399egQYP02WefOe03xmjOnDnq3r27AgIClJSUpJycnNb7MgAAQLvm0tBVXl6uq666Si+//HK9fRUVFdq2bZtmz56tbdu26cMPP1R2drZ+8pOfONVNnDhRe/bs0Zo1a7Rq1SqlpaXpvvvuc+y32+0aO3asevXqpczMTC1cuFBPPvmkXnvtNUfN5s2blZKSotTUVGVlZWn8+PEaP368du/e7ahZsGCBFi1apKVLlyojI0NBQUFKTk7W6dOn2+CbAQAA7Y5xE5LMRx991GjNli1bjCSTn59vjDHmm2++MZLM1q1bHTWff/65sdls5ttvvzXGGPPKK6+YsLAwc+bMGUfNzJkzTb9+/Rx//+xnPzPjxo1zOldCQoK5//77jTHG1NTUmOjoaLNw4ULH/pKSEuPn52dWrFjR7M9YWlpqJJnS0tJmvwcAALhWa/1+e9ScrtLSUtlsNoWGhkqS0tPTFRoaqqFDhzpqkpKS5OXlpYyMDEfNmDFj5Ovr66hJTk5Wdna2Tpw44ahJSkpyOldycrLS09MlSXl5eSooKHCqCQkJUUJCgqOmIWfOnJHdbnd6AQCAjsljQtfp06c1c+ZMpaSkKDg4WJJUUFCgyMhIpzpvb2917dpVBQUFjpqoqCinmtq/m6qpu7/u+xqqaci8efMUEhLieMXGxrboMwMAgPbDI0JXVVWVfvazn8kYoyVLlri6nWZ77LHHVFpa6ngdOnTI1S0BAAAXcfsHXtcGrvz8fK1bt84xyiVJ0dHROnbsmFP92bNnVVxcrOjoaEfN0aNHnWpq/26qpu7+2m3du3d3qhk8ePB5e/fz85Ofn19LPi4AAGin3HqkqzZw5eTk6G9/+5vCw8Od9icmJqqkpESZmZmObevWrVNNTY0SEhIcNWlpaaqqqnLUrFmzRv369VNYWJijZu3atU7HXrNmjRITEyVJcXFxio6Odqqx2+3KyMhw1AAAADTGpaGrrKxM27dv1/bt2yV9N2F9+/btOnjwoKqqqnT77bfr66+/1ttvv63q6moVFBSooKBAlZWVkqQBAwboxhtv1L333qstW7Zo06ZNeuihhzRhwgTFxMRIkv7nf/5Hvr6+Sk1N1Z49e/Tee+/pD3/4g2bMmOHo4+GHH9YXX3yhF154Qfv27dOTTz6pr7/+Wg899JAkyWazafr06XrmmWf06aefateuXbrrrrsUExOj8ePHW/qdAQAAD9U6N1NemPXr1xtJ9V6TJk0yeXl5De6TZNavX+84RlFRkUlJSTGdO3c2wcHBZvLkyebkyZNO59mxY4cZNWqU8fPzMz169DDz58+v18v7779vLrvsMuPr62suv/xys3r1aqf9NTU1Zvbs2SYqKsr4+fmZG264wWRnZ7fo87JkBAAAnqe1fr9txhjjkrTXAdntdoWEhKi0tNRpbhoAAHBfrfX77dZzugAAANoLQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABQhdAAAAFiB0AQAAWIDQBQAAYAFCFwAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABQhdAAAAFiB0AQAAWIDQBQAAYAFCFwAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABQhdAAAAFiB0AQAAWIDQBQAAYAFCFwAAgAUIXQAAABYgdAEAAFiA0AUAAGABQhcAAIAFCF0AAAAWIHQBAABYgNAFAABgAUIXAACABVwautLS0nTzzTcrJiZGNptNH3/8sdN+Y4zmzJmj7t27KyAgQElJScrJyXGqKS4u1sSJExUcHKzQ0FClpqaqrKzMqWbnzp0aPXq0/P39FRsbqwULFtTrZeXKlerfv7/8/f01aNAgffbZZy3uBQAA4HxcGrrKy8t11VVX6eWXX25w/4IFC7Ro0SItXbpUGRkZCgoKUnJysk6fPu2omThxovbs2aM1a9Zo1apVSktL03333efYb7fbNXbsWPXq1UuZmZlauHChnnzySb322muOms2bNyslJUWpqanKysrS+PHjNX78eO3evbtFvQAAAJyXcROSzEcffeT4u6amxkRHR5uFCxc6tpWUlBg/Pz+zYsUKY4wx33zzjZFktm7d6qj5/PPPjc1mM99++60xxphXXnnFhIWFmTNnzjhqZs6cafr16+f4+2c/+5kZN26cUz8JCQnm/vvvb3YvzVFaWmokmdLS0ma/BwAAuFZr/X677ZyuvLw8FRQUKCkpybEtJCRECQkJSk9PlySlp6crNDRUQ4cOddQkJSXJy8tLGRkZjpoxY8bI19fXUZOcnKzs7GydOHHCUVP3PLU1tedpTi8AAACN8XZ1A+dTUFAgSYqKinLaHhUV5dhXUFCgyMhIp/3e3t7q2rWrU01cXFy9Y9TuCwsLU0FBQZPnaaqXhpw5c0Znzpxx/G232xv5xAAAoD1z25Gu9mDevHkKCQlxvGJjY13dEgAAcBG3DV3R0dGSpKNHjzptP3r0qGNfdHS0jh075rT/7NmzKi4udqpp6Bh1z3G+mrr7m+qlIY899phKS0sdr0OHDjXxqQEAQHvltqErLi5O0dHRWrt2rWOb3W5XRkaGEhMTJUmJiYkqKSlRZmamo2bdunWqqalRQkKCoyYtLU1VVVWOmjVr1qhfv34KCwtz1NQ9T21N7Xma00tD/Pz8FBwc7PQCAAAdVCtN7L8gJ0+eNFlZWSYrK8tIMr/73e9MVlaWyc/PN8YYM3/+fBMaGmo++eQTs3PnTnPLLbeYuLg4c+rUKccxbrzxRjNkyBCTkZFhNm7caOLj401KSopjf0lJiYmKijJ33nmn2b17t3n33XdNYGCgefXVVx01mzZtMt7e3ub55583e/fuNXPnzjU+Pj5m165
2023-11-26 09:12:43 +01:00
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2023-11-26 11:14:05 +01:00
"df.plot.scatter('gdp_log', 'gdp')"
2023-11-19 12:33:16 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 3** Zaimportuj `LinearRegression` z pakietu `sklearn.linear_model`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tworzymy obiekt modelu regresji liniowej."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 35,
2023-11-19 12:33:16 +01:00
"metadata": {},
"outputs": [],
"source": [
2023-11-26 09:12:43 +01:00
"from sklearn.linear_model import LinearRegression\n",
2023-11-19 12:33:16 +01:00
"model = LinearRegression()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Trening modelu ogranicza się do wywołania metodu `fit`, która przyjmuje dwa argumenty:"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-r
],
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
2023-11-19 12:33:16 +01:00
"source": [
"model.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Współczynniki modelu:"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyraz wolny (bias): [83.2025629]\n",
"Współczynniki cech: [[-4.41400624]]\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"print(\"Wyraz wolny (bias):\", model.intercept_)\n",
"print(\"Współczynniki cech:\", model.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 4** Wytrenuj nowy model `model2`, który będzie jako X przyjmie kolumnę `gdp_log`. Wyświetl parametry nowego modelu."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyraz wolny (bias): [20.24520889]\n",
"Współczynniki cech: [[5.47719379]]\n"
]
}
],
"source": [
"y = df['life_expectancy'].values\n",
"X = df['gdp_log'].values\n",
"\n",
"y = y.reshape(-1, 1)\n",
"X = X.reshape(-1, 1)\n",
"\n",
"model.fit(X, y)\n",
"\n",
"print(\"Wyraz wolny (bias):\", model.intercept_)\n",
"print(\"Współczynniki cech:\", model.coef_)"
]
2023-11-19 12:33:16 +01:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mając wytrenowany model możemy wykorzystać go do predykcji. Wystarczy wywołać metodę `predict`."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input: 7.1785454837637\t predicted: 59.56349361537759\t expected: 52.8\n",
"input: 9.064620717626777\t predicted: 69.89389316839483\t expected: 76.8\n",
"input: 9.418492105471156\t predicted: 71.83211533534711\t expected: 75.5\n",
"input: 8.86827250899781\t predicted: 68.81845597997369\t expected: 56.7\n",
"input: 10.155646068918863\t predicted: 75.86965044411781\t expected: 75.5\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"X_test = X[:5,:]\n",
"y_test = y[:5,:]\n",
"output = model.predict(X_test)\n",
"\n",
"for i in range(5):\n",
" print(\"input: {}\\t predicted: {}\\t expected: {}\".format(X_test[i,0], output[i,0], y_test[i,0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2023-11-24 20:16:16 +01:00
"## Sprawdzenie jakości modelu - metryki: $MSE$"
2023-11-19 12:33:16 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Istnieją 3 metryki, które określają jak dobry jest nasz model:\n",
" * $MSE$: [błąd średnio-kwadratowy](https://pl.wikipedia.org/wiki/B%C5%82%C4%85d_%C5%9Bredniokwadratowy) \n",
" * $RMSE = \\sqrt{MSE}$"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Root Mean Squared Error: 5.542126033117308\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"from sklearn.metrics import mean_squared_error\n",
"\n",
"rmse = np.sqrt(mean_squared_error(y, model.predict(X)))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R^2: 0.768485708231896\n",
"Root Mean Squared Error: 4.108807300711791\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"# Import necessary modules\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Create training and test sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
"\n",
"# Create the regressor: reg_all\n",
2023-11-26 09:12:43 +01:00
"reg_all = LinearRegression() \n",
2023-11-19 12:33:16 +01:00
"\n",
"# Fit the regressor to the training data\n",
"reg_all.fit(X_train, y_train)\n",
"\n",
"# Predict on the test data: y_pred\n",
"y_pred = reg_all.predict(X_test)\n",
"\n",
"# Compute and print R^2 and RMSE\n",
"print(\"R^2: {}\".format(reg_all.score(X_test, y_test)))\n",
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regresja wielu zmiennych"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Model regresji liniowej wielu zmiennych nie różni się istotnie od modelu jednej zmiennej. Np. chcąc zbudować model oparty o dwie kolumny: `fertility` i `gdp` wystarczy zmienić X (cechy wejściowe):"
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(175, 2)\n",
"Wyraz wolny (bias): [78.39388437]\n",
"Współczynniki cech: [[-3.68816683e+00 1.38298454e-04]]\n",
"Root Mean Squared Error: 4.347105512793037\n"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"X = df[['fertility', 'gdp']]\n",
2023-11-24 20:16:16 +01:00
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
"\n",
"print(X.shape)\n",
2023-11-19 12:33:16 +01:00
"\n",
"model_mv = LinearRegression()\n",
2023-11-24 20:16:16 +01:00
"model_mv.fit(X_train, y_train)\n",
2023-11-19 12:33:16 +01:00
"\n",
2023-11-24 20:16:16 +01:00
"print(\"Wyraz wolny (bias):\", model_mv.intercept_)\n",
"print(\"Współczynniki cech:\", model_mv.coef_)\n",
"\n",
"y_pred = model_mv.predict(X_test)\n",
2023-11-19 12:33:16 +01:00
"\n",
2023-11-24 20:16:16 +01:00
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))"
2023-11-19 12:33:16 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 7** \n",
" * Zbuduj model regresji liniowej, która oszacuje wartność kolumny `life_expectancy` na podstawie pozostałych kolumn.\n",
2023-11-24 20:16:16 +01:00
" * Wyświetl współczynniki modelu.\n",
" * Oblicz wartości metryki rmse na zbiorze trenującym.\n",
2023-11-19 12:33:16 +01:00
" "
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 87,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyraz wolny (bias): 51.77966340645703\n",
"Współczynniki cech: [-1.26558913e+00 1.58457647e+00 -1.19465585e-05 8.99682207e-10\n",
" -1.32027358e-01 3.09413223e-01 1.74214537e+00]\n",
"Root Mean Sqared Error: 3.42188778846474\n"
]
}
],
"source": [
"from sklearn.linear_model import LogisticRegression, LinearRegression, LogisticRegressionCV\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X = df.drop('life_expectancy', axis='columns')\n",
"y = df['life_expectancy']\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)\n",
"\n",
"model = LinearRegression()\n",
"model.fit(X_train, y_train)\n",
"\n",
"print(\"Wyraz wolny (bias):\", model.intercept_)\n",
"print(\"Współczynniki cech:\", model.coef_)\n",
"\n",
"y_pred = model.predict(X_test)\n",
"\n",
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(f\"Root Mean Sqared Error: {rmse}\")"
]
2023-11-19 12:33:16 +01:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2023-11-24 20:16:16 +01:00
"**zad. 6**\n"
2023-11-19 12:33:16 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Zaimplementuj metrykę $RMSE$ jako fukcję rmse (szablon poniżej). Fukcja rmse przyjmuje dwa parametry typu list i ma zwrócić wartość metryki $RMSE$ ."
]
},
{
"cell_type": "code",
2023-11-26 09:12:43 +01:00
"execution_count": 96,
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "Found input variables with inconsistent numbers of samples: [175, 176]",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32mj:\\Python\\2023-programowanie-w-pythonie\\zajecia3\\sklearn cz. 1.ipynb Cell 39\u001b[0m line \u001b[0;36m2\n\u001b[0;32m <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=25'>26</a>\u001b[0m expected\u001b[39m.\u001b[39mappend(\u001b[39m1\u001b[39m)\n\u001b[0;32m <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=27'>28</a>\u001b[0m \u001b[39m# print(rmse(predicted,expected))\u001b[39;00m\n\u001b[1;32m---> <a href='vscode-notebook-cell:/j%3A/Python/2023-programowanie-w-pythonie/zajecia3/sklearn%20cz.%201.ipynb#X53sZmlsZQ%3D%3D?line=28'>29</a>\u001b[0m \u001b[39mprint\u001b[39m(np\u001b[39m.\u001b[39msqrt(mean_squared_error(predicted, expected)))\n",
"File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\utils\\_param_validation.py:211\u001b[0m, in \u001b[0;36mvalidate_params.<locals>.decorator.<locals>.wrapper\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m 205\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m 206\u001b[0m \u001b[39mwith\u001b[39;00m config_context(\n\u001b[0;32m 207\u001b[0m skip_parameter_validation\u001b[39m=\u001b[39m(\n\u001b[0;32m 208\u001b[0m prefer_skip_nested_validation \u001b[39mor\u001b[39;00m global_skip_validation\n\u001b[0;32m 209\u001b[0m )\n\u001b[0;32m 210\u001b[0m ):\n\u001b[1;32m--> 211\u001b[0m \u001b[39mreturn\u001b[39;00m func(\u001b[39m*\u001b[39margs, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mkwargs)\n\u001b[0;32m 212\u001b[0m \u001b[39mexcept\u001b[39;00m InvalidParameterError \u001b[39mas\u001b[39;00m e:\n\u001b[0;32m 213\u001b[0m \u001b[39m# When the function is just a wrapper around an estimator, we allow\u001b[39;00m\n\u001b[0;32m 214\u001b[0m \u001b[39m# the function to delegate validation to the estimator, but we replace\u001b[39;00m\n\u001b[0;32m 215\u001b[0m \u001b[39m# the name of the estimator by the name of the function in the error\u001b[39;00m\n\u001b[0;32m 216\u001b[0m \u001b[39m# message to avoid confusion.\u001b[39;00m\n\u001b[0;32m 217\u001b[0m msg \u001b[39m=\u001b[39m re\u001b[39m.\u001b[39msub(\n\u001b[0;32m 218\u001b[0m \u001b[39mr\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mparameter of \u001b[39m\u001b[39m\\\u001b[39m\u001b[39mw+ must be\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m 219\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mparameter of \u001b[39m\u001b[39m{\u001b[39;00mfunc\u001b[39m.\u001b[39m\u001b[39m__qualname__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m must be\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m 220\u001b[0m \u001b[39mstr\u001b[39m(e),\n\u001b[0;32m 221\u001b[0m )\n",
"File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\metrics\\_regression.py:474\u001b[0m, in \u001b[0;36mmean_squared_error\u001b[1;34m(y_true, y_pred, sample_weight, multioutput, squared)\u001b[0m\n\u001b[0;32m 404\u001b[0m \u001b[39m@validate_params\u001b[39m(\n\u001b[0;32m 405\u001b[0m {\n\u001b[0;32m 406\u001b[0m \u001b[39m\"\u001b[39m\u001b[39my_true\u001b[39m\u001b[39m\"\u001b[39m: [\u001b[39m\"\u001b[39m\u001b[39marray-like\u001b[39m\u001b[39m\"\u001b[39m],\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 415\u001b[0m y_true, y_pred, \u001b[39m*\u001b[39m, sample_weight\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m, multioutput\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39muniform_average\u001b[39m\u001b[39m\"\u001b[39m, squared\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m\n\u001b[0;32m 416\u001b[0m ):\n\u001b[0;32m 417\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"Mean squared error regression loss.\u001b[39;00m\n\u001b[0;32m 418\u001b[0m \n\u001b[0;32m 419\u001b[0m \u001b[39m Read more in the :ref:`User Guide <mean_squared_error>`.\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 472\u001b[0m \u001b[39m 0.825...\u001b[39;00m\n\u001b[0;32m 473\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[1;32m--> 474\u001b[0m y_type, y_true, y_pred, multioutput \u001b[39m=\u001b[39m _check_reg_targets(\n\u001b[0;32m 475\u001b[0m y_true, y_pred, multioutput\n\u001b[0;32m 476\u001b[0m )\n\u001b[0;32m 477\u001b[0m check_consistent_length(y_true, y_pred, sample_weight)\n\u001b[0;32m 478\u001b[0m output_errors \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39maverage((y_true \u001b[39m-\u001b[39m y_pred) \u001b[39m*\u001b[39m\u001b[39m*\u001b[39m \u001b[39m2\u001b[39m, axis\u001b[39m=\u001b[39m\u001b[39m0\u001b[39m, weights\u001b[39m=\u001b[39msample_weight)\n",
"File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\metrics\\_regression.py:99\u001b[0m, in \u001b[0;36m_check_reg_targets\u001b[1;34m(y_true, y_pred, multioutput, dtype)\u001b[0m\n\u001b[0;32m 65\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_check_reg_targets\u001b[39m(y_true, y_pred, multioutput, dtype\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mnumeric\u001b[39m\u001b[39m\"\u001b[39m):\n\u001b[0;32m 66\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"Check that y_true and y_pred belong to the same regression task.\u001b[39;00m\n\u001b[0;32m 67\u001b[0m \n\u001b[0;32m 68\u001b[0m \u001b[39m Parameters\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 97\u001b[0m \u001b[39m correct keyword.\u001b[39;00m\n\u001b[0;32m 98\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[1;32m---> 99\u001b[0m check_consistent_length(y_true, y_pred)\n\u001b[0;32m 100\u001b[0m y_true \u001b[39m=\u001b[39m check_array(y_true, ensure_2d\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, dtype\u001b[39m=\u001b[39mdtype)\n\u001b[0;32m 101\u001b[0m y_pred \u001b[39m=\u001b[39m check_array(y_pred, ensure_2d\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, dtype\u001b[39m=\u001b[39mdtype)\n",
"File \u001b[1;32mc:\\software\\python3\\lib\\site-packages\\sklearn\\utils\\validation.py:409\u001b[0m, in \u001b[0;36mcheck_consistent_length\u001b[1;34m(*arrays)\u001b[0m\n\u001b[0;32m 407\u001b[0m uniques \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39munique(lengths)\n\u001b[0;32m 408\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mlen\u001b[39m(uniques) \u001b[39m>\u001b[39m \u001b[39m1\u001b[39m:\n\u001b[1;32m--> 409\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m 410\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mFound input variables with inconsistent numbers of samples: \u001b[39m\u001b[39m%r\u001b[39;00m\u001b[39m\"\u001b[39m\n\u001b[0;32m 411\u001b[0m \u001b[39m%\u001b[39m [\u001b[39mint\u001b[39m(l) \u001b[39mfor\u001b[39;00m l \u001b[39min\u001b[39;00m lengths]\n\u001b[0;32m 412\u001b[0m )\n",
"\u001b[1;31mValueError\u001b[0m: Found input variables with inconsistent numbers of samples: [175, 176]"
]
}
],
2023-11-19 12:33:16 +01:00
"source": [
"def rmse(expected, predicted):\n",
" \"\"\"\n",
" argumenty:\n",
" expected (type: list): poprawne wartości\n",
" predicted (type: list): oszacowanie z modelu\n",
" \"\"\"\n",
2023-11-26 09:12:43 +01:00
"\n",
" if len(expected) != len(predicted):\n",
" raise ValueError(\"Lists have to be equal length, can't proceed.\")\n",
"\n",
" mse = 0\n",
" for i in range(len(expected)):\n",
" mse += pow((expected[i] - predicted[i]),2)\n",
" return np.sqrt(mse/len(expected))\n",
" \n",
2023-11-24 20:16:16 +01:00
" \n",
2023-11-19 12:33:16 +01:00
"\n",
"y = df['life_expectancy'].values\n",
"X = df[['fertility', 'gdp']].values\n",
"\n",
"test_model = LinearRegression()\n",
"test_model.fit(X, y)\n",
"\n",
"predicted = list(test_model.predict(X))\n",
"expected = list(y)\n",
2023-11-26 09:12:43 +01:00
"# expected.append(1)\n",
2023-11-19 12:33:16 +01:00
"\n",
2023-11-24 20:16:16 +01:00
"print(rmse(predicted,expected))\n",
2023-11-26 09:12:43 +01:00
"print(np.sqrt(mean_squared_error(predicted, expected)))\n"
2023-11-19 12:33:16 +01:00
]
2023-11-24 20:16:16 +01:00
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
2023-11-19 12:33:16 +01:00
}
],
"metadata": {
"kernelspec": {
2023-11-24 20:16:16 +01:00
"display_name": "Python 3 (ipykernel)",
2023-11-19 12:33:16 +01:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2023-11-26 09:12:43 +01:00
"version": "3.10.11"
2023-11-19 12:33:16 +01:00
}
},
"nbformat": 4,
"nbformat_minor": 2
}