2024-programowanie-w-python.../zajecia4/sklearn cz. 1-ODPOWIEDZI.ipynb

813 lines
82 KiB
Plaintext
Raw Normal View History

2024-12-07 11:54:47 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kkolejna część zajęć będzie wprowadzeniem do szeroko używanej biblioteki w Pythonie: `sklearn`. Zajęcia będą miały charaktere case-study poprzeplatane zadaniami do wykonania. Zacznijmy od załadowania odpowiednich bibliotek."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# ! pip install matplotlib"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zacznijmy od załadowania danych. Na dzisiejszych zajęciach będziemy korzystać z danych z portalu [gapminder.org](https://www.gapminder.org/data/)."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('gapminder.csv', index_col=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dane zawierają różne informacje z większość państw świata (z roku 2008). Poniżej znajduje się opis kolumn:\n",
" * female_BMI - średnie BMI u kobiet\n",
" * male_BMI - średnie BMI u mężczyzn\n",
" * gdp - PKB na obywatela\n",
" * population - wielkość populacji\n",
" * under5mortality - wskaźnik śmiertelności dzieni pon. 5 roku życia (na 1000 urodzonych dzieci)\n",
" * life_expectancy - średnia długość życia\n",
" * fertility - wskaźnik dzietności"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 1**\n",
"Na podstawie danych zawartych w `df` odpowiedz na następujące pytania:\n",
" * Jaki był współczynniki dzietności w Polsce w 2018?\n",
" * W którym kraju ludzie żyją najdłużej?\n",
" * Z ilu krajów zostały zebrane dane?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.33"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['Poland', 'fertility']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>female_BMI</th>\n",
" <th>male_BMI</th>\n",
" <th>gdp</th>\n",
" <th>population</th>\n",
" <th>under5mortality</th>\n",
" <th>life_expectancy</th>\n",
" <th>fertility</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Japan</th>\n",
" <td>21.87088</td>\n",
" <td>23.50004</td>\n",
" <td>34800.0</td>\n",
" <td>127317900.0</td>\n",
" <td>3.4</td>\n",
" <td>82.5</td>\n",
" <td>1.34</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" female_BMI male_BMI gdp population under5mortality \\\n",
"Country \n",
"Japan 21.87088 23.50004 34800.0 127317900.0 3.4 \n",
"\n",
" life_expectancy fertility \n",
"Country \n",
"Japan 82.5 1.34 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['life_expectancy'].max() == df['life_expectancy']]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"175"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 2** Stwórz kolumnę `gdp_log`, która powstanie z kolumny `gdp` poprzez zastowanie funkcji `log` (logarytm). \n",
"\n",
"Hint 1: Wykorzystaj funkcję `apply` (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply).\n",
"\n",
"Hint 2: Wykorzystaj fukcję `log` z pakietu `np`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"df['gdp_log'] = df['gdp'].apply(np.log)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Naszym zadaniem będzie oszacowanie długości życia (kolumna `life_expectancy`) na podstawie pozostałych zmiennych. Na samym początku, zastosujemy regresje jednowymiarową na `fertility`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Y shape: (175,)\n",
"X shape: (175,)\n"
]
}
],
"source": [
"y = df['life_expectancy'].values\n",
"X = df['fertility'].values\n",
"\n",
"print(\"Y shape:\", y.shape)\n",
"print(\"X shape:\", X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Będziemy korzystać z gotowej implementacji regreji liniowej z pakietu sklearn. Żeby móc wykorzystać, musimy napierw zmienić shape na dwuwymiarowy."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Y shape: (175, 1)\n",
"X shape: (175, 1)\n"
]
}
],
"source": [
"y = y.reshape(-1, 1)\n",
"X = X.reshape(-1, 1)\n",
"\n",
"print(\"Y shape:\", y.shape)\n",
"print(\"X shape:\", X.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jeszcze przed właściwą analizą, narysujmy wykres i zobaczny czy istnieje \"wizualny\" związek pomiędzy kolumnami."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Axes: xlabel='fertility', ylabel='life_expectancy'>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjMAAAGwCAYAAABcnuQpAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8g+/7EAAAACXBIWXMAAA9hAAAPYQGoP6dpAABO2UlEQVR4nO3deViVdfo/8PdhVRAQWUQUWZRyQQ1zx9yXzK/Z5FiR5ZpNRZo6NmaTmZqiXlmOTFk2hjkmTos5Lb+cUQcxzYVcMqcZZRFxQ9FYBBKJc35/OOfIgbM+51nPeb+ui+uS5znnOTcH4bn5fO7789EZDAYDiIiIiDTKS+kAiIiIiFzBZIaIiIg0jckMERERaRqTGSIiItI0JjNERESkaUxmiIiISNOYzBAREZGm+SgdgNT0ej0uXbqEoKAg6HQ6pcMhIiIiBxgMBty4cQPR0dHw8rI99uL2ycylS5cQExOjdBhEREQkwPnz59GuXTubj3H7ZCYoKAjA7TcjODhY4WiIiIjIEZWVlYiJiTHdx21x+2TGOLUUHBzMZIaIiEhjHCkRYQEwERERaRqTGSIiItI0JjNERESkaUxmiIiISNOYzBAREZGmMZkhIiIiTWMyQ0RERJrGZIaIiIg0jckMERERaRqTGSIiItI0t9/OQAmFpVU493MN4sICER8eqNprEhERuQMmMyIqr7mF2VknsC+v1HRsUGIEMlKTERLgq5prqgUTNCIiEoPOYDAYlA5CSpWVlQgJCUFFRYXkG01O3ngEB/Kvob7BW+qt0yGlYzg2z+ijmmsqzZ0TNCIiEocz92/WzIiksLQK+/JKzZIOAKg3GLAvrxRnr1Wr4ppqMDvrBA7kXzM7diD/GmZlHVcoIiIi0jImMyI593ONzfNF151PPJy9ZmFpFbJPX1V1kuOuCRoRESmHNTMiiW0VYPN8XJjzNSGOXlNL0zaOJGisnyEiImdwZEYkCREtMCgxAt46ndlxb50OgxIjBN2gHb2mlqZtpEj6iIjIszGZEVFGajJSOoabHUvpGI6M1GTJrqm1aRspkj4iIvJsnGYSUUiALzbP6IOz16pRdL1alJZje9dU27SNI+3WGanJmJV13GxazNWkj4iIPBeTGQnEh4u/bkrDazZMGNQybeNM3Y4USR8REXkuJjMaYi1hGNAhDIcLf7a4Fo1cSYKtuh1r6+FIkfQREZHnYc2MhlhLGAwGiF6r4wyt1e0QEZF74ciMRhgThsbqDQYcLLyO7PlDAECRaZvDZ3+2eZ7t1kREJCUmMxphr9B3VtYxfDSjn0NJg1h7Ilma9rKE7dZERCQlJjMaYa/Q96dLlTbrUwDxF9ezNO3VWGiAL0dliIhIUqyZ0Qjj+izWvmF6A+zWp1iruXlqc67T2yBYq5NprKymjjUzREQkKSYzGpKRmowOkS1sPsbaHlC2inRzi8owLTMXQ9/Yi8kbj6Cips5uLPamvRyJiYiISAxMZlSq8aaR5TW3MCvrOPKuVtl8nrX6FEeTD0e3QbA37eVITO5KCxt+EhG5E9bMqIy1upa6ej2O2OgasreujKPJR8N2alu1LsZprwP51+xONXkKLW34SUTkTjgyozKW6lr255fiYOF1m0lD34RWNteVsbYnkjVF16vtjjBY2jfK2rUscbcRDC1t+ElE5E44MqMi1taS0dsZ+NABuPWrHsfOlzm9J5I172TnI7eozPS5pREG47YE+85cxeQPcq1eq/E0kzuOYNhaB8iRkS4iIhKOIzMq4kxRbUMGAN+fs1/Ea0w+sucPQea03ugdG2px9+rQAF8cO1dudtzWCMOguyItjvp46YDecaFNbuLuOILhyIafREQkDUWTmfr6eixatAjx8fFo3rw5OnTogGXLlsHQYDrFYDDg1VdfRZs2bdC8eXOMGDECeXl5CkYtHXt1LY5NEN2elnpqs/WRkvjwQAy9OxJ/mdK7yTRRcvuWKKupc3prAktTTnoDkFtUZpZcuevWB2rZ8JOIyBMpmsysWrUK69evx5///Gf85z//wapVq7B69WpkZGSYHrN69WqsW7cO7777Lg4fPozAwECMHj0aN2/eVDByaVira/HW6dAnLhRBzRybFTQmERPXf2ezzbrxSE32/CFIG9bR5rWtjTAYr9U7NrTJf6qGoy7uOoJh63s3KDGCU0xERBJSNJn57rvvMH78eIwdOxZxcXH47W9/i1GjRuHIkSMAbo/KrF27Fq+88grGjx+P7t27Y/Pmzbh06RJ27NihZOiSsTTCkdIxHN5eXqiurXfqWkfPlTk0dWMcqYkPD3R6hKFhEW9haRVyz5VB3+g5DUdd3HkEw9r3Tq4NP4mIPJWiBcADBgzAhg0bcObMGdx111344YcfsH//frz55psAgLNnz6KkpAQjRowwPSckJAR9+/bFwYMH8dhjjzW5Zm1tLWpra02fV1ZWSv+FiMg4wnH2WrVp00iDwYBha3KcvpYecLr41FrLdePW7x/Ol+GPn5/CqUt33t+k6GCb1y66Xo2hd0c6dH0tsvS90/LXQ0SkFYomMy+99BIqKyvRqVMneHt7o76+HsuXL8ekSZMAACUlJQCA1q1bmz2vdevWpnONpaenY8mSJdIGLoP48Ds3wuzTV20+VofbRcDWOLtrtaWup56xLfFI73b4fz9exjt783HqYtMk8adLthNH46iLpeu70whGw+8dERFJT9Fk5uOPP8ZHH32ErVu3omvXrjhx4gTmzJmD6OhoTJkyRdA1Fy5ciHnz5pk+r6ysRExMjFghK8Le1ExS22D8aCG5MHJ26qbhCMO/L1bgw++KkFtUZtaqbYlxeslLZ95O3njUhSMYREQkJkVrZl588UW89NJLeOyxx9CtWzc8+eSTmDt3LtLT0wEAUVFRAIArV66YPe/KlSumc435+/sjODjY7EMqci361irQD6EW1l/xwu31Wb6cdZ/FwltXi0/jwwPx8fcXcKy43KnndWk03WRt1KVhrQ4REZFQio7M1NTUwMvL/Bbs7e0Nvf723/jx8fGIiorCnj17cM899wC4PdJy+PBhPPvss3KHayL3om+zs06g8hfL68YYk4S/TOkt+tSNtYXg7MlI7QkAHHUhIiJZKJrMjBs3DsuXL0f79u3RtWtXHD9+HG+++SamT58OANDpdJgzZw5ef/11JCYmIj4+HosWLUJ0dDQeeughxeK2tejb5hl9RH0tWwlFWU0dfq65hZAAX0mmbpxdxM8LwMAGI0FMYoiISA6KJjMZGRlYtGgRnnvuOVy9ehXR0dH43e9+h1dffdX0mD/84Q+orq7G008/jfLycgwcOBA7d+5Es2bNFIlZ7mXrHVmXpeHriVl86szO2MDtRMZdiniJiEg7FE1mgoKCsHbtWqxdu9bqY3Q6HZYuXYqlS5fKF5gNziYXrlJyXRZHd8ZOjGyBNY/0QPd2LSWLhYiIyBruzeQkuZMLpVeWdWRn7LyrVXjjH2dsrjZMREQkFSYzTlIiubC1sqzUHVWNtzz464w+SIoOtrllARERkZx0BoON+QM3UFlZiZCQEFRUVIjWpl1RU9ekc0jKbiajhsW9oQG+snZUGRWWVtlcjTh7/hBZC38LS6tw7ucadk0REbkZZ+7fitbMaJVSi741LO6dvPGIbB1VDcldM2SN3O3xRESkXpxmcoFSi74ZO6oaF+U27KiSiis1Q2JOidlqj1cTuRZWJCLyZByZ0SC1jI44SuxRFDHa46WenuLIERGRfDgyo0FKtms7kkg1JvYoipAYjMprbmHyxiMYtiYH0zJzMfSNvZi88YjonVhaGTkiInIHTGY0SMl2bWcTKSmmxFxJ5uRIMpScBiQi8kRMZjTKVrt2Y2LWbTibSDkyitIwPkdiFZrMyZVkuDJyREREzmPNjEY50lElVd1GRmqyw5ta2htFeedf+cg9V2bxnK1YnYnBSK5aIyWnAYmIPBHXmXFDxuLWd7LzcexcudlIhLdOh5SO4aK0bzvamm5sI28cR3BzH1T+8qvVrRK8dMC9saH45JkBLscAyLtGjrWvWaz3nojI3Tlz/+Y0k0qIMRXUuLg1t6hM0ikVR1vTLU2
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df.plot.scatter('fertility', 'life_expectancy')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 3** Zaimportuj `LinearRegression` z pakietu `sklearn.linear_model`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tworzymy obiekt modelu regresji liniowej."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Trening modelu ogranicza się do wywołania metodu `fit`, która przyjmuje dwa argumenty:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-r
],
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.fit(X,y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Współczynniki modelu:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyraz wolny (bias): [83.2025629]\n",
"Współczynniki cech: [[-4.41400624]]\n"
]
}
],
"source": [
"print(\"Wyraz wolny (bias):\", model.intercept_)\n",
"print(\"Współczynniki cech:\", model.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 4** Wytrenuj nowy model `model2`, który będzie jako X przyjmie kolumnę `gdp_log`. Wyświetl parametry nowego modelu."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Y shape: (175,)\n",
"X shape: (175,)\n",
"Y shape: (175, 1)\n",
"X shape: (175, 1)\n"
]
}
],
"source": [
"y = df['gdp_log'].values\n",
"X = df['fertility'].values\n",
"\n",
"print(\"Y shape:\", y.shape)\n",
"print(\"X shape:\", X.shape)\n",
"\n",
"y = y.reshape(-1, 1)\n",
"X = X.reshape(-1, 1)\n",
"\n",
"print(\"Y shape:\", y.shape)\n",
"print(\"X shape:\", X.shape)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Axes: xlabel='gdp_log', ylabel='life_expectancy'>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjoAAAGxCAYAAABr1xxGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8g+/7EAAAACXBIWXMAAA9hAAAPYQGoP6dpAABOa0lEQVR4nO3de1hVZdo/8O8CEQEBFZCDIQc1D6llecbMU5o5ZuV0IOfVlA6aZWY6ZpN5SDN9tRzNdOplKPPQVGONNW81YmRpKozH/OUgchBFRTzAFkgk2L8/fPeODfuw9tpr7XXY3891cV2y1mLvZ2/AdfM8930/gtlsNoOIiIjIgPzUHgARERGRUhjoEBERkWEx0CEiIiLDYqBDREREhsVAh4iIiAyLgQ4REREZFgMdIiIiMiwGOkRERGRYzdQegNLq6+tx9uxZhIaGQhAEtYdDREREIpjNZly9ehVxcXHw85M+L2P4QOfs2bOIj49XexhEREQkwenTp3HTTTdJ/nrDBzqhoaEAbrxRYWFhKo+GiIiIxDCZTIiPj7fex6UyfKBjWa4KCwtjoENERKQznqadMBmZiIiIDIuBDhERERkWAx0iIiIyLAY6REREZFgMdIiIiMiwGOgQERGRYTHQISIiIsNioENERESGxUCHiIiIDIuBDhERERmW4beAICIi/Sgoq8Spy9VIjAhBUmSIbp+DtIOBDhERqa68+jpmbD2M7/PKrMcGd4rC2tReCA8OcOuxHAUycj4H6YdgNpvNag9CSSaTCeHh4aioqOCmnkREGjUxPRt7Tl5EXYNbkr8gIKVjJDam9RX1GK4CGTmeg7xHrvs3c3SIiEhVBWWV+D6vzCYAAYA6sxnf55Wh8GKVqMeZsfUw9py8aHNsz8mLeG7rIdmeg/SHgQ4REanq1OVqp+eLLrkOQlwFMvsLL8vyHFm5FxgU6QxzdIiISFUJbYKdnk+McJ0w7CpYApxnaTh7Dub26BtndIiISFXJUS0xuFMU/AXB5ri/IGBwpyhRlVGugqX+yZGSn8PZkhhpHwMdIiJS3drUXkjpGGlzLKVjJNam9hL19WKCJSnPwdwe/ePSFRERqS48OAAb0/qi8GIVii5VSepx8+LITrhcXYNjJSbrsYaBjJTnEJM/xF482sZAh4iINCMp0nHw4U5/nO5xYXj9gR7oGd/KredoTI78IVIXAx0iItI0V8nA9nJojp+7ipX/OuFxfxzLkpij/juczdE+5ugQEZGmqd0fx9P8IVIXZ3SIiEgzGi9PWQKZxsT2x9l+pAT33drOo5kXOfKHSD0MdIiISHWOlqce7nOTi6903h/nrR15eGtHnix9b9zJ7dESX9/ElIEOERGpztHy1C+1vzr9Okt/nMY5NI1Zlrp8aU8rNjq8gTk6RESkKmd5NjlFV1x+vb0cmsbU7Huj1tYRbHR4A2d0iIjIY54sj7jevsGxoktVaO3G7IQ3+96oOaPiKrep8KLv9P9hoENERJLZ7WHT7v962NzUStRjuOpV40xiRIjdmQtn13uLvXHtzivzyhIaGx3+hktXREQkmb2b+bESE+57ew8mpmejorrW5WO42r7B2Tnz/81QOMvPAQA/AaL3zZKDo+W4egDf55Xh6JlyRZ+fjQ5/w0CHiIgkcXQzt7DMXjj62oZ5K8561Tg7J3bZq94M1NbVWwMvpfNmXI3r5c9+UuR5LeTYKNUouHRFRESSuLqZW2YvGuaDOMtbcdarxtE5d5a9sgsvY+qmAwjw91M8b8bVuI6VmBTPk1mb2gvPbT1k81p9sdGhqjM6dXV1mD9/PpKSkhAUFIQOHTrgtddeg7nBXwdmsxmvvvoqYmNjERQUhBEjRiAvL0/FURMRESA+yCi69NusyTObDzZJkv0+rwzTNh8AcKNXzdDObe0GAPbOOZq5sKfObMbegkvYfdL2+XefdDzzJFVyVEt0jwtzek3D90UJlkaHWbOHIGNyH2TNHoKNaX19qrQcUDnQWb58OdavX4+3334bx48fx/Lly7FixQqsXbvWes2KFSuwZs0abNiwAfv370dISAhGjRqFa9euqThyIiKyBBl+LmIMSz5IQVklfsy/ZPeaH/MvNVlGEru8JKa8vKF6c9PPv88rw9HT5aIfQ4ylD/Rwet5beTLOgkdfoGqg8+OPP2LcuHEYM2YMEhMT8fvf/x4jR45EdnY2gBuzOatXr8Yrr7yCcePGoWfPnti4cSPOnj2Lzz//XM2hExERbgQZgzpG2T3XOB9kf6H9IMdif8GN8+XV1zExPRvDVu3C5IwcDF35ndPEZrOL7shiyZ03c2t8K7uBoC/myahJ1UBn4MCB2LlzJ06cOAEAOHLkCHbv3o3Ro0cDAAoLC3H+/HmMGDHC+jXh4eHo168f9u7dq8qYici3qdX8TassyyPbp6c0Wappmg/ifOrHEq642+hObHm5qxvesbMm2b+v9gJBX8yTUZOqycgvvfQSTCYTunTpAn9/f9TV1WHp0qWYMGECAOD8+fMAgOjoaJuvi46Otp5rrKamBjU1NdbPTSaTQqMnIl/CdvrO9YxvhS9n3Ol048t+SW2cPkb/5AiXje4+yi5Gv+QI62M7ut6eQZ2iUHb1Go6fv+rwGrn7y3BDUPWpGuh8/PHH2Lx5M7Zs2YJbbrkFhw8fxsyZMxEXF4dJkyZJesxly5Zh0aJFMo+UiHyds1kGX9o/yRVnG18mR7XEgOQI7C1ouoQ14P+Cl6zcC04f/6VtN5aXLEGmq8qvBWO7ITEyxBpgHDl9BePW/ejweqXyZvS6IagRqLp0NWfOHLz00kt49NFH0aNHD/zXf/0XXnjhBSxbtgwAEBMTAwAoLS21+brS0lLrucbmzZuHiooK68fp06eVfRFEZHjO9mJSa/8kLXFnOW/DH+7A4E62SzmDO0Vhwx/uACC+kssSZLq6/n+PnrNJxL01vvWNvJlG1zFvxrhUndGprq6Gn5/tj5u/vz/q6+sBAElJSYiJicHOnTtx2223AbixFLV//35MmzbN7mMGBgYiMDBQ0XETkW9hO337pCznuVrKsVRyudqN3BJknrni/HuTc+pKk341euwv48leYr5O1UBn7NixWLp0Kdq3b49bbrkFhw4dwptvvokpU6YAAARBwMyZM7FkyRJ06tQJSUlJmD9/PuLi4nD//ferOXQi8iFsp2+fJ8t5zpZy7AUijhwSURLeOBDVU94Mc8M8p2qgs3btWsyfPx/PPPMMLly4gLi4ODz99NN49dVXrdf88Y9/RFVVFZ566imUl5dj0KBB+Prrr9GiRQsVR05EvsTRLIO/ICClY6Rmb5JKUnJ37IaByL6CS5i3zXHZd6/4Vi4fz1Egqoe8GeaGeU7VHJ3Q0FCsXr0ap06dwi+//IL8/HwsWbIEzZs3t14jCAIWL16M8+fP49q1a8jMzMTNN9+s4qiJyBc522/JF4lZzvNUUmQIUvu2d77h581tm+T8NKTnvBvmhsmDe10REYmgp+UOb/Dmcp6rnJq1qb0wddOBJtVcAztEWK/RY44Lc8PkwUCHiMgNelju8AZ3l/M8CTRcBZnhwQHY+lR/61KXAFh77Vi6LOsxx4W5YfIQzGYnae0GYDKZEB4ejoqKCoSFOd9gjYiIxKuorm0y09I4iFA7mXZierbDYEwPOS56H78n5Lp/M9AhIhJJi8sfWhiTs+U8NW/UBWWVGLZql8PzWbOHaOb76IiYYNKo5Lp/c+mKiMgFtWcltD4mR8t5SlZmiWGEHBfmhnlO1aorIiI9cHeTSW/Q4pga80ZlljPu5rhoecPWpMgQmw7PJB5ndIjIcORczlF7VkIvY7JHzWTa8urrWLj9Z7vnGidMa2l2jOTHQIeIDEOJG5YWlz+0OCZ71Gy0aG/Gy6Jx/yM25TM2Ll0RkWEosZyjxRJfLY7JETUaLTpqtGexaNwt1sCXTfk8o+XlPgvO6BCRISi1nCPHrIS7S2murtfTlhRqJNO6M+Oll9kxrdHTch8DHSIyBCVvWFJ3u3b3ZuDO9XrbgdubjRbdmfHS0+yYluhpuY+BDhEZgpI3LKmzEu7eDMRc33C2h2XHjnVvF4a
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df.plot.scatter('gdp_log', 'life_expectancy')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-2 {color: black;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-r
],
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model2 = LinearRegression()\n",
"model2.fit(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyraz wolny (bias): [10.97412729]\n",
"Współczynniki cech: [[-0.63200209]]\n"
]
}
],
"source": [
"print(\"Wyraz wolny (bias):\", model2.intercept_)\n",
"print(\"Współczynniki cech:\", model2.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Mając wytrenowany model możemy wykorzystać go do predykcji. Wystarczy wywołać metodę `predict`."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input: 6.2\t predicted: 55.83572421482946\t expected: 7.1785454837637\n",
"input: 1.76\t predicted: 75.43391191760766\t expected: 9.064620717626777\n",
"input: 2.73\t predicted: 71.15232586542413\t expected: 9.418492105471156\n",
"input: 6.43\t predicted: 54.82050277977564\t expected: 8.86827250899781\n",
"input: 2.16\t predicted: 73.66830942186188\t expected: 10.155646068918863\n"
]
}
],
"source": [
"X_test = X[:5,:]\n",
"y_test = y[:5,:]\n",
"output = model.predict(X_test)\n",
"\n",
"for i in range(5):\n",
" print(\"input: {}\\t predicted: {}\\t expected: {}\".format(X_test[i,0], output[i,0], y_test[i,0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sprawdzenie jakości modelu - metryki: $MSE$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Istnieją 3 metryki, które określają jak dobry jest nasz model:\n",
" * $MSE$: [błąd średnio-kwadratowy](https://pl.wikipedia.org/wiki/B%C5%82%C4%85d_%C5%9Bredniokwadratowy) \n",
" * $RMSE = \\sqrt{MSE}$"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Root Mean Squared Error: 61.20258121223673\n"
]
}
],
"source": [
"from sklearn.metrics import mean_squared_error\n",
"\n",
"rmse = np.sqrt(mean_squared_error(y, model.predict(X)))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Root Mean Squared Error: 0.8330994741525843\n"
]
}
],
"source": [
"# Import necessary modules\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Create training and test sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
"\n",
"# Create the regressor: reg_all\n",
"reg_all = LinearRegression()\n",
"\n",
"# Fit the regressor to the training data\n",
"reg_all.fit(X_train, y_train)\n",
"\n",
"# Predict on the test data: y_pred\n",
"y_pred = reg_all.predict(X_test)\n",
"\n",
"# Compute and print RMSE\n",
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regresja wielu zmiennych"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Model regresji liniowej wielu zmiennych nie różni się istotnie od modelu jednej zmiennej. Np. chcąc zbudować model oparty o dwie kolumny: `fertility` i `gdp` wystarczy zmienić X (cechy wejściowe):"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(175, 2)\n",
"Wyraz wolny (bias): [9.47431285]\n",
"Współczynniki cech: [[-3.58540438e-01 4.05443491e-05]]\n",
"Root Mean Squared Error: 0.5039206253337853\n"
]
}
],
"source": [
"X = df[['fertility', 'gdp']]\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
"\n",
"print(X.shape)\n",
"\n",
"model_mv = LinearRegression()\n",
"model_mv.fit(X_train, y_train)\n",
"\n",
"print(\"Wyraz wolny (bias):\", model_mv.intercept_)\n",
"print(\"Współczynniki cech:\", model_mv.coef_)\n",
"\n",
"y_pred = model_mv.predict(X_test)\n",
"\n",
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 6** \n",
" * Zbuduj model regresji liniowej, która oszacuje wartność kolumny `life_expectancy` na podstawie pozostałych kolumn.\n",
"* Wyświetl współczynniki modelu.\n",
"* Oblicz wartości metryki rmse na zbiorze trenującym.\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['female_BMI', 'male_BMI', 'gdp', 'population', 'under5mortality',\n",
" 'life_expectancy', 'fertility', 'gdp_log'],\n",
" dtype='object')"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(175, 7)\n",
"Wyraz wolny (bias): [-2.48689958e-14]\n",
"Współczynniki cech: [[-4.53155263e-16 4.57243814e-16 5.81045637e-19 3.74348839e-26\n",
" 4.40441174e-16 -1.32227302e-16 1.00000000e+00]]\n",
"Root Mean Squared Error: 1.854651242181147e-14\n"
]
}
],
"source": [
"X = df[['female_BMI', 'male_BMI', 'gdp', 'population', 'under5mortality', 'fertility', 'gdp_log']]\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
"\n",
"print(X.shape)\n",
"\n",
"model_mv = LinearRegression()\n",
"model_mv.fit(X_train, y_train)\n",
"\n",
"print(\"Wyraz wolny (bias):\", model_mv.intercept_)\n",
"print(\"Współczynniki cech:\", model_mv.coef_)\n",
"\n",
"y_pred = model_mv.predict(X_test)\n",
"\n",
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
"print(\"Root Mean Squared Error: {}\".format(rmse))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**zad. 7**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Zaimplementuj metrykę $RMSE$ jako fukcję rmse (szablon poniżej). Fukcja rmse przyjmuje dwa parametry typu list i ma zwrócić wartość metryki $RMSE$ ."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.234841906276239\n",
"5.234841906276239\n"
]
}
],
"source": [
"def rmse(expected, predicted):\n",
" \"\"\"\n",
" argumenty:\n",
" expected (type: list): poprawne wartości\n",
" predicted (type: list): oszacowanie z modelu\n",
" \"\"\"\n",
" return np.sqrt(sum([(e-p)**2 for e,p in zip(expected,predicted)])/len(expected))\n",
" \n",
"\n",
"y = df['life_expectancy'].values\n",
"X = df[['fertility', 'gdp']].values\n",
"\n",
"test_model = LinearRegression()\n",
"test_model.fit(X, y)\n",
"\n",
"predicted = list(test_model.predict(X))\n",
"expected = list(y)\n",
"\n",
"print(rmse(predicted,expected))\n",
"print(np.sqrt(mean_squared_error(predicted, expected)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}