431 lines
11 KiB
Plaintext
431 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Kkolejna część zajęć będzie wprowadzeniem do szeroko używanej biblioteki w Pythonie: `sklearn`. Zajęcia będą miały charaktere case-study poprzeplatane zadaniami do wykonania. Zacznijmy od załadowania odpowiednich bibliotek."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# ! pip install matplotlib"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"%matplotlib inline"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Zacznijmy od załadowania danych. Na dzisiejszych zajęciach będziemy korzystać z danych z portalu [gapminder.org](https://www.gapminder.org/data/)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.read_csv('gapminder.csv', index_col=0)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Dane zawierają różne informacje z większość państw świata (z roku 2008). Poniżej znajduje się opis kolumn:\n",
|
|
" * female_BMI - średnie BMI u kobiet\n",
|
|
" * male_BMI - średnie BMI u mężczyzn\n",
|
|
" * gdp - PKB na obywatela\n",
|
|
" * population - wielkość populacji\n",
|
|
" * under5mortality - wskaźnik śmiertelności dzieni pon. 5 roku życia (na 1000 urodzonych dzieci)\n",
|
|
" * life_expectancy - średnia długość życia\n",
|
|
" * fertility - wskaźnik dzietności"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**zad. 1**\n",
|
|
"Na podstawie danych zawartych w `df` odpowiedz na następujące pytania:\n",
|
|
" * Jaki był współczynniki dzietności w Polsce w 2018?\n",
|
|
" * W którym kraju ludzie żyją najdłużej?\n",
|
|
" * Z ilu krajów zostały zebrane dane?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**zad. 2** Stwórz kolumnę `gdp_log`, która powstanie z kolumny `gdp` poprzez zastowanie funkcji `log` (logarytm). \n",
|
|
"\n",
|
|
"Hint 1: Wykorzystaj funkcję `apply` (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html#pandas.Series.apply).\n",
|
|
"\n",
|
|
"Hint 2: Wykorzystaj fukcję `log` z pakietu `np`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Naszym zadaniem będzie oszacowanie długości życia (kolumna `life_expectancy`) na podstawie pozostałych zmiennych. Na samym początku, zastosujemy regresje jednowymiarową na `fertility`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y = df['life_expectancy'].values\n",
|
|
"X = df['fertility'].values\n",
|
|
"\n",
|
|
"print(\"Y shape:\", y.shape)\n",
|
|
"print(\"X shape:\", X.shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Będziemy korzystać z gotowej implementacji regreji liniowej z pakietu sklearn. Żeby móc wykorzystać, musimy napierw zmienić shape na dwuwymiarowy."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"y = y.reshape(-1, 1)\n",
|
|
"X = X.reshape(-1, 1)\n",
|
|
"\n",
|
|
"print(\"Y shape:\", y.shape)\n",
|
|
"print(\"X shape:\", X.shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Jeszcze przed właściwą analizą, narysujmy wykres i zobaczny czy istnieje \"wizualny\" związek pomiędzy kolumnami."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.plot.scatter('fertility', 'life_expectancy')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**zad. 3** Zaimportuj `LinearRegression` z pakietu `sklearn.linear_model`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Tworzymy obiekt modelu regresji liniowej."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"model = LinearRegression()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Trening modelu ogranicza się do wywołania metodu `fit`, która przyjmuje dwa argumenty:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"model.fit(X, y)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Współczynniki modelu:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"Wyraz wolny (bias):\", model.intercept_)\n",
|
|
"print(\"Współczynniki cech:\", model.coef_)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**zad. 4** Wytrenuj nowy model `model2`, który będzie jako X przyjmie kolumnę `gdp_log`. Wyświetl parametry nowego modelu."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Mając wytrenowany model możemy wykorzystać go do predykcji. Wystarczy wywołać metodę `predict`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X_test = X[:5,:]\n",
|
|
"y_test = y[:5,:]\n",
|
|
"output = model.predict(X_test)\n",
|
|
"\n",
|
|
"for i in range(5):\n",
|
|
" print(\"input: {}\\t predicted: {}\\t expected: {}\".format(X_test[i,0], output[i,0], y_test[i,0]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Sprawdzenie jakości modelu - metryki: $MSE$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Istnieją 3 metryki, które określają jak dobry jest nasz model:\n",
|
|
" * $MSE$: [błąd średnio-kwadratowy](https://pl.wikipedia.org/wiki/B%C5%82%C4%85d_%C5%9Bredniokwadratowy) \n",
|
|
" * $RMSE = \\sqrt{MSE}$"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.metrics import mean_squared_error\n",
|
|
"\n",
|
|
"rmse = np.sqrt(mean_squared_error(y, model.predict(X)))\n",
|
|
"print(\"Root Mean Squared Error: {}\".format(rmse))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import necessary modules\n",
|
|
"from sklearn.linear_model import LinearRegression\n",
|
|
"from sklearn.metrics import mean_squared_error\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"\n",
|
|
"# Create training and test sets\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
|
|
"\n",
|
|
"# Create the regressor: reg_all\n",
|
|
"reg_all = LinearRegression()\n",
|
|
"\n",
|
|
"# Fit the regressor to the training data\n",
|
|
"reg_all.fit(X_train, y_train)\n",
|
|
"\n",
|
|
"# Predict on the test data: y_pred\n",
|
|
"y_pred = reg_all.predict(X_test)\n",
|
|
"\n",
|
|
"# Compute and print R^2 and RMSE\n",
|
|
"print(\"R^2: {}\".format(reg_all.score(X_test, y_test)))\n",
|
|
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
|
|
"print(\"Root Mean Squared Error: {}\".format(rmse))\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Regresja wielu zmiennych"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Model regresji liniowej wielu zmiennych nie różni się istotnie od modelu jednej zmiennej. Np. chcąc zbudować model oparty o dwie kolumny: `fertility` i `gdp` wystarczy zmienić X (cechy wejściowe):"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"X = df[['fertility', 'gdp']]\n",
|
|
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=42)\n",
|
|
"\n",
|
|
"print(X.shape)\n",
|
|
"\n",
|
|
"model_mv = LinearRegression()\n",
|
|
"model_mv.fit(X_train, y_train)\n",
|
|
"\n",
|
|
"print(\"Wyraz wolny (bias):\", model_mv.intercept_)\n",
|
|
"print(\"Współczynniki cech:\", model_mv.coef_)\n",
|
|
"\n",
|
|
"y_pred = model_mv.predict(X_test)\n",
|
|
"\n",
|
|
"rmse = np.sqrt(mean_squared_error(y_test, y_pred))\n",
|
|
"print(\"Root Mean Squared Error: {}\".format(rmse))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**zad. 6** \n",
|
|
" * Zbuduj model regresji liniowej, która oszacuje wartność kolumny `life_expectancy` na podstawie pozostałych kolumn.\n",
|
|
" * Wyświetl współczynniki modelu.\n",
|
|
" * Oblicz wartości metryki rmse na zbiorze trenującym.\n",
|
|
" "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**zad. 7**\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
" Zaimplementuj metrykę $RMSE$ jako fukcję rmse (szablon poniżej). Fukcja rmse przyjmuje dwa parametry typu list i ma zwrócić wartość metryki $RMSE$ ."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def rmse(expected, predicted):\n",
|
|
" \"\"\"\n",
|
|
" argumenty:\n",
|
|
" expected (type: list): poprawne wartości\n",
|
|
" predicted (type: list): oszacowanie z modelu\n",
|
|
" \"\"\"\n",
|
|
" pass\n",
|
|
" \n",
|
|
"\n",
|
|
"y = df['life_expectancy'].values\n",
|
|
"X = df[['fertility', 'gdp']].values\n",
|
|
"\n",
|
|
"test_model = LinearRegression()\n",
|
|
"test_model.fit(X, y)\n",
|
|
"\n",
|
|
"predicted = list(test_model.predict(X))\n",
|
|
"expected = list(y)\n",
|
|
"\n",
|
|
"print(rmse(predicted,expected))\n",
|
|
"print(np.sqrt(mean_squared_error(predicted, expected)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.7"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|