IUM_s464980/lab1.ipynb

309 KiB
Raw Permalink Blame History

!pip install opendatasets
!pip install pandas
!pip install seaborn
!pip install scikit-learn
Requirement already satisfied: opendatasets in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (0.1.22)
Requirement already satisfied: tqdm in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from opendatasets) (4.66.2)
Requirement already satisfied: kaggle in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from opendatasets) (1.6.6)
Requirement already satisfied: click in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from opendatasets) (8.1.7)
Requirement already satisfied: colorama in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from click->opendatasets) (0.4.6)
Requirement already satisfied: six>=1.10 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (1.16.0)
Requirement already satisfied: certifi in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (2024.2.2)
Requirement already satisfied: python-dateutil in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (2.9.0.post0)
Requirement already satisfied: requests in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (2.31.0)
Requirement already satisfied: python-slugify in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (8.0.4)
Requirement already satisfied: urllib3 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (2.2.1)
Requirement already satisfied: bleach in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (6.1.0)
Requirement already satisfied: webencodings in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from bleach->kaggle->opendatasets) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from python-slugify->kaggle->opendatasets) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from requests->kaggle->opendatasets) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from requests->kaggle->opendatasets) (3.6)
[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: pandas in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (2.2.1)
Requirement already satisfied: numpy<2,>=1.23.2 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: six>=1.5 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: seaborn in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (0.13.2)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from seaborn) (2.2.1)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from seaborn) (3.8.3)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0)
Requirement already satisfied: cycler>=0.10 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.50.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5)
Requirement already satisfied: packaging>=20.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.0)
Requirement already satisfied: pillow>=8 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: six>=1.5 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: scikit-learn in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (1.4.1.post1)
Requirement already satisfied: numpy<2.0,>=1.19.5 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from scikit-learn) (1.12.0)
Requirement already satisfied: joblib>=1.2.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from scikit-learn) (3.3.0)
[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
import opendatasets as od
import pandas as pd
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
od.download("https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression/code")
data = pd.read_csv("student-performance-multiple-linear-regression/Student_Performance.csv")
data.head()
Hours Studied Previous Scores Extracurricular Activities Sleep Hours Sample Question Papers Practiced Performance Index
0 7 99 Yes 9 1 91.0
1 4 82 No 4 2 65.0
2 8 51 Yes 7 2 45.0
3 5 52 Yes 5 2 36.0
4 7 75 No 8 5 66.0
data.shape
(10000, 6)

Remove duplicates

data.drop_duplicates(inplace=True)
data.shape
(9873, 6)

Change Extra Activities column to int

data["Extracurricular Activities"] = data["Extracurricular Activities"].replace({'Yes': 1, 'No': 0})
C:\Users\Arden\AppData\Local\Temp\ipykernel_9200\3312621466.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  data["Extracurricular Activities"] = data["Extracurricular Activities"].replace({'Yes': 1, 'No': 0})

Data exploration

hours_studied = data["Hours Studied"]
hours_studied.dtype
dtype('int64')
sns.histplot(hours_studied)
<Axes: xlabel='Hours Studied', ylabel='Count'>
hours_studied.describe()
count    9873.000000
mean        4.992100
std         2.589081
min         1.000000
25%         3.000000
50%         5.000000
75%         7.000000
max         9.000000
Name: Hours Studied, dtype: float64
previous_score = data["Previous Scores"]
previous_score.dtype
dtype('int64')
sns.histplot(previous_score)
<Axes: xlabel='Previous Scores', ylabel='Count'>
extra_activities = data['Extracurricular Activities']
extra_activities.dtype
dtype('int64')
sns.histplot(extra_activities)
<Axes: xlabel='Extracurricular Activities', ylabel='Count'>
sleep_hours = data['Sleep Hours']
sleep_hours.dtype
dtype('int64')
sns.histplot(sleep_hours)
<Axes: xlabel='Sleep Hours', ylabel='Count'>
samples_practised = data["Sample Question Papers Practiced"]
samples_practised.dtype
dtype('int64')
sns.histplot(samples_practised)
<Axes: xlabel='Sample Question Papers Practiced', ylabel='Count'>
performance = data["Performance Index"]
performance.dtype
dtype('float64')
sns.histplot(performance)
<Axes: xlabel='Performance Index', ylabel='Count'>
sns.scatterplot(x=previous_score,y=performance)
<Axes: xlabel='Previous Scores', ylabel='Performance Index'>
data.describe().T
count mean std min 25% 50% 75% max
Hours Studied 9873.0 4.992100 2.589081 1.0 3.0 5.0 7.0 9.0
Previous Scores 9873.0 69.441102 17.325601 40.0 54.0 69.0 85.0 99.0
Extracurricular Activities 9873.0 0.494986 0.500000 0.0 0.0 0.0 1.0 1.0
Sleep Hours 9873.0 6.531652 1.697683 4.0 5.0 7.0 8.0 9.0
Sample Question Papers Practiced 9873.0 4.583004 2.867202 0.0 2.0 5.0 7.0 9.0
Performance Index 9873.0 55.216651 19.208570 10.0 40.0 55.0 70.0 100.0
correlation_matrix = data[['Hours Studied', 'Previous Scores', 'Extracurricular Activities', 'Sleep Hours', 'Sample Question Papers Practiced', 'Performance Index']].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
<Axes: >

Data standarization to normal distribution

X = data.drop("Performance Index", axis=1)
y = data["Performance Index"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled
array([[ 0.77824248,  0.02642469, -0.98691768, -1.48058947,  0.84637542],
       [-0.77049682, -1.58829235,  1.01325574,  0.87108386,  1.54382162],
       [-0.77049682,  0.48777241, -0.98691768,  1.45900219,  0.14892922],
       ...,
       [ 0.00387283,  1.179794  , -0.98691768, -0.89267114, -1.24596317],
       [ 1.16542731, -1.70362928, -0.98691768,  0.28316553,  0.84637542],
       [-0.77049682,  0.89145167,  1.01325574, -0.89267114, -0.89724007]])

Evaluation

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train_scaled, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
import numpy

# Predicting Test Set Results
y_pred = regressor.predict(X_test_scaled)
y_pred = numpy.round(y_pred, decimals = 2)
from sklearn.metrics import r2_score, mean_squared_error
r2 = r2_score(y_test, y_pred)
mean_er = mean_squared_error(y_test, y_pred)
print('Mean Squared Error : ', mean_er)
print('R Square : ', r2)
Mean Squared Error :  4.213380000000001
R Square :  0.9888499650063397
pd.DataFrame({'Actual Performance': y_test, 'Predicted Performance': y_pred})
Actual Performance Predicted Performance
4288 31.0 29.19
5077 62.0 59.85
3955 16.0 16.21
9149 73.0 73.85
3089 44.0 45.14
... ... ...
4791 35.0 39.65
1750 48.0 48.94
8441 75.0 74.65
9143 67.0 65.49
2522 51.0 47.46

1975 rows × 2 columns