309 KiB
309 KiB
!pip install opendatasets
!pip install pandas
!pip install seaborn
!pip install scikit-learn
Requirement already satisfied: opendatasets in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (0.1.22) Requirement already satisfied: tqdm in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from opendatasets) (4.66.2) Requirement already satisfied: kaggle in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from opendatasets) (1.6.6) Requirement already satisfied: click in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from opendatasets) (8.1.7) Requirement already satisfied: colorama in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from click->opendatasets) (0.4.6) Requirement already satisfied: six>=1.10 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (1.16.0) Requirement already satisfied: certifi in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (2024.2.2) Requirement already satisfied: python-dateutil in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (2.9.0.post0) Requirement already satisfied: requests in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (2.31.0) Requirement already satisfied: python-slugify in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (8.0.4) Requirement already satisfied: urllib3 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (2.2.1) Requirement already satisfied: bleach in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from kaggle->opendatasets) (6.1.0) Requirement already satisfied: webencodings in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from bleach->kaggle->opendatasets) (0.5.1) Requirement already satisfied: text-unidecode>=1.3 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from python-slugify->kaggle->opendatasets) (1.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from requests->kaggle->opendatasets) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from requests->kaggle->opendatasets) (3.6)
[notice] A new release of pip available: 22.3.1 -> 24.0 [notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: pandas in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (2.2.1) Requirement already satisfied: numpy<2,>=1.23.2 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas) (2024.1) Requirement already satisfied: six>=1.5 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
[notice] A new release of pip available: 22.3.1 -> 24.0 [notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: seaborn in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (0.13.2) Requirement already satisfied: numpy!=1.24.0,>=1.20 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from seaborn) (1.26.4) Requirement already satisfied: pandas>=1.2 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from seaborn) (2.2.1) Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from seaborn) (3.8.3) Requirement already satisfied: contourpy>=1.0.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.2.0) Requirement already satisfied: cycler>=0.10 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.50.0) Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.5) Requirement already satisfied: packaging>=20.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.0) Requirement already satisfied: pillow>=8 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.2.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: tzdata>=2022.7 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from pandas>=1.2->seaborn) (2024.1) Requirement already satisfied: six>=1.5 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
[notice] A new release of pip available: 22.3.1 -> 24.0 [notice] To update, run: python.exe -m pip install --upgrade pip
Requirement already satisfied: scikit-learn in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (1.4.1.post1) Requirement already satisfied: numpy<2.0,>=1.19.5 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from scikit-learn) (1.26.4) Requirement already satisfied: scipy>=1.6.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from scikit-learn) (1.12.0) Requirement already satisfied: joblib>=1.2.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from scikit-learn) (1.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\arden\pycharmprojects\ium_s464980\venv\lib\site-packages (from scikit-learn) (3.3.0)
[notice] A new release of pip available: 22.3.1 -> 24.0 [notice] To update, run: python.exe -m pip install --upgrade pip
import opendatasets as od
import pandas as pd
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
od.download("https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression/code")
data = pd.read_csv("student-performance-multiple-linear-regression/Student_Performance.csv")
data.head()
Hours Studied | Previous Scores | Extracurricular Activities | Sleep Hours | Sample Question Papers Practiced | Performance Index | |
---|---|---|---|---|---|---|
0 | 7 | 99 | Yes | 9 | 1 | 91.0 |
1 | 4 | 82 | No | 4 | 2 | 65.0 |
2 | 8 | 51 | Yes | 7 | 2 | 45.0 |
3 | 5 | 52 | Yes | 5 | 2 | 36.0 |
4 | 7 | 75 | No | 8 | 5 | 66.0 |
data.shape
(10000, 6)
Remove duplicates
data.drop_duplicates(inplace=True)
data.shape
(9873, 6)
Change Extra Activities column to int
data["Extracurricular Activities"] = data["Extracurricular Activities"].replace({'Yes': 1, 'No': 0})
C:\Users\Arden\AppData\Local\Temp\ipykernel_9200\3312621466.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` data["Extracurricular Activities"] = data["Extracurricular Activities"].replace({'Yes': 1, 'No': 0})
Data exploration
hours_studied = data["Hours Studied"]
hours_studied.dtype
dtype('int64')
sns.histplot(hours_studied)
<Axes: xlabel='Hours Studied', ylabel='Count'>
hours_studied.describe()
count 9873.000000 mean 4.992100 std 2.589081 min 1.000000 25% 3.000000 50% 5.000000 75% 7.000000 max 9.000000 Name: Hours Studied, dtype: float64
previous_score = data["Previous Scores"]
previous_score.dtype
dtype('int64')
sns.histplot(previous_score)
<Axes: xlabel='Previous Scores', ylabel='Count'>
extra_activities = data['Extracurricular Activities']
extra_activities.dtype
dtype('int64')
sns.histplot(extra_activities)
<Axes: xlabel='Extracurricular Activities', ylabel='Count'>
sleep_hours = data['Sleep Hours']
sleep_hours.dtype
dtype('int64')
sns.histplot(sleep_hours)
<Axes: xlabel='Sleep Hours', ylabel='Count'>
samples_practised = data["Sample Question Papers Practiced"]
samples_practised.dtype
dtype('int64')
sns.histplot(samples_practised)
<Axes: xlabel='Sample Question Papers Practiced', ylabel='Count'>
performance = data["Performance Index"]
performance.dtype
dtype('float64')
sns.histplot(performance)
<Axes: xlabel='Performance Index', ylabel='Count'>
sns.scatterplot(x=previous_score,y=performance)
<Axes: xlabel='Previous Scores', ylabel='Performance Index'>
data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Hours Studied | 9873.0 | 4.992100 | 2.589081 | 1.0 | 3.0 | 5.0 | 7.0 | 9.0 |
Previous Scores | 9873.0 | 69.441102 | 17.325601 | 40.0 | 54.0 | 69.0 | 85.0 | 99.0 |
Extracurricular Activities | 9873.0 | 0.494986 | 0.500000 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
Sleep Hours | 9873.0 | 6.531652 | 1.697683 | 4.0 | 5.0 | 7.0 | 8.0 | 9.0 |
Sample Question Papers Practiced | 9873.0 | 4.583004 | 2.867202 | 0.0 | 2.0 | 5.0 | 7.0 | 9.0 |
Performance Index | 9873.0 | 55.216651 | 19.208570 | 10.0 | 40.0 | 55.0 | 70.0 | 100.0 |
correlation_matrix = data[['Hours Studied', 'Previous Scores', 'Extracurricular Activities', 'Sleep Hours', 'Sample Question Papers Practiced', 'Performance Index']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
<Axes: >
Data standarization to normal distribution
X = data.drop("Performance Index", axis=1)
y = data["Performance Index"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled
array([[ 0.77824248, 0.02642469, -0.98691768, -1.48058947, 0.84637542], [-0.77049682, -1.58829235, 1.01325574, 0.87108386, 1.54382162], [-0.77049682, 0.48777241, -0.98691768, 1.45900219, 0.14892922], ..., [ 0.00387283, 1.179794 , -0.98691768, -0.89267114, -1.24596317], [ 1.16542731, -1.70362928, -0.98691768, 0.28316553, 0.84637542], [-0.77049682, 0.89145167, 1.01325574, -0.89267114, -0.89724007]])
Evaluation
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train_scaled, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
import numpy
# Predicting Test Set Results
y_pred = regressor.predict(X_test_scaled)
y_pred = numpy.round(y_pred, decimals = 2)
from sklearn.metrics import r2_score, mean_squared_error
r2 = r2_score(y_test, y_pred)
mean_er = mean_squared_error(y_test, y_pred)
print('Mean Squared Error : ', mean_er)
print('R Square : ', r2)
Mean Squared Error : 4.213380000000001 R Square : 0.9888499650063397
pd.DataFrame({'Actual Performance': y_test, 'Predicted Performance': y_pred})
Actual Performance | Predicted Performance | |
---|---|---|
4288 | 31.0 | 29.19 |
5077 | 62.0 | 59.85 |
3955 | 16.0 | 16.21 |
9149 | 73.0 | 73.85 |
3089 | 44.0 | 45.14 |
... | ... | ... |
4791 | 35.0 | 39.65 |
1750 | 48.0 | 48.94 |
8441 | 75.0 | 74.65 |
9143 | 67.0 | 65.49 |
2522 | 51.0 | 47.46 |
1975 rows × 2 columns