Notebook for first substask of Inżynieria Uczenia Maszynowego class project.

This workbook downloads, normalizes and prints short summary of the dataset I will be working on and its subsets.

Link to the dataset at Kaggle.com:

https://www.kaggle.com/pcbreviglieri/smart-grid-stability

from google.colab import drive
drive.mount('drive')

Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).

Click in Colab GUI to allow Colab access and modify Google Drive files

!mkdir ~/.kaggle
!cp drive/MyDrive/kaggle.json ~/.kaggle/.
!chmod +x ~/.kaggle/kaggle.json
!pip install -q kaggle

script for lab IUM-01

download data

!kaggle datasets download -d 'pcbreviglieri/smart-grid-stability' >>/dev/null 2>&1
!unzip smart-grid-stability.zip >>/dev/null 2>&1

read the data as pandas data frame

import pandas as pd

df = pd.read_csv('smart_grid_stability_augmented.csv')

normalize values, so they are all between 0 and 1 (included)

from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(df.iloc[:, 0:-1])
df_norm_array = scaler.transform(df.iloc[:, 0:-1])
df_norm = pd.DataFrame(data=df_norm_array,
                       columns=df.columns[:-1])
df_norm['stabf'] = df['stabf']

divide the data into train, test and validation subsets

from sklearn.model_selection import train_test_split

train, testAndValid = train_test_split(
    df_norm,
    test_size=0.2,
    random_state=42,
    stratify=df_norm['stabf'])

test, valid =train_test_split(
    testAndValid,
    test_size=0.5,
    random_state=42,
    stratify=testAndValid['stabf'])

print short summary of the dataset and its subsets

def namestr(obj, namespace):
  return [name for name in namespace if namespace[name] is obj]

dataset = df_norm
for x in [dataset, train, test, valid]:
  print([q for q in namestr(x, globals()) if len(q) == max([len(w) for w in namestr(x, globals())])][-1]) 
  print("size:", len(x))
  print(x.describe(include='all'))
  print("class distribution", x.value_counts('stabf'))
  print('===============================================================')

dataset
size: 60000
                tau1          tau2  ...          stab     stabf
count   6.000000e+04  6.000000e+04  ...  6.000000e+04     60000
unique           NaN           NaN  ...           NaN         2
top              NaN           NaN  ...           NaN  unstable
freq             NaN           NaN  ...           NaN     38280
mean    1.476245e-16 -1.998105e-16  ...  3.981075e-17       NaN
std     1.000008e+00  1.000008e+00  ...  1.000008e+00       NaN
min    -1.731763e+00 -1.731999e+00  ... -2.613709e+00       NaN
25%    -8.660657e-01 -8.660215e-01  ... -8.475133e-01       NaN
50%     1.437170e-06 -7.028730e-06  ...  3.821538e-02       NaN
75%     8.659131e-01  8.659873e-01  ...  7.895385e-01       NaN
max     1.731859e+00  1.731991e+00  ...  2.537363e+00       NaN

[11 rows x 14 columns]
class distribution stabf
unstable    38280
stable      21720
dtype: int64
===============================================================
train
size: 48000
                tau1          tau2  ...          stab     stabf
count   48000.000000  48000.000000  ...  48000.000000     48000
unique           NaN           NaN  ...           NaN         2
top              NaN           NaN  ...           NaN  unstable
freq             NaN           NaN  ...           NaN     30624
mean       -0.001546     -0.001068  ...     -0.000873       NaN
std         1.000934      0.999107  ...      0.999578       NaN
min        -1.731763     -1.731999  ...     -2.613709       NaN
25%        -0.868796     -0.864317  ...     -0.847686       NaN
50%        -0.001740     -0.005136  ...      0.036743       NaN
75%         0.868335      0.861387  ...      0.788993       NaN
max         1.731859      1.731991  ...      2.537363       NaN

[11 rows x 14 columns]
class distribution stabf
unstable    30624
stable      17376
dtype: int64
===============================================================
test
size: 6000
               tau1         tau2  ...         stab     stabf
count   6000.000000  6000.000000  ...  6000.000000      6000
unique          NaN          NaN  ...          NaN         2
top             NaN          NaN  ...          NaN  unstable
freq            NaN          NaN  ...          NaN      3828
mean       0.023917     0.012911  ...     0.003546       NaN
std        0.998552     1.001761  ...     0.998815       NaN
min       -1.731763    -1.731184  ...    -2.613709       NaN
25%       -0.839910    -0.855393  ...    -0.847835       NaN
50%        0.042499     0.020595  ...     0.049834       NaN
75%        0.889110     0.902355  ...     0.794568       NaN
max        1.731686     1.731427  ...     2.537363       NaN

[11 rows x 14 columns]
class distribution stabf
unstable    3828
stable      2172
dtype: int64
===============================================================
valid
size: 6000
               tau1         tau2  ...         stab     stabf
count   6000.000000  6000.000000  ...  6000.000000      6000
unique          NaN          NaN  ...          NaN         2
top             NaN          NaN  ...          NaN  unstable
freq            NaN          NaN  ...          NaN      3828
mean      -0.011551    -0.004364  ...     0.003435       NaN
std        0.993842     1.005519  ...     1.004786       NaN
min       -1.731763    -1.731999  ...    -2.613709       NaN
25%       -0.874471    -0.887753  ...    -0.844789       NaN
50%       -0.017244     0.017840  ...     0.039665       NaN
75%        0.825347     0.868048  ...     0.787678       NaN
max        1.731859     1.731991  ...     2.537363       NaN

[11 rows x 14 columns]
class distribution stabf
unstable    3828
stable      2172
dtype: int64
===============================================================

script for lab IUM-03

download data

!kaggle datasets download -d 'pcbreviglieri/smart-grid-stability' >>/dev/null 2>&1
!unzip smart-grid-stability.zip >>/dev/null 2>&1

check how many data entries is in the dataset

!wc -l smart_grid_stability_augmented.csv

60001 smart_grid_stability_augmented.csv

take a look at the dataset to choose columns to keep

import pandas as pd
df = pd.read_csv('smart_grid_stability_augmented.csv')
df.head()

	tau1	tau2	tau3	tau4	p1	p2	p3	p4	g1	g2	g3	g4	stab	stabf
0	2.959060	3.079885	8.381025	9.780754	3.763085	-0.782604	-1.257395	-1.723086	0.650456	0.859578	0.887445	0.958034	0.055347	unstable
1	9.304097	4.902524	3.047541	1.369357	5.067812	-1.940058	-1.872742	-1.255012	0.413441	0.862414	0.562139	0.781760	-0.005957	stable
2	8.971707	8.848428	3.046479	1.214518	3.405158	-1.207456	-1.277210	-0.920492	0.163041	0.766689	0.839444	0.109853	0.003471	unstable
3	0.716415	7.669600	4.486641	2.340563	3.963791	-1.027473	-1.938944	-0.997374	0.446209	0.976744	0.929381	0.362718	0.028871	unstable
4	3.134112	7.608772	4.943759	9.857573	3.525811	-1.125531	-1.845975	-0.554305	0.797110	0.455450	0.656947	0.820923	0.049860	unstable

discard some of the columns; shuffle the data; divide into train, test and validations subsets and print number of rows of the subsets

!sed 1d smart_grid_stability_augmented.csv | cut -f 1,5,9,13,14 -d "," | shuf | split -l 48000
!mv xaa train.csv
!mv xab toDivide
!split -l 6000 toDivide
!mv xaa test.csv
!mv xab valid.csv
!wc -l train.csv
!wc -l test.csv
!wc -l valid.csv

48000 train.csv
6000 test.csv
6000 valid.csv

15 KiB Raw Blame History

Notebook for first substask of Inżynieria Uczenia Maszynowego class project.

google colab related stuff

script for lab IUM-01

script for lab IUM-03

15 KiB

Raw Blame History