15 KiB
15 KiB
Notebook for first substask of Inżynieria Uczenia Maszynowego class project.
This workbook downloads, normalizes and prints short summary of the dataset I will be working on and its subsets.
Link to the dataset at Kaggle.com:
google colab related stuff
from google.colab import drive
drive.mount('drive')
Drive already mounted at drive; to attempt to forcibly remount, call drive.mount("drive", force_remount=True).
- Click in Colab GUI to allow Colab access and modify Google Drive files
!mkdir ~/.kaggle
!cp drive/MyDrive/kaggle.json ~/.kaggle/.
!chmod +x ~/.kaggle/kaggle.json
!pip install -q kaggle
script for lab IUM-01
download data
!kaggle datasets download -d 'pcbreviglieri/smart-grid-stability' >>/dev/null 2>&1
!unzip smart-grid-stability.zip >>/dev/null 2>&1
read the data as pandas data frame
import pandas as pd
df = pd.read_csv('smart_grid_stability_augmented.csv')
normalize values, so they are all between 0 and 1 (included)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(df.iloc[:, 0:-1])
df_norm_array = scaler.transform(df.iloc[:, 0:-1])
df_norm = pd.DataFrame(data=df_norm_array,
columns=df.columns[:-1])
df_norm['stabf'] = df['stabf']
divide the data into train, test and validation subsets
from sklearn.model_selection import train_test_split
train, testAndValid = train_test_split(
df_norm,
test_size=0.2,
random_state=42,
stratify=df_norm['stabf'])
test, valid =train_test_split(
testAndValid,
test_size=0.5,
random_state=42,
stratify=testAndValid['stabf'])
print short summary of the dataset and its subsets
def namestr(obj, namespace):
return [name for name in namespace if namespace[name] is obj]
dataset = df_norm
for x in [dataset, train, test, valid]:
print([q for q in namestr(x, globals()) if len(q) == max([len(w) for w in namestr(x, globals())])][-1])
print("size:", len(x))
print(x.describe(include='all'))
print("class distribution", x.value_counts('stabf'))
print('===============================================================')
dataset size: 60000 tau1 tau2 ... stab stabf count 6.000000e+04 6.000000e+04 ... 6.000000e+04 60000 unique NaN NaN ... NaN 2 top NaN NaN ... NaN unstable freq NaN NaN ... NaN 38280 mean 1.476245e-16 -1.998105e-16 ... 3.981075e-17 NaN std 1.000008e+00 1.000008e+00 ... 1.000008e+00 NaN min -1.731763e+00 -1.731999e+00 ... -2.613709e+00 NaN 25% -8.660657e-01 -8.660215e-01 ... -8.475133e-01 NaN 50% 1.437170e-06 -7.028730e-06 ... 3.821538e-02 NaN 75% 8.659131e-01 8.659873e-01 ... 7.895385e-01 NaN max 1.731859e+00 1.731991e+00 ... 2.537363e+00 NaN [11 rows x 14 columns] class distribution stabf unstable 38280 stable 21720 dtype: int64 =============================================================== train size: 48000 tau1 tau2 ... stab stabf count 48000.000000 48000.000000 ... 48000.000000 48000 unique NaN NaN ... NaN 2 top NaN NaN ... NaN unstable freq NaN NaN ... NaN 30624 mean -0.001546 -0.001068 ... -0.000873 NaN std 1.000934 0.999107 ... 0.999578 NaN min -1.731763 -1.731999 ... -2.613709 NaN 25% -0.868796 -0.864317 ... -0.847686 NaN 50% -0.001740 -0.005136 ... 0.036743 NaN 75% 0.868335 0.861387 ... 0.788993 NaN max 1.731859 1.731991 ... 2.537363 NaN [11 rows x 14 columns] class distribution stabf unstable 30624 stable 17376 dtype: int64 =============================================================== test size: 6000 tau1 tau2 ... stab stabf count 6000.000000 6000.000000 ... 6000.000000 6000 unique NaN NaN ... NaN 2 top NaN NaN ... NaN unstable freq NaN NaN ... NaN 3828 mean 0.023917 0.012911 ... 0.003546 NaN std 0.998552 1.001761 ... 0.998815 NaN min -1.731763 -1.731184 ... -2.613709 NaN 25% -0.839910 -0.855393 ... -0.847835 NaN 50% 0.042499 0.020595 ... 0.049834 NaN 75% 0.889110 0.902355 ... 0.794568 NaN max 1.731686 1.731427 ... 2.537363 NaN [11 rows x 14 columns] class distribution stabf unstable 3828 stable 2172 dtype: int64 =============================================================== valid size: 6000 tau1 tau2 ... stab stabf count 6000.000000 6000.000000 ... 6000.000000 6000 unique NaN NaN ... NaN 2 top NaN NaN ... NaN unstable freq NaN NaN ... NaN 3828 mean -0.011551 -0.004364 ... 0.003435 NaN std 0.993842 1.005519 ... 1.004786 NaN min -1.731763 -1.731999 ... -2.613709 NaN 25% -0.874471 -0.887753 ... -0.844789 NaN 50% -0.017244 0.017840 ... 0.039665 NaN 75% 0.825347 0.868048 ... 0.787678 NaN max 1.731859 1.731991 ... 2.537363 NaN [11 rows x 14 columns] class distribution stabf unstable 3828 stable 2172 dtype: int64 ===============================================================
script for lab IUM-03
download data
!kaggle datasets download -d 'pcbreviglieri/smart-grid-stability' >>/dev/null 2>&1
!unzip smart-grid-stability.zip >>/dev/null 2>&1
check how many data entries is in the dataset
!wc -l smart_grid_stability_augmented.csv
60001 smart_grid_stability_augmented.csv
take a look at the dataset to choose columns to keep
import pandas as pd
df = pd.read_csv('smart_grid_stability_augmented.csv')
df.head()
tau1 | tau2 | tau3 | tau4 | p1 | p2 | p3 | p4 | g1 | g2 | g3 | g4 | stab | stabf | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.959060 | 3.079885 | 8.381025 | 9.780754 | 3.763085 | -0.782604 | -1.257395 | -1.723086 | 0.650456 | 0.859578 | 0.887445 | 0.958034 | 0.055347 | unstable |
1 | 9.304097 | 4.902524 | 3.047541 | 1.369357 | 5.067812 | -1.940058 | -1.872742 | -1.255012 | 0.413441 | 0.862414 | 0.562139 | 0.781760 | -0.005957 | stable |
2 | 8.971707 | 8.848428 | 3.046479 | 1.214518 | 3.405158 | -1.207456 | -1.277210 | -0.920492 | 0.163041 | 0.766689 | 0.839444 | 0.109853 | 0.003471 | unstable |
3 | 0.716415 | 7.669600 | 4.486641 | 2.340563 | 3.963791 | -1.027473 | -1.938944 | -0.997374 | 0.446209 | 0.976744 | 0.929381 | 0.362718 | 0.028871 | unstable |
4 | 3.134112 | 7.608772 | 4.943759 | 9.857573 | 3.525811 | -1.125531 | -1.845975 | -0.554305 | 0.797110 | 0.455450 | 0.656947 | 0.820923 | 0.049860 | unstable |
discard some of the columns; shuffle the data; divide into train, test and validations subsets and print number of rows of the subsets
!sed 1d smart_grid_stability_augmented.csv | cut -f 1,5,9,13,14 -d "," | shuf | split -l 48000
!mv xaa train.csv
!mv xab toDivide
!split -l 6000 toDivide
!mv xaa test.csv
!mv xab valid.csv
!wc -l train.csv
!wc -l test.csv
!wc -l valid.csv
48000 train.csv 6000 test.csv 6000 valid.csv