ium_z487175/02_Dane-Zadanie01.ipynb
2023-04-03 21:27:41 +02:00

211 KiB
Raw Blame History

%pip install --user pandas
Requirement already satisfied: pandas in c:\users\admin\appdata\roaming\python\python311\site-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\admin\appdata\roaming\python\python311\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\admin\appdata\roaming\python\python311\site-packages (from pandas) (2023.2)
Requirement already satisfied: numpy>=1.21.0 in c:\users\admin\appdata\roaming\python\python311\site-packages (from pandas) (1.24.2)
Requirement already satisfied: six>=1.5 in c:\users\admin\appdata\roaming\python\python311\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
%pip install --user kaggle
Requirement already satisfied: kaggle in c:\users\admin\appdata\roaming\python\python311\site-packages (1.5.13)
Requirement already satisfied: six>=1.10 in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (2022.12.7)
Requirement already satisfied: python-dateutil in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (2.28.2)
Requirement already satisfied: tqdm in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (4.65.0)
Requirement already satisfied: python-slugify in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (8.0.1)
Requirement already satisfied: urllib3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (1.26.15)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\admin\appdata\roaming\python\python311\site-packages (from requests->kaggle) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in c:\users\admin\appdata\roaming\python\python311\site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: colorama in c:\users\admin\appdata\roaming\python\python311\site-packages (from tqdm->kaggle) (0.4.6)
Note: you may need to restart the kernel to use updated packages.
%python -m kaggle datasets download -d ulrikthygepedersen/diamonds
UsageError: Line magic function `%python` not found (But cell magic `%%python` exists, did you mean that instead?).
!kaggle datasets download -d shivam2503/diamonds
Downloading diamonds.zip to c:\Users\admin\ium_z487175

  0%|          | 0.00/733k [00:00<?, ?B/s]
100%|██████████| 733k/733k [00:00<00:00, 1.35MB/s]
100%|██████████| 733k/733k [00:00<00:00, 1.33MB/s]
!tar -xf diamonds.zip
## rozpakowanie archiwum .zip w windowsie
import pandas as pd
diamonds = pd.read_csv('diamonds.csv')
#Wyświetlenie zbioru danych
diamonds
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ... ...
53935 53936 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 53937 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 53938 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 53939 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 53940 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 11 columns

#przydzielanie nazwy kolumny z id
diamonds = diamonds.rename(columns={diamonds.columns[0]: 'id'})
diamonds
id carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ... ...
53935 53936 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 53937 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 53938 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 53939 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 53940 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 11 columns

#Convert to lowerCase

diamonds['cut'] = diamonds['cut'].str.lower()
diamonds

Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ... ...
53935 53936 0.72 ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 53937 0.72 good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 53938 0.70 very good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 53939 0.86 premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 53940 0.75 ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 11 columns

%pip install scikit-learn
Requirement already satisfied: scikit-learn in c:\users\admin\appdata\roaming\python\python311\site-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (1.24.2)
Requirement already satisfied: scipy>=1.3.2 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (1.10.1)
Requirement already satisfied: joblib>=1.1.1 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (3.1.0)
Note: you may need to restart the kernel to use updated packages.
import sklearn
from sklearn.model_selection import train_test_split
#podział danych na train/test/dev w proporcji 4:1:1
#losować ustawiona na 10

#1. Dzielimy na zbiór treningowy 80 % i resztę danych
diamonds_train, diamonds_test_dev = sklearn.model_selection.train_test_split(diamonds, test_size=0.2, random_state=10)

#2. Podział reszty danych na zbiór testowy 10% i walidacyjny 10%
diamonds_test, diamonds_dev = train_test_split(diamonds_test_dev, test_size=0.5, random_state=10)
#Wyświetlenie rozmiarów zbiorów danych train/test/dev
print("Rozmiar diamonds: ", diamonds.shape)
print("Rozmiar diamonds_train: ", diamonds_train.shape)
print("Rozmiar diamonds_test: ", diamonds_test.shape)
print("Rozmiar diamonds_dev: ", diamonds_dev.shape)
Rozmiar diamonds:  (53940, 11)
Rozmiar diamonds_train:  (43152, 11)
Rozmiar diamonds_test:  (5394, 11)
Rozmiar diamonds_dev:  (5394, 11)
# średnią, minimum, maksimum, odchylenia standardowe, medianę wartości poszczególnych parametrów)
print(diamonds.describe())
         Unnamed: 0         carat         depth         table         price  \
count  53940.000000  53940.000000  53940.000000  53940.000000  53940.000000   
mean   26970.500000      0.797940     61.749405     57.457184   3932.799722   
std    15571.281097      0.474011      1.432621      2.234491   3989.439738   
min        1.000000      0.200000     43.000000     43.000000    326.000000   
25%    13485.750000      0.400000     61.000000     56.000000    950.000000   
50%    26970.500000      0.700000     61.800000     57.000000   2401.000000   
75%    40455.250000      1.040000     62.500000     59.000000   5324.250000   
max    53940.000000      5.010000     79.000000     95.000000  18823.000000   

                  x             y             z  
count  53940.000000  53940.000000  53940.000000  
mean       5.731157      5.734526      3.538734  
std        1.121761      1.142135      0.705699  
min        0.000000      0.000000      0.000000  
25%        4.710000      4.720000      2.910000  
50%        5.700000      5.710000      3.530000  
75%        6.540000      6.540000      4.040000  
max       10.740000     58.900000     31.800000  
print(diamonds_train.describe())
         Unnamed: 0         carat         depth         table         price  \
count  43152.000000  43152.000000  43152.000000  43152.000000  43152.000000   
mean   26971.712111      0.795979     61.748241     57.448355   3920.786939   
std    15565.585777      0.472184      1.426394      2.224297   3975.894633   
min        3.000000      0.200000     43.000000     44.000000    327.000000   
25%    13469.750000      0.400000     61.000000     56.000000    946.000000   
50%    27019.500000      0.700000     61.800000     57.000000   2400.000000   
75%    40439.250000      1.040000     62.500000     59.000000   5313.250000   
max    53938.000000      5.010000     79.000000     76.000000  18823.000000   

                  x             y             z  
count  43152.000000  43152.000000  43152.000000  
mean       5.726933      5.731011      3.535791  
std        1.119635      1.147069      0.693846  
min        0.000000      0.000000      0.000000  
25%        4.710000      4.720000      2.910000  
50%        5.690000      5.710000      3.520000  
75%        6.540000      6.530000      4.030000  
max       10.740000     58.900000      8.060000  
print(diamonds_test.describe())
         Unnamed: 0        carat        depth        table         price  \
count   5394.000000  5394.000000  5394.000000  5394.000000   5394.000000   
mean   26951.351316     0.802666    61.760808    57.470189   3970.308676   
std    15565.740253     0.482062     1.464893     2.309900   4083.195823   
min        1.000000     0.210000    52.300000    43.000000    326.000000   
25%    13519.750000     0.400000    61.000000    56.000000    958.000000   
50%    27013.500000     0.700000    61.900000    57.000000   2375.500000   
75%    40342.250000     1.050000    62.500000    59.000000   5273.750000   
max    53930.000000     3.510000    78.200000    95.000000  18806.000000   

                 x            y            z  
count  5394.000000  5394.000000  5394.000000  
mean      5.738817     5.739106     3.542097  
std       1.132069     1.123925     0.701446  
min       3.840000     3.780000     0.000000  
25%       4.710000     4.710000     2.900000  
50%       5.690000     5.700000     3.530000  
75%       6.550000     6.540000     4.040000  
max       9.660000     9.630000     6.030000  
print(diamonds_dev.describe())
         Unnamed: 0        carat        depth        table         price  \
count   5394.000000  5394.000000  5394.000000  5394.000000   5394.000000   
mean   26979.951798     0.808901    61.747312    57.514813   3991.393029   
std    15625.161644     0.480344     1.449816     2.238671   4002.742530   
min        2.000000     0.200000    53.200000    51.000000    326.000000   
25%    13525.500000     0.400000    61.000000    56.000000    961.000000   
50%    26529.500000     0.710000    61.850000    57.000000   2484.500000   
75%    40665.500000     1.050000    62.500000    59.000000   5465.250000   
max    53940.000000     3.040000    73.600000    68.000000  18779.000000   

                 x            y            z  
count  5394.000000  5394.000000  5394.000000  
mean      5.757290     5.758066     3.558910  
std       1.128191     1.120344     0.797759  
min       3.790000     3.750000     0.000000  
25%       4.730000     4.740000     2.930000  
50%       5.710000     5.730000     3.540000  
75%       6.560000     6.540000     4.040000  
max       9.510000     9.460000    31.800000  
#Wyświetlenie częstości przykładów dla poszczególnych klas diamentów
diamonds_train["cut"].value_counts()

Ideal        17292
Premium      10954
Very Good     9708
Good          3929
Fair          1269
Name: cut, dtype: int64
diamonds_test["cut"].value_counts()
Ideal        2184
Premium      1385
Very Good    1183
Good          473
Fair          169
Name: cut, dtype: int64
diamonds_dev["cut"].value_counts()
Ideal        2075
Premium      1452
Very Good    1191
Good          504
Fair          172
Name: cut, dtype: int64
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
diamonds['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
diamonds_train['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds tranującego')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
diamonds_test['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds testowego')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
diamonds_dev['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds walidacyjnego')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()
diamonds[["cut","carat"]].groupby("cut").std()
carat
cut
fair 0.516404
good 0.454054
ideal 0.432876
premium 0.515262
very good 0.459435
diamonds[["cut","carat"]].groupby("cut").mean().plot(kind="bar")
<Axes: xlabel='cut'>
#normalizacja wartości typu float do zakrsu 0.0 - 1.0
#Powyżej wykonano jeszcze konwersję danych typu string na lowerCase

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
diamonds[['carat', 'depth', 'table', 'price', 'x', 'y', 'z']] = scaler.fit_transform(diamonds[['carat', 'depth', 'table', 'price', 'x', 'y', 'z']])

#wyświetlenie zbioru
diamonds
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.006237 ideal E SI2 0.513889 0.230769 0.000000 0.367784 0.067572 0.076415
1 2 0.002079 premium E SI1 0.466667 0.346154 0.000000 0.362197 0.065195 0.072642
2 3 0.006237 good E VS1 0.386111 0.423077 0.000054 0.377095 0.069100 0.072642
3 4 0.018711 premium I VS2 0.538889 0.288462 0.000433 0.391061 0.071817 0.082704
4 5 0.022869 good J SI2 0.563889 0.288462 0.000487 0.404097 0.073854 0.086478
... ... ... ... ... ... ... ... ... ... ... ...
53935 53936 0.108108 ideal D SI1 0.494444 0.269231 0.131427 0.535382 0.097793 0.110063
53936 53937 0.108108 good D SI1 0.558333 0.230769 0.131427 0.529795 0.097623 0.113522
53937 53938 0.103950 very good D SI1 0.550000 0.326923 0.131427 0.527002 0.096435 0.111950
53938 53939 0.137214 premium H SI2 0.500000 0.288462 0.131427 0.572626 0.103905 0.117610
53939 53940 0.114345 ideal D SI2 0.533333 0.230769 0.131427 0.542831 0.099660 0.114465

53940 rows × 11 columns

# Usuwanie artefaktów
diamonds = diamonds.dropna() ## usuwanie pustych wierszy, które posiadają przynajmniej jedno wystąpienie NULL or NaN
diamonds
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ... ...
53935 53936 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 53937 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 53938 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 53939 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 53940 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 11 columns