ium_z487175/02_Dane-Zadanie01.ipynb at bf3b0944fdffb5dd77690b84a6d1398d3e49ead8

%pip install --user pandas

Requirement already satisfied: pandas in c:\users\admin\appdata\roaming\python\python311\site-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\admin\appdata\roaming\python\python311\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\admin\appdata\roaming\python\python311\site-packages (from pandas) (2023.2)
Requirement already satisfied: numpy>=1.21.0 in c:\users\admin\appdata\roaming\python\python311\site-packages (from pandas) (1.24.2)
Requirement already satisfied: six>=1.5 in c:\users\admin\appdata\roaming\python\python311\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

%pip install --user kaggle

Requirement already satisfied: kaggle in c:\users\admin\appdata\roaming\python\python311\site-packages (1.5.13)
Requirement already satisfied: six>=1.10 in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (2022.12.7)
Requirement already satisfied: python-dateutil in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (2.28.2)
Requirement already satisfied: tqdm in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (4.65.0)
Requirement already satisfied: python-slugify in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (8.0.1)
Requirement already satisfied: urllib3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (1.26.15)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\admin\appdata\roaming\python\python311\site-packages (from requests->kaggle) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in c:\users\admin\appdata\roaming\python\python311\site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: colorama in c:\users\admin\appdata\roaming\python\python311\site-packages (from tqdm->kaggle) (0.4.6)
Note: you may need to restart the kernel to use updated packages.

%python -m kaggle datasets download -d ulrikthygepedersen/diamonds

UsageError: Line magic function `%python` not found (But cell magic `%%python` exists, did you mean that instead?).

!kaggle datasets download -d shivam2503/diamonds

Downloading diamonds.zip to c:\Users\admin\ium_z487175

  0%|          | 0.00/733k [00:00<?, ?B/s]
100%|██████████| 733k/733k [00:00<00:00, 1.35MB/s]
100%|██████████| 733k/733k [00:00<00:00, 1.33MB/s]

!tar -xf diamonds.zip
## rozpakowanie archiwum .zip w windowsie

import pandas as pd
diamonds = pd.read_csv('diamonds.csv')
#Wyświetlenie zbioru danych
diamonds

	Unnamed: 0	carat	cut	color	clarity	depth	table	price	x	y	z
0	1	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	2	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	3	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	4	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	5	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75
...	...	...	...	...	...	...	...	...	...	...	...
53935	53936	0.72	Ideal	D	SI1	60.8	57.0	2757	5.75	5.76	3.50
53936	53937	0.72	Good	D	SI1	63.1	55.0	2757	5.69	5.75	3.61
53937	53938	0.70	Very Good	D	SI1	62.8	60.0	2757	5.66	5.68	3.56
53938	53939	0.86	Premium	H	SI2	61.0	58.0	2757	6.15	6.12	3.74
53939	53940	0.75	Ideal	D	SI2	62.2	55.0	2757	5.83	5.87	3.64

53940 rows × 11 columns

#przydzielanie nazwy kolumny z id
diamonds = diamonds.rename(columns={diamonds.columns[0]: 'id'})
diamonds

	id	carat	cut	color	clarity	depth	table	price	x	y	z
0	1	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	2	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	3	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	4	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	5	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75
...	...	...	...	...	...	...	...	...	...	...	...
53935	53936	0.72	Ideal	D	SI1	60.8	57.0	2757	5.75	5.76	3.50
53936	53937	0.72	Good	D	SI1	63.1	55.0	2757	5.69	5.75	3.61
53937	53938	0.70	Very Good	D	SI1	62.8	60.0	2757	5.66	5.68	3.56
53938	53939	0.86	Premium	H	SI2	61.0	58.0	2757	6.15	6.12	3.74
53939	53940	0.75	Ideal	D	SI2	62.2	55.0	2757	5.83	5.87	3.64

53940 rows × 11 columns

#Convert to lowerCase

diamonds['cut'] = diamonds['cut'].str.lower()
diamonds

	Unnamed: 0	carat	cut	color	clarity	depth	table	price	x	y	z
0	1	0.23	ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	2	0.21	premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	3	0.23	good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	4	0.29	premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	5	0.31	good	J	SI2	63.3	58.0	335	4.34	4.35	2.75
...	...	...	...	...	...	...	...	...	...	...	...
53935	53936	0.72	ideal	D	SI1	60.8	57.0	2757	5.75	5.76	3.50
53936	53937	0.72	good	D	SI1	63.1	55.0	2757	5.69	5.75	3.61
53937	53938	0.70	very good	D	SI1	62.8	60.0	2757	5.66	5.68	3.56
53938	53939	0.86	premium	H	SI2	61.0	58.0	2757	6.15	6.12	3.74
53939	53940	0.75	ideal	D	SI2	62.2	55.0	2757	5.83	5.87	3.64

53940 rows × 11 columns

%pip install scikit-learn

Requirement already satisfied: scikit-learn in c:\users\admin\appdata\roaming\python\python311\site-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (1.24.2)
Requirement already satisfied: scipy>=1.3.2 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (1.10.1)
Requirement already satisfied: joblib>=1.1.1 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (3.1.0)
Note: you may need to restart the kernel to use updated packages.

import sklearn
from sklearn.model_selection import train_test_split

#podział danych na train/test/dev w proporcji 4:1:1
#losować ustawiona na 10

#1. Dzielimy na zbiór treningowy 80 % i resztę danych
diamonds_train, diamonds_test_dev = sklearn.model_selection.train_test_split(diamonds, test_size=0.2, random_state=10)

#2. Podział reszty danych na zbiór testowy 10% i walidacyjny 10%
diamonds_test, diamonds_dev = train_test_split(diamonds_test_dev, test_size=0.5, random_state=10)

#Wyświetlenie rozmiarów zbiorów danych train/test/dev
print("Rozmiar diamonds: ", diamonds.shape)
print("Rozmiar diamonds_train: ", diamonds_train.shape)
print("Rozmiar diamonds_test: ", diamonds_test.shape)
print("Rozmiar diamonds_dev: ", diamonds_dev.shape)

Rozmiar diamonds:  (53940, 11)
Rozmiar diamonds_train:  (43152, 11)
Rozmiar diamonds_test:  (5394, 11)
Rozmiar diamonds_dev:  (5394, 11)

# średnią, minimum, maksimum, odchylenia standardowe, medianę wartości poszczególnych parametrów)
print(diamonds.describe())

         Unnamed: 0         carat         depth         table         price  \
count  53940.000000  53940.000000  53940.000000  53940.000000  53940.000000   
mean   26970.500000      0.797940     61.749405     57.457184   3932.799722   
std    15571.281097      0.474011      1.432621      2.234491   3989.439738   
min        1.000000      0.200000     43.000000     43.000000    326.000000   
25%    13485.750000      0.400000     61.000000     56.000000    950.000000   
50%    26970.500000      0.700000     61.800000     57.000000   2401.000000   
75%    40455.250000      1.040000     62.500000     59.000000   5324.250000   
max    53940.000000      5.010000     79.000000     95.000000  18823.000000   

                  x             y             z  
count  53940.000000  53940.000000  53940.000000  
mean       5.731157      5.734526      3.538734  
std        1.121761      1.142135      0.705699  
min        0.000000      0.000000      0.000000  
25%        4.710000      4.720000      2.910000  
50%        5.700000      5.710000      3.530000  
75%        6.540000      6.540000      4.040000  
max       10.740000     58.900000     31.800000

print(diamonds_train.describe())

         Unnamed: 0         carat         depth         table         price  \
count  43152.000000  43152.000000  43152.000000  43152.000000  43152.000000   
mean   26971.712111      0.795979     61.748241     57.448355   3920.786939   
std    15565.585777      0.472184      1.426394      2.224297   3975.894633   
min        3.000000      0.200000     43.000000     44.000000    327.000000   
25%    13469.750000      0.400000     61.000000     56.000000    946.000000   
50%    27019.500000      0.700000     61.800000     57.000000   2400.000000   
75%    40439.250000      1.040000     62.500000     59.000000   5313.250000   
max    53938.000000      5.010000     79.000000     76.000000  18823.000000   

                  x             y             z  
count  43152.000000  43152.000000  43152.000000  
mean       5.726933      5.731011      3.535791  
std        1.119635      1.147069      0.693846  
min        0.000000      0.000000      0.000000  
25%        4.710000      4.720000      2.910000  
50%        5.690000      5.710000      3.520000  
75%        6.540000      6.530000      4.030000  
max       10.740000     58.900000      8.060000

print(diamonds_test.describe())

         Unnamed: 0        carat        depth        table         price  \
count   5394.000000  5394.000000  5394.000000  5394.000000   5394.000000   
mean   26951.351316     0.802666    61.760808    57.470189   3970.308676   
std    15565.740253     0.482062     1.464893     2.309900   4083.195823   
min        1.000000     0.210000    52.300000    43.000000    326.000000   
25%    13519.750000     0.400000    61.000000    56.000000    958.000000   
50%    27013.500000     0.700000    61.900000    57.000000   2375.500000   
75%    40342.250000     1.050000    62.500000    59.000000   5273.750000   
max    53930.000000     3.510000    78.200000    95.000000  18806.000000   

                 x            y            z  
count  5394.000000  5394.000000  5394.000000  
mean      5.738817     5.739106     3.542097  
std       1.132069     1.123925     0.701446  
min       3.840000     3.780000     0.000000  
25%       4.710000     4.710000     2.900000  
50%       5.690000     5.700000     3.530000  
75%       6.550000     6.540000     4.040000  
max       9.660000     9.630000     6.030000

print(diamonds_dev.describe())

         Unnamed: 0        carat        depth        table         price  \
count   5394.000000  5394.000000  5394.000000  5394.000000   5394.000000   
mean   26979.951798     0.808901    61.747312    57.514813   3991.393029   
std    15625.161644     0.480344     1.449816     2.238671   4002.742530   
min        2.000000     0.200000    53.200000    51.000000    326.000000   
25%    13525.500000     0.400000    61.000000    56.000000    961.000000   
50%    26529.500000     0.710000    61.850000    57.000000   2484.500000   
75%    40665.500000     1.050000    62.500000    59.000000   5465.250000   
max    53940.000000     3.040000    73.600000    68.000000  18779.000000   

                 x            y            z  
count  5394.000000  5394.000000  5394.000000  
mean      5.757290     5.758066     3.558910  
std       1.128191     1.120344     0.797759  
min       3.790000     3.750000     0.000000  
25%       4.730000     4.740000     2.930000  
50%       5.710000     5.730000     3.540000  
75%       6.560000     6.540000     4.040000  
max       9.510000     9.460000    31.800000

#Wyświetlenie częstości przykładów dla poszczególnych klas diamentów
diamonds_train["cut"].value_counts()

Ideal        17292
Premium      10954
Very Good     9708
Good          3929
Fair          1269
Name: cut, dtype: int64

diamonds_test["cut"].value_counts()

Ideal        2184
Premium      1385
Very Good    1183
Good          473
Fair          169
Name: cut, dtype: int64

diamonds_dev["cut"].value_counts()

Ideal        2075
Premium      1452
Very Good    1191
Good          504
Fair          172
Name: cut, dtype: int64

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
diamonds['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
diamonds_train['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds tranującego')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
diamonds_test['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds testowego')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
diamonds_dev['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds walidacyjnego')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()

diamonds[["cut","carat"]].groupby("cut").std()

	carat
cut
fair	0.516404
good	0.454054
ideal	0.432876
premium	0.515262
very good	0.459435

diamonds[["cut","carat"]].groupby("cut").mean().plot(kind="bar")

<Axes: xlabel='cut'>

#normalizacja wartości typu float do zakrsu 0.0 - 1.0
#Powyżej wykonano jeszcze konwersję danych typu string na lowerCase

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
diamonds[['carat', 'depth', 'table', 'price', 'x', 'y', 'z']] = scaler.fit_transform(diamonds[['carat', 'depth', 'table', 'price', 'x', 'y', 'z']])

#wyświetlenie zbioru
diamonds

	Unnamed: 0	carat	cut	color	clarity	depth	table	price	x	y	z
0	1	0.006237	ideal	E	SI2	0.513889	0.230769	0.000000	0.367784	0.067572	0.076415
1	2	0.002079	premium	E	SI1	0.466667	0.346154	0.000000	0.362197	0.065195	0.072642
2	3	0.006237	good	E	VS1	0.386111	0.423077	0.000054	0.377095	0.069100	0.072642
3	4	0.018711	premium	I	VS2	0.538889	0.288462	0.000433	0.391061	0.071817	0.082704
4	5	0.022869	good	J	SI2	0.563889	0.288462	0.000487	0.404097	0.073854	0.086478
...	...	...	...	...	...	...	...	...	...	...	...
53935	53936	0.108108	ideal	D	SI1	0.494444	0.269231	0.131427	0.535382	0.097793	0.110063
53936	53937	0.108108	good	D	SI1	0.558333	0.230769	0.131427	0.529795	0.097623	0.113522
53937	53938	0.103950	very good	D	SI1	0.550000	0.326923	0.131427	0.527002	0.096435	0.111950
53938	53939	0.137214	premium	H	SI2	0.500000	0.288462	0.131427	0.572626	0.103905	0.117610
53939	53940	0.114345	ideal	D	SI2	0.533333	0.230769	0.131427	0.542831	0.099660	0.114465

53940 rows × 11 columns

# Usuwanie artefaktów
diamonds = diamonds.dropna() ## usuwanie pustych wierszy, które posiadają przynajmniej jedno wystąpienie NULL or NaN
diamonds

	Unnamed: 0	carat	cut	color	clarity	depth	table	price	x	y	z
0	1	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	2	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	3	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	4	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	5	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75
...	...	...	...	...	...	...	...	...	...	...	...
53935	53936	0.72	Ideal	D	SI1	60.8	57.0	2757	5.75	5.76	3.50
53936	53937	0.72	Good	D	SI1	63.1	55.0	2757	5.69	5.75	3.61
53937	53938	0.70	Very Good	D	SI1	62.8	60.0	2757	5.66	5.68	3.56
53938	53939	0.86	Premium	H	SI2	61.0	58.0	2757	6.15	6.12	3.74
53939	53940	0.75	Ideal	D	SI2	62.2	55.0	2757	5.83	5.87	3.64

53940 rows × 11 columns

211 KiB Raw Blame History Unescape Escape

211 KiB

Raw Blame History