211 KiB
211 KiB
%pip install --user pandas
Requirement already satisfied: pandas in c:\users\admin\appdata\roaming\python\python311\site-packages (1.5.3) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\admin\appdata\roaming\python\python311\site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\admin\appdata\roaming\python\python311\site-packages (from pandas) (2023.2) Requirement already satisfied: numpy>=1.21.0 in c:\users\admin\appdata\roaming\python\python311\site-packages (from pandas) (1.24.2) Requirement already satisfied: six>=1.5 in c:\users\admin\appdata\roaming\python\python311\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0) Note: you may need to restart the kernel to use updated packages.
%pip install --user kaggle
Requirement already satisfied: kaggle in c:\users\admin\appdata\roaming\python\python311\site-packages (1.5.13) Requirement already satisfied: six>=1.10 in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (1.16.0) Requirement already satisfied: certifi in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (2022.12.7) Requirement already satisfied: python-dateutil in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (2.8.2) Requirement already satisfied: requests in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (2.28.2) Requirement already satisfied: tqdm in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (4.65.0) Requirement already satisfied: python-slugify in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (8.0.1) Requirement already satisfied: urllib3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from kaggle) (1.26.15) Requirement already satisfied: text-unidecode>=1.3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\admin\appdata\roaming\python\python311\site-packages (from requests->kaggle) (3.1.0) Requirement already satisfied: idna<4,>=2.5 in c:\users\admin\appdata\roaming\python\python311\site-packages (from requests->kaggle) (3.4) Requirement already satisfied: colorama in c:\users\admin\appdata\roaming\python\python311\site-packages (from tqdm->kaggle) (0.4.6) Note: you may need to restart the kernel to use updated packages.
%python -m kaggle datasets download -d ulrikthygepedersen/diamonds
UsageError: Line magic function `%python` not found (But cell magic `%%python` exists, did you mean that instead?).
!kaggle datasets download -d shivam2503/diamonds
Downloading diamonds.zip to c:\Users\admin\ium_z487175
0%| | 0.00/733k [00:00<?, ?B/s] 100%|██████████| 733k/733k [00:00<00:00, 1.35MB/s] 100%|██████████| 733k/733k [00:00<00:00, 1.33MB/s]
!tar -xf diamonds.zip
## rozpakowanie archiwum .zip w windowsie
import pandas as pd
diamonds = pd.read_csv('diamonds.csv')
#Wyświetlenie zbioru danych
diamonds
Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 2 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 3 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 4 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 5 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53935 | 53936 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
53936 | 53937 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
53937 | 53938 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
53938 | 53939 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
53939 | 53940 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53940 rows × 11 columns
#przydzielanie nazwy kolumny z id
diamonds = diamonds.rename(columns={diamonds.columns[0]: 'id'})
diamonds
id | carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 2 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 3 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 4 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 5 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53935 | 53936 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
53936 | 53937 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
53937 | 53938 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
53938 | 53939 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
53939 | 53940 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53940 rows × 11 columns
#Convert to lowerCase
diamonds['cut'] = diamonds['cut'].str.lower()
diamonds
Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.23 | ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 2 | 0.21 | premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 3 | 0.23 | good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 4 | 0.29 | premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 5 | 0.31 | good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53935 | 53936 | 0.72 | ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
53936 | 53937 | 0.72 | good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
53937 | 53938 | 0.70 | very good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
53938 | 53939 | 0.86 | premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
53939 | 53940 | 0.75 | ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53940 rows × 11 columns
%pip install scikit-learn
Requirement already satisfied: scikit-learn in c:\users\admin\appdata\roaming\python\python311\site-packages (1.2.2) Requirement already satisfied: numpy>=1.17.3 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (1.24.2) Requirement already satisfied: scipy>=1.3.2 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (1.10.1) Requirement already satisfied: joblib>=1.1.1 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\admin\appdata\roaming\python\python311\site-packages (from scikit-learn) (3.1.0) Note: you may need to restart the kernel to use updated packages.
import sklearn
from sklearn.model_selection import train_test_split
#podział danych na train/test/dev w proporcji 4:1:1
#losować ustawiona na 10
#1. Dzielimy na zbiór treningowy 80 % i resztę danych
diamonds_train, diamonds_test_dev = sklearn.model_selection.train_test_split(diamonds, test_size=0.2, random_state=10)
#2. Podział reszty danych na zbiór testowy 10% i walidacyjny 10%
diamonds_test, diamonds_dev = train_test_split(diamonds_test_dev, test_size=0.5, random_state=10)
#Wyświetlenie rozmiarów zbiorów danych train/test/dev
print("Rozmiar diamonds: ", diamonds.shape)
print("Rozmiar diamonds_train: ", diamonds_train.shape)
print("Rozmiar diamonds_test: ", diamonds_test.shape)
print("Rozmiar diamonds_dev: ", diamonds_dev.shape)
Rozmiar diamonds: (53940, 11) Rozmiar diamonds_train: (43152, 11) Rozmiar diamonds_test: (5394, 11) Rozmiar diamonds_dev: (5394, 11)
# średnią, minimum, maksimum, odchylenia standardowe, medianę wartości poszczególnych parametrów)
print(diamonds.describe())
Unnamed: 0 carat depth table price \ count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 mean 26970.500000 0.797940 61.749405 57.457184 3932.799722 std 15571.281097 0.474011 1.432621 2.234491 3989.439738 min 1.000000 0.200000 43.000000 43.000000 326.000000 25% 13485.750000 0.400000 61.000000 56.000000 950.000000 50% 26970.500000 0.700000 61.800000 57.000000 2401.000000 75% 40455.250000 1.040000 62.500000 59.000000 5324.250000 max 53940.000000 5.010000 79.000000 95.000000 18823.000000 x y z count 53940.000000 53940.000000 53940.000000 mean 5.731157 5.734526 3.538734 std 1.121761 1.142135 0.705699 min 0.000000 0.000000 0.000000 25% 4.710000 4.720000 2.910000 50% 5.700000 5.710000 3.530000 75% 6.540000 6.540000 4.040000 max 10.740000 58.900000 31.800000
print(diamonds_train.describe())
Unnamed: 0 carat depth table price \ count 43152.000000 43152.000000 43152.000000 43152.000000 43152.000000 mean 26971.712111 0.795979 61.748241 57.448355 3920.786939 std 15565.585777 0.472184 1.426394 2.224297 3975.894633 min 3.000000 0.200000 43.000000 44.000000 327.000000 25% 13469.750000 0.400000 61.000000 56.000000 946.000000 50% 27019.500000 0.700000 61.800000 57.000000 2400.000000 75% 40439.250000 1.040000 62.500000 59.000000 5313.250000 max 53938.000000 5.010000 79.000000 76.000000 18823.000000 x y z count 43152.000000 43152.000000 43152.000000 mean 5.726933 5.731011 3.535791 std 1.119635 1.147069 0.693846 min 0.000000 0.000000 0.000000 25% 4.710000 4.720000 2.910000 50% 5.690000 5.710000 3.520000 75% 6.540000 6.530000 4.030000 max 10.740000 58.900000 8.060000
print(diamonds_test.describe())
Unnamed: 0 carat depth table price \ count 5394.000000 5394.000000 5394.000000 5394.000000 5394.000000 mean 26951.351316 0.802666 61.760808 57.470189 3970.308676 std 15565.740253 0.482062 1.464893 2.309900 4083.195823 min 1.000000 0.210000 52.300000 43.000000 326.000000 25% 13519.750000 0.400000 61.000000 56.000000 958.000000 50% 27013.500000 0.700000 61.900000 57.000000 2375.500000 75% 40342.250000 1.050000 62.500000 59.000000 5273.750000 max 53930.000000 3.510000 78.200000 95.000000 18806.000000 x y z count 5394.000000 5394.000000 5394.000000 mean 5.738817 5.739106 3.542097 std 1.132069 1.123925 0.701446 min 3.840000 3.780000 0.000000 25% 4.710000 4.710000 2.900000 50% 5.690000 5.700000 3.530000 75% 6.550000 6.540000 4.040000 max 9.660000 9.630000 6.030000
print(diamonds_dev.describe())
Unnamed: 0 carat depth table price \ count 5394.000000 5394.000000 5394.000000 5394.000000 5394.000000 mean 26979.951798 0.808901 61.747312 57.514813 3991.393029 std 15625.161644 0.480344 1.449816 2.238671 4002.742530 min 2.000000 0.200000 53.200000 51.000000 326.000000 25% 13525.500000 0.400000 61.000000 56.000000 961.000000 50% 26529.500000 0.710000 61.850000 57.000000 2484.500000 75% 40665.500000 1.050000 62.500000 59.000000 5465.250000 max 53940.000000 3.040000 73.600000 68.000000 18779.000000 x y z count 5394.000000 5394.000000 5394.000000 mean 5.757290 5.758066 3.558910 std 1.128191 1.120344 0.797759 min 3.790000 3.750000 0.000000 25% 4.730000 4.740000 2.930000 50% 5.710000 5.730000 3.540000 75% 6.560000 6.540000 4.040000 max 9.510000 9.460000 31.800000
#Wyświetlenie częstości przykładów dla poszczególnych klas diamentów
diamonds_train["cut"].value_counts()
Ideal 17292 Premium 10954 Very Good 9708 Good 3929 Fair 1269 Name: cut, dtype: int64
diamonds_test["cut"].value_counts()
Ideal 2184 Premium 1385 Very Good 1183 Good 473 Fair 169 Name: cut, dtype: int64
diamonds_dev["cut"].value_counts()
Ideal 2075 Premium 1452 Very Good 1191 Good 504 Fair 172 Name: cut, dtype: int64
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
diamonds['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
diamonds_train['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds tranującego')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
diamonds_test['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds testowego')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
diamonds_dev['cut'].value_counts().plot(kind='bar')
plt.title('Rozkład częstości dla szlifów diamentów dla zbioru diamonds walidacyjnego')
plt.xlabel('Szlif')
plt.ylabel('Liczba wystąpień')
plt.show()
diamonds[["cut","carat"]].groupby("cut").std()
carat | |
---|---|
cut | |
fair | 0.516404 |
good | 0.454054 |
ideal | 0.432876 |
premium | 0.515262 |
very good | 0.459435 |
diamonds[["cut","carat"]].groupby("cut").mean().plot(kind="bar")
<Axes: xlabel='cut'>
#normalizacja wartości typu float do zakrsu 0.0 - 1.0
#Powyżej wykonano jeszcze konwersję danych typu string na lowerCase
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
diamonds[['carat', 'depth', 'table', 'price', 'x', 'y', 'z']] = scaler.fit_transform(diamonds[['carat', 'depth', 'table', 'price', 'x', 'y', 'z']])
#wyświetlenie zbioru
diamonds
Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.006237 | ideal | E | SI2 | 0.513889 | 0.230769 | 0.000000 | 0.367784 | 0.067572 | 0.076415 |
1 | 2 | 0.002079 | premium | E | SI1 | 0.466667 | 0.346154 | 0.000000 | 0.362197 | 0.065195 | 0.072642 |
2 | 3 | 0.006237 | good | E | VS1 | 0.386111 | 0.423077 | 0.000054 | 0.377095 | 0.069100 | 0.072642 |
3 | 4 | 0.018711 | premium | I | VS2 | 0.538889 | 0.288462 | 0.000433 | 0.391061 | 0.071817 | 0.082704 |
4 | 5 | 0.022869 | good | J | SI2 | 0.563889 | 0.288462 | 0.000487 | 0.404097 | 0.073854 | 0.086478 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53935 | 53936 | 0.108108 | ideal | D | SI1 | 0.494444 | 0.269231 | 0.131427 | 0.535382 | 0.097793 | 0.110063 |
53936 | 53937 | 0.108108 | good | D | SI1 | 0.558333 | 0.230769 | 0.131427 | 0.529795 | 0.097623 | 0.113522 |
53937 | 53938 | 0.103950 | very good | D | SI1 | 0.550000 | 0.326923 | 0.131427 | 0.527002 | 0.096435 | 0.111950 |
53938 | 53939 | 0.137214 | premium | H | SI2 | 0.500000 | 0.288462 | 0.131427 | 0.572626 | 0.103905 | 0.117610 |
53939 | 53940 | 0.114345 | ideal | D | SI2 | 0.533333 | 0.230769 | 0.131427 | 0.542831 | 0.099660 | 0.114465 |
53940 rows × 11 columns
# Usuwanie artefaktów
diamonds = diamonds.dropna() ## usuwanie pustych wierszy, które posiadają przynajmniej jedno wystąpienie NULL or NaN
diamonds
Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 2 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 3 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 4 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 5 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53935 | 53936 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
53936 | 53937 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
53937 | 53938 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
53938 | 53939 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
53939 | 53940 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53940 rows × 11 columns