36 KiB
36 KiB
Instalacja i import bibliotek
!pip install kaggle
!pip install pandas
Requirement already satisfied: kaggle in c:\users\krzys\anaconda3\lib\site-packages (1.6.6) Requirement already satisfied: bleach in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (6.1.0) Requirement already satisfied: python-dateutil in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (2.9.0.post0) Requirement already satisfied: six>=1.10 in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (1.16.0) Requirement already satisfied: requests in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (2.31.0) Requirement already satisfied: certifi in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (2024.2.2) Requirement already satisfied: tqdm in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (4.66.2) Requirement already satisfied: python-slugify in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (8.0.4) Requirement already satisfied: urllib3 in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (2.2.1) Requirement already satisfied: webencodings in c:\users\krzys\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1) Requirement already satisfied: text-unidecode>=1.3 in c:\users\krzys\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\krzys\anaconda3\lib\site-packages (from requests->kaggle) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in c:\users\krzys\anaconda3\lib\site-packages (from requests->kaggle) (3.6) Requirement already satisfied: colorama in c:\users\krzys\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6) Requirement already satisfied: pandas in c:\users\krzys\anaconda3\lib\site-packages (1.4.2) Requirement already satisfied: pytz>=2020.1 in c:\users\krzys\anaconda3\lib\site-packages (from pandas) (2021.3) Requirement already satisfied: numpy>=1.18.5 in c:\users\krzys\anaconda3\lib\site-packages (from pandas) (1.21.5) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\krzys\anaconda3\lib\site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: six>=1.5 in c:\users\krzys\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
Pobranie zbioru danych
!kaggle datasets download -d syedanwarafridi/vehicle-sales-data
vehicle-sales-data.zip: Skipping, found more recently modified local copy (use --force to force download)
#conda install git pip
#!pip install unzip
!unzip -o vehicle-sales-data.zip
Archive: vehicle-sales-data.zip inflating: car_prices.csv
Opis i czyszczenie danych danych
df = pd.read_csv('car_prices.csv')
df.head()
year | make | model | trim | body | transmission | vin | state | condition | odometer | color | interior | seller | mmr | sellingprice | saledate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015 | Kia | Sorento | LX | SUV | automatic | 5xyktca69fg566472 | ca | 5.0 | 16639.0 | white | black | kia motors america inc | 20500.0 | 21500.0 | Tue Dec 16 2014 12:30:00 GMT-0800 (PST) |
1 | 2015 | Kia | Sorento | LX | SUV | automatic | 5xyktca69fg561319 | ca | 5.0 | 9393.0 | white | beige | kia motors america inc | 20800.0 | 21500.0 | Tue Dec 16 2014 12:30:00 GMT-0800 (PST) |
2 | 2014 | BMW | 3 Series | 328i SULEV | Sedan | automatic | wba3c1c51ek116351 | ca | 45.0 | 1331.0 | gray | black | financial services remarketing (lease) | 31900.0 | 30000.0 | Thu Jan 15 2015 04:30:00 GMT-0800 (PST) |
3 | 2015 | Volvo | S60 | T5 | Sedan | automatic | yv1612tb4f1310987 | ca | 41.0 | 14282.0 | white | black | volvo na rep/world omni | 27500.0 | 27750.0 | Thu Jan 29 2015 04:30:00 GMT-0800 (PST) |
4 | 2014 | BMW | 6 Series Gran Coupe | 650i | Sedan | automatic | wba6b2c57ed129731 | ca | 43.0 | 2641.0 | gray | black | financial services remarketing (lease) | 66000.0 | 67000.0 | Thu Dec 18 2014 12:30:00 GMT-0800 (PST) |
df.shape
(558837, 16)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 558837 entries, 0 to 558836 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 558837 non-null int64 1 make 548536 non-null object 2 model 548438 non-null object 3 trim 548186 non-null object 4 body 545642 non-null object 5 transmission 493485 non-null object 6 vin 558833 non-null object 7 state 558837 non-null object 8 condition 547017 non-null float64 9 odometer 558743 non-null float64 10 color 558088 non-null object 11 interior 558088 non-null object 12 seller 558837 non-null object 13 mmr 558799 non-null float64 14 sellingprice 558825 non-null float64 15 saledate 558825 non-null object dtypes: float64(4), int64(1), object(11) memory usage: 68.2+ MB
df.describe(include='all')
year | make | model | trim | body | transmission | vin | state | condition | odometer | color | interior | seller | mmr | sellingprice | saledate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 558837.000000 | 548536 | 548438 | 548186 | 545642 | 493485 | 558833 | 558837 | 547017.000000 | 558743.000000 | 558088 | 558088 | 558837 | 558799.000000 | 558825.000000 | 558825 |
unique | NaN | 96 | 973 | 1963 | 87 | 4 | 550297 | 64 | NaN | NaN | 46 | 17 | 14263 | NaN | NaN | 3766 |
top | NaN | Ford | Altima | Base | Sedan | automatic | automatic | fl | NaN | NaN | black | black | nissan-infiniti lt | NaN | NaN | Tue Feb 10 2015 01:30:00 GMT-0800 (PST) |
freq | NaN | 93554 | 19349 | 55817 | 199437 | 475915 | 22 | 82945 | NaN | NaN | 110970 | 244329 | 19693 | NaN | NaN | 5334 |
mean | 2010.038927 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 30.672365 | 68320.017767 | NaN | NaN | NaN | 13769.377495 | 13611.358810 | NaN |
std | 3.966864 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 13.402832 | 53398.542821 | NaN | NaN | NaN | 9679.967174 | 9749.501628 | NaN |
min | 1982.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 1.000000 | NaN | NaN | NaN | 25.000000 | 1.000000 | NaN |
25% | 2007.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 23.000000 | 28371.000000 | NaN | NaN | NaN | 7100.000000 | 6900.000000 | NaN |
50% | 2012.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 35.000000 | 52254.000000 | NaN | NaN | NaN | 12250.000000 | 12100.000000 | NaN |
75% | 2013.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 42.000000 | 99109.000000 | NaN | NaN | NaN | 18300.000000 | 18200.000000 | NaN |
max | 2015.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 49.000000 | 999999.000000 | NaN | NaN | NaN | 182000.000000 | 230000.000000 | NaN |
df = df.dropna()
df.shape
(472325, 16)
df.isna().sum()
year 0 make 0 model 0 trim 0 body 0 transmission 0 vin 0 state 0 condition 0 odometer 0 color 0 interior 0 seller 0 mmr 0 sellingprice 0 saledate 0 dtype: int64
df['body'] = df['body'].replace({'sedan': 'Sedan'})
df['body'] = df['body'].replace({'Suv': 'SUV'})
df['body'] = df['body'].replace({'suv': 'SUV'})
numeric_columns = df.select_dtypes(include=['int', 'float']).columns
scaler = MinMaxScaler(feature_range=(0, 1))
df_scaled = df.copy()
df_scaled[numeric_columns] = scaler.fit_transform(df[numeric_columns])
df_scaled.head()
year | make | model | trim | body | transmission | vin | state | condition | odometer | color | interior | seller | mmr | sellingprice | saledate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.00 | Kia | Sorento | LX | SUV | automatic | 5xyktca69fg566472 | ca | 0.083333 | 0.016638 | white | black | kia motors america inc | 0.112515 | 0.093474 | Tue Dec 16 2014 12:30:00 GMT-0800 (PST) |
1 | 1.00 | Kia | Sorento | LX | SUV | automatic | 5xyktca69fg561319 | ca | 0.083333 | 0.009392 | white | beige | kia motors america inc | 0.114164 | 0.093474 | Tue Dec 16 2014 12:30:00 GMT-0800 (PST) |
2 | 0.96 | BMW | 3 Series | 328i SULEV | Sedan | automatic | wba3c1c51ek116351 | ca | 0.916667 | 0.001330 | gray | black | financial services remarketing (lease) | 0.175161 | 0.130431 | Thu Jan 15 2015 04:30:00 GMT-0800 (PST) |
3 | 1.00 | Volvo | S60 | T5 | Sedan | automatic | yv1612tb4f1310987 | ca | 0.833333 | 0.014281 | white | black | volvo na rep/world omni | 0.150982 | 0.120648 | Thu Jan 29 2015 04:30:00 GMT-0800 (PST) |
4 | 0.96 | BMW | 6 Series Gran Coupe | 650i | Sedan | automatic | wba6b2c57ed129731 | ca | 0.875000 | 0.002640 | gray | black | financial services remarketing (lease) | 0.362550 | 0.291301 | Thu Dec 18 2014 12:30:00 GMT-0800 (PST) |
Podział danych na podzbiory
car_train, car_dev_test = train_test_split(df, random_state = 0, train_size = 0.8)
car_dev, car_test = train_test_split(car_dev_test, random_state = 0, train_size = 0.5)
print(car_train.shape)
print(car_dev.shape)
print(car_test.shape)
(377860, 16) (47232, 16) (47233, 16)
Statystyki zbioru
df['make'].value_counts().head(10)
Ford 81013 Chevrolet 54150 Nissan 44043 Toyota 35313 Dodge 27181 Honda 24781 Hyundai 18659 BMW 17509 Kia 15828 Chrysler 15133 Name: make, dtype: int64
df['body'].value_counts().head(10)
Sedan 211298 SUV 120968 Hatchback 19351 Minivan 18305 Coupe 13121 Wagon 12023 Crew Cab 11508 Convertible 7725 SuperCrew 6195 G Sedan 5644 Name: body, dtype: int64
df['transmission'].value_counts()
automatic 455963 manual 16362 Name: transmission, dtype: int64