ium_464962/ium_01.ipynb
Krzysztof Raczyński 3dbe94f3bc added ium01
2024-03-19 15:42:16 +01:00

36 KiB

Instalacja i import bibliotek

!pip install kaggle
!pip install pandas
Requirement already satisfied: kaggle in c:\users\krzys\anaconda3\lib\site-packages (1.6.6)
Requirement already satisfied: bleach in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (6.1.0)
Requirement already satisfied: python-dateutil in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (2.9.0.post0)
Requirement already satisfied: six>=1.10 in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: requests in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (2.31.0)
Requirement already satisfied: certifi in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (2024.2.2)
Requirement already satisfied: tqdm in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (4.66.2)
Requirement already satisfied: python-slugify in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (8.0.4)
Requirement already satisfied: urllib3 in c:\users\krzys\anaconda3\lib\site-packages (from kaggle) (2.2.1)
Requirement already satisfied: webencodings in c:\users\krzys\anaconda3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\krzys\anaconda3\lib\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\krzys\anaconda3\lib\site-packages (from requests->kaggle) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in c:\users\krzys\anaconda3\lib\site-packages (from requests->kaggle) (3.6)
Requirement already satisfied: colorama in c:\users\krzys\anaconda3\lib\site-packages (from tqdm->kaggle) (0.4.6)
Requirement already satisfied: pandas in c:\users\krzys\anaconda3\lib\site-packages (1.4.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\krzys\anaconda3\lib\site-packages (from pandas) (2021.3)
Requirement already satisfied: numpy>=1.18.5 in c:\users\krzys\anaconda3\lib\site-packages (from pandas) (1.21.5)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\krzys\anaconda3\lib\site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in c:\users\krzys\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

Pobranie zbioru danych

!kaggle datasets download -d syedanwarafridi/vehicle-sales-data
vehicle-sales-data.zip: Skipping, found more recently modified local copy (use --force to force download)
#conda install git pip
#!pip install unzip
!unzip -o vehicle-sales-data.zip
Archive:  vehicle-sales-data.zip
  inflating: car_prices.csv          

Opis i czyszczenie danych danych

df = pd.read_csv('car_prices.csv')
df.head()
year make model trim body transmission vin state condition odometer color interior seller mmr sellingprice saledate
0 2015 Kia Sorento LX SUV automatic 5xyktca69fg566472 ca 5.0 16639.0 white black kia motors america inc 20500.0 21500.0 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
1 2015 Kia Sorento LX SUV automatic 5xyktca69fg561319 ca 5.0 9393.0 white beige kia motors america inc 20800.0 21500.0 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
2 2014 BMW 3 Series 328i SULEV Sedan automatic wba3c1c51ek116351 ca 45.0 1331.0 gray black financial services remarketing (lease) 31900.0 30000.0 Thu Jan 15 2015 04:30:00 GMT-0800 (PST)
3 2015 Volvo S60 T5 Sedan automatic yv1612tb4f1310987 ca 41.0 14282.0 white black volvo na rep/world omni 27500.0 27750.0 Thu Jan 29 2015 04:30:00 GMT-0800 (PST)
4 2014 BMW 6 Series Gran Coupe 650i Sedan automatic wba6b2c57ed129731 ca 43.0 2641.0 gray black financial services remarketing (lease) 66000.0 67000.0 Thu Dec 18 2014 12:30:00 GMT-0800 (PST)
df.shape
(558837, 16)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 558837 entries, 0 to 558836
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   year          558837 non-null  int64  
 1   make          548536 non-null  object 
 2   model         548438 non-null  object 
 3   trim          548186 non-null  object 
 4   body          545642 non-null  object 
 5   transmission  493485 non-null  object 
 6   vin           558833 non-null  object 
 7   state         558837 non-null  object 
 8   condition     547017 non-null  float64
 9   odometer      558743 non-null  float64
 10  color         558088 non-null  object 
 11  interior      558088 non-null  object 
 12  seller        558837 non-null  object 
 13  mmr           558799 non-null  float64
 14  sellingprice  558825 non-null  float64
 15  saledate      558825 non-null  object 
dtypes: float64(4), int64(1), object(11)
memory usage: 68.2+ MB
df.describe(include='all')
year make model trim body transmission vin state condition odometer color interior seller mmr sellingprice saledate
count 558837.000000 548536 548438 548186 545642 493485 558833 558837 547017.000000 558743.000000 558088 558088 558837 558799.000000 558825.000000 558825
unique NaN 96 973 1963 87 4 550297 64 NaN NaN 46 17 14263 NaN NaN 3766
top NaN Ford Altima Base Sedan automatic automatic fl NaN NaN black black nissan-infiniti lt NaN NaN Tue Feb 10 2015 01:30:00 GMT-0800 (PST)
freq NaN 93554 19349 55817 199437 475915 22 82945 NaN NaN 110970 244329 19693 NaN NaN 5334
mean 2010.038927 NaN NaN NaN NaN NaN NaN NaN 30.672365 68320.017767 NaN NaN NaN 13769.377495 13611.358810 NaN
std 3.966864 NaN NaN NaN NaN NaN NaN NaN 13.402832 53398.542821 NaN NaN NaN 9679.967174 9749.501628 NaN
min 1982.000000 NaN NaN NaN NaN NaN NaN NaN 1.000000 1.000000 NaN NaN NaN 25.000000 1.000000 NaN
25% 2007.000000 NaN NaN NaN NaN NaN NaN NaN 23.000000 28371.000000 NaN NaN NaN 7100.000000 6900.000000 NaN
50% 2012.000000 NaN NaN NaN NaN NaN NaN NaN 35.000000 52254.000000 NaN NaN NaN 12250.000000 12100.000000 NaN
75% 2013.000000 NaN NaN NaN NaN NaN NaN NaN 42.000000 99109.000000 NaN NaN NaN 18300.000000 18200.000000 NaN
max 2015.000000 NaN NaN NaN NaN NaN NaN NaN 49.000000 999999.000000 NaN NaN NaN 182000.000000 230000.000000 NaN
df = df.dropna()
df.shape
(472325, 16)
df.isna().sum()
year            0
make            0
model           0
trim            0
body            0
transmission    0
vin             0
state           0
condition       0
odometer        0
color           0
interior        0
seller          0
mmr             0
sellingprice    0
saledate        0
dtype: int64
df['body'] = df['body'].replace({'sedan': 'Sedan'})
df['body'] = df['body'].replace({'Suv': 'SUV'})
df['body'] = df['body'].replace({'suv': 'SUV'})
numeric_columns = df.select_dtypes(include=['int', 'float']).columns
scaler = MinMaxScaler(feature_range=(0, 1))

df_scaled = df.copy()
df_scaled[numeric_columns] = scaler.fit_transform(df[numeric_columns])
df_scaled.head()
year make model trim body transmission vin state condition odometer color interior seller mmr sellingprice saledate
0 1.00 Kia Sorento LX SUV automatic 5xyktca69fg566472 ca 0.083333 0.016638 white black kia motors america inc 0.112515 0.093474 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
1 1.00 Kia Sorento LX SUV automatic 5xyktca69fg561319 ca 0.083333 0.009392 white beige kia motors america inc 0.114164 0.093474 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
2 0.96 BMW 3 Series 328i SULEV Sedan automatic wba3c1c51ek116351 ca 0.916667 0.001330 gray black financial services remarketing (lease) 0.175161 0.130431 Thu Jan 15 2015 04:30:00 GMT-0800 (PST)
3 1.00 Volvo S60 T5 Sedan automatic yv1612tb4f1310987 ca 0.833333 0.014281 white black volvo na rep/world omni 0.150982 0.120648 Thu Jan 29 2015 04:30:00 GMT-0800 (PST)
4 0.96 BMW 6 Series Gran Coupe 650i Sedan automatic wba6b2c57ed129731 ca 0.875000 0.002640 gray black financial services remarketing (lease) 0.362550 0.291301 Thu Dec 18 2014 12:30:00 GMT-0800 (PST)

Podział danych na podzbiory

car_train, car_dev_test = train_test_split(df, random_state = 0, train_size = 0.8)
car_dev, car_test = train_test_split(car_dev_test, random_state = 0, train_size = 0.5)
print(car_train.shape)
print(car_dev.shape)
print(car_test.shape)
(377860, 16)
(47232, 16)
(47233, 16)

Statystyki zbioru

df['make'].value_counts().head(10)
Ford         81013
Chevrolet    54150
Nissan       44043
Toyota       35313
Dodge        27181
Honda        24781
Hyundai      18659
BMW          17509
Kia          15828
Chrysler     15133
Name: make, dtype: int64
df['body'].value_counts().head(10)
Sedan          211298
SUV            120968
Hatchback       19351
Minivan         18305
Coupe           13121
Wagon           12023
Crew Cab        11508
Convertible      7725
SuperCrew        6195
G Sedan          5644
Name: body, dtype: int64
df['transmission'].value_counts()
automatic    455963
manual        16362
Name: transmission, dtype: int64