ium_464914/IUM_2.ipynb

449 KiB
Raw Permalink Blame History

%pip install --user kaggle 
%pip install --user pandas
%pip install --user scikit-learn
%pip install --user matplotlib
%pip install --user geopandas
Requirement already satisfied: kaggle in \\\\files\students\s464914\.appdata\python\python310\site-packages (1.6.6)
Requirement already satisfied: six>=1.10 in c:\software\python3\lib\site-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in c:\software\python3\lib\site-packages (from kaggle) (2023.7.22)
Requirement already satisfied: python-dateutil in c:\software\python3\lib\site-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in c:\software\python3\lib\site-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in c:\software\python3\lib\site-packages (from kaggle) (4.66.1)
Requirement already satisfied: python-slugify in \\\\files\students\s464914\.appdata\python\python310\site-packages (from kaggle) (8.0.4)
Requirement already satisfied: urllib3 in c:\software\python3\lib\site-packages (from kaggle) (1.26.16)
Requirement already satisfied: bleach in c:\software\python3\lib\site-packages (from kaggle) (6.0.0)
Requirement already satisfied: webencodings in c:\software\python3\lib\site-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in \\\\files\students\s464914\.appdata\python\python310\site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\software\python3\lib\site-packages (from requests->kaggle) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in c:\software\python3\lib\site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: colorama in c:\software\python3\lib\site-packages (from tqdm->kaggle) (0.4.6)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python3.exe -m pip install --upgrade pip
Requirement already satisfied: pandas in c:\software\python3\lib\site-packages (2.0.3)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\software\python3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\software\python3\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in c:\software\python3\lib\site-packages (from pandas) (2023.3)
Requirement already satisfied: numpy>=1.21.0 in c:\software\python3\lib\site-packages (from pandas) (1.24.3)
Requirement already satisfied: six>=1.5 in c:\software\python3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python3.exe -m pip install --upgrade pip
Requirement already satisfied: scikit-learn in c:\software\python3\lib\site-packages (1.3.0)
Requirement already satisfied: numpy>=1.17.3 in c:\software\python3\lib\site-packages (from scikit-learn) (1.24.3)
Requirement already satisfied: scipy>=1.5.0 in c:\software\python3\lib\site-packages (from scikit-learn) (1.11.2)
Requirement already satisfied: joblib>=1.1.1 in c:\software\python3\lib\site-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\software\python3\lib\site-packages (from scikit-learn) (3.2.0)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python3.exe -m pip install --upgrade pip
Requirement already satisfied: matplotlib in c:\software\python3\lib\site-packages (3.7.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\software\python3\lib\site-packages (from matplotlib) (1.1.0)
Requirement already satisfied: cycler>=0.10 in c:\software\python3\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\software\python3\lib\site-packages (from matplotlib) (4.42.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\software\python3\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.20 in c:\software\python3\lib\site-packages (from matplotlib) (1.24.3)
Requirement already satisfied: packaging>=20.0 in c:\software\python3\lib\site-packages (from matplotlib) (23.1)
Requirement already satisfied: pillow>=6.2.0 in c:\software\python3\lib\site-packages (from matplotlib) (10.0.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in c:\software\python3\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\software\python3\lib\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\software\python3\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python3.exe -m pip install --upgrade pip
Requirement already satisfied: geopandas in \\\\files\students\s464914\.appdata\python\python310\site-packages (0.14.3)
Requirement already satisfied: fiona>=1.8.21 in \\\\files\students\s464914\.appdata\python\python310\site-packages (from geopandas) (1.9.6)
Requirement already satisfied: packaging in c:\software\python3\lib\site-packages (from geopandas) (23.1)
Requirement already satisfied: pandas>=1.4.0 in c:\software\python3\lib\site-packages (from geopandas) (2.0.3)
Requirement already satisfied: pyproj>=3.3.0 in \\\\files\students\s464914\.appdata\python\python310\site-packages (from geopandas) (3.6.1)
Requirement already satisfied: shapely>=1.8.0 in \\\\files\students\s464914\.appdata\python\python310\site-packages (from geopandas) (2.0.3)
Requirement already satisfied: attrs>=19.2.0 in c:\software\python3\lib\site-packages (from fiona>=1.8.21->geopandas) (23.1.0)
Requirement already satisfied: certifi in c:\software\python3\lib\site-packages (from fiona>=1.8.21->geopandas) (2023.7.22)
Requirement already satisfied: click~=8.0 in c:\software\python3\lib\site-packages (from fiona>=1.8.21->geopandas) (8.1.7)
Requirement already satisfied: click-plugins>=1.0 in \\\\files\students\s464914\.appdata\python\python310\site-packages (from fiona>=1.8.21->geopandas) (1.1.1)
Requirement already satisfied: cligj>=0.5 in \\\\files\students\s464914\.appdata\python\python310\site-packages (from fiona>=1.8.21->geopandas) (0.7.2)
Requirement already satisfied: six in c:\software\python3\lib\site-packages (from fiona>=1.8.21->geopandas) (1.16.0)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\software\python3\lib\site-packages (from pandas>=1.4.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\software\python3\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in c:\software\python3\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3)
Requirement already satisfied: numpy>=1.21.0 in c:\software\python3\lib\site-packages (from pandas>=1.4.0->geopandas) (1.24.3)
Requirement already satisfied: colorama in c:\software\python3\lib\site-packages (from click~=8.0->fiona>=1.8.21->geopandas) (0.4.6)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip is available: 23.0.1 -> 24.0
[notice] To update, run: python3.exe -m pip install --upgrade pip
import matplotlib.pyplot as plt 
import pandas as pd
!kaggle datasets download -d uciml/forest-cover-type-dataset
Downloading forest-cover-type-dataset.zip to J:\PycharmProjects\ium_464914
  0%|          | 0.00/11.2M [00:00<?, ?B/s]
  9%|8         | 1.00M/11.2M [00:00<00:06, 1.56MB/s]
 18%|#7        | 2.00M/11.2M [00:00<00:03, 3.10MB/s]
 36%|###5      | 4.00M/11.2M [00:00<00:01, 6.25MB/s]
 54%|#####3    | 6.00M/11.2M [00:01<00:00, 9.19MB/s]
 81%|########  | 9.00M/11.2M [00:01<00:00, 13.0MB/s]
100%|##########| 11.2M/11.2M [00:01<00:00, 9.30MB/s]
!unzip -o forest-cover-type-dataset.zip 
Archive:  forest-cover-type-dataset.zip
  inflating: covtype.csv             

Zbiór

data = pd.read_csv("covtype.csv")
data = data.sample(frac = 1)
data.head(10)
Elevation Aspect Slope Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways Hillshade_9am Hillshade_Noon Hillshade_3pm Horizontal_Distance_To_Fire_Points ... Soil_Type32 Soil_Type33 Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39 Soil_Type40 Cover_Type
318054 2517 271 12 272 84 484 189 244 193 162 ... 0 0 0 0 0 0 0 0 0 2
30504 2959 0 1 180 20 5960 217 236 156 3960 ... 0 0 0 0 0 0 0 0 0 2
349520 3093 54 19 42 -3 797 227 196 94 1318 ... 0 0 0 0 0 0 0 0 0 1
365645 2502 330 17 150 52 738 177 216 178 510 ... 0 0 0 0 0 0 0 0 0 6
131114 2962 4 13 95 7 4270 202 214 148 1999 ... 0 0 0 0 0 0 0 0 0 2
385769 3181 119 5 170 -1 2416 228 235 141 999 ... 0 0 0 0 0 0 0 0 0 1
161626 2950 270 4 108 15 2053 210 241 170 2037 ... 0 0 0 0 0 0 0 0 0 2
394880 3051 155 22 390 70 1871 239 236 114 1510 ... 0 0 0 0 0 0 0 0 0 1
389492 3024 191 16 785 110 3000 218 251 162 1961 ... 0 1 0 0 0 0 0 0 0 2
52507 2714 349 18 67 20 1599 184 207 160 3234 ... 0 0 0 0 0 0 0 0 0 2

10 rows × 55 columns

Podział na podzbiory

from sklearn.model_selection import train_test_split
forest_train, forest_test = train_test_split(data, test_size=0.2, random_state=1)
forest_train, forest_val = train_test_split(forest_train, test_size=0.25, random_state=1)

Statystyki

Wielkości zbiorów

print(f'wielkość zbioru: {data.shape}')
print(f'wielkość zbioru treningowego: {forest_train.shape}')
print(f'wielkość zbioru testującego: {forest_test.shape}')
print(f'wielkość zbioru walidacyjnego: {forest_val.shape}')
wielkość zbioru: (581012, 55)
wielkość zbioru treningowego: (348606, 55)
wielkość zbioru testującego: (116203, 55)
wielkość zbioru walidacyjnego: (116203, 55)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype
---  ------                              --------------   -----
 0   Elevation                           581012 non-null  int64
 1   Aspect                              581012 non-null  int64
 2   Slope                               581012 non-null  int64
 3   Horizontal_Distance_To_Hydrology    581012 non-null  int64
 4   Vertical_Distance_To_Hydrology      581012 non-null  int64
 5   Horizontal_Distance_To_Roadways     581012 non-null  int64
 6   Hillshade_9am                       581012 non-null  int64
 7   Hillshade_Noon                      581012 non-null  int64
 8   Hillshade_3pm                       581012 non-null  int64
 9   Horizontal_Distance_To_Fire_Points  581012 non-null  int64
 10  Wilderness_Area1                    581012 non-null  int64
 11  Wilderness_Area2                    581012 non-null  int64
 12  Wilderness_Area3                    581012 non-null  int64
 13  Wilderness_Area4                    581012 non-null  int64
 14  Soil_Type1                          581012 non-null  int64
 15  Soil_Type2                          581012 non-null  int64
 16  Soil_Type3                          581012 non-null  int64
 17  Soil_Type4                          581012 non-null  int64
 18  Soil_Type5                          581012 non-null  int64
 19  Soil_Type6                          581012 non-null  int64
 20  Soil_Type7                          581012 non-null  int64
 21  Soil_Type8                          581012 non-null  int64
 22  Soil_Type9                          581012 non-null  int64
 23  Soil_Type10                         581012 non-null  int64
 24  Soil_Type11                         581012 non-null  int64
 25  Soil_Type12                         581012 non-null  int64
 26  Soil_Type13                         581012 non-null  int64
 27  Soil_Type14                         581012 non-null  int64
 28  Soil_Type15                         581012 non-null  int64
 29  Soil_Type16                         581012 non-null  int64
 30  Soil_Type17                         581012 non-null  int64
 31  Soil_Type18                         581012 non-null  int64
 32  Soil_Type19                         581012 non-null  int64
 33  Soil_Type20                         581012 non-null  int64
 34  Soil_Type21                         581012 non-null  int64
 35  Soil_Type22                         581012 non-null  int64
 36  Soil_Type23                         581012 non-null  int64
 37  Soil_Type24                         581012 non-null  int64
 38  Soil_Type25                         581012 non-null  int64
 39  Soil_Type26                         581012 non-null  int64
 40  Soil_Type27                         581012 non-null  int64
 41  Soil_Type28                         581012 non-null  int64
 42  Soil_Type29                         581012 non-null  int64
 43  Soil_Type30                         581012 non-null  int64
 44  Soil_Type31                         581012 non-null  int64
 45  Soil_Type32                         581012 non-null  int64
 46  Soil_Type33                         581012 non-null  int64
 47  Soil_Type34                         581012 non-null  int64
 48  Soil_Type35                         581012 non-null  int64
 49  Soil_Type36                         581012 non-null  int64
 50  Soil_Type37                         581012 non-null  int64
 51  Soil_Type38                         581012 non-null  int64
 52  Soil_Type39                         581012 non-null  int64
 53  Soil_Type40                         581012 non-null  int64
 54  Cover_Type                          581012 non-null  int64
dtypes: int64(55)
memory usage: 243.8 MB

Nachylenie

print(f'Średnie nachylenie: {data["Slope"].mean()}')
print(f'Maksymalne nachylenie: {data["Slope"].max()}')
print(f'Minimalne nachylenie: {data["Slope"].min()}')
Średnie nachylenie: 14.103703537964792
Maksymalne nachylenie: 66
Minimalne nachylenie: 0
import seaborn as sns
features = data.loc[:,'Elevation':'Horizontal_Distance_To_Fire_Points']

plt.figure(figsize=(30, 50))
for i,col in enumerate(features.columns.values):
    plt.subplot(5,2,i+1)
    sns.boxplot(x=data['Cover_Type'], y=col, data=data)
    plt.title(col, fontsize=20)
    
plt.show()

Normalizacja

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

columns_to_normalize = data.columns[~data.columns.str.startswith('Soil_Type')]
columns_to_normalize = columns_to_normalize.to_list()
columns_to_normalize.remove('Cover_Type')
data[columns_to_normalize] = scaler.fit_transform(data[columns_to_normalize])

data.head(10)
Elevation Aspect Slope Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways Hillshade_9am Hillshade_Noon Hillshade_3pm Horizontal_Distance_To_Fire_Points ... Soil_Type32 Soil_Type33 Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 Soil_Type39 Soil_Type40 Cover_Type
318054 -1.579964 1.030645 -0.280934 0.012100 0.644670 -1.196821 -0.864631 1.046164 1.318678 -1.373130 ... 0 0 0 0 0 0 0 0 0 2
30504 -0.001305 -1.390866 -1.749905 -0.420741 -0.453191 2.315116 0.181321 0.641484 0.351977 1.495029 ... 0 0 0 0 0 0 0 0 0 2
349520 0.477293 -0.908351 0.653865 -1.070003 -0.847735 -0.996083 0.554876 -1.381919 -1.267901 -0.500147 ... 0 0 0 0 0 0 0 0 0 1
365645 -1.633538 1.557837 0.386780 -0.561885 0.095739 -1.033922 -1.312896 -0.370218 0.926772 -1.110329 ... 0 0 0 0 0 0 0 0 0 6
131114 0.009410 -1.355124 -0.147392 -0.820649 -0.676194 1.231264 -0.379010 -0.471388 0.142960 0.014128 ... 0 0 0 0 0 0 0 0 0 2
385769 0.791596 -0.327546 -1.215734 -0.467789 -0.813427 0.042234 0.592231 0.590899 -0.039929 -0.741048 ... 0 0 0 0 0 0 0 0 0 1
161626 -0.033449 1.021709 -1.349277 -0.759486 -0.538961 -0.190570 -0.080167 0.894409 0.717756 0.042825 ... 0 0 0 0 0 0 0 0 0 2
394880 0.327285 -0.005869 1.054494 0.567265 0.404513 -0.307292 1.003141 0.641484 -0.745360 -0.355153 ... 0 0 0 0 0 0 0 0 0 1
389492 0.230851 0.315808 0.253237 2.425659 1.090676 0.416772 0.218677 1.400260 0.508739 -0.014568 ... 0 1 0 0 0 0 0 0 0 2
52507 -0.876353 1.727611 0.520322 -0.952383 -0.453191 -0.481735 -1.051408 -0.825483 0.456485 0.946771 ... 0 0 0 0 0 0 0 0 0 2

10 rows × 55 columns