ium_487197/ium_lab2.ipynb
2023-03-21 11:45:41 +01:00

152 KiB
Raw Permalink Blame History

#instalacja pakietow
!pip install kaggle
!pip install pandas
!pip install unzip
!pip install scikit-learn
!pip install seaborn
Requirement already satisfied: kaggle in ./jupyter_env/lib/python3.10/site-packages (1.5.13)
Requirement already satisfied: requests in ./jupyter_env/lib/python3.10/site-packages (from kaggle) (2.28.2)
Requirement already satisfied: six>=1.10 in ./jupyter_env/lib/python3.10/site-packages (from kaggle) (1.16.0)
Requirement already satisfied: tqdm in ./jupyter_env/lib/python3.10/site-packages (from kaggle) (4.65.0)
Requirement already satisfied: urllib3 in ./jupyter_env/lib/python3.10/site-packages (from kaggle) (1.26.15)
Requirement already satisfied: certifi in ./jupyter_env/lib/python3.10/site-packages (from kaggle) (2022.12.7)
Requirement already satisfied: python-slugify in ./jupyter_env/lib/python3.10/site-packages (from kaggle) (8.0.1)
Requirement already satisfied: python-dateutil in ./jupyter_env/lib/python3.10/site-packages (from kaggle) (2.8.2)
Requirement already satisfied: text-unidecode>=1.3 in ./jupyter_env/lib/python3.10/site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: idna<4,>=2.5 in ./jupyter_env/lib/python3.10/site-packages (from requests->kaggle) (3.4)
Requirement already satisfied: charset-normalizer<4,>=2 in ./jupyter_env/lib/python3.10/site-packages (from requests->kaggle) (3.1.0)
Requirement already satisfied: pandas in ./jupyter_env/lib/python3.10/site-packages (1.5.3)
Requirement already satisfied: numpy>=1.21.0 in ./jupyter_env/lib/python3.10/site-packages (from pandas) (1.24.2)
Requirement already satisfied: python-dateutil>=2.8.1 in ./jupyter_env/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./jupyter_env/lib/python3.10/site-packages (from pandas) (2022.7.1)
Requirement already satisfied: six>=1.5 in ./jupyter_env/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Requirement already satisfied: unzip in ./jupyter_env/lib/python3.10/site-packages (1.0.0)
Requirement already satisfied: scikit-learn in ./jupyter_env/lib/python3.10/site-packages (1.2.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./jupyter_env/lib/python3.10/site-packages (from scikit-learn) (3.1.0)
Requirement already satisfied: numpy>=1.17.3 in ./jupyter_env/lib/python3.10/site-packages (from scikit-learn) (1.24.2)
Requirement already satisfied: joblib>=1.1.1 in ./jupyter_env/lib/python3.10/site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: scipy>=1.3.2 in ./jupyter_env/lib/python3.10/site-packages (from scikit-learn) (1.10.1)
Requirement already satisfied: seaborn in ./jupyter_env/lib/python3.10/site-packages (0.12.2)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in ./jupyter_env/lib/python3.10/site-packages (from seaborn) (1.24.2)
Requirement already satisfied: pandas>=0.25 in ./jupyter_env/lib/python3.10/site-packages (from seaborn) (1.5.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in ./jupyter_env/lib/python3.10/site-packages (from seaborn) (3.7.1)
Requirement already satisfied: pillow>=6.2.0 in ./jupyter_env/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0)
Requirement already satisfied: fonttools>=4.22.0 in ./jupyter_env/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.39.2)
Requirement already satisfied: pyparsing>=2.3.1 in ./jupyter_env/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: contourpy>=1.0.1 in ./jupyter_env/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.7)
Requirement already satisfied: cycler>=0.10 in ./jupyter_env/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./jupyter_env/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: packaging>=20.0 in ./jupyter_env/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.0)
Requirement already satisfied: python-dateutil>=2.7 in ./jupyter_env/lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./jupyter_env/lib/python3.10/site-packages (from pandas>=0.25->seaborn) (2022.7.1)
Requirement already satisfied: six>=1.5 in ./jupyter_env/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
#Pobranie zbioru
!kaggle datasets download -d sohier/crime-in-baltimore
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/user/.kaggle/kaggle.json'
crime-in-baltimore.zip: Skipping, found more recently modified local copy (use --force to force download)
!unzip -o crime-in-baltimore.zip
Archive:  crime-in-baltimore.zip
  inflating: BPD_Part_1_Victim_Based_Crime_Data.csv  
! grep -P "^$" -n BPD_Part_1_Victim_Based_Crime_Data.csv
import pandas as pd
baltimore=pd.read_csv('BPD_Part_1_Victim_Based_Crime_Data.csv')
baltimore
CrimeDate CrimeTime CrimeCode Location Description Inside/Outside Weapon Post District Neighborhood Longitude Latitude Location 1 Premise Total Incidents
0 09/02/2017 23:30:00 3JK 4200 AUDREY AVE ROBBERY - RESIDENCE I KNIFE 913.0 SOUTHERN Brooklyn -76.60541 39.22951 (39.2295100000, -76.6054100000) ROW/TOWNHO 1
1 09/02/2017 23:00:00 7A 800 NEWINGTON AVE AUTO THEFT O NaN 133.0 CENTRAL Reservoir Hill -76.63217 39.31360 (39.3136000000, -76.6321700000) STREET 1
2 09/02/2017 22:53:00 9S 600 RADNOR AV SHOOTING Outside FIREARM 524.0 NORTHERN Winston-Govans -76.60697 39.34768 (39.3476800000, -76.6069700000) Street 1
3 09/02/2017 22:50:00 4C 1800 RAMSAY ST AGG. ASSAULT I OTHER 934.0 SOUTHERN Carrollton Ridge -76.64526 39.28315 (39.2831500000, -76.6452600000) ROW/TOWNHO 1
4 09/02/2017 22:31:00 4E 100 LIGHT ST COMMON ASSAULT O HANDS 113.0 CENTRAL Downtown West -76.61365 39.28756 (39.2875600000, -76.6136500000) STREET 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
276524 01/01/2012 00:00:00 6J 1400 JOH AVE LARCENY I NaN 832.0 SOUTHWESTERN Violetville -76.67195 39.26132 (39.2613200000, -76.6719500000) OTHER - IN 1
276525 01/01/2012 00:00:00 6J 5500 SINCLAIR LN LARCENY O NaN 444.0 NORTHEASTERN Frankford -76.53829 39.32493 (39.3249300000, -76.5382900000) OTHER - OU 1
276526 01/01/2012 00:00:00 6E 400 N PATTERSON PK AV LARCENY O NaN 321.0 EASTERN CARE -76.58497 39.29573 (39.2957300000, -76.5849700000) STREET 1
276527 01/01/2012 00:00:00 5A 5800 LILLYAN AV BURGLARY I NaN 425.0 NORTHEASTERN Glenham-Belhar -76.54578 39.34701 (39.3470100000, -76.5457800000) APT. LOCKE 1
276528 01/01/2012 00:00:00 5A 1900 GRINNALDS AV BURGLARY I NaN 831.0 SOUTHWESTERN Morrell Park -76.65094 39.26698 (39.2669800000, -76.6509400000) ROW/TOWNHO 1

276529 rows × 15 columns

baltimore.isnull().sum()
CrimeDate               0
CrimeTime               0
CrimeCode               0
Location             2207
Description             0
Inside/Outside      10279
Weapon             180952
Post                  224
District               80
Neighborhood         2740
Longitude            2204
Latitude             2204
Location 1           2204
Premise             10757
Total Incidents         0
dtype: int64
# W wiekszosci przestepstw nie uzywa sie broni, zastepujemy
# puste pola przez None
baltimore["Weapon"].fillna("None", inplace=True)
baltimore.isnull().sum()
CrimeDate              0
CrimeTime              0
CrimeCode              0
Location            2207
Description            0
Inside/Outside     10279
Weapon                 0
Post                 224
District              80
Neighborhood        2740
Longitude           2204
Latitude            2204
Location 1          2204
Premise            10757
Total Incidents        0
dtype: int64
#Wyczyszczenie zbioru z artefaktow
baltimore.dropna(inplace=True)
baltimore.isnull().sum()
CrimeDate          0
CrimeTime          0
CrimeCode          0
Location           0
Description        0
Inside/Outside     0
Weapon             0
Post               0
District           0
Neighborhood       0
Longitude          0
Latitude           0
Location 1         0
Premise            0
Total Incidents    0
dtype: int64
from sklearn.model_selection import train_test_split
#Normalizacja
baltimore['Post'] = baltimore['Post'] /baltimore['Post'].abs().max()
baltimore['Location']=baltimore['Location'].str.lower()
baltimore['Description']=baltimore['Description'].str.lower()
baltimore['Weapon']=baltimore['Weapon'].str.lower()
baltimore['Premise']=baltimore['Premise'].str.lower()
baltimore['District']=baltimore['District'].str.lower()
baltimore['CrimeCode']=baltimore['CrimeCode'].str.lower()
baltimore['Neighborhood']=baltimore['Neighborhood'].str.lower()
baltimore['Inside/Outside']=baltimore['Inside/Outside'].str.lower()
baltimore['District'].value_counts().plot(kind="bar")
<Axes: >
import seaborn as sns
sns.set_theme()
sns.relplot(data=baltimore[:20], x='Longitude', y='Latitude', hue='Weapon')
<seaborn.axisgrid.FacetGrid at 0x7f9756fab6a0>
#Podzial na zbiory
baltimore_train, baltimore_test = train_test_split(baltimore, test_size=0.1, random_state=1)
baltimore_test
CrimeDate CrimeTime CrimeCode Location Description Inside/Outside Weapon Post District Neighborhood Longitude Latitude Location 1 Premise Total Incidents
20700 04/10/2017 22:26:00 4e 4900 eastern av common assault o hands 0.256628 southeastern greektown -76.55422 39.28706 (39.2870600000, -76.5542200000) alley 1
63746 06/05/2016 20:44:00 4e 3000 s hanover st common assault o hands 0.977731 southern middle branch/reedbird pa -76.61504 39.25134 (39.2513400000, -76.6150400000) street 1
169854 03/10/2014 20:00:00 4e 4100 parkside dr common assault o hands 0.447508 northeastern belair-parkside -76.56605 39.32783 (39.3278300000, -76.5660500000) street 1
42473 10/31/2016 09:30:00 4e 5600 loch raven blvd common assault i hands 0.440085 northeastern loch raven -76.58856 39.35952 (39.3595200000, -76.5885600000) hotel/mote 1
86103 12/05/2015 08:15:00 4e 1100 guilford ave common assault i hands 0.149523 central mid-town belvedere -76.61194 39.30319 (39.3031900000, -76.6119400000) apt/condo 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
182763 11/20/2013 20:00:00 6d 3800 dolfield av larceny from auto o none 0.681866 northwestern dolfield -76.68090 39.33938 (39.3393800000, -76.6809000000) street 1
14972 05/22/2017 03:30:00 4c 3000 w garrison ave agg. assault i other 0.651113 northwestern central park heights -76.67146 39.34863 (39.3486300000, -76.6714600000) row/townho 1
44956 10/15/2016 23:30:00 7a 500 jack st auto theft o none 0.968187 southern brooklyn -76.60582 39.23265 (39.2326500000, -76.6058200000) street 1
36873 12/08/2016 18:30:00 4e 3800 cedarhurst rd common assault o hands 0.451750 northeastern waltherson -76.56315 39.33720 (39.3372000000, -76.5631500000) street 1
230084 12/06/2012 14:00:00 4e 800 s highland av common assault i hands 0.246023 southeastern canton -76.56878 39.28342 (39.2834200000, -76.5687800000) school 1

26312 rows × 15 columns

baltimore_train, baltimore_val= train_test_split(baltimore_train, test_size=0.25, random_state=1)
baltimore.describe(include='all')
CrimeDate CrimeTime CrimeCode Location Description Inside/Outside Weapon Post District Neighborhood Longitude Latitude Location 1 Premise Total Incidents
count 263118 263118 263118 263118 263118 263118 263118 263118.000000 263118 263118 263118.000000 263118.000000 263118 263118 263118.0
unique 2072 2935 80 25276 15 4 5 NaN 9 278 NaN NaN 93543 118 NaN
top 04/27/2015 18:00:00 4e 200 e pratt st larceny i none NaN northeastern downtown NaN NaN (39.3180000000, -76.6582100000) street NaN
freq 407 6483 43093 632 58246 131015 173175 NaN 40842 8701 NaN NaN 503 102544 NaN
mean NaN NaN NaN NaN NaN NaN NaN 0.536416 NaN NaN -76.617469 39.307456 NaN NaN 1.0
std NaN NaN NaN NaN NaN NaN NaN 0.276554 NaN NaN 0.042220 0.029537 NaN NaN 0.0
min NaN NaN NaN NaN NaN NaN NaN 0.117709 NaN NaN -76.711280 39.200410 NaN NaN 1.0
25% NaN NaN NaN NaN NaN NaN NaN 0.256628 NaN NaN -76.648420 39.288340 NaN NaN 1.0
50% NaN NaN NaN NaN NaN NaN NaN 0.541888 NaN NaN -76.614010 39.303680 NaN NaN 1.0
75% NaN NaN NaN NaN NaN NaN NaN 0.775186 NaN NaN -76.587490 39.327890 NaN NaN 1.0
max NaN NaN NaN NaN NaN NaN NaN 1.000000 NaN NaN -76.529770 39.371980 NaN NaN 1.0
baltimore_test.describe(include='all')
CrimeDate CrimeTime CrimeCode Location Description Inside/Outside Weapon Post District Neighborhood Longitude Latitude Location 1 Premise Total Incidents
count 26312 26312 26312 26312 26312 26312 26312 26312.000000 26312 26312 26312.000000 26312.000000 26312 26312 26312.0
unique 2071 1513 71 11180 15 4 5 NaN 9 276 NaN NaN 18843 104 NaN
top 04/27/2015 18:00:00 4e 1500 russell st larceny i none NaN northeastern downtown NaN NaN (39.3180000000, -76.6582100000) street NaN
freq 28 650 4357 56 5740 13248 17358 NaN 4137 853 NaN NaN 49 10075 NaN
mean NaN NaN NaN NaN NaN NaN NaN 0.535663 NaN NaN -76.617518 39.307771 NaN NaN 1.0
std NaN NaN NaN NaN NaN NaN NaN 0.275572 NaN NaN 0.042479 0.029477 NaN NaN 0.0
min NaN NaN NaN NaN NaN NaN NaN 0.117709 NaN NaN -76.711220 39.200470 NaN NaN 1.0
25% NaN NaN NaN NaN NaN NaN NaN 0.257688 NaN NaN -76.648905 39.288490 NaN NaN 1.0
50% NaN NaN NaN NaN NaN NaN NaN 0.541888 NaN NaN -76.614170 39.303850 NaN NaN 1.0
75% NaN NaN NaN NaN NaN NaN NaN 0.766702 NaN NaN -76.587170 39.328290 NaN NaN 1.0
max NaN NaN NaN NaN NaN NaN NaN 1.000000 NaN NaN -76.529770 39.371970 NaN NaN 1.0
baltimore_train.describe(include='all')
CrimeDate CrimeTime CrimeCode Location Description Inside/Outside Weapon Post District Neighborhood Longitude Latitude Location 1 Premise Total Incidents
count 177604 177604 177604 177604 177604 177604 177604 177604.000000 177604 177604 177604.000000 177604.000000 177604 177604 177604.0
unique 2072 2435 79 22781 15 4 5 NaN 9 278 NaN NaN 74417 116 NaN
top 04/27/2015 18:00:00 4e 200 e pratt st larceny i none NaN northeastern downtown NaN NaN (39.3180000000, -76.6582100000) street NaN
freq 298 4340 29065 440 39287 88319 116884 NaN 27451 5877 NaN NaN 337 69325 NaN
mean NaN NaN NaN NaN NaN NaN NaN 0.536132 NaN NaN -76.617452 39.307395 NaN NaN 1.0
std NaN NaN NaN NaN NaN NaN NaN 0.276695 NaN NaN 0.042192 0.029526 NaN NaN 0.0
min NaN NaN NaN NaN NaN NaN NaN 0.117709 NaN NaN -76.711280 39.200410 NaN NaN 1.0
25% NaN NaN NaN NaN NaN NaN NaN 0.256628 NaN NaN -76.648290 39.288330 NaN NaN 1.0
50% NaN NaN NaN NaN NaN NaN NaN 0.541888 NaN NaN -76.613990 39.303580 NaN NaN 1.0
75% NaN NaN NaN NaN NaN NaN NaN 0.775186 NaN NaN -76.587500 39.327742 NaN NaN 1.0
max NaN NaN NaN NaN NaN NaN NaN 1.000000 NaN NaN -76.529770 39.371970 NaN NaN 1.0
baltimore_val.describe(include='all')
CrimeDate CrimeTime CrimeCode Location Description Inside/Outside Weapon Post District Neighborhood Longitude Latitude Location 1 Premise Total Incidents
count 59202 59202 59202 59202 59202 59202 59202 59202.000000 59202 59202 59202.000000 59202.000000 59202 59202 59202.0
unique 2070 1804 77 16050 15 4 5 NaN 9 276 NaN NaN 35435 112 NaN
top 04/27/2015 18:00:00 4e 200 e pratt st larceny i none NaN northeastern downtown NaN NaN (39.3180000000, -76.6582100000) street NaN
freq 81 1493 9671 140 13219 29448 38933 NaN 9254 1971 NaN NaN 117 23144 NaN
mean NaN NaN NaN NaN NaN NaN NaN 0.537601 NaN NaN -76.617499 39.307502 NaN NaN 1.0
std NaN NaN NaN NaN NaN NaN NaN 0.276567 NaN NaN 0.042191 0.029595 NaN NaN 0.0
min NaN NaN NaN NaN NaN NaN NaN 0.117709 NaN NaN -76.711270 39.202540 NaN NaN 1.0
25% NaN NaN NaN NaN NaN NaN NaN 0.257688 NaN NaN -76.648500 39.288340 NaN NaN 1.0
50% NaN NaN NaN NaN NaN NaN NaN 0.541888 NaN NaN -76.614020 39.303930 NaN NaN 1.0
75% NaN NaN NaN NaN NaN NaN NaN 0.775186 NaN NaN -76.587592 39.328030 NaN NaN 1.0
max NaN NaN NaN NaN NaN NaN NaN 1.000000 NaN NaN -76.529770 39.371980 NaN NaN 1.0