ium/IUM_02.Dane.ipynb

1616 lines
408 KiB
Plaintext
Raw Normal View History

2021-03-15 11:51:20 +01:00
{
"cells": [
2021-09-28 10:56:21 +02:00
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Inżynieria uczenia maszynowego </h1>\n",
"<h2> 2. <i>Dane</i> [laboratoria]</h2> \n",
"<h3> Tomasz Ziętkiewicz (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
2021-03-15 11:51:20 +01:00
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Plan na dzisiaj\n",
"1. Motywacja\n",
"2. Podział danych\n",
"3. Skąd wziąć dane?\n",
"4. Przygotowanie danych\n",
"5. Zadanie"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Motywacja\n",
"- Zasada \"Garbage in - garbage out\"\n",
"- Im lepszej jakości dane - tym lepszy model\n",
"- Najlepsza architektura, najpotężniejsze zasoby obliczeniowe i najbardziej wyrafinowane metody nie pomogą, jeśli dane użyte do rozwoju modelu nie odpowiadają tym, z którymi będzie on używany, albo jeśli w danych nie będzie żadnych zależności\n",
"- Możemy stracić dużo czasu, energii i zasobów optymalizując nasz model w złym kierunku, jeśli dane są źle dobrane"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Źródła danych\n",
"- Gotowe zbiory:\n",
" - Otwarte wyzwania (challenge)\n",
" - Repozytoria otwartych zbiorów danych\n",
" - Dane udostępniane przez firmy\n",
" - Repozytoria zbiorów komercyjnych\n",
" - Dane wewnętrzne (np. firmy)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Źródła danych\n",
"- Tworzenie danych:\n",
" - Generowanie syntetyczne\n",
2022-03-14 09:09:50 +01:00
" - np. generowanie korpusów mowy za pomocą TTS (syntezy mowy)\n",
2021-03-15 11:51:20 +01:00
" - Crowdsourcing\n",
2022-03-14 09:09:50 +01:00
" - Data scrapping"
2021-03-15 11:51:20 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Otwarte wyzwania (shared task / challenge)\n",
"- Kaggle: https://www.kaggle.com/datasets\n",
"- Gonito: https://gonito.net/list-challenges - polski (+poznański +z UAM) Kaggle\n",
"- Semeval: https://semeval.github.io/ - zadania z semantyki\n",
"- Poleval: http://poleval.pl/ - przetwarzanie języka polskiego\n",
"- WMT http://www.statmt.org/wmt20/ (tłumaczenie maszynowe)\n",
"- IWSLT https://iwslt.org/2021/#shared-tasks (tłumaczenie mowy)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Repozytoria/wyszukiwarki otwartych zbiorów danych\n",
"- Papers with code: https://paperswithcode.com/datasets\n",
2022-03-14 09:09:50 +01:00
"- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/ (University of California)\n",
2021-03-15 11:51:20 +01:00
"- Google dataset search: https://datasetsearch.research.google.com/\n",
"- Zbiory google:https://research.google/tools/datasets/\n",
2022-03-14 09:09:50 +01:00
"- Otwarte zbiory na Amazon AWS: https://registry.opendata.aws/\n",
2021-03-15 11:51:20 +01:00
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Otwarte zbiory\n",
"- Rozpoznawanie mowy:\n",
" - https://www.openslr.org/ - Libri Speech, TED Lium\n",
" - Mozilla Open Voice: https://commonvoice.mozilla.org/\n",
"- NLP:\n",
" - Clarin: https://clarin-pl.eu/index.php/zasoby/\n",
2022-03-14 09:09:50 +01:00
" - NKJP: http://nkjp.pl/\n",
2021-03-15 11:51:20 +01:00
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Crowdsourcing\n",
"- reCAPTCHA\n",
2022-03-14 09:09:50 +01:00
"<img src=\"img/ReCAPTCHA_idea.jpg\">\n",
"<img src=\"img/cat_captcha.png\">\n",
"\n",
"<sub>Źródło: https://pl.wikipedia.org/wiki/ReCAPTCHA#/media/Plik:ReCAPTCHA_idea.jpg</sub>"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Amazon Mechanical Turk: https://www.mturk.com/\n",
"<img src=\"img/Tuerkischer_schachspieler_windisch4.jpg\">\n",
"\n",
"<sub>Źródło: https://en.wikipedia.org/wiki/Mechanical_Turk#/media/File:Tuerkischer_schachspieler_windisch4.jpg</sub>"
2021-03-15 11:51:20 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Licencje\n",
"- Przed podjęciem decyzji o użyciu danego zbioru koniecznie sprawdź jego licencję!\n",
"- Wiele dostępnych w internecie zbiorów jest udostępniana na podstawie otwartych licencji\n",
"- Zazwyczaj jednak ich użycie wymaga spełnienia pewnych warunków, np. podania źródła\n",
"- Wiele ogólnie dostępnych zbiorów nie może być jednak użytych za darmo w celach komercyjnych!\n",
"- Niektóre z nich mogą nawet powodować, że praca pochodna, która zostanie stworzona z ich wykorzystaniem, będzie musiała być udostępniona na tej samej licencji (GPL). Jest to \"niebezpieczeństwo\" w przypadku wykorzystania zasobów przez firmę komercyjną!\n",
"- Zasady działania licencji CC: https://creativecommons.pl/\n",
"- Najbardziej popularne licencje:\n",
" - Przyjazne również w zastosowaniach komercyjnych: MIT, BSD, Appache, CC (bez dopisku NC)\n",
" - GPL (GNU Public License) - \"zaraźliwa\" licencja Open Source"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Przykład \n",
"- Za pomocą standardowych narzędzi bash dokonamy wstępnej inspekcji i podziału danych\n",
"- Jako przykładu użyjemy klasycznego zbioru IRIS: https://archive.ics.uci.edu/ml/datasets/Iris\n",
"- Zbiór zawiera dane dotyczące długości i szerokości płatków kwiatowych trzech gatunków irysa:\n",
" - Iris Setosa\n",
" - Iris Versicolour\n",
" - Iris Virginica\n",
" \n",
"<img src=IUM_02/iris.png/>\n",
"\n",
"<sub>Źródło: https://www.kaggle.com/vinayshaw/iris-species-100-accuracy-using-naive-bayes<br>\n",
"Licencja: [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)</sub>"
2021-03-15 11:51:20 +01:00
]
},
{
"cell_type": "markdown",
2022-03-14 09:09:50 +01:00
"metadata": {},
2021-03-15 11:51:20 +01:00
"source": [
2022-03-14 09:09:50 +01:00
"## Pobranie danych"
2021-03-15 11:51:20 +01:00
]
},
{
"cell_type": "code",
2022-03-14 09:09:50 +01:00
"execution_count": 1,
2021-03-15 11:51:20 +01:00
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-14 09:09:50 +01:00
"Collecting kaggle\n",
" Using cached kaggle-1.5.12.tar.gz (58 kB)\n",
"Requirement already satisfied: six>=1.10 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.15.0)\n",
"Requirement already satisfied: certifi in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2021.5.30)\n",
"Requirement already satisfied: python-dateutil in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.8.1)\n",
"Requirement already satisfied: requests in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.25.1)\n",
"Requirement already satisfied: tqdm in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (4.59.0)\n",
"Requirement already satisfied: python-slugify in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (5.0.2)\n",
"Requirement already satisfied: urllib3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.26.4)\n",
"Requirement already satisfied: text-unidecode>=1.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-slugify->kaggle) (1.3)\n",
"Requirement already satisfied: idna<3,>=2.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (2.10)\n",
"Requirement already satisfied: chardet<5,>=3.0.2 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (4.0.0)\n",
"Building wheels for collected packages: kaggle\n",
" Building wheel for kaggle (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73053 sha256=1e6240d540651324d97a9772ad1ced30da7d7b5dc5956dc974eeeddf7c48844b\n",
" Stored in directory: /home/tomek/.cache/pip/wheels/ac/b2/c3/fa4706d469b5879105991d1c8be9a3c2ef329ba9fe2ce5085e\n",
"Successfully built kaggle\n",
"Installing collected packages: kaggle\n",
"Successfully installed kaggle-1.5.12\n",
"Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n",
"Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n",
"Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n",
"Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n",
"Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n"
2021-03-15 11:51:20 +01:00
]
}
],
"source": [
"#Zainstalujmy potrzebne biblioteki \n",
"!pip install --user kaggle #API Kaggle, do pobrania zbioru\n",
"!pip install --user pandas"
]
},
2022-03-14 09:09:50 +01:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
" - Pobierzemy zbiór Iris z Kaggle: https://www.kaggle.com/uciml/iris\n",
" - Licencja to \"Public Domain\", więc możemy z niego korzystać bez ograniczeń."
]
},
2021-03-15 11:51:20 +01:00
{
"cell_type": "code",
2022-03-14 09:09:50 +01:00
"execution_count": 2,
2021-03-15 11:51:20 +01:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
2022-03-14 09:09:50 +01:00
"Downloading iris.zip to /home/tomek/AITech/repo/aitech-ium\n",
" 0%| | 0.00/3.60k [00:00<?, ?B/s]\n",
"100%|██████████████████████████████████████| 3.60k/3.60k [00:00<00:00, 1.63MB/s]\n"
2021-03-15 11:51:20 +01:00
]
}
],
"source": [
2021-05-10 12:53:57 +02:00
"# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.\n",
2021-03-15 11:51:20 +01:00
"# Instrukcje: https://www.kaggle.com/docs/api\n",
"!kaggle datasets download -d uciml/iris"
]
},
{
"cell_type": "code",
2022-03-14 09:09:50 +01:00
"execution_count": 3,
2021-03-15 11:51:20 +01:00
"metadata": {
2022-03-14 09:09:50 +01:00
"scrolled": true,
2021-03-15 11:51:20 +01:00
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: iris.zip\r\n",
" inflating: Iris.csv \r\n",
" inflating: database.sqlite \r\n"
]
}
],
"source": [
"!unzip -o iris.zip"
]
},
2022-03-14 09:09:50 +01:00
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Inspekcja\n",
"- Zanim zaczniemy trenować model na danych, powinniśmy poznać ich specyfikę\n",
"- Pozwoli nam to:\n",
" - usunąć lub naprawić nieprawidłowe przykłady\n",
" - dokonać selekcji cech, których użyjemy w naszym modelu\n",
" - wybrać odpowiedni algorytm uczenia\n",
" - podjąć dezycję dotyczącą podziału zbioru i ewentualnej normalizacji\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Inspekcja\n",
"- Do inspekcji danych użyjemy popularnej biblioteki pythonowej Pandas: https://pandas.pydata.org/\n",
"- Do wizualizacji użyjemy biblioteki Seaborn: https://seaborn.pydata.org/index.html\n",
"- Służy ona do analizy i operowania na danych tabelarycznych jak i szeregach czasowych"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n",
"Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n",
"Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n",
"Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n",
"Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n",
"Collecting seaborn\n",
" Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)\n",
"\u001b[K |████████████████████████████████| 292 kB 1.1 MB/s eta 0:00:01\n",
"\u001b[?25hCollecting matplotlib>=2.2\n",
" Downloading matplotlib-3.5.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)\n",
"\u001b[K |████████████████████████████████| 11.2 MB 10.8 MB/s eta 0:00:01\n",
"\u001b[?25hRequirement already satisfied: pandas>=0.23 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.2.4)\n",
"Requirement already satisfied: numpy>=1.15 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.20.2)\n",
"Requirement already satisfied: scipy>=1.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.6.3)\n",
"Requirement already satisfied: packaging>=20.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (20.9)\n",
"Requirement already satisfied: python-dateutil>=2.7 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)\n",
"Collecting cycler>=0.10\n",
" Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)\n",
"Requirement already satisfied: pyparsing>=2.2.1 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)\n",
"Collecting fonttools>=4.22.0\n",
" Downloading fonttools-4.30.0-py3-none-any.whl (898 kB)\n",
"\u001b[K |████████████████████████████████| 898 kB 4.9 MB/s eta 0:00:01\n",
"\u001b[?25hRequirement already satisfied: pillow>=6.2.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)\n",
"Collecting kiwisolver>=1.0.1\n",
" Downloading kiwisolver-1.3.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)\n",
"\u001b[K |████████████████████████████████| 1.6 MB 7.7 MB/s eta 0:00:01\n",
"\u001b[?25hRequirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas>=0.23->seaborn) (2021.1)\n",
"Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.15.0)\n",
"Installing collected packages: kiwisolver, fonttools, cycler, matplotlib, seaborn\n",
"Successfully installed cycler-0.11.0 fonttools-4.30.0 kiwisolver-1.3.2 matplotlib-3.5.1 seaborn-0.11.2\n"
]
}
],
"source": [
"!pip install --user pandas\n",
"!pip install --user seaborn"
]
},
2021-03-15 11:51:20 +01:00
{
"cell_type": "code",
2022-03-14 09:09:50 +01:00
"execution_count": 4,
2021-03-15 11:51:20 +01:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species\r\n",
"1,5.1,3.5,1.4,0.2,Iris-setosa\r\n",
"2,4.9,3.0,1.4,0.2,Iris-setosa\r\n",
"3,4.7,3.2,1.3,0.2,Iris-setosa\r\n",
"4,4.6,3.1,1.5,0.2,Iris-setosa\r\n"
]
}
],
"source": [
"!head -n 5 Iris.csv"
]
},
{
"cell_type": "code",
2022-03-14 09:09:50 +01:00
"execution_count": 5,
2021-03-15 11:51:20 +01:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>SepalLengthCm</th>\n",
" <th>SepalWidthCm</th>\n",
" <th>PetalLengthCm</th>\n",
" <th>PetalWidthCm</th>\n",
" <th>Species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>5.1</td>\n",
" <td>3.5</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>4.9</td>\n",
" <td>3.0</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>4.7</td>\n",
" <td>3.2</td>\n",
" <td>1.3</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>4.6</td>\n",
" <td>3.1</td>\n",
" <td>1.5</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>5.0</td>\n",
" <td>3.6</td>\n",
" <td>1.4</td>\n",
" <td>0.2</td>\n",
" <td>Iris-setosa</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>145</th>\n",
" <td>146</td>\n",
" <td>6.7</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.3</td>\n",
" <td>Iris-virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>146</th>\n",
" <td>147</td>\n",
" <td>6.3</td>\n",
" <td>2.5</td>\n",
" <td>5.0</td>\n",
" <td>1.9</td>\n",
" <td>Iris-virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>147</th>\n",
" <td>148</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" <td>5.2</td>\n",
" <td>2.0</td>\n",
" <td>Iris-virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>148</th>\n",
" <td>149</td>\n",
" <td>6.2</td>\n",
" <td>3.4</td>\n",
" <td>5.4</td>\n",
" <td>2.3</td>\n",
" <td>Iris-virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>149</th>\n",
" <td>150</td>\n",
" <td>5.9</td>\n",
" <td>3.0</td>\n",
" <td>5.1</td>\n",
" <td>1.8</td>\n",
" <td>Iris-virginica</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>150 rows × 6 columns</p>\n",
"</div>"
],
"text/plain": [
" Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \\\n",
"0 1 5.1 3.5 1.4 0.2 \n",
"1 2 4.9 3.0 1.4 0.2 \n",
"2 3 4.7 3.2 1.3 0.2 \n",
"3 4 4.6 3.1 1.5 0.2 \n",
"4 5 5.0 3.6 1.4 0.2 \n",
".. ... ... ... ... ... \n",
"145 146 6.7 3.0 5.2 2.3 \n",
"146 147 6.3 2.5 5.0 1.9 \n",
"147 148 6.5 3.0 5.2 2.0 \n",
"148 149 6.2 3.4 5.4 2.3 \n",
"149 150 5.9 3.0 5.1 1.8 \n",
"\n",
" Species \n",
"0 Iris-setosa \n",
"1 Iris-setosa \n",
"2 Iris-setosa \n",
"3 Iris-setosa \n",
"4 Iris-setosa \n",
".. ... \n",
"145 Iris-virginica \n",
"146 Iris-virginica \n",
"147 Iris-virginica \n",
"148 Iris-virginica \n",
"149 Iris-virginica \n",
"\n",
"[150 rows x 6 columns]"
]
},
2022-03-14 09:09:50 +01:00
"execution_count": 5,
2021-03-15 11:51:20 +01:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"iris=pd.read_csv('Iris.csv')\n",
"iris"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>SepalLengthCm</th>\n",
" <th>SepalWidthCm</th>\n",
" <th>PetalLengthCm</th>\n",
" <th>PetalWidthCm</th>\n",
" <th>Species</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Iris-virginica</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>75.500000</td>\n",
" <td>5.843333</td>\n",
" <td>3.054000</td>\n",
" <td>3.758667</td>\n",
" <td>1.198667</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>43.445368</td>\n",
" <td>0.828066</td>\n",
" <td>0.433594</td>\n",
" <td>1.764420</td>\n",
" <td>0.763161</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.000000</td>\n",
" <td>4.300000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.100000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>38.250000</td>\n",
" <td>5.100000</td>\n",
" <td>2.800000</td>\n",
" <td>1.600000</td>\n",
" <td>0.300000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>75.500000</td>\n",
" <td>5.800000</td>\n",
" <td>3.000000</td>\n",
" <td>4.350000</td>\n",
" <td>1.300000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>112.750000</td>\n",
" <td>6.400000</td>\n",
" <td>3.300000</td>\n",
" <td>5.100000</td>\n",
" <td>1.800000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>150.000000</td>\n",
" <td>7.900000</td>\n",
" <td>4.400000</td>\n",
" <td>6.900000</td>\n",
" <td>2.500000</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \\\n",
"count 150.000000 150.000000 150.000000 150.000000 150.000000 \n",
"unique NaN NaN NaN NaN NaN \n",
"top NaN NaN NaN NaN NaN \n",
"freq NaN NaN NaN NaN NaN \n",
"mean 75.500000 5.843333 3.054000 3.758667 1.198667 \n",
"std 43.445368 0.828066 0.433594 1.764420 0.763161 \n",
"min 1.000000 4.300000 2.000000 1.000000 0.100000 \n",
"25% 38.250000 5.100000 2.800000 1.600000 0.300000 \n",
"50% 75.500000 5.800000 3.000000 4.350000 1.300000 \n",
"75% 112.750000 6.400000 3.300000 5.100000 1.800000 \n",
"max 150.000000 7.900000 4.400000 6.900000 2.500000 \n",
"\n",
" Species \n",
"count 150 \n",
"unique 3 \n",
"top Iris-virginica \n",
"freq 50 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris.describe(include='all')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Iris-virginica 50\n",
"Iris-setosa 50\n",
"Iris-versicolor 50\n",
"Name: Species, dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris[\"Species\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:>"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEyCAYAAADjiYtYAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAASkklEQVR4nO3de6xlZX3G8e8zgOKNCuFAplwcbFGrlpujEaGaglhaVKgVkaqdGCq9YEtTi4HeEmusWBPjpd5GRKf1SivIFI1CByiSEHC4CkGD5aYyMgNVGcEil1//2OvIdDgzZ5+zz9lr3tnfT3Ky9nr33rN/yTrznLXf9b7vSlUhSWrPkr4LkCTNjwEuSY0ywCWpUQa4JDXKAJekRhngktSoHcf5YbvvvnstW7ZsnB8pSc27+uqr76mqqc3bxxrgy5YtY+3ateP8SElqXpI7Zmq3C0WSGmWAS1KjDHBJapQBLkmNMsAlqVFDjUJJcjuwEXgEeLiqlifZDfgisAy4HXhdVf1occqUJG1uLmfgv1lVB1XV8m7/dGBNVe0PrOn2JUljMkoXyrHAqu7xKuC4kauRJA1t2Ik8BVyYpICPV9VKYM+qWgdQVeuS7DHTG5OcDJwMsO+++y5AycNbdvpXxvp543b7mcf0XcKi8di1zeM3HsMG+GFVdVcX0hcl+fawH9CF/UqA5cuXe/sfSVogQ3WhVNVd3XY9cB7wIuDuJEsBuu36xSpSkvR4swZ4kqckedr0Y+AVwI3AamBF97IVwPmLVaQk6fGG6ULZEzgvyfTrP1dVX0vyTeCcJCcBdwLHL16ZkqTNzRrgVXUrcOAM7fcCRy5GUZKk2TkTU5IaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktSooQM8yQ5Jrk1yQbe/W5KLktzSbXddvDIlSZubyxn4qcDNm+yfDqypqv2BNd2+JGlMhgrwJHsDxwBnbdJ8LLCqe7wKOG5BK5MkbdWwZ+DvB94OPLpJ255VtQ6g2+6xsKVJkrZm1gBP8kpgfVVdPZ8PSHJykrVJ1m7YsGE+/4QkaQbDnIEfBrw6ye3AF4AjknwGuDvJUoBuu36mN1fVyqpaXlXLp6amFqhsSdKsAV5VZ1TV3lW1DHg9cHFVvRFYDazoXrYCOH/RqpQkPc4o48DPBI5KcgtwVLcvSRqTHefy4qq6FLi0e3wvcOTClyRJGoYzMSWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNmjXAk+yc5Kok1ye5Kck7uvbdklyU5JZuu+vilytJmjbMGfiDwBFVdSBwEHB0khcDpwNrqmp/YE23L0kak1kDvAZ+2u3u1P0UcCywqmtfBRy3GAVKkmY2VB94kh2SXAesBy6qqiuBPatqHUC33WPRqpQkPc5QAV5Vj1TVQcDewIuSPH/YD0hycpK1SdZu2LBhnmVKkjY3p1EoVfVj4FLgaODuJEsBuu36LbxnZVUtr6rlU1NTo1UrSfqFYUahTCV5evf4ScDLgW8Dq4EV3ctWAOcvUo2SpBnsOMRrlgKrkuzAIPDPqaoLklwBnJPkJOBO4PhFrFOStJlZA7yqbgAOnqH9XuDIxShKkjQ7Z2JKUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjZg3wJPskuSTJzUluSnJq175bkouS3NJtd138ciVJ04Y5A38YeFtV/RrwYuCUJM8FTgfWVNX+wJpuX5I0JrMGeFWtq6pruscbgZuBvYBjgVXdy1YBxy1SjZKkGcypDzzJMuBg4Epgz6paB4OQB/ZY8OokSVs0dIAneSrwJeAvquq+Obzv5CRrk6zdsGHDfGqUJM1gqABPshOD8P5sVZ3bNd+dZGn3/FJg/UzvraqVVbW8qpZPTU0tRM2SJIYbhRLgk8DNVfW+TZ5aDazoHq8Azl/48iRJW7LjEK85DHgT8K0k13Vtfw2cCZyT5CTgTuD4RalQkjSjWQO8qi4HsoWnj1zYciRJw3ImpiQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRswZ4krOTrE9y4yZtuyW5KMkt3XbXxS1TkrS5Yc7APw0cvVnb6cCaqtofWNPtS5LGaNYAr6rLgP/ZrPlYYFX3eBVw3MKWJUmazXz7wPesqnUA3XaPhStJkjSMRb+ImeTkJGuTrN2wYcNif5wkTYz5BvjdSZYCdNv1W3phVa2squVVtXxqamqeHydJ2tx8A3w1sKJ7vAI4f2HKkSQNa5hhhJ8HrgCeneT7SU4CzgSOSnILcFS3L0kaox1ne0FVnbiFp45c4FokSXPgTExJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWrUSAGe5Ogk30ny3SSnL1RRkqTZzTvAk+wAfBj4beC5wIlJnrtQhUmStm6UM/AXAd+tqlur6ufAF4BjF6YsSdJsRgnwvYDvbbL//a5NkjQGO47w3szQVo97UXIycHK3+9Mk3xnhM7d1uwP3jOvD8p5xfdJE8Ni1bXs/fs+YqXGUAP8+sM8m+3sDd23+oqpaCawc4XOakWRtVS3vuw7NnceubZN6/EbpQvkmsH+S/ZI8AXg9sHphypIkzWbeZ+BV9XCStwJfB3YAzq6qmxasMknSVo3ShUJVfRX46gLVsj2YiK6i7ZTHrm0TefxS9bjrjpKkBjiVXpIaZYBLUqMMcEnNSbIkyUv6rqNv9oEvgCTHAM8Ddp5uq6p/6K8iDctj164kV1TVoX3X0SfPwEeU5GPACcCfMZidejxbmDWlbYvHrnkXJvm9JDPNCp8InoGPKMkNVXXAJtunAudW1Sv6rk1b57FrW5KNwFOAR4CfMfgjXFW1S6+FjdFI48AFDH5xAB5I8svAvcB+Pdaj4XnsGlZVT+u7hr4Z4KO7IMnTgfcC1zBY0OusXivSsDx2jUvyauCl3e6lVXVBn/WMm10oCyjJE4Gdq+onfdeiufHYtSfJmcALgc92TScCV1fVxNwdzIuYI0pySncWR1U9CCxJ8qf9VqVhJDk+yfTX8NOATyU5uM+aNCe/AxxVVWdX1dnA0V3bxDDAR/eWqvrx9E5V/Qh4S3/laA7+rqo2Jjkc+C1gFfCxnmvS3Dx9k8e/1FcRfTHAR7dk02FM3b1Cn9BjPRreI932GOCjVXU+HruWvBu4Nsmnk6wCrgb+seeaxso+8BEleS+wjMGZWwF/DHyvqt7WZ12aXZILgB8ALwdewGBUylVVdWCvhWloSZYy6AcPcGVV/bDnksbKAB9RkiXAHwFHMvgluhA4q6oe2eob1bskT2b
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"iris[\"Species\"].value_counts().plot(kind=\"bar\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PetalLengthCm</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Species</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Iris-setosa</th>\n",
" <td>1.464</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Iris-versicolor</th>\n",
" <td>4.260</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Iris-virginica</th>\n",
" <td>5.552</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PetalLengthCm\n",
"Species \n",
"Iris-setosa 1.464\n",
"Iris-versicolor 4.260\n",
"Iris-virginica 5.552"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris[[\"Species\",\"PetalLengthCm\"]].groupby(\"Species\").mean()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot:xlabel='Species'>"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAFACAYAAACV7zazAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAY+ElEQVR4nO3dfZRU9Z3n8c+nGxQSMG603WPEBFRGI0+NNixCIFHiw4qTmU1iiJKsZ+LT7IYdNpnokTiYE0ej2XjUjJPEIIO46xNO8GnUzGhURs1xeZIGRXQh2kZGFDQZRPAB8Lt/1K22hYa+jV11f9X1fp1Tp+reunXr21TXh1//7u/+riNCAIB0NRRdAABgzwhqAEgcQQ0AiSOoASBxBDUAJI6gBoDE9anETg888MAYPHhwJXYNAL3SsmXLXo+Ips6eq0hQDx48WEuXLq3ErgGgV7L90u6eo+sDABJHUANA4ghqAEhcRfqoO7Nt2zatW7dO77zzTrXeEj2gX79+GjRokPr27Vt0KUDdqlpQr1u3TgMHDtTgwYNlu1pvi48gIvTGG29o3bp1GjJkSNHlAHWral0f77zzjg444ABCuobY1gEHHMBfQUDBqtpHTUjXHj4zoHh1dTCxsbFRzc3NGj58uE4//XRt3bp1t9u2trbqgQce6HKfCxcu1GmnnSZJmjdvnqZPn95j9e6sra1Nt956a/vynt7vrbfe0vnnn6/DDz9cw4YN06RJk7Ro0aKK1QagcqrWR72zwRfd36P7a7tySpfb9O/fX62trZKkadOm6frrr9d3v/vdTrdtbW3V0qVLdeqpp/ZkmR9JOajPPPPMLrc955xzNGTIEK1Zs0YNDQ164YUXtHr16ipUiXrS09/jlOTJlGqpqxZ1RxMnTtTatWu1ZcsWfetb39KYMWM0evRo3XPPPXrvvfd0ySWXaP78+Wpubtb8+fO1ePFijR8/XqNHj9b48eP1/PPP536vm2++WWPHjlVzc7POP/987dixQ5I0YMAAXXzxxRo1apTGjRun1157TZL0u9/9TuPGjdOYMWN0ySWXaMCAAZKkiy66SI8//riam5t1zTXXSJJeeeUVnXLKKRo6dKguvPDC9tcvWrRIl112mRoaSh/xYYcdpilTpqitrU1HHXWUzjnnHA0fPlzTpk3Tb37zG02YMEFDhw7V4sWLe+zfGEDPqMug3r59u379619rxIgRuvzyy3XCCSdoyZIlevTRR3XBBRdo27ZtuvTSSzV16lS1trZq6tSpOuqoo/TYY49p+fLluvTSS/X9738/13utXr1a8+fP129/+1u1traqsbFRt9xyiyRpy5YtGjdunFasWKFJkybphhtukCTNmDFDM2bM0JIlS/SpT32qfV9XXnmlJk6cqNbWVn3nO9+RVGr5z58/X08//bTmz5+vl19+WatWrVJzc7MaGxs7rWnt2rWaMWOGVq5cqeeee0633nqrnnjiCV111VX60Y9+9FH+aQFUQGFdH0V4++231dzcLKnUoj777LM1fvx43XvvvbrqqqsklUan/P73v9/ltZs2bdJZZ52lNWvWyLa2bduW6z0ffvhhLVu2TGPGjGmv4aCDDpIk7bPPPu3928cee6weeughSdKTTz6pu+++W5J05pln6nvf+95u9z958mR94hOfkCQdffTReuml3U4X0G7IkCEaMWKEJGnYsGGaPHmybGvEiBFqa2vL9XMBqJ66CuqOfdRlEaEFCxboyCOP/ND6nQ+8zZo1S8cff7zuuusutbW16Qtf+EKu94wInXXWWbriiit2ea5v377toyoaGxu1ffv2/D9MZt99921/XN7HsGHDtGLFCr3//vvtXR+7e01DQ0P7ckNDw17VAKCy6rLro6OTTz5Z1113ncpXY1++fLkkaeDAgdq8eXP7dps2bdIhhxwiqTTaIq/JkyfrV7/6lTZs2CBJ+sMf/tBlq3fcuHFasGCBJOn2229vX79zTbtz+OGHq6WlRT/4wQ/af641a9bonnvuyV03gHTUfVDPmjVL27Zt08iRIzV8+HDNmjVLknT88cfr2WefbT+YeOGFF2rmzJmaMGFC+8HAzsybN0+DBg1qv+2333667LLLdNJJJ2nkyJE68cQTtX79+j3WdO211+rqq6/W2LFjtX79+vaujZEjR6pPnz4aNWpU+8HE3ZkzZ45effVVHXHEERoxYoTOPffcD/V3A6gdLre4elJLS0vsPB/16tWr9dnPfrbH36s32rp1q/r37y/buv3223XbbbcV2hrms8PuMDyv59heFhEtnT1XV33UtWLZsmWaPn26IkL777+/5s6dW3RJAApEUCdo4sSJWrFiRdFlAEhE3fdRA0DqqhrUlegPR2XxmQHFq1pQ9+vXT2+88QZf/BpSno+6X79+RZcC1LWq9VEPGjRI69at08aNG6v1lugB5Su8AChO1YK6b9++XCUEAPYCBxMBIHEENQAkLlfXh+02SZsl7ZC0fXdnzwAAel53+qiPj4jXK1YJAKBTdH0AQOLyBnVIetD2MtvnVbIgAMCH5e36mBARr9g+SNJDtp+LiMc6bpAF+HmS9OlPf7qHywSA+pWrRR0Rr2T3GyTdJWlsJ9vMjoiWiGhpamrq2SoBoI51GdS2P257YPmxpJMkPVPpwgAAJXm6Pv6jpLuya/v1kXRrRPxzRasCALTrMqgj4gVJo6pQCwCgEwzPA4DEEdQAkDiCGgASR1ADQOIIagBIHEENAIkjqAEgcQQ1ACSOoAaAxBHUAJA4ghoAEkdQA0DiCGoASBxBDQCJI6gBIHEENQAkjqAGgMQR1ACQOIIaABJHUANA4ghqAEgcQQ0AiSOoASBxfYouAPVt8EX3F11CRbVdOaXoEtAL0KIGgMQR1ACQOIIaABJHUANA4nIHte1G28tt31fJggAAH9adFvUMSasrVQgAoHO5gtr2IElTJM2pbDkAgJ3lbVFfK+lCSe9XrhQAQGe6DGrbp0naEBHLutjuPNtLbS/duHFjjxUIAPUuT4t6gqQv2W6TdLukE2zfvPNGETE7IloioqWpqamHywSA+tVlUEfEzIgYFBGDJX1d0iMR8Y2KVwYAkMQ4agBIXrcmZYqIhZIWVqQSAECnaFEDQOIIagBIHEENAIkjqAEgcQQ1ACSOoAaAxBHUAJA4ghoAEkdQA0DiCGoASBxBDQCJI6gBIHEENQAkjqAGgMQR1ACQOIIaABJHUANA4ghqAEgcQQ0AiSOoASBxBDUAJI6gBoDEEdQAkDiCGgASR1ADQOIIagBIHEENAIkjqAEgcQQ1ACSuy6C23c/2YtsrbK+y/cNqFAYAKOmTY5t3JZ0QEW/Z7ivpCdu/joj/W+HaAADKEdQREZLeyhb7ZreoZFEAgA/k6qO23Wi7VdIGSQ9FxKKKVgUAaJcrqCNiR0Q0Sxokaazt4TtvY/s820ttL924cWMPlwkA9atboz4i4t8lLZR0SifPzY6IlohoaWpq6pnqAAC5Rn002d4/e9xf0hclPVfhugAAmTyjPg6WdJPtRpWC/Y6IuK+yZQEAyvKM+lgpaXQVagEAdIIzEwEgcQQ1ACSOoAaAxBHUAJA4ghoAEkdQA0DiCGoASBxBDQCJI6gBIHEENQAkjqAGgMQR1ACQOIIaABJHUANA4ghqAEgcQQ0AiSOoASBxBDUAJI6gBoDEEdQAkDiCGgASR1ADQOIIagBIHEENAIkjqAEgcQQ1ACSOoAaAxBHUAJC4LoPa9qG2H7W92vYq2zOqURgAoKRPjm22S/rriHjK9kBJy2w/FBHPVrg2AIBytKgjYn1EPJU93ixptaRDKl0YAKCkW33UtgdLGi1pUUWqAQDsIndQ2x4gaYGk/xkRb3by/Hm2l9peunHjxp6sEQDqWq6gtt1XpZC+JSLu7GybiJgdES0R0dLU1NSTNQJAXcsz6sOS/kHS6oi4uvIlAQA6ytOiniDpm5JOsN2a3U6tcF0AgEyXw/Mi4glJrkItAIBOcGYiACSOoAaAxBHUAJA4ghoAEkdQA0DiCGoASBxBDQCJI6gBIHEENQAkjqAGgMQR1ACQOIIaABJHUANA4ghqAEgcQQ0
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"iris[[\"Species\",\"PetalLengthCm\"]].groupby(\"Species\").mean().plot(kind=\"bar\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<seaborn.axisgrid.FacetGrid at 0x7f97eed545b0>"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAdoAAAFtCAYAAACgK6tiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAABg1ElEQVR4nO3deXwU9f348dfMnrnvC8KNHMohEO5LARUBgYIHXtSq0NYDa2ulnigiX1FbWlGLtrX+rEetiiKIioCK3CAooIDIkQC5T5JNstfM74/AwrIhB2R3s+H9fDx4PNhP5j3zzhLy3pn5zPuj6LquI4QQQgi/UIOdgBBCCNGSSaEVQggh/EgKrRBCCOFHUmiFEEIIP5JCK4QQQviRFFohhBDCj4zBTqCpFRVVoGmNe2IpLi6ckpJKP2XkP6GaN0juwRCqeUPLzz0pKSpA2YhgkDNawGg0BDuFcxKqeYPkHgyhmjdI7iK0SaEVQggh/EgKrRBCCOFHUmiFEEIIP5JCK4QQQviRFFohhBDCj6TQCiGEEH4khVYIIYTwIym0QgghhB8FpDNUSUkJDz74IFlZWZjNZtq1a8fcuXOJj4/32m7RokW8/fbbJCcnA9C3b1/mzJkTiBSFEH5mMCiAgtutnUOcN6NRRdP0RneBEyIYAlJoFUXhzjvvZODAgQAsWLCA559/nvnz5/tsO3nyZGbPnh2ItIQQAaAokO/KY8PhrdhcVQxrM4B0azqqXnfHJEWBPGceGw9vpXJfFUPbDCA1LJnD5VlsOPotaZHJDGjVh3g1AV3qrWjGAlJoY2NjPUUW4NJLL+Wdd94JxKGFEEFW4Mpn3jd/w6W5AFiftZU/DP41HcM61Rv39LpTcbvz9zGhy2je3PmhZ5vVh9bx6NDfEa3E+i1/Ic5XwO/RaprGO++8w6hRo2r9+ieffMI111zD7bffzo4dOwKcnRCiKSmKwq6CPZ5iedKyn74Aw9kvISuKws5877hBbfry0d6VXtvZHJUcrchu2qSFaGIBX73nqaeeIjw8nFtuucXna9OmTeM3v/kNJpOJ9evXc9ddd7FixQri4uIavP+EhMhzyitUV88I1bxBcg+GYOSt5PqO6ehER4dhMZrPHpjjfT1YQUGv5RqxwaA2+3+P5p6f8K+AFtoFCxaQmZnJ4sWLUVXfk+mkpCTP34cOHUpaWhr79+9nwIABDT7GuSyTl5QURUFBeaNimoNQzRsk92AIVt6XJHRlifIpbv3UGez4zmM4XmIH7GeN65HYjY+Uzzxxm4/u4JquY/jv7o8924SZrKSGpTTrf4+GvO9SiFu2gBXahQsXsnv3bl599VXM5to/xebl5ZGSkgLAnj17OHbsGB06dAhUikIIP0gypfDI8PtYc3g9Nmclo9sPo214W6jn83DyaXGVzkpGtR9Gq/AUEsLi+TpzI62iUhneZiCxapxMhhLNmqLXdi2mie3fv58JEybQvn17rFYrAOnp6bz00kvMmDGDWbNm0bNnT2bPns0PP/yAqqqYTCZmzZrFyJEjG3UsOaMNDZJ74AU7b4Oh5ipW4x/vUYmLC6ewsMIzZjSq6LqO2938K6yc0YqAFNpAkkIbGiT3wAvVvKHl5y6FtmWTzlBCCCGEH0mhFUIIIfxICq0QQgjhR1JohbhAqQZQDP6ZomEwKGByYzTKrxghAt6wQggRZIpOtuMYn+75Epujkqs6jaRTVCeMuqlJdl+k5/PNwc38XHyYnindGdiqD7EkNMm+hQhFUmiFuMDkOfOYv26Rp8vSvqID3NP/NrpHXXze+65Uj/PSxtfJsxUCcKjkCAeLM5nZ+1YM7jq6QAnRgsl1HSEuIIqisLtgr08rw0/2r0ZX3ee9/2xbnqfInrQ7fx/59oLz3rcQoUoKrRAXFB2zwfcSscVoQfFd9rXRDIrvrxQFBbWWcSEuFPLTL8QFRNfh4oQuPsX2mi5XgLvu9WEbIi08lc7x7b3GhrbNIMmSfN77FiJUyT1aIS4wicZkHh32O3bk7cLmrCIjrRdpltb19h5uCKsWwe2XTuPHwp84WJJJt8TOdI3vjOo6/yIuRKiSQivEBUbXdRIMSVyZPhqgpmVpEz7lE0M8Q5MHMSJtKE6nu0n3LUQokkIrxAWqsT3BG7dv0LTzn1wlREsg92iFEEIIP5JCK4QQQviRFFohhBDCj6TQCnGB0gwu3KrD8/ysqio4VQe64bR7q6qGU7WjqPpZ4871eLVS9ZrjKWe/f1xrns2AprpwncP7Ilo+mQwlxAVGUzQyKw+zZO8KbM4qxl80mm7xndmes5PVh9aRGB7PlO7jiDRFsGzvSvYWHaBfWk+u6DCS/MpCluxdQeWJuJ7xl2DSLXUeT1c0Dlce5oO9K6iqI65EL+KTvV+wr+gg/dJ6Mrr9CKKI8dqmChtbcraz5tB6kiMSmNptPKnmNNCDV900xc0h2yGW7F2B3eVgQpcxXBLXvd73RVw4FP3MXmwhrqiootGzKZOSoigoKPdTRv4TqnmD5B4MJ/POcR5j3jd/84ynRiaR0bo3y/et8owZVAM39ZrEf75b4hm7o+8N/Gv7u177nNnvFi6N7UVdv0WyHUd5et0LXmO/7ncLvU+Lq1JszFu/kNLq455tuiR05J4+d2DQTCQlRVFYVM5nWav4+KeVXnk+MfwPxBsSG/VeNKVjjiPMX7fIa+y3GdPpEd0DaNjPS1JSlN/yE8Enl46FuICoqsKu/L1eY31b9WT1wXVeY27NTaWj2vM6PiyWgyVHfPb32c9fotXRI1lVFXbm7/GNO+Adl19V4FVkAX4qOkiJs8Tzulqv5PMDX/nkecyWc9bj+5uqKmzP3e0zvvLA12DQgpCRaI6k0ApxAdF1nUhzhNdYlbOKSFO4z7an9ye2uxyEmaw+28RYolA4+2VbXdeJskT4jMdYolFO+/VjVn37L6uKium0cVUxEGH2zdOsBm9VIF2HaEukz3iMNbrO90VcWKTQCnEB0XW4OLGLV8Hacux7rusxwWu7pPAEok4rIDZnJZ3j2xNxWkFWFZVrul4J7rP/GtF1uCShq29clyvAfaoQJVmT6JvW0yt23EWjiTHEel6bdQs39ZzstU1yRCLpka3q/qb9SNd1eiVd7PUhxKCojOs8Ct0thVbUkHu0hP49t1AkuQfeybwVBUrdxewvPYTdZeei+I4kmhPJrc5lf8khYi3RdIptjxETmRVZZFfk0Ta6NW0j21DptrG/5FRcsiml3olIigIlWjE/n4jrEt+JZFMy+hlxVdjIrDhCTkUebaPTaRuR7plQdDJ3TXGTY8/hZ0+eHYgk2m/vWUMoChS7i/i59BBOt5OL4juSZEz2vC9yj1ZIoSX0f3GGIsk98GrLW1Hwmsh05uvGjDVEQ+Jq2+bM3M/1+P7WkNxrI4W2ZZNLx0JcwM4sCrUVr4aOncvx/LlNMDTXvERwSaEVQggh/EgKrRBCCOFHUmiFEEIIP5JCK4RocprBRZViQ6+jmYWi6lQrlbhURwAzazxFJSTyFM2X9DoWQjQZRYF8Vx7vfP8RB0oy6ZXSjWu7XUOMEue1nY1yVh38mq8yNxIfFsetvabQPqxDUHsW18ZGOV8c/IqvMzeREBbHLb2m0j6sfbPLUzRvckYrhGgyNr2c5zb+nX1FB3BpLrbn7Obv2/8fztPOBhVV54vDX7Py4Focbie5Ffk8v/EVCpz5Qcy8FqrO54fW8MXBb3C4neRU5PP8xsUUugqCnZkIMVJohRBNJr+qCJuj0mvsSFk2pY5TPYur9Cq+ytzotY2u62Tb8gKSY0NV6ZV8lbnJa0zXdXKaWZ6i+ZNCK4RoMmFG337IBtWAxXBqyTijYiTBGuuzXXgtscFkVIzEW2N8xmv7HoWoixRaIUSTSTQncFn7wV5j13UfT4x6qmAZNTO39J7q1XS/Q2wb0iOC17O4Nmbdwq29r/XKs2NcW1o3szxF8yctGGlZLfVCheQeeIHK20E12VW5lNpLSQxLINWailE/Y3UeRafAmU+2LY9wo5X0iFaE4bsKzklBe88VjXxnPjm2/Jo8I1sTpvuuRlQXacEoZNaxEKJJmbHWzMw
"text/plain": [
"<Figure size 474.35x360 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAdoAAAFtCAYAAACgK6tiAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAB0yElEQVR4nO3dd5hU5fnw8e8502d2ts82ekekShOwgcQuIJZgosYSkVjwF2PktSJGMZZoFDXYExM1VlSqBrECIkiR3qVt77uzu9POef8YGBhmYWfZndkF7s91cV3sc9q9Z87Ofcpz7kfRdV1HCCGEEDGhtnQAQgghxIlMEq0QQggRQ5JohRBCiBiSRCuEEELEkCRaIYQQIoYk0QohhBAxFPdE+8ILL9CjRw+2bNkSMW3GjBkMGzaMsWPHMnbsWKZNmxbv8IQQQohmZYznxtavX8/q1avJyck54jzjxo1jypQpx7yNkpJqNC1+rwanpNgpK6uJ2/ai1VrjgtYbm8TVOBJX4x0pNpfL2QLRiHiJ2xWt1+vlkUceYerUqSiKEq/NxpzRaGjpEOrVWuOC1hubxNU4ElfjtebYROzELdE+99xzjBkzhnbt2h11vrlz53LppZdy4403smrVqjhFJ4QQQsSGEo8SjKtWreLZZ5/lX//6F4qiMGrUKGbOnEn37t3D5isqKiI5ORmTycTixYu5++67mTdvHikpKbEOUQghhIiJuDyjXb58OTt27ODcc88FID8/n5tuuonHH3+cM844IzSfy+UK/X/EiBFkZ2ezdetWhgwZEvW24v2M1uVyUlRUFbftRau1xgWtNzaJq3EkrsY7UmzyjPbEFpdEO3HiRCZOnBj6+UhXtAUFBWRmZgKwceNG9u3bR6dOneIRohBCCBETce11XJ+bb76ZyZMn06dPH5555hnWr1+PqqqYTCaefPLJsKtcIYQQ4njTIol20aJFof+/+uqrof8/8cQTLRGOEEIIETNSGUoIIYSIIUm0QgghRAxJohVCCCFiSBKtOC5oQHGVh/IaX0uHIoQQjdLivY6FaEhVnZ83521g9ZZiVFVh7JmdOX9wO8xGOU8UQrR+8k0lWjVFVfhq5V5WbykGQNN0Zn2znZ0FrbMggRBCHE4SrWjVfH6dH9bnR7Rv3VOOqp44g1MIIU5ckmhFq2YyKPRoH1nrun2mM66lNoUQ4lhJohWtmq7rXHpGJ1KcllBbn65pdGmT1IJRCSFE9KQzlGj1Uh1m/nLz6eSX1mAyGshMsWI2yDmiEOL4IIlWHBfsZgOds2SEEyHE8UcuC4QQQogYkkQrhBBCxJAkWiGEECKGJNEKIYQQMSSJVgghhIghSbRCCCFEDEmiFUIIIWJIEq0QQggRQ5JohRBCiBiSRCuEEELEkCRaIYQQIoYk0QohhBAxJIlWCCGEiCFJtEIIIUQMSaIVQgghYkgSrRBCCBFDMvC7iIuArlNYUUdRWS2uijrSE8yYDHKeJ4Q48UmiFTGnKPDT5mJmzlobart4eEfGntEJo6q0YGRCCBF7ckkhYq6i1s+bczaEtc1d8gvFlXUtFJEQQsSPJFoRc3VePx5fIKK9utbXAtEIIUR8SaIVMZeSYKFtRkJYm8VsICPF3kIRCSFE/EiiFTFnUhX+79f96dkhBYCcdAf3/24wSTZTC0cmhBCxJ52hRFykOszcNaE/tZ4AaSl2vLVedF1v6bCEECLm5IpWxI1RUXBajSQlWFo6FCGEiBtJtEIIIUQMSaIVQgghYkgSrRBCCBFDkmhFBEWKNQkhRLORXscixKfp7C1ys31fBZmpdjrnJOIwG1o6LCGEOK5JohUAKCosWZvPv+ZuDLX17JDC5Cv7YTXKjQ8hhDhW8g0qAKis9fPfL7aEtW3aVUZeSU0LRSSEECcGSbQCgEBAr7cesbeeNiGEENGTRCsASLSbGNYnO6zNbjWSne5ooYiEEOLEIM9oBRA847p6dDcyU2x8/3MuHbOTuGJkV5JsRqRSohBCHDtJtCIkwWJk7BmduGBoB8wGFdAlyQohRBPJrWMRRtd0zAYFkAwrhBDNQRKtEEIIEUOSaIUQQogYkkQrhBBCxFDcE+0LL7xAjx492LJlS8S0QCDAtGnTGD16NL/61a/44IMP4h2eOE4pChiNKooUahZCtDJx7XW8fv16Vq9eTU5OTr3TZ8+eze7du/niiy8oLy9n3LhxDBs2jLZt28YzTHGccXsCrN5ezMpNhfTvls6A7i4SLNKhXgjROsTtitbr9fLII48wderUI151zJs3jyuvvBJVVUlNTWX06NEsWLAgXiGK45Bf13lz3kZe/2w9q7YU8ebcjcyctQ6fJr2mhRCtQ9wS7XPPPceYMWNo167dEefJy8sLu9rNzs4mPz8/HuGJ41RJpYeVmwvD2jb8UkpRRV0LRSSEEOHicn9t1apVrF27lrvvvjvm20pLS4j5Ng7ncjnjvs1otNa4oPliK6v119tuMhmOaRutdZ9JXI3TWuOC1h2biI24JNrly5ezY8cOzj33XADy8/O56aabePzxxznjjDNC82VnZ5Obm0vfvn2ByCvcaJSUVKPF8bahy+WkqKgqbtuLVmuNC5o3NqfVSP/uLlZvKQq19eyQQrLN1OhttNZ9JnE1TmuNC44cmyTfE1tcEu3EiROZOHFi6OdRo0Yxc+ZMunfvHjbfBRdcwAcffMB5551HeXk5Cxcu5O23345HiOI4ZVTgxotPYVV3Fys3FdKvm4uBPVyYDNL7WAjROrR418ybb76ZyZMn06dPH8aOHcuaNWs477zzALjtttuO+kxXCAjWaD6rTzYj+7chENDQpUCzEKIVaZFEu2jRotD/X3311dD/DQYD06ZNa4mQxHFO13X8fhk7VwjR+khlKCGEECKGJNEKIYQQMSSJVgghhIghSbSi2agq+HSQJ6VCCHFQi/c6FieGOr/Gmu0lzF28E5NR5bKzu9KzXSJGVc7lhBAnN/kWFM1i/a4yXp61lr2F1ezMreSZd1eys8Dd0mEJIUSLk0Qrmkw1qiz8cXdE+7L1+ZhMcogJIU5u8i0omkxRwGk3R7Q77SY0rQUCEkKIVkQSrWiygE/j4uEdMagHyx7aLEYG98okEJBMK4Q4uUlnKNEsOmQ6mPr709mwswSjUeWUjqm0SbHKFa0Q4qQniVY0Dw3aptpon942+KOGJFkhhEASrWhmklyFECKcPKMVQgghYkgSrRBCCBFDkmiFEEKIGJJEe5JRjVBe66e81o/BqDS8QGuhgNsboM6noarHUdzi+KHquKnCo9SiKI07xlRVoU5xU6NUo8i3qjiMdIY6iVR5Any7bB9zvt8JwCVndOLMfjkkWlv3YVDjDTB/2S4WLN2F1WLkmgt6MrBbOkZJuKKZuKnisy2fs3jPcpzmBK7rezk9E3ug6IYGl/UrXn4qXMUHG+bi0/xc0PUcRrY9Ayv2OEQujgdy7nUS2fBLKR99tQ2PL4DHF+Cjr7axcVdZS4d1VKqqsHR9PnMX/0JA03HX+nh51lr2FEkdZdE8FFXnf798w/e7f0TXdSo9Vbyw/J8UeAuiWn63ew9v/fwRtf46/JqfOVsWsqZkXaOvisWJSxLtScJuN/Pj+vyI9h/W5WOvp3xia+ENaCz6aW9E+8ZfSuUWsmgWtXot3+/5MaJ9X3Xk38vhVFXh58INEe3f7PoBXfU3S3zi+CeJ9iTh9wfIcSVEtLdxJeD3t94RZI2qQpt64s5ItaNpegtEJE40JsVEpiM9oj3RHHncHU7XdbITMiPa2yXmRHXbWZwcJNGeJLzeACP65pCUcPDqNdFhZkS/bLze1pto0eHyc7pgMR/80mrjSqBHu+SWi0mcUAyaid/0vgyDevAY65raibYJbRpcVtehV3oP0u2poTab0cp5nc9Gl+ItYj9F1/UT6rKgpKQ6rlc6LpeToqKquG0vWkeKq7jay56CanRdp32Wk/SE+N82buw+UxSFUreX3KJqTEYDbVwOHObmv1o43j7LlnZCxaXolAZKyHcXYjVYyHFkY9Wj78zkpopcdz4BPUCOI4skNZn6vlmPFJvL5WxcvOK40rq7m4pml55
"text/plain": [
"<Figure size 474.35x360 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns\n",
"sns.set_theme()\n",
"sns.relplot(data=iris, x=\"PetalLengthCm\", y=\"PetalWidthCm\", hue=\"Species\")\n",
"sns.relplot(data=iris, x=\"SepalLengthCm\", y=\"SepalWidthCm\", hue=\"Species\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<seaborn.axisgrid.FacetGrid at 0x7f97ef942eb0>"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAdgAAAFtCAYAAACk3ntfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAABjbUlEQVR4nO3deXwU9f348dfMnjk29yYECKfcgtwICMpRFQWJKC22HvXAIiL8bClUrSAeVNBqK6VfvFtbsRYUFIhoERREQBALQhERlCPkPsi12Wvm90dkw5KbZDdZ8n4+Hj7MfuazM+8ddve9M/OZz1vRdV1HCCGEEE1Kbe4AhBBCiIuRJFghhBAiACTBCiGEEAEgCVYIIYQIAEmwQgghRABIghVCCCECwNjcAQRTXl4Jmta0dyXFxoZTUFDWpOsMtFCLOdTihdCLOdTihdCLubp47XZbM0UjgkGOYBvJaDQ0dwgNFmoxh1q8EHoxh1q8EHoxh1q8ovEkwQohhBABIAlWCCGECABJsEIIIUQASIIVQgghAkASrBBCCBEAkmCFEEKIAJAEK4QQQgSAJFghhBAiACTBCiFaJUUBo1FFURq3HlVVMBrlq1RU1aqmShRCCACLpxDPiX2U/7Afa6f+GDr0w2mMbvB6rOXZOL/7gvLcE4T3GIGe1AO3GhaAiEUokgQrhGhVTLqT4k0v4zxxEADHd19i7TKA8PH34cZc7/VY3QXkrX4SrayoYj3ffkHMVbei9BiP3rRTnosQJec1hBCtilqS5UuuZ5Uf+wq1NLtB69HyTviS61lnPl+N2VPc6BjFxUESrBCidanp8LKBR516dU/Q9YavSFy0JMEKIVoV3ZaEuV0PvzZLxz5okfYGrccQl4IaFunXFnV5Km5TVKNjFBcHuQYrhGhVXIqVqKtn4D62G+ex/2LpOhBT58E4sTRoPeXmeOJv/j3lh7bhzjlOeJ8roW0f3FqAAhchRxKsEKLVcZpiUXpeQ1ifa/F6dZwXOCqpPKwNhiE/xagouD2SWYW/oCXYmTNncurUKVRVJTw8nEcffZRevXr59Zk3bx6HDx/2PT58+DDLly9n3LhxLFu2jJUrV5KYmAjAwIEDWbhwYbDCF0JcZHRdx+Np/PVSr1euu4rqBS3BLlmyBJvNBsCmTZt4+OGHWbNmjV+fpUuX+v7+5ptvuOOOOxg1apSvLTU1lfnz5wcnYCGEEKIRgjbI6WxyBSgpKUGpY/qU1atXM2nSJMzm+t+XJoQQQrQUQb0G+8gjj7B9+3Z0XeeVV16psZ/L5WLdunX87W9/82vfsGEDn332GXa7nQceeIABAwYEOGIhhBDiwii6Hvw5R9auXcuGDRt4+eWXq12elpbGyy+/7HcKOScnh5iYGEwmE9u3b2fu3LmkpaURGxsbrLCFEEKIemuWUcSpqaksWLCAgoKCahPkO++8w0033eTXZrdX3qM2cuRIkpOTOXLkCEOHDq33dvPyStC0pv09YbfbyMkJrZlbQi3mUIsXQi/mUIsXQi/m6uK122019BYXg6Bcgy0tLSUjI8P3ePPmzURHRxMTE1Olb2ZmJl9++SUTJ070a8/KyvL9fejQIdLT0+ncuXPAYhZCCCEaIyhHsA6Hgzlz5uBwOFBVlejoaFasWIGiKEyfPp3Zs2fTt29fANasWcOYMWOqJN/nnnuOgwcPoqoqJpOJpUuX+h3VCiGEEC1Js1yDbS5yirhCqMUcavFC6MUcavFC6MUsp4hbH5mLWAghhAgASbBCCCFEAEiCFUIIIQJAEqwQQggRAJJghRBCiACQBCuEEEIEgCRYIYQQIgAkwQohhBABIAlWCCGECABJsEIIIUQASIIVQgghAkASrBBCCBEAkmCFEEKIAJAEK4QQQgSAJFghhBAiACTBCiEaTVEULLoDi16GoijNHY4QLYKxuQMQQoQ2o1aOcuorCravAiBq+E3QcRAexdrMkQnRvOQIVgjRKGrOEfI3voi3OB9vcT4FH72Mmv1tc4clRLOTBCuEuGBGo0rZoa1V2su+3ozRKF8vonWTT4AQ4oJpmo4xOqlKuyEmCV3XmyEiIVoOSbBCiAumaTrWniNRzGG+NsVsJaz3lXi9kmBF6yaDnIQQjeIMTybhlsfx5vwAuo7B3glnWCJIfhWtnCRYIUSj6DqUW+zQ3g6AGyS5CoGcIhZCCCECQhKsEEIIEQCSYIUQQogAkAQrhBBCBIAkWCGEECIAJMEKIYQQASAJVgghhAgASbBCCCFEAARtoomZM2dy6tQpVFUlPDycRx99lF69evn1WbZsGStXriQxMRGAgQMHsnDhQgC8Xi9PPvkk27ZtQ1EU7r33XqZOnRqs8IUQQogGCVqCXbJkCTabDYBNmzbx8MMPs2bNmir9UlNTmT9/fpX2devWceLECT766CMKCwtJTU1l+PDhtG/fPuCxCyGEEA0VtFPEZ5MrQElJCYqiNOj5aWlpTJ06FVVViYuLY/z48WzcuLGpwxRCCCGaRFDnIn7kkUfYvn07uq7zyiuvVNtnw4YNfPbZZ9jtdh544AEGDBgAQEZGBm3btvX1S05OJjMzMyhxCyGEEA0V1AT71FNPAbB27VqWLl3Kyy+/7Ld82rRpzJgxA5PJxPbt25k5cyZpaWnExsY2yfbj4yObZD3ns9ttdXdqYUIt5lCLF0Iv5lCLF0Iv5lCLVzROs1TTSU1NZcGCBRQUFPglT7vd7vt75MiRJCcnc+TIEYYOHUpycjKnT5+mX79+QNUj2vrIyytB05q2zIfdbiMnp7hJ1xlooRZzqMULoRdzqMULoRdzdfFKwr24BeUabGlpKRkZGb7HmzdvJjo6mpiYGL9+WVlZvr8PHTpEeno6nTt3BuDaa69l1apVaJpGfn4+mzZt4pprrglG+EIIIUSDBeUI1uFwMGfOHBwOB6qqEh0dzYoVK1AUhenTpzN79mz69u3Lc889x8GDB1FVFZPJxNKlS31HtZMnT2bfvn1cffXVANx///2kpKQEI3whhBCiwRRd11tNaWQ5RVwh1GIOtXgh9GIOtXgh9GKWU8Stj8zkJIQQQgSAJFghhBAiACTBCiGEEAEgCVYIIYQIAEmwQgghRABIghVCCCECQBKsEEIIEQDNMlWiEPWhKAqFZW4KThQQaTFgUhtWgekso+LFUJYLuoYWnoAbU5U+JtyoZbmgqHjDE/DohsaGL4Ro5STBihbJq+l8/r9M/vHBN3i8Gu0TI/l/PxtAXETV5Fgbs6eE8j3vUrp/C6Bj7TqIyNG34TTF+PpYPIWUbv0Hju++BBQi+l2FdcgUXAaZBEAIceHkFLFokTILHby+/n94vBoAp7JLeOODQ2gNXI+eeYjS/ZuBihm8yo9+ifvoLtQfj4YVRcF9dPePyRVAp3T/FrSMQ03zQoQQrZYkWNEiZRc4qrTt/y4Xh8tb73UYDCrO4weqtDuOfIFBqViPUdUoP7KrSh/nD/sxGOTjIYS4cPINIlqkWJulSluXdtFYTPV/y2qahrntJVXaLSl98P54jdWrq5hT+lTt07Y7mtbQ42UhhKgkCVa0SG3jI7h2eEff43Crkbsn9cGo1H+gk66DoX1fzG17+NqMcW2x9h7tK/qgaTrWnqMwxlXWFja37YYhpR+tpwyGECIQZJCTaJHMBoUpo7ow+rJ2uLw6sZEmoqzGBic9pykG2/X/D6UoE13zokQnU65G+PUptyQQPeUROJOBoqroUck4lbAmfDVCiNZIEqxosYyqQpsYq6/M14UeUbqUMIjuXHsfNQJiq55OFkKICyWniIUQQogAkAQrhBBCBIAkWCGEECIAJMEKIYQQASAJVgghhAgASbBCCCFEAEiCFUIIIQJAEqwQQggRADLRhBCAwVuKJ79iJidDTDJeQ/PN5KSiY3WcRivKRbXF4YpIxqPLR1WIUCOfWtHqqY4cij9Yhjf3BACGtr2I/Mm9aJbY4MeigunkbrI2vgReD6gG4sb/EkPn4XglyQoRUuQUsWjVjEaV8kOf+ZIrgPf0IVzHv0Zthk+H1ZFJ3kevViRXAM1L/qa/YSnNDH4wQohGkQQrWjVV0eH0wSrt3tO
"text/plain": [
"<Figure size 474.35x360 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"irisv = iris[iris[\"Species\"] != \"Iris-setosa\"]\n",
"sns.relplot(data=irisv, x=\"SepalLengthCm\", y=\"SepalWidthCm\", hue=\"Species\")"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<seaborn.axisgrid.PairGrid at 0x7f97f2ad3550>"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAzwAAALDCAYAAADQRQWWAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAEAAElEQVR4nOydd3gcV7m435nZ3nelVW+2Zcu23HtN7z1OQggtEEJNIEBCgAvcGzoXuPy4l8ClBgiXmp6Q3uPe495lq3dppV1t35nfHyutvJZsy/aqOed9Hj+P59szM9+Mzsyc75yvSJqmaQgEAoFAIBAIBALBeYg82goIBAKBQCAQCAQCwXAhDB6BQCAQCAQCgUBw3iIMHoFAIBAIBAKBQHDeIgwegUAgEAgEAoFAcN4iDB6BQCAQCAQCgUBw3iIMHoFAIBAIBAKBQHDeohttBU7kzTff5L//+7/RNA1VVfnc5z7HFVdcMeT929sDqOrQMm273RY6O4Nnq+q44Hy/xtG4Pq/XPuS2Z9Ifz4bx8PcdDzrC+NVzqP1xuPviqRgP91boeO5k8t041q/1dAj9Rxe324JOp4y2GoIxxJgyeDRN48EHH+Qvf/kLU6ZMYf/+/dxxxx1cdtllyHLmF6PeCw/D+X6N5/v1nY7xcP3jQUcQeg4n40FnoePYYrxfq9B/dBnv+gsyz5hzaZNlGb/fD4Df7ycnJ2dYjB2BQCAQCAQCgUBw/iNpmjY6Pg4nYf369XzhC1/AYrHQ09PDr3/9a+bOnTvaagkEAoFAIBAIBIJxyJhyaYvH4/z617/ml7/8JfPnz2fr1q188Ytf5Pnnn8dqtQ7pGGfip+712mlt9Z+LymOe8/0aR+P6xlIMz3j4+46mjpIEXaqPxp4mFFmh0FqAWRv8XTIe7iUM1HM8xPCMh3s7XnXs6+MNPY3oZT0F1vyT9vGR0G+onK4/joe/x6kYr/r34Ke+pxFFkck15WDDMdoqnRVn0hcF7w3GlMGzb98+WlpamD9/PgDz58/HbDZz5MgRZs2aNcraCQSC8UZrooUfrnmYUDwMQL4th/sWfRL7OP2ICwQn0hJv5gdrHyYSjwBQaM/j8wvvHrcDVcHo0aV18J/rf0lXuBsAh9HOV5Z+FpecNcqaCQTnzpgKjsnLy6OpqYmqqioAjhw5QltbGyUlJaOsmUAgGG9IssaLh99IGTsAjYEWDnQcQpJGUTGBIFPIGs8dfCVl7ADU+5s47KsaRaUE4xFZltjc+G7K2AHojvhZ37ANWRYvTMH4Z0yt8Hi9Xh566CHuu+8+pN4RyQ9+8ANcLtfoKpYhQpE4+6o7mV2ehSISMQgEw0pCSlDb3TBA3tTTipQtMcbCFwWCMyZBnDp/0wB5S08bskcaNRdGwfhDliWqu+oGyI/5apDLZFQ1MQpaCQSZY0wZPAA33HADN9xww2irMSz87l972XOsgysWFLPqwkmjrY5AcF6jqHouLFnCX3c/nSavzJ4iBoKC8wI9Bi4oWcxje/+VJq/IKhd9XHBGxOMqS4vms61xd5p8RfEi4nFh7AjGP2KZYYRo84U4WOvjQ5dP4Y3t9cQT6mirJBCc12iaxrycOVxVfhGKrGDWmfjI7FspthSNtmoCQUZQVY1FefO4fOJKFEnGojfzsTm3U2guHG3VBOOQcsckbp1+HQZFj17Rs2rq1UxxTh5ttQSCjDDmVnjOV7YdamNykYtspxmn1cCR+i4qStyjrZZAcF5jxsINZddwaelKJGSskk3MfAvOKyzYuGnidVwx4SJkZCyijwvOEoNm4pKCC1iSPw+TSY8cMaKJuVnBeYJY4Rkhdle1U5JrA6Akx8a+6s5R1kggeG+gqWDR7Jg1qxgICs5PVAmLZsck+rjgHNFUMGs2cmzZwtgRnFeIFZ4RQNM0qhq6uWB2AQAF2VYO1PpGVymB4FRIGq3xFhoDzVj05kHr1/TQTV2gkZgao9CWj0fJTksEEJMiNIaa2H20G5feRZ4pF51mGNnLkCQ6Em00BJJ1eIpthVgR9RkE4xBJ43D7MWq6GjDrTRRa8jFjO2lzWU6mZa/prkdGosRZiB4jtYF6NE2l0FaAS3YjcncI+pAk6FZ91AUaORCUyDfn4hykjySkOM2RZlqDbTiNDgos+cSJURuopzvix2vNotBciF41js6FCASDIAyeEaDVF0Kvk7GZ9QDkeSy8umVgNhSBYCwgSVAVOsp/rf91yoAp95Tx6bl3poweP138ZP3/0hbqAECv6Pm35feSo8sHQJUTvHzsDV48/GbquLdOu5aLCi9AUkcuxWlzrJEfrP050UQMgCyzmweWfAaH5BoxHQSCTFATruHH6/4XtXfavcxZxD0L7sKiDW70NMYa+PG6XxGKJdOy2w1Wbpx2Jf+340kAzDoTX1v+ObIU78hcgGDM06G288O1DxOI9gBg0Zv52vLP4ZGzU20kGba0buPRHY+nZPcu+igb6razpWFHSvbBWTexImcZqlglEowRhEvbCFDdHCDXY0lt28x6EqpGV090FLUSCAYnJkX4847H01ZrDncco74nmeJZkmB/+6GUsQMQS8T416HXkOTkPh2x9jRjB+DJ/S/SFR85V05J0Xjh8OspYwegPdTJflGHRzDOSMhR/rLzyZSxA3Csq476wMC06wB6vcLb1RtTxg6AP9pDo7+FLHMydjQUD/N2zToURQwDBMm01Jvqt6WMHYBgLMS6us0oSv8L06928bfdz6TvLJFm7AA8vucFOtT2YdVZIDgTxJtuBKhr8ZPtNKW2JUkix2WmoTUwiloJBIMT1+K0hQYaJsFYCEj23/ZBfm8KtJIgmb70+IFWH6qmEo4PlA8XKipNPa0D5G3B9lSdL4FgPBDT4mkTDH30xIKD7yBrtPS0DRB3hnw4TP0rQvX+ZkD4tAmSBk9DoHmAvM7fiCT1DxUjiSix4yaRAMLHFb49vt1gcoFgtBAGzwhQ0xxIM3gA3HYjDe0n+VgJBKOISbJwQeniNJkkSeTbcoFkKtzp3ikD9rt4wjIULem2mW3OwmFMj5XJsWbjMXqGSeuByKqOi8uWDZBXeqeKwG7BuMIsWbiobGmaTEKiwJY3oK0kSbSF2llaPG/Ab+VZZdT46lPbF5UuJZEQz4IgWYdnWdGCAfILSpbQFGlih28HBwL7MetNlLmK09rYDVaMuvR4nQnuYrJG8H0vEJwOYfCMAI3twZMYPD0n2UMgGEVUiasnXJo0YGSFXGs29y/5JF5dTqpJoamQT8//ME6jHaNi4OapVzE3e1bKDc4oGblr7vuY5ClFkiQqsifxodmrMDBySQs0TWNO9gxWTbsao86I02jnU/M/RJGoUSIYZ2gqXFKykismXYBO1uG1ePjCkk/g1ecMaNsab+Kh1T8hFA2xatrVmPUmrAYLH5x1M3m2HMx6M2adiffPuJEpLlFjRdDPRPsEPjLrFqx6Cxa9mQ/MvJksi4uH3vkJv9n2F/5n0yP896bf8fF572de/gwkSWKCq4Qsk4cvLb2bEmchkiQxO286d819P/qE6fQnFQhGCEnTzq8cLe3tgSHP3nq9dlpb/cOqTzyh8tmfvs3nb5mF7jhf6cP1Xeyv6eSB988d1vOPxDWOJqNxfV7v0LN8nUl/PFtdhu36ZY2wFkQn6dANkm1HkiQiUghVUzFLlrQUpm3xFr61+qfMy59BkTOfY5117Gzex7cv/DJuOWt49D0JsgwhLYgkSRg1Cyd75Y2XZ+VEPYfaH4e7L56K8XBvx4OO7iwLjR1tKJKCQTMOyJ4lK/CXA4+xpmYzkFzRuar8IjwmJ4XGElRVJUQQ0DCT+RTWmXw3joe/x6kYr/rLskSIIBazgXhE48ebfkm1Lz3J0r0LP8ZU5xQihNFjQFGTK/tRJUxUi2CVrEiJ0c2JdSZ9UfDeQGRpG2ZafSEcVkOasQPgthlp6QyNklYCwRBQJUxYT+rir2kaBs3U+//030LxMKqmsqVhJ1sadqbkkUR0xNeVVRWMWEADTcQrCMYxOlnBpPX15YFoaDQF+uPWDrcf4+H2P3Jx2TJun1yMqmoYMQOgimdBMAh9fcRjsVMXbKE
"text/plain": [
"<Figure size 834.35x720 with 20 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.pairplot(data=iris.drop(columns=[\"Id\"]), hue=\"Species\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Podział danych\n",
" - ### Zbiór trenujący (\"training set\")\n",
" - Służy do dopasowania parametrów modelu (np. wag w sieci neuronowej).\n",
" - Podczas trenowania algorytm minimalizuje funkcję kosztu obliczoną na zbiorze treningowym \n",
" - ### Zbiór walidujący/walidacyjny (\"validation set\" aka. \"dev set\")\n",
" - Służy do porównania modeli powstałych przy użyciu różnych hiperparametrów (np. architektura sieci, ilość iteracji trenowania)\n",
" - Pomaga uniknąć przetrenowania (overfitting) modelu na zbiorze trenującym poprzez zastosowanie tzw. early stopping\n",
" - ### Zbiór testujący (\"test set\")\n",
" - Służy do ewaluacji finalnego modelu wybranego/wytrenowanego za pomocą zbiorów trenującego i walidującego"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Podział danych\n",
"- Zbiory trenujący, walidacyjny i testowy powinny być niezależne, ale pochodzić z tego samego rozkładu\n",
"- W przypadku klasyfikacji, rozkład klas w zbiorach powinien być zbliżony\n",
"- Bardzo istotne jest to, żeby zbiory walidujący i testujący dobrze odzwierciedlały nasze cele biznesowe i rzeczywiste dane, na których będzie działał nasz model\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Metody podziału:\n",
"- Skorzystać z gotowego podziału danych :)\n",
"- Jeśli dzielimy zbiór sami:\n",
" - \"Klasyczne\" podejście: proporcja Train:Dev:Test 6:2:2 lub 8:1:1\n",
" - Uczenie głębokie: \n",
" - metody \"głębokie\" mają bardzo duże zapotrzebowanie na dane, zbiory rzędu > 1 000 000 przykładów\n",
" - Załóżmy, że cały zbiór ma 1 000 000 przykładów\n",
" - wielkości zbiorów dev i test ustalamy bezwzględnie, np. na 1000 albo 10 000 przykładów\n",
" - 10 000 przykładów to (wystarczająco) dużo, choć stanowi jedynie 1% z całego zbioru\n",
" - szkoda \"marnować\" dodatkowe 180 000 przykładów na zbiory testujące i walidacyjne, lepiej mieć większy zbiór trenujący \n"
]
},
{
"cell_type": "markdown",
2021-05-10 12:53:57 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-03-15 11:51:20 +01:00
"source": [
"### Przykładowy podział z pomocą standardowych narzędzi Bash"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2021-03-15 11:16:36-- https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\n",
"Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n",
"Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.\n",
"HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable\n",
"\n",
" The file is already fully retrieved; nothing to do.\n",
"\n"
]
}
],
"source": [
"# Pobierzmy plik ze zbiorem z repozytorium\n",
"!cd IUM_02; wget -c https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"151 IUM_02/iris.data\r\n"
]
}
],
"source": [
"#Sprawdźmy wielkość zbioru\n",
"!wc -l IUM_02/iris.data"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.1,3.5,1.4,0.2,Iris-setosa\r\n",
"4.9,3.0,1.4,0.2,Iris-setosa\r\n",
"4.7,3.2,1.3,0.2,Iris-setosa\r\n",
"4.6,3.1,1.5,0.2,Iris-setosa\r\n",
"5.0,3.6,1.4,0.2,Iris-setosa\r\n"
]
}
],
"source": [
"#Sprawdźmy strukturę\n",
"!head -n 5 IUM_02/iris.data"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1 \r\n",
" 50 Iris-setosa\r\n",
" 50 Iris-versicolor\r\n",
" 50 Iris-virginica\r\n"
]
}
],
"source": [
"#Sprawdźmy jakie są klasy i ile każda ma przykładów:\n",
"!cut -f 5 -d \",\" IUM_02/iris.data | sort | uniq -c"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"151:\r\n"
]
}
],
"source": [
"# Znajdźmy pustą linijkę:\n",
"! grep -P \"^$\" -n IUM_02/iris.data"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 151 IUM_02/iris.data\r\n",
" 25 IUM_02/iris.data.dev\r\n",
" 25 IUM_02/iris.data.test\r\n",
" 100 IUM_02/iris.data.train\r\n",
" 301 total\r\n"
]
}
],
"source": [
"#Usuwamy pustą linijkę i tasujemy plik:\n",
"! head -n -1 IUM_02/iris.data | shuf > IUM_02/iris.data.shuf\n",
"# Dzielimy zbiór w proporcji 4:1:1\n",
"!head -n 25 IUM_02/iris.data.shuf > IUM_02/iris.data.test\n",
"!head -n 50 IUM_02/iris.data.shuf | tail -n 25 > IUM_02/iris.data.dev\n",
"!tail -n +51 IUM_02/iris.data.shuf > IUM_02/iris.data.train\n",
"!rm IUM_02/iris.data.shuf\n",
"#Sprawdźmy, czy wielkości się zgadzają:\n",
"!wc -l IUM_02/iris.data*"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 33 Iris-setosa\r\n",
" 36 Iris-versicolor\r\n",
" 31 Iris-virginica\r\n"
]
}
],
"source": [
"!cut -f 5 -d \",\" IUM_02/iris.data.train | sort | uniq -c"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 7 Iris-setosa\r\n",
" 9 Iris-versicolor\r\n",
" 9 Iris-virginica\r\n"
]
}
],
"source": [
"!cut -f 5 -d \",\" IUM_02/iris.data.dev | sort | uniq -c"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 10 Iris-setosa\r\n",
" 5 Iris-versicolor\r\n",
" 10 Iris-virginica\r\n"
]
}
],
"source": [
"!cut -f 5 -d \",\" IUM_02/iris.data.test | sort | uniq -c"
]
},
{
"cell_type": "markdown",
2021-05-10 12:53:57 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-03-15 11:51:20 +01:00
"source": [
"### Podział z pomocą sckikit learn\n",
"- Do podziału możemy też użyć biblioteki https://scikit-learn.org/"
]
},
{
"cell_type": "code",
"execution_count": 45,
2021-05-10 12:53:57 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-03-15 11:51:20 +01:00
"outputs": [
{
"data": {
"text/plain": [
"Iris-virginica 36\n",
"Iris-setosa 33\n",
"Iris-versicolor 31\n",
"Name: Species, dtype: int64"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"iris_train, iris_test = sklearn.model_selection.train_test_split(iris, test_size=50, random_state=1)\n",
"iris_train[\"Species\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 46,
2021-05-10 12:53:57 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-03-15 11:51:20 +01:00
"outputs": [
{
"data": {
"text/plain": [
"Iris-versicolor 19\n",
"Iris-setosa 17\n",
"Iris-virginica 14\n",
"Name: Species, dtype: int64"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_test[\"Species\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
2021-05-10 12:53:57 +02:00
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
2021-03-15 11:51:20 +01:00
},
"outputs": [
{
"data": {
"text/plain": [
"Iris-setosa 34\n",
"Iris-virginica 33\n",
"Iris-versicolor 33\n",
"Name: Species, dtype: int64"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"iris_train, iris_test = sklearn.model_selection.train_test_split(iris, test_size=50, random_state=1, stratify=iris[\"Species\"])\n",
"iris_train[\"Species\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
2021-05-10 12:53:57 +02:00
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
2021-03-15 11:51:20 +01:00
},
"outputs": [
{
"data": {
"text/plain": [
"Iris-virginica 17\n",
"Iris-versicolor 17\n",
"Iris-setosa 16\n",
"Name: Species, dtype: int64"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_test[\"Species\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Preprocessing danych\n",
"- Czyszczenie\n",
" - usuwanie ze zbioru przykładów nieprawidłowych\n",
" - korekta nieprawidłowych wartości\n",
"- Normalizacja\n",
" - Dane numeryczne: skalowanie do zakresu, np. \\[0.0, 1.0\\] (https://scikit-learn.org/stable/modules/preprocessing.html)\n",
" - Dane tekstowe: lowercase, ujednolicenie wariantów pisowni, normalizacja wyrażeń numerycznych\n",
" - Dane obrazowe: normalizacja rozdzielczości, palety kolorów\n",
" - Dane dźwiękowe: normalizacja natężenia, rozdzielczości, częstotliwości próbkowania, ilości kanałów\n",
"- Poszerzanie (augumentacja) danych\n",
" - Generowanie nowych przykładów przez wprowadzanie szumu/przekształceń na originalnych danych\n",
2022-03-14 09:09:50 +01:00
" - np. dodanie echa do nagrania dźwiękowego, dodanie szumów do obrazka\n",
2021-03-15 11:51:20 +01:00
" - zmiana wartości cech o względnie małe, losowe wartości \n",
"- Over/under-sampling\n",
" - Algorymty uczące i metryki mogą być wrażliwe na niezbalansowane klasy w zbiorze\n",
" - Np. jeśli w zbiorze są 2 klasy w propocji 9:1, to najprostszy \"klasyfikator\" bez problemy osiągnie accuracy 90%\n",
" - Najprostszy sposób: skopiujmy (albo usuńmy) część przykładów zwiększając (lub zmniejszając) dany zbiór"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Zadanie [5pkt]\n",
"- Wybierz jeden z ogólnodostępnych zbiorów danych. Będziesz na nim pracował do końca roku (oczywiście, zbiór można zmienić w trakcie, ale będzie się to wiązało z powtarzeniem pewnych działań, co prawdwa niezbyt kosztownych, ale jednak).\n",
"- Zbiór powinien być:\n",
2022-03-14 09:09:50 +01:00
" - nie za duży (max 50 MB)\n",
2021-03-15 11:51:20 +01:00
" - nie za mały (np. IRIS jest za mały ;))\n",
2022-03-14 09:09:50 +01:00
" - unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n",
2021-03-15 11:51:20 +01:00
" - najlepiej, żeby był to zbiór zawierający dane w formie tekstowej, mogący posłużyć do zadania klasyfikacji lub rergesji - na takim zbiorze będzie łatwiej pracować niż np. na zbiorze obrazów albo dźwięków. Dzięki temu będziesz się mogła/mógł skupić na istocie tych zajęć.\n",
"\n",
"- Napisz skrypt, który:\n",
"1. Pobierze wybrany przez Ciebie zbiór\n",
"2. Jeśli brak w zbiorze gotowego podziału na podzbiory train/dev/test, to dokona takiego podziału\n",
"2. Zbierze i wydrukuje statystyki dla tego zbioru i jego podzbiorów, takie jak np.:\n",
" - wielkość zbioru i podzbiorów\n",
" - średnią, minimum, maksimum, odchylenia standardowe, medianę wartości poszczególnych parametrów)\n",
" - rozkład częstości przykładów dla poszczególnych klas\n",
"4. Dokona normalizacji danych w zbiorze (np. normalizacja wartości float do zakresu 0.0 - 1.0)\n",
"5. Wyczyści zbiór z artefaktów (np. puste linie, przykłady z niepoprawnymi wartościami)\n",
"\n",
"- Skrypt możesz napisać w swoim ulubionym języku. Jedyne ograniczenie: ma działać pod Linuxem\n",
"- Wygodnie będzie stworzyć zeszyt Jupyter w tym celu (choć nie jest to konieczne)\n",
"- Stwórz na wydziałowym serwerze git (http://git.wmi.amu.edu.pl/) repozytorium \"ium_nrindeksu\" i umieść w nim stworzony skrypt\n",
2022-03-14 09:09:50 +01:00
"- Link do repozytorium wklej do arkusza ze zbiorami: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n"
2021-03-15 11:51:20 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Bibliografia\n",
" - https://www.coursera.org/learn/machine-learning-projects \n",
" - https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf\n"
]
}
],
"metadata": {
2021-09-28 10:56:21 +02:00
"author": "Tomasz Ziętkiewicz",
2021-03-15 11:51:20 +01:00
"celltoolbar": "Slideshow",
2021-09-28 10:56:21 +02:00
"email": "tomasz.zietkiewicz@amu.edu.pl",
2021-03-15 11:51:20 +01:00
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
2021-09-28 10:56:21 +02:00
"lang": "pl",
2021-03-15 11:51:20 +01:00
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
2021-03-15 11:51:20 +01:00
},
2021-09-28 10:56:21 +02:00
"slideshow": {
"slide_type": "slide"
},
"subtitle": "2.Dane[laboratoria]",
"title": "Inżynieria uczenia maszynowego",
2021-03-15 11:51:20 +01:00
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": false,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": false,
"toc_window_display": false
2021-09-28 10:56:21 +02:00
},
"year": "2021"
2021-03-15 11:51:20 +01:00
},
"nbformat": 4,
"nbformat_minor": 4
}