{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Analiza Danych w Pythonie: `pandas`\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### `pandas`\n", "Biblioteka `pandas` jest podstawowym narzędziem w ekosystemie Pythona do analizy danych:\n", " * dostarcza dwa podstawowe typy danych: \n", " * `Series` (szereg, 1D)\n", " * `DataFrame` (ramka danych, 2D)\n", " * operacje na tych obiektach: obsługa brakujących wartości, łączenie danych;\n", " * obsługuje dane różnego typu, np. szeregi czasowe;\n", " * biblioteka bazuje na `numpy` -- bibliotece do obliczeń numerycznych;\n", " * pozwala też na prostą wizualizację danych;\n", " * ETL: extract, transform, load." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Żeby zaimportowąc bibliotekę `pandas` wystarczy:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### __Zadanie 0__: sprawdź, czy masz zainstalowaną bibliotekę `pandas`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### [Szeregi](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) (`pd.Series`)\n", "\n", " Szereg reprezentuje jednorodne dane jednowymiarowe - jest odpowiednikiem wektora w R.\n", " * Szeregi możemy tworzyć na różne sposoby (więcej za chwilę), np. z obiektów tj. listy i słowniki.\n", " * Dane muszą być jednorodne. W przeciwnym przypadku nastąpi automatyczna konwersja.\n", " * Podczas tworzenia szeregu musimy podać jeden obowiązkowy argument `data` - dane.\n", " * Ponadto możemy podać też indeks (`index`), typ danych (`dtype`) lub nazwę (`name`).\n", " \n", " \n", " ```\n", " class pandas.Series(data=None, index=None, dtype=None, name=None)\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Podczas tworzenie szeregu mozemy podać dane w formacie listy lub słownika.\n", "\n", "Poniżej jest przykład przedstawiający tworzenie szeregu z danych, które są zawarte w liście:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0 211819\n", "1 682758\n", "2 737011\n", "3 779511\n", "4 673790\n", "5 673790\n", "6 444177\n", "7 136791\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "data = [211819, 682758, 737011, 779511, 673790, 673790, 444177, 136791]\n", "\n", "s = pd.Series(data)\n", "\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "W przypadku, gdy dane pochodzą z listy i nie podaliśmy indeksu, pandas doda automatyczny indeks liczbowy zaczynający się od 0." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "W przypadku przekazania słownika jako danych do szeregu, pandas wykorzysta klucze do stworzenia indeksu:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "April 211819\n", "May 682758\n", "June 737011\n", "July 779511\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n", "\n", "s = pd.Series(members)\n", "\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Podczas tworzenia szeregu możemy zdefiniować indeks, jak i nazwę szeregu:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "April 211819.0\n", "May 682758.0\n", "June 737011.0\n", "July 779511.0\n", "Name: Rides, dtype: float64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "months = ['April', 'May', 'June', 'July']\n", "\n", "data = [211819, 682758, 737011, 779511]\n", "\n", "s = pd.Series(data=data, index=months, dtype=float, name='Rides')\n", "\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Odwołanie się do poszczególnego elementu odbywa się przy pomocy klucza z indeksu." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "211819\n" ] }, { "data": { "text/plain": [ "April 211819\n", "May 682758\n", "June 737011\n", "July 779511\n", "August 673790\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n", "\n", "s = pd.Series(members)\n", "\n", "print(s['April'])\n", "\n", "s['August'] = 673790\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Dodanie elementu do szeregu odbywa się poprzez definiowanie nowego klucza:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "April 211819\n", "May 682758\n", "June 737011\n", "July 779511\n", "August 673790\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n", "\n", "s = pd.Series(members)\n", "\n", "s['August'] = 673790\n", "\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Więcej nt. indeksowania w szeregach w dalszej części kursu." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Podstawowa cechą szeregu jest wykonywanie operacji w sposób wektorowy. Działa to w następujący sposób:\n", " * gdy w obu szeregach jest zawarty ten sam klucz, to są sumowane ich wartości;\n", " * w przeciwnym przypadku wartość klucza w wynikowym szeregu to `pd.NaN`. \n", " * Równoważnie możemy wykorzystać metodę `pandas.Series.add`. W tym przypadku możemy podać domyślną wartość w przypadku braku klucza." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "August 880599.0\n", "July 973827.0\n", "June 908505.0\n", "May 830656.0\n", "October NaN\n", "September 814282.0\n", "dtype: float64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = pd.Series({'May': 682758, 'June': 737011, 'August': 673790, 'July': 779511,\n", "'September': 673790, 'October': 444177})\n", "\n", "occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316, 'August': 206809,\n", "'September': 140492})\n", "\n", "all_data = members + occasionals\n", "# Równoważnie\n", "all_data = members.add(occasionals)\n", "all_data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Możemy wykonać operacje arytmetyczne na szeregu: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "May 683758\n", "June 738011\n", "July 780511\n", "August 674790\n", "September 674790\n", "October 445177\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511, 'August': 673790,\n", "'September': 673790, 'October': 444177})\n", "\n", "members += 1000\n", "\n", "members" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "May 683758\n", "June 738011\n", "July 780511\n", "August 674790\n", "September 674790\n", "October 445177\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "May 10000000000683758\n", "June 10000000000738011\n", "July 10000000000780511\n", "August 10000000000674790\n", "September 10000000000674790\n", "October 10000000000445177\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members + 10000000000000000" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Podsumowanie\n", " * Szeregi działają podobnie do słowników, z tą różnicą, że wartości muszą być jednorodne (tego samego typu).\n", " * Odwołanie do poszczególnych elementów odbywa się poprzez nawiasy `[]` i podanie klucza.\n", " * W przeciwieństwie do słowników, możemy w prosty sposób wykonywać operacje arytmetyczne." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie 1\n", " * Stwórz szereg `n`, który będzie zawierać liczby od 0 do 10 (włącznie).\n", " * Stwórz szereg `n2`, który będzie zawierać kwadraty liczb od 0 do 10 (włącznie).\n", " * Następnie stwórz szereg `trojkatne`, który będzie sumą powyższych szeregów podzieloną przez 2." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "n= list(range(10+1))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "n = pd.Series(n)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 3\n", "4 4\n", "5 5\n", "6 6\n", "7 7\n", "8 8\n", "9 9\n", "10 10\n", "dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "n2 = n**2" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 4\n", "3 9\n", "4 16\n", "5 25\n", "6 36\n", "7 49\n", "8 64\n", "9 81\n", "10 100\n", "dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n2" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "trojkatne = ( n + n2 ) / 2" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 1.0\n", "2 3.0\n", "3 6.0\n", "4 10.0\n", "5 15.0\n", "6 21.0\n", "7 28.0\n", "8 36.0\n", "9 45.0\n", "10 55.0\n", "dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trojkatne" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### [Ramka danych](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) (`pd.DataFrame`)\n", "\n", "Ramka danych jest podstawową strukturą danych w bibliotece `pandas`, która pozwala na trzymanie i reprezentowanie danych tabelarycznych (dwuwymiarowych).\n", " * Posiada kolumny (cechy) i wiersze (obserwacje, przykłady).\n", " * Możemy też patrzeć na nią jak na słownik, którego wartościami są szeregi.\n", "\n", "```\n", "class pandas.DataFrame(data=None, index=None, columns=None, dtype=None)\n", "```\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Ramkę danych możemy stworzyć na różne sposoby.\n", "\n", "Pierwszy z nich (\"kolumnowy\") polega na zdefiniowaniu ramki poprzez podanie szeregów jako kolumn:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "May 682758\n", "June 737011\n", "July 779511\n", "dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "May 147898\n", "June 171494\n", "July 194316\n", "dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "occasionals" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
July779511.0194316.0
June737011.0171494.0
MayNaN147898.0
Maydfdsgfdg682758.0NaN
\n", "
" ], "text/plain": [ " members occasionals\n", "July 779511.0 194316.0\n", "June 737011.0 171494.0\n", "May NaN 147898.0\n", "Maydfdsgfdg 682758.0 NaN" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = pd.Series({'Maydfdsgfdg': 682758, 'June': 737011, 'July': 779511})\n", "occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316})\n", "\n", "df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Drugim popularnym sposobem jest przekazanie listy słowników. Wtedy `pandas` zinterpretuje to jako listę przykładów:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
0682758147898
1737011171494
2779511194316
\n", "
" ], "text/plain": [ " members occasionals\n", "0 682758 147898\n", "1 737011 171494\n", "2 779511 194316" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = [\n", " {'members': 682758, 'occasionals': 147898},\n", " {'occasionals': 171494,'members': 737011},\n", " {'members': 779511, 'occasionals': 194316},\n", "]\n", "\n", "df = pd.DataFrame(data)\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Możemy też wykorzystać metodę `from_dict` ([doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html)), która pozwala zdefiniować czy podane dane są w podane w postaci kolumnowej lub wierszowej:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "index\n", " members occasionals\n", "May 682758 147898\n", "June 737011 171494\n", "July 779511 194316\n", "\n", "columns\n", " May June July\n", "members 682758 737011 779511\n", "occasionals 147898 171494 194316\n" ] } ], "source": [ "data = {\n", " 'May': {'members': 682758, 'occasionals': 147898},\n", " 'June': {'members': 737011, 'occasionals': 171494},\n", " 'July': {'members': 779511, 'occasionals': 194316}\n", "}\n", "\n", "df = pd.DataFrame.from_dict(data, orient='index')\n", "print('index\\n', df)\n", "print()\n", "df = pd.DataFrame.from_dict(data, orient='columns')\n", "print('columns\\n', df)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Wczytywanie danych" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Biblioteka `pandas` pozwala na wczytanie i zapis danych z różnych formatów:\n", " * formaty tekstowe, np. `csv`, `json`\n", " * pliki arkuszy kalkulacyjnych: Excel (xls, xlsx)\n", " * bazy danych\n", " * inne: `sas` `spss`\n", "\n", "\n", "Efektem wczytania danych jest odpowiednio stworzona ramka danych (`DataFrame`)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Jednym z najprostszych formatów danych jest format `csv`, gdzie kolejne wartości są rozdzielone przecinkiem.\n", "\n", "Żeby wczytać dane w takim formacie należy użyć funkcji `pandas.read_csv`.\n", "\n", "Pandas pozwala na ustawienie wielu parametrów (np. separator, cudzysłowy). Więcej na ten temat w [dokumentacji](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Countryfemale_BMImale_BMIgdppopulationunder5mortalitylife_expectancyfertility
0Afghanistan21.0740220.620581311.026528741.0110.452.86.20
1Albania25.6572626.446578644.02968026.017.976.81.76
2Algeria26.3684124.5962012314.034811059.029.575.52.73
3Angola23.4843122.250837103.019842251.0192.056.76.43
4Antigua and Barbuda27.5054525.7660225736.085350.010.975.52.16
...........................
170Venezuela28.1340827.4450017911.028116716.017.174.22.53
171Vietnam21.0650020.916304085.086589342.026.274.11.86
172Palestine29.0264326.577503564.03854667.024.774.14.38
173Zambia23.0543620.683213039.013114579.094.951.15.88
174Zimbabwe24.6452222.026601286.013495462.098.347.33.85
\n", "

175 rows × 8 columns

\n", "
" ], "text/plain": [ " Country female_BMI male_BMI gdp population \\\n", "0 Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n", "1 Albania 25.65726 26.44657 8644.0 2968026.0 \n", "2 Algeria 26.36841 24.59620 12314.0 34811059.0 \n", "3 Angola 23.48431 22.25083 7103.0 19842251.0 \n", "4 Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n", ".. ... ... ... ... ... \n", "170 Venezuela 28.13408 27.44500 17911.0 28116716.0 \n", "171 Vietnam 21.06500 20.91630 4085.0 86589342.0 \n", "172 Palestine 29.02643 26.57750 3564.0 3854667.0 \n", "173 Zambia 23.05436 20.68321 3039.0 13114579.0 \n", "174 Zimbabwe 24.64522 22.02660 1286.0 13495462.0 \n", "\n", " under5mortality life_expectancy fertility \n", "0 110.4 52.8 6.20 \n", "1 17.9 76.8 1.76 \n", "2 29.5 75.5 2.73 \n", "3 192.0 56.7 6.43 \n", "4 10.9 75.5 2.16 \n", ".. ... ... ... \n", "170 17.1 74.2 2.53 \n", "171 26.2 74.1 1.86 \n", "172 24.7 74.1 4.38 \n", "173 94.9 51.1 5.88 \n", "174 98.3 47.3 3.85 \n", "\n", "[175 rows x 8 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('gapminder.csv')\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale2210A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female3810PC 1759971.2833C85C
313Heikkinen\\t Miss. Lainafemale2600STON/O2. 31012827.9250NaNS
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S
503Allen\\t Mr. William Henrymale35003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38 \n", "3 Heikkinen\\t Miss. Laina female 26 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35 \n", "5 Allen\\t Mr. William Henry male 35 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./titanic_train.tsv', delimiter='\\t', index_col=0, nrows=5)\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Do wczytania danych z arkusza kalkulacyjnego służy funkcja `pandas.read_excel`. Do otworzenia pliku `xlsx` może być koniecnze ustawienie parametru: `engine='openpyxl`. Więcej opcji w [dokumentacji](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = pd.read_excel('./bikes.xlsx', engine='openpyxl', nrows=5)\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Innym ważnym źródłem informacji są bazy danych. Pandas potrafi komunikować się z bazą danych za pomocą biblioteki [SQLAlchemy](https://pypi.org/project/SQLAlchemy/) i dostarcza odpowiedną funkcję:\n", " * `pandas.read_sql` - wczytanie całej tabeli lub zapytania do bazy danych" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleArtistId
AlbumId
1For Those About To Rock We Salute You1
2Balls to the Wall2
3Restless and Wild2
4Let There Be Rock1
5Big Ones3
.........
343Respighi:Pines of Rome226
344Schubert: The Late String Quartets & String Qu...272
345Monteverdi: L'Orfeo273
346Mozart: Chamber Music274
347Koyaanisqatsi (Soundtrack from the Motion Pict...275
\n", "

347 rows × 2 columns

\n", "
" ], "text/plain": [ " Title ArtistId\n", "AlbumId \n", "1 For Those About To Rock We Salute You 1\n", "2 Balls to the Wall 2\n", "3 Restless and Wild 2\n", "4 Let There Be Rock 1\n", "5 Big Ones 3\n", "... ... ...\n", "343 Respighi:Pines of Rome 226\n", "344 Schubert: The Late String Quartets & String Qu... 272\n", "345 Monteverdi: L'Orfeo 273\n", "346 Mozart: Chamber Music 274\n", "347 Koyaanisqatsi (Soundtrack from the Motion Pict... 275\n", "\n", "[347 rows x 2 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_sql('Album', con='sqlite:///Chinook.sqlite', index_col='AlbumId')\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleArtistId
AlbumId
1For Those About To Rock We Salute You1
2Balls to the Wall2
3Restless and Wild2
4Let There Be Rock1
5Big Ones3
.........
343Respighi:Pines of Rome226
344Schubert: The Late String Quartets & String Qu...272
345Monteverdi: L'Orfeo273
346Mozart: Chamber Music274
347Koyaanisqatsi (Soundtrack from the Motion Pict...275
\n", "

347 rows × 2 columns

\n", "
" ], "text/plain": [ " Title ArtistId\n", "AlbumId \n", "1 For Those About To Rock We Salute You 1\n", "2 Balls to the Wall 2\n", "3 Restless and Wild 2\n", "4 Let There Be Rock 1\n", "5 Big Ones 3\n", "... ... ...\n", "343 Respighi:Pines of Rome 226\n", "344 Schubert: The Late String Quartets & String Qu... 272\n", "345 Monteverdi: L'Orfeo 273\n", "346 Mozart: Chamber Music 274\n", "347 Koyaanisqatsi (Soundtrack from the Motion Pict... 275\n", "\n", "[347 rows x 2 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlalchemy\n", "\n", "engine = sqlalchemy.create_engine('sqlite:///Chinook.sqlite', echo=True)\n", "connection = engine.raw_connection()\n", "\n", "df = pd.read_sql('SELECT * FROM Album', con='sqlite:///Chinook.sqlite', index_col='AlbumId')\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Podsumowanie\n", "\n", "\n", " * Biblioteka `pandas` wspiera pobieranie danych z różnych formatów i źródeł.\n", " * Każda funkcja ma listę argumentów, które pozwalają na ustawić poszczególne parametry (np. [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zapis i eksport danych" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas pozwala w prosty sposób na zapisywanie ramki danych do pliku. " ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511})\n", "occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316})\n", "\n", "df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# zapis do formatu CSV\n", "df.to_csv('tmp.csv')\n", "# zapis do arkusza kalkulacyjnego \n", "df.to_excel('tmp.xlsx')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Ponadto możemy przekonwertować ramkę danych do JSONa lub Pythonowego słownika:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"members\":{\"May\":682758,\"June\":737011,\"July\":779511},\"occasionals\":{\"May\":147898,\"June\":171494,\"July\":194316}}\n" ] } ], "source": [ "print(df.to_json())" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'members': {'May': 682758, 'June': 737011, 'July': 779511}, 'occasionals': {'May': 147898, 'June': 171494, 'July': 194316}}\n" ] } ], "source": [ "print(df.to_dict())\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Lub przekopiować dane do schowka:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.to_clipboard()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie\n", "\n", "\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " * Przekonwertuj tabele `Customer` z bazy `Chinook.sqlite` do arkusza kalkulacyjnego. Plik wynikowy nazwij `customers.xlsx`.\n", " * Tabela `Employee` zawiera informacje o pracownikach firmy Chinook. Wyswietl dane na ekranie i podaj miasta, w których mieszkają pracownicy.\n", " * Tabela `Invoice` zawiera informacje o fakturach. Przekonwertuj kolumnę `BillingCountry` do pythonowego słownika, a następnie podaj najcześciej występującą wartość. Ile razy pojawiła się?\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Ramka danych - podstawy" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### Kolumny\n", "\n", "Na ramkę danych możemy patrzeć jak na swego rodzaju słownik, którego wartościami są szeregi. Pozwoli to na uzyskanie lepszej intuicji.\n", "\n" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gdppopulationlife_expectancy
Country
Afghanistan1311.026528741.052.8
Albania8644.02968026.076.8
Algeria12314.034811059.075.5
Angola7103.019842251.056.7
Antigua and Barbuda25736.085350.075.5
Argentina14646.040381860.075.4
Armenia7383.02975029.072.3
Australia41312.021370348.081.6
\n", "
" ], "text/plain": [ " gdp population life_expectancy\n", "Country \n", "Afghanistan 1311.0 26528741.0 52.8\n", "Albania 8644.0 2968026.0 76.8\n", "Algeria 12314.0 34811059.0 75.5\n", "Angola 7103.0 19842251.0 56.7\n", "Antigua and Barbuda 25736.0 85350.0 75.5\n", "Argentina 14646.0 40381860.0 75.4\n", "Armenia 7383.0 2975029.0 72.3\n", "Australia 41312.0 21370348.0 81.6" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=8, usecols=['Country', 'gdp', 'population','life_expectancy'])\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Dostęp do poszczególnej kolumny możemy uzystać na dwa sposoby:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Country\n", "Afghanistan 26528741.0\n", "Albania 2968026.0\n", "Algeria 34811059.0\n", "Angola 19842251.0\n", "Antigua and Barbuda 85350.0\n", "Argentina 40381860.0\n", "Armenia 2975029.0\n", "Australia 21370348.0\n", "Name: population, dtype: float64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# notacja z kropką\n", "df.population" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Country\n", "Afghanistan 26528741.0\n", "Albania 2968026.0\n", "Algeria 34811059.0\n", "Angola 19842251.0\n", "Antigua and Barbuda 85350.0\n", "Argentina 40381860.0\n", "Armenia 2975029.0\n", "Australia 21370348.0\n", "Name: population, dtype: float64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Operator []\n", "df['population']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Do operatora `[]` możemy też podać listę nazw kolumn:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gdppopulation
Country
Afghanistan1311.026528741.0
Albania8644.02968026.0
Algeria12314.034811059.0
Angola7103.019842251.0
Antigua and Barbuda25736.085350.0
Argentina14646.040381860.0
Armenia7383.02975029.0
Australia41312.021370348.0
\n", "
" ], "text/plain": [ " gdp population\n", "Country \n", "Afghanistan 1311.0 26528741.0\n", "Albania 8644.0 2968026.0\n", "Algeria 12314.0 34811059.0\n", "Angola 7103.0 19842251.0\n", "Antigua and Barbuda 25736.0 85350.0\n", "Argentina 14646.0 40381860.0\n", "Armenia 7383.0 2975029.0\n", "Australia 41312.0 21370348.0" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['gdp','population']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Listę kolumn możemy pobrać za pomocą:" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Index(['gdp', 'population', 'life_expectancy'], dtype='object')" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gdppopulationlife_expectancy
Country
Afghanistan1311.026528741.052.8
Albania8644.02968026.076.8
Algeria12314.034811059.075.5
Angola7103.019842251.056.7
Antigua and Barbuda25736.085350.075.5
Argentina14646.040381860.075.4
Armenia7383.02975029.072.3
Australia41312.021370348.081.6
\n", "
" ], "text/plain": [ " gdp population life_expectancy\n", "Country \n", "Afghanistan 1311.0 26528741.0 52.8\n", "Albania 8644.0 2968026.0 76.8\n", "Algeria 12314.0 34811059.0 75.5\n", "Angola 7103.0 19842251.0 56.7\n", "Antigua and Barbuda 25736.0 85350.0 75.5\n", "Argentina 14646.0 40381860.0 75.4\n", "Armenia 7383.0 2975029.0 72.3\n", "Australia 41312.0 21370348.0 81.6" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PKBPopulacjaODŻ
Country
Afghanistan1311.026528741.052.8
Albania8644.02968026.076.8
Algeria12314.034811059.075.5
Angola7103.019842251.056.7
Antigua and Barbuda25736.085350.075.5
Argentina14646.040381860.075.4
Armenia7383.02975029.072.3
Australia41312.021370348.081.6
\n", "
" ], "text/plain": [ " PKB Populacja ODŻ\n", "Country \n", "Afghanistan 1311.0 26528741.0 52.8\n", "Albania 8644.0 2968026.0 76.8\n", "Algeria 12314.0 34811059.0 75.5\n", "Angola 7103.0 19842251.0 56.7\n", "Antigua and Barbuda 25736.0 85350.0 75.5\n", "Argentina 14646.0 40381860.0 75.4\n", "Armenia 7383.0 2975029.0 72.3\n", "Australia 41312.0 21370348.0 81.6" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns = ['PKB', 'Populacja', 'ODŻ']\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Żeby odwołać się do poszczególnych wierszy należy wykorzystać metodę `loc`:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PKB 14646.0\n", "Populacja 40381860.0\n", "ODŻ 75.4\n", "Name: Argentina, dtype: float64" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['Argentina']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Metoda `loc` również może przyjąć listę wierszy: " ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PKBPopulacjaODŻ
Country
Albania8644.02968026.076.8
Angola7103.019842251.056.7
\n", "
" ], "text/plain": [ " PKB Populacja ODŻ\n", "Country \n", "Albania 8644.0 2968026.0 76.8\n", "Angola 7103.0 19842251.0 56.7" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[['Albania', 'Angola']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Możemy również podać drugi parametr: nazwy kolumn:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PKBPopulacja
Country
Albania8644.02968026.0
Angola7103.019842251.0
\n", "
" ], "text/plain": [ " PKB Populacja\n", "Country \n", "Albania 8644.0 2968026.0\n", "Angola 7103.0 19842251.0" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = df.loc[['Albania', 'Angola'], ['PKB', 'Populacja']]\n", "\n", "df2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Albo wykorzystać tzw. _slicing_, cyzli operator `:`:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PKBPopulacjaODŻ
Country
Albania8644.02968026.076.8
Algeria12314.034811059.075.5
Angola7103.019842251.056.7
\n", "
" ], "text/plain": [ " PKB Populacja ODŻ\n", "Country \n", "Albania 8644.0 2968026.0 76.8\n", "Algeria 12314.0 34811059.0 75.5\n", "Angola 7103.0 19842251.0 56.7" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['Albania': 'Angola', 'PKB': 'ODŻ']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Żeby odwołać się do pojedyńczej wartości możemy użyć metody `at`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.at['Angola', 'PKB']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Dostęp do indeksu:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda',\n", " 'Argentina', 'Armenia', 'Australia'],\n", " dtype='object', name='Country')" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Podstawowe metody `pd.Series` i `pd.DataFrame`" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
May682758147898
June737011171494
July779511194316
August673790206809
September673790140492
October44417753596
\n", "
" ], "text/plain": [ " members occasionals\n", "May 682758 147898\n", "June 737011 171494\n", "July 779511 194316\n", "August 673790 206809\n", "September 673790 140492\n", "October 444177 53596" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511, 'August': 673790,\n", "'September': 673790, 'October': 444177})\n", "\n", "occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316, 'August': 206809,\n", "'September': 140492, 'October': 53596})\n", "\n", "df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `head` pozwala tworzy nową ramkę danych z pierwszymi 5 przykładami:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
May682758147898
June737011171494
\n", "
" ], "text/plain": [ " members occasionals\n", "May 682758 147898\n", "June 737011 171494" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `tail` robi to samo, ale z 5 ostatnymi przykładami:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
June737011171494
July779511194316
August673790206809
September673790140492
October44417753596
\n", "
" ], "text/plain": [ " members occasionals\n", "June 737011 171494\n", "July 779511 194316\n", "August 673790 206809\n", "September 673790 140492\n", "October 444177 53596" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `sample` pozwala na stworzenie nowej ramki danych z wylosowanymi `n` przykładami:" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
September673790140492
May682758147898
June737011171494
\n", "
" ], "text/plain": [ " members occasionals\n", "September 673790 140492\n", "May 682758 147898\n", "June 737011 171494" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `describe` zwraca podstawowe statystyki m.in.: liczebność, średnią, wartości skrajne: " ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
May682758147898
June737011171494
July779511194316
August673790206809
September673790140492
October44417753596
\n", "
" ], "text/plain": [ " members occasionals\n", "May 682758 147898\n", "June 737011 171494\n", "July 779511 194316\n", "August 673790 206809\n", "September 673790 140492\n", "October 444177 53596" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
count6.0000006.000000
mean665172.833333152434.166667
std116216.04545654783.506738
min444177.00000053596.000000
25%673790.000000142343.500000
50%678274.000000159696.000000
75%723447.750000188610.500000
max779511.000000206809.000000
\n", "
" ], "text/plain": [ " members occasionals\n", "count 6.000000 6.000000\n", "mean 665172.833333 152434.166667\n", "std 116216.045456 54783.506738\n", "min 444177.000000 53596.000000\n", "25% 673790.000000 142343.500000\n", "50% 678274.000000 159696.000000\n", "75% 723447.750000 188610.500000\n", "max 779511.000000 206809.000000" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `info` zwraca informacje techniczne o kolumnach: np. typ danych:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 6 entries, May to October\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 members 6 non-null int64\n", " 1 occasionals 6 non-null int64\n", "dtypes: int64(2)\n", "memory usage: 144.0+ bytes\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podstawową informacją o ramce danych to liczba przykładów w ramce danych. Możemy wykorzystać to tego funkcję `len`:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "6" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Natomiast atrybut `shape` zwraca nam krotkę z liczbą przykładów i liczbą kolumn:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "(6, 2)" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Operacja arytmetyczne\n", "\n", " * `max`, `idxmax`\n", " * `min`, `idxmin`\n", " * `mean`\n", " * `count`" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
May682758147898
June737011171494
July779511194316
August673790206809
September673790140492
October44417753596
\n", "
" ], "text/plain": [ " members occasionals\n", "May 682758 147898\n", "June 737011 171494\n", "July 779511 194316\n", "August 673790 206809\n", "September 673790 140492\n", "October 444177 53596" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Zbiór wartości i zliczanie wartości:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 3 2]\n", "3 4\n", "1 3\n", "2 3\n", "Name: count, dtype: int64\n" ] } ], "source": [ "dane = pd.Series([1, 3, 2, 3, 1, 1, 2, 3, 2, 3])\n", "\n", "print(dane.unique())\n", "\n", "dane = pd.Series([1, 3, 2, 3, 1, 1, 2, 3, 2, 3])\n", "\n", "print(dane.value_counts())" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 3\n", "2 2\n", "3 3\n", "4 1\n", "5 1\n", "6 2\n", "7 3\n", "8 2\n", "9 3\n", "dtype: int64" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dane" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3 4\n", "1 3\n", "2 3\n", "Name: count, dtype: int64\n" ] } ], "source": [ "print(dane.value_counts())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Sprawdzanie czy brakuje danych:" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", " ... \n", "887 False\n", "888 False\n", "889 True\n", "890 False\n", "891 False\n", "Name: Age, Length: 891, dtype: bool" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n", "df.Age.isnull()\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Dodawanie i modyfikowanie danych" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=5)\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "conts = pd.Series({\n", " 'Afghanistan': 'Asia', 'Albania': 'Europe', 'Algeria':' Africa', 'Angola': 'Africa', 'Antigua and Barbuda': 'Americas'})\n", "\n", "df['continent'] = conts\n", "\n", "df['tmp'] = 1\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.loc['Argentina'] = {\n", " 'female_BMI': 27.46523,\n", " 'male_BMI': 27.5017,\n", " 'gdp': 14646.0,\n", " 'population': 40381860.0,\n", " 'under5mortality': 15.4,\n", " 'life_expectancy': 75.4,\n", " 'fertility': 2.24\n", "}\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.drop('gdp', axis='columns')\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Filtrowanie danych" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Biblioteka pandas posiada 2 sposoby na filtrowanie danych zawartych w ramce danych:\n", " * operator `[]` -- najbardziej rozpowszechniony;\n", " * metoda `query()`.\n", "Oba sposoby mają różną składnię.\n", " " ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "scrolled": true, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
313Heikkinen\\t Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen\\t Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22.0 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "3 Heikkinen\\t Miss. Laina female 26.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen\\t Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 0\n", "2 1\n", "3 1\n", "4 1\n", "5 0\n", " ..\n", "887 0\n", "888 1\n", "889 0\n", "890 1\n", "891 0\n", "Name: Survived, Length: 891, dtype: int64" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Survived']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df['Survived']" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 False\n", "2 True\n", "3 True\n", "4 True\n", "5 False\n", " ... \n", "887 False\n", "888 True\n", "889 False\n", "890 True\n", "891 False\n", "Name: Survived, Length: 891, dtype: bool" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Survived'] == 1" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df_survived = df[df['Pclass'] == 1]" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "891" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "216" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_survived)" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
701McCarthy\\t Mr. Timothy Jmale54.0001746351.8625E46S
1211Bonnell\\t Miss. Elizabethfemale58.00011378326.5500C103S
2411Sloper\\t Mr. William Thompsonmale28.00011378835.5000A6S
....................................
87211Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny)female47.0111175152.5542D35S
87301Carlsson\\t Mr. Frans Olofmale33.0006955.0000B51 B53 B55S
88011Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88811Graham\\t Miss. Margaret Edithfemale19.00011205330.0000B42S
89011Behr\\t Mr. Karl Howellmale26.00011136930.0000C148C
\n", "

216 rows × 11 columns

\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "4 1 1 \n", "7 0 1 \n", "12 1 1 \n", "24 1 1 \n", "... ... ... \n", "872 1 1 \n", "873 0 1 \n", "880 1 1 \n", "888 1 1 \n", "890 1 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "7 McCarthy\\t Mr. Timothy J male 54.0 \n", "12 Bonnell\\t Miss. Elizabeth female 58.0 \n", "24 Sloper\\t Mr. William Thompson male 28.0 \n", "... ... ... ... \n", "872 Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) female 47.0 \n", "873 Carlsson\\t Mr. Frans Olof male 33.0 \n", "880 Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 \n", "888 Graham\\t Miss. Margaret Edith female 19.0 \n", "890 Behr\\t Mr. Karl Howell male 26.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "7 0 0 17463 51.8625 E46 S \n", "12 0 0 113783 26.5500 C103 S \n", "24 0 0 113788 35.5000 A6 S \n", "... ... ... ... ... ... ... \n", "872 1 1 11751 52.5542 D35 S \n", "873 0 0 695 5.0000 B51 B53 B55 S \n", "880 0 1 11767 83.1583 C50 C \n", "888 0 0 112053 30.0000 B42 S \n", "890 0 0 111369 30.0000 C148 C \n", "\n", "[216 rows x 11 columns]" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_survived" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Operatory\n", "\n", "* `&` - koniukcja (i)\n", "* `|` - alternatywa (lub)\n", "* `~` - negacja (nie)\n", "* `()` - jeżeli mamy kilka warunków to warto je uporządkować w nawiasy" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
1211Bonnell\\t Miss. Elizabethfemale58.00011378326.5500C103S
3211Spencer\\t Mrs. William Augustus (Marie Eugenie)femaleNaN10PC 17569146.5208B78C
5311Harper\\t Mrs. Henry Sleeper (Myna Haxtun)female49.010PC 1757276.7292D33C
....................................
85711Wick\\t Mrs. George Dennick (Mary Hitchcock)female45.01136928164.8667NaNS
86311Swift\\t Mrs. Frederick Joel (Margaret Welles B...female48.0001746625.9292D17S
87211Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny)female47.0111175152.5542D35S
88011Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88811Graham\\t Miss. Margaret Edithfemale19.00011205330.0000B42S
\n", "

94 rows × 11 columns

\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "4 1 1 \n", "12 1 1 \n", "32 1 1 \n", "53 1 1 \n", "... ... ... \n", "857 1 1 \n", "863 1 1 \n", "872 1 1 \n", "880 1 1 \n", "888 1 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "12 Bonnell\\t Miss. Elizabeth female 58.0 \n", "32 Spencer\\t Mrs. William Augustus (Marie Eugenie) female NaN \n", "53 Harper\\t Mrs. Henry Sleeper (Myna Haxtun) female 49.0 \n", "... ... ... ... \n", "857 Wick\\t Mrs. George Dennick (Mary Hitchcock) female 45.0 \n", "863 Swift\\t Mrs. Frederick Joel (Margaret Welles B... female 48.0 \n", "872 Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) female 47.0 \n", "880 Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 \n", "888 Graham\\t Miss. Margaret Edith female 19.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "12 0 0 113783 26.5500 C103 S \n", "32 1 0 PC 17569 146.5208 B78 C \n", "53 1 0 PC 17572 76.7292 D33 C \n", "... ... ... ... ... ... ... \n", "857 1 1 36928 164.8667 NaN S \n", "863 0 0 17466 25.9292 D17 S \n", "872 1 1 11751 52.5542 D35 S \n", "880 0 1 11767 83.1583 C50 C \n", "888 0 0 112053 30.0000 B42 S \n", "\n", "[94 rows x 11 columns]" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pierwsza_klasa = df['Pclass'] == 1\n", "kobiety = df['Sex'] == 'female'\n", "\n", "df[pierwsza_klasa & kobiety]\n" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
803Palsson\\t Master. Gosta Leonardmale2.03134990921.0750NaNS
1012Nasser\\t Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
....................................
86103Hansen\\t Mr. Claus Petermale41.02035002614.1083NaNS
86202Giles\\t Mr. Frederick Edwardmale21.0102813411.5000NaNS
86403Sage\\t Miss. Dorothy Edith \"Dolly\"femaleNaN82CA. 234369.5500NaNS
86712Duran y More\\t Miss. Asuncionfemale27.010SC/PARIS 214913.8583NaNC
87512Abelson\\t Mrs. Samuel (Hannah Wizosky)female28.010P/PP 338124.0000NaNC
\n", "

192 rows × 11 columns

\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "4 1 1 \n", "8 0 3 \n", "10 1 2 \n", "... ... ... \n", "861 0 3 \n", "862 0 2 \n", "864 0 3 \n", "867 1 2 \n", "875 1 2 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22.0 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "8 Palsson\\t Master. Gosta Leonard male 2.0 \n", "10 Nasser\\t Mrs. Nicholas (Adele Achem) female 14.0 \n", "... ... ... ... \n", "861 Hansen\\t Mr. Claus Peter male 41.0 \n", "862 Giles\\t Mr. Frederick Edward male 21.0 \n", "864 Sage\\t Miss. Dorothy Edith \"Dolly\" female NaN \n", "867 Duran y More\\t Miss. Asuncion female 27.0 \n", "875 Abelson\\t Mrs. Samuel (Hannah Wizosky) female 28.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "8 3 1 349909 21.0750 NaN S \n", "10 1 0 237736 30.0708 NaN C \n", "... ... ... ... ... ... ... \n", "861 2 0 350026 14.1083 NaN S \n", "862 1 0 28134 11.5000 NaN S \n", "864 8 2 CA. 2343 69.5500 NaN S \n", "867 1 0 SC/PARIS 2149 13.8583 NaN C \n", "875 1 0 P/PP 3381 24.0000 NaN C \n", "\n", "[192 rows x 11 columns]" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['SibSp'] > df['Parch']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### `pd.DataFrame.query`\n", "\n", "Innym sposobem na filtrowanie danych jest metoda `query`, która jako argument przyjmuje wyrażenie:" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
701McCarthy\\t Mr. Timothy Jmale54.0001746351.8625E46S
1211Bonnell\\t Miss. Elizabethfemale58.00011378326.5500C103S
2411Sloper\\t Mr. William Thompsonmale28.00011378835.5000A6S
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "4 1 1 \n", "7 0 1 \n", "12 1 1 \n", "24 1 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "7 McCarthy\\t Mr. Timothy J male 54.0 \n", "12 Bonnell\\t Miss. Elizabeth female 58.0 \n", "24 Sloper\\t Mr. William Thompson male 28.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "7 0 0 17463 51.8625 E46 S \n", "12 0 0 113783 26.5500 C103 S \n", "24 0 0 113788 35.5000 A6 S " ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query('Pclass == 1').head()" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
1211Bonnell\\t Miss. Elizabethfemale58.00011378326.5500C103S
3211Spencer\\t Mrs. William Augustus (Marie Eugenie)femaleNaN10PC 17569146.5208B78C
5311Harper\\t Mrs. Henry Sleeper (Myna Haxtun)female49.010PC 1757276.7292D33C
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "4 1 1 \n", "12 1 1 \n", "32 1 1 \n", "53 1 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "12 Bonnell\\t Miss. Elizabeth female 58.0 \n", "32 Spencer\\t Mrs. William Augustus (Marie Eugenie) female NaN \n", "53 Harper\\t Mrs. Henry Sleeper (Myna Haxtun) female 49.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "12 0 0 113783 26.5500 C103 S \n", "32 1 0 PC 17569 146.5208 B78 C \n", "53 1 0 PC 17572 76.7292 D33 C " ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query('(Pclass == 1) and (Sex == \"female\")').head()" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
803Palsson\\t Master. Gosta Leonardmale2.03134990921.0750NaNS
1012Nasser\\t Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
....................................
86103Hansen\\t Mr. Claus Petermale41.02035002614.1083NaNS
86202Giles\\t Mr. Frederick Edwardmale21.0102813411.5000NaNS
86403Sage\\t Miss. Dorothy Edith \"Dolly\"femaleNaN82CA. 234369.5500NaNS
86712Duran y More\\t Miss. Asuncionfemale27.010SC/PARIS 214913.8583NaNC
87512Abelson\\t Mrs. Samuel (Hannah Wizosky)female28.010P/PP 338124.0000NaNC
\n", "

192 rows × 11 columns

\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "4 1 1 \n", "8 0 3 \n", "10 1 2 \n", "... ... ... \n", "861 0 3 \n", "862 0 2 \n", "864 0 3 \n", "867 1 2 \n", "875 1 2 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22.0 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "8 Palsson\\t Master. Gosta Leonard male 2.0 \n", "10 Nasser\\t Mrs. Nicholas (Adele Achem) female 14.0 \n", "... ... ... ... \n", "861 Hansen\\t Mr. Claus Peter male 41.0 \n", "862 Giles\\t Mr. Frederick Edward male 21.0 \n", "864 Sage\\t Miss. Dorothy Edith \"Dolly\" female NaN \n", "867 Duran y More\\t Miss. Asuncion female 27.0 \n", "875 Abelson\\t Mrs. Samuel (Hannah Wizosky) female 28.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "8 3 1 349909 21.0750 NaN S \n", "10 1 0 237736 30.0708 NaN C \n", "... ... ... ... ... ... ... \n", "861 2 0 350026 14.1083 NaN S \n", "862 1 0 28134 11.5000 NaN S \n", "864 8 2 CA. 2343 69.5500 NaN S \n", "867 1 0 SC/PARIS 2149 13.8583 NaN C \n", "875 1 0 P/PP 3381 24.0000 NaN C \n", "\n", "[192 rows x 11 columns]" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query('SibSp > Parch')" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "(113, 11)" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "young = 18\n", "df.query('Age < @young').shape" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Operacje na wierszach i kolumnach" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=5)\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Iterowanie po ramce danych oznacza oznacza przejście po nazwach kolumn:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "for column_name in df:\n", " print(column_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for col_name, series in df.items():\n", " print(col_name, series)\n", " break" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "for idx, row in df.iterrows():\n", " print(idx, '\\n', row)\n", " break" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def bmi_level(bmi):\n", " if bmi <= 18.5:\n", " level = 'underweight'\n", " elif bmi < 25:\n", " level = 'normal'\n", " elif bmi < 30:\n", " level = 'overweight'\n", " else:\n", " level = 'obese'\n", " return level\n", "\n", "s = df['male_BMI'].map(bmi_level)\n", " \n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def bmi_level(row_data):\n", " bmi = row_data['male_BMI']\n", " if bmi <= 18.5:\n", " return 'underweight'\n", " elif bmi < 25:\n", " return 'normal'\n", " elif bmi < 30:\n", " return 'overweight'\n", " return 'obese'\n", "\n", "df.apply(bmi_level, axis=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.transpose()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grupowanie (`groupby`)\n", "\n", "Często zdarza się, gdy potrzebujemy podzielić dane ze względu na wartości w zadanej kolumnie, a następnie obliczenie zebranie danych w każdej z grup. Do tego służy metody `groupby`." ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('./nba.csv')\n", "\n", "#df.sample(5)" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameTeamNumberPositionAgeHeightWeightCollegeSalary
0Avery BradleyBoston Celtics0.0PG25.06-2180.0Texas7730337.0
1Jae CrowderBoston Celtics99.0SF25.06-6235.0Marquette6796117.0
2John HollandBoston Celtics30.0SG27.06-5205.0Boston UniversityNaN
3R.J. HunterBoston Celtics28.0SG22.06-5185.0Georgia State1148640.0
4Jonas JerebkoBoston Celtics8.0PF29.06-10231.0NaN5000000.0
..............................
453Shelvin MackUtah Jazz8.0PG26.06-3203.0Butler2433333.0
454Raul NetoUtah Jazz25.0PG24.06-1179.0NaN900000.0
455Tibor PleissUtah Jazz21.0C26.07-3256.0NaN2900000.0
456Jeff WitheyUtah Jazz24.0C26.07-0231.0Kansas947276.0
457NaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

458 rows × 9 columns

\n", "
" ], "text/plain": [ " Name Team Number Position Age Height Weight \\\n", "0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 \n", "1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 \n", "2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 \n", "3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 \n", "4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 \n", ".. ... ... ... ... ... ... ... \n", "453 Shelvin Mack Utah Jazz 8.0 PG 26.0 6-3 203.0 \n", "454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0 \n", "455 Tibor Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0 \n", "456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0 231.0 \n", "457 NaN NaN NaN NaN NaN NaN NaN \n", "\n", " College Salary \n", "0 Texas 7730337.0 \n", "1 Marquette 6796117.0 \n", "2 Boston University NaN \n", "3 Georgia State 1148640.0 \n", "4 NaN 5000000.0 \n", ".. ... ... \n", "453 Butler 2433333.0 \n", "454 NaN 900000.0 \n", "455 NaN 2900000.0 \n", "456 Kansas 947276.0 \n", "457 NaN NaN \n", "\n", "[458 rows x 9 columns]" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "_Przykład_: chcemy obliczyć średnią wypłatę dla każdej z drużyn." ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TeamSalary
0Boston Celtics7730337.0
1Boston Celtics6796117.0
2Boston CelticsNaN
3Boston Celtics1148640.0
4Boston Celtics5000000.0
.........
453Utah Jazz2433333.0
454Utah Jazz900000.0
455Utah Jazz2900000.0
456Utah Jazz947276.0
457NaNNaN
\n", "

458 rows × 2 columns

\n", "
" ], "text/plain": [ " Team Salary\n", "0 Boston Celtics 7730337.0\n", "1 Boston Celtics 6796117.0\n", "2 Boston Celtics NaN\n", "3 Boston Celtics 1148640.0\n", "4 Boston Celtics 5000000.0\n", ".. ... ...\n", "453 Utah Jazz 2433333.0\n", "454 Utah Jazz 900000.0\n", "455 Utah Jazz 2900000.0\n", "456 Utah Jazz 947276.0\n", "457 NaN NaN\n", "\n", "[458 rows x 2 columns]" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Team', 'Salary']]" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Salary
Team
Atlanta Hawks2854940.0
Boston Celtics3021242.5
Brooklyn Nets1335480.0
Charlotte Hornets4204200.0
Chicago Bulls2380440.0
\n", "
" ], "text/plain": [ " Salary\n", "Team \n", "Atlanta Hawks 2854940.0\n", "Boston Celtics 3021242.5\n", "Brooklyn Nets 1335480.0\n", "Charlotte Hornets 4204200.0\n", "Chicago Bulls 2380440.0" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Team', 'Salary']].groupby('Team').median().h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Możemy też podać listę nazw kolumn. Wtedy wartości zostaną obliczone dla każdej z wytworzonych grup:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.groupby(['Team', 'Position'])['Salary'].mean()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " * `sum()`\n", " * `min()`\n", " * `max()`\n", " * `mean()`\n", " * `size()`\n", " * `describe()`\n", " * `first()`\n", " * `last()`\n", " * `count()`\n", " * `std()`\n", " * `var()`\n", " * `sem()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df[['Position', 'Salary']].groupby('Position').agg(['mean', 'std', 'count'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "def group_range(x):\n", " return x.max() - x.min()\n", "\n", "df[['Position', 'Salary']].groupby('Position').apply(group_range)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "gb = df.groupby(['Position'])\n", "\n", "print('Liczba grup:', gb.ngroups)\n", "print(gb.groups.keys())\n", "\n", "print(gb.get_group('C').head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "df.Height.str.split('-').str[0].astype('Int64') * 2.56" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Pivot\n", "Metoda `pivot` pozwala na stworzenie nowej ramki danych, gdzie indeks i nazwy kolumn są wartościami początkowej ranki danych. \n", "\n", "_Przykład_: zobaczmy na poniższą ramkę danych, która zawiera informacje o jakości tłumaczenia dla pary językowej hausa-angielski. Kolumna `system` zawiera nazwę systemu, kolumna `metric` - nazwę metryki, zaś kolumna `score`- wartość metryki. Chcemy przedstawić te dane w następujący sposób: jako klucz chcemy mieć nazwę systemu, zaś jako kolumny - metryki. Możemy wykorzystać do tego metodę `pivot`, gdzie musimy podać 3 argumenty:\n", " * `index`: nazwę kolumny, na podstawie której zostanie stworzony indeks;\n", " * `columns`: nazwa kolumny, które zawiera nazwy kolumn dla nowej ramki danych;\n", " * `values`: nazwa kolumny, która zawiera interesujące nas dane." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/wmt-conference/wmt21-news-systems/main/scores/automatic-scores.tsv', sep='\\t')\n", "df = df[df.pair == 'ha-en']\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.pivot(index='system', columns='metric', values='score')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Dane tekstowe" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "`pandas` posiada udogodnienia do pracy z wartościami tekstowymi:\n", " * dostęp następuje przez atrybut `str`;\n", " * funkcje:\n", " * formatujące: `lower()`, `upper()`;\n", " * wyrażenia regularne: `contains()`, `match()`;\n", " * inne: `split()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.Name.str.upper()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "print(df.Name.head())\n", "df.Name.str.contains('Miss|Mrs').head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.Name.str.split('\\t', expand=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.Name.str.split('\\t')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.Name.str.split('\\t').str[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.Name.str.split('\\t').str[1].str.strip().str.split(' ').str[0]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie\n", "Zestaw `nba.csv` zawiera informaję o wysokości zawodników. Oblicz wzrost każdego z zawodników w systemie metrycznym przyjmując, że stop to `30.48` cm., a cal to `2.54` cm." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "interpreter": { "hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 4 }