{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Analiza Danych w Pythonie: `pandas`\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### `pandas`\n", "Biblioteka `pandas` jest podstawowym narzędziem w ekosystemie Pythona do analizy danych:\n", " * dostarcza dwa podstawowe typy danych: \n", " * `Series` (szereg, 1D)\n", " * `DataFrame` (ramka danych, 2D)\n", " * operacje na tych obiektach: obsługa brakujących wartości, łączenie danych;\n", " * obsługuje dane różnego typu, np. szeregi czasowe;\n", " * biblioteka bazuje na `numpy` -- bibliotece do obliczeń numerycznych;\n", " * pozwala też na prostą wizualizację danych;\n", " * ETL: extract, transform, load." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Żeby zaimportowąc bibliotekę `pandas` wystarczy:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### __Zadanie 0__: sprawdź, czy masz zainstalowaną bibliotekę `pandas`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### [Szeregi](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) (`pd.Series`)\n", "\n", " Szereg reprezentuje jednorodne dane jednowymiarowe - jest odpowiednikiem wektora w R.\n", " * Szeregi możemy tworzyć na różne sposoby (więcej za chwilę), np. z obiektów tj. listy i słowniki.\n", " * Dane muszą być jednorodne. W przeciwnym przypadku nastąpi automatyczna konwersja.\n", " * Podczas tworzenia szeregu musimy podać jeden obowiązkowy argument `data` - dane.\n", " * Ponadto możemy podać też indeks (`index`), typ danych (`dtype`) lub nazwę (`name`).\n", " \n", " \n", " ```\n", " class pandas.Series(data=None, index=None, dtype=None, name=None)\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Podczas tworzenie szeregu mozemy podać dane w formacie listy lub słownika.\n", "\n", "Poniżej jest przykład przedstawiający tworzenie szeregu z danych, które są zawarte w liście:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0 211819\n", "1 682758\n", "2 737011\n", "3 779511\n", "4 673790\n", "5 673790\n", "6 444177\n", "7 136791\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "data = [211819, 682758, 737011, 779511, 673790, 673790, 444177, 136791]\n", "\n", "s = pd.Series(data)\n", "\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "W przypadku, gdy dane pochodzą z listy i nie podaliśmy indeksu, pandas doda automatyczny indeks liczbowy zaczynający się od 0." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "W przypadku przekazania słownika jako danych do szeregu, pandas wykorzysta klucze do stworzenia indeksu:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "April 211819\n", "May 682758\n", "June 737011\n", "July 779511\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n", "\n", "s = pd.Series(members)\n", "\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Podczas tworzenia szeregu możemy zdefiniować indeks, jak i nazwę szeregu:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "April 211819.0\n", "May 682758.0\n", "June 737011.0\n", "July 779511.0\n", "Name: Rides, dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "months = ['April', 'May', 'June', 'July']\n", "\n", "data = [211819, 682758, 737011, 779511]\n", "\n", "s = pd.Series(data=data, index=months, dtype=float, name='Rides')\n", "\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Odwołanie się do poszczególnego elementu odbywa się przy pomocy klucza z indeksu." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "211819\n" ] }, { "data": { "text/plain": [ "April 211819\n", "May 682758\n", "June 737011\n", "July 779511\n", "August 673790\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n", "\n", "s = pd.Series(members)\n", "\n", "print(s['April'])\n", "\n", "s['August'] = 673790\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Dodanie elementu do szeregu odbywa się poprzez definiowanie nowego klucza:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "April 211819\n", "May 682758\n", "June 737011\n", "July 779511\n", "August 673790\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n", "\n", "s = pd.Series(members)\n", "\n", "s['August'] = 673790\n", "\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Więcej nt. indeksowania w szeregach w dalszej części kursu." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Podstawowa cechą szeregu jest wykonywanie operacji w sposób wektorowy. Działa to w następujący sposób:\n", " * gdy w obu szeregach jest zawarty ten sam klucz, to są sumowane ich wartości;\n", " * w przeciwnym przypadku wartość klucza w wynikowym szeregu to `pd.NaN`. \n", " * Równoważnie możemy wykorzystać metodę `pandas.Series.add`. W tym przypadku możemy podać domyślną wartość w przypadku braku klucza." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "August 880599.0\n", "July 973827.0\n", "June 908505.0\n", "May 830656.0\n", "October NaN\n", "September 814282.0\n", "dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = pd.Series({'May': 682758, 'June': 737011, 'August': 673790, 'July': 779511,\n", "'September': 673790, 'October': 444177})\n", "\n", "occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316, 'August': 206809,\n", "'September': 140492})\n", "\n", "all_data = members + occasionals\n", "# Równoważnie\n", "all_data = members.add(occasionals)\n", "all_data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Możemy wykonać operacje arytmetyczne na szeregu: " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "May 683758\n", "June 738011\n", "July 780511\n", "August 674790\n", "September 674790\n", "October 445177\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511, 'August': 673790,\n", "'September': 673790, 'October': 444177})\n", "\n", "members += 1000\n", "\n", "members" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Podsumowanie\n", " * Szeregi działają podobnie do słowników, z tą różnicą, że wartości muszą być jednorodne (tego samego typu).\n", " * Odwołanie do poszczególnych elementów odbywa się poprzez nawiasy `[]` i podanie klucza.\n", " * W przeciwieństwie do słowników, możemy w prosty sposób wykonywać operacje arytmetyczne." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie 1\n", " * Stwórz szereg `n`, który będzie zawierać liczby od 0 do 10 (włącznie).\n", " * Stwórz szereg `n2`, który będzie zawierać kwadraty liczb od 0 do 10 (włącznie).\n", " * Następnie stwórz szereg `trojkatne`, który będzie sumą powyższych szeregów podzieloną przez 2." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### [Ramka danych](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) (`pd.DataFrame`)\n", "\n", "Ramka danych jest podstawową strukturą danych w bibliotece `pandas`, która pozwala na trzymanie i reprezentowanie danych tabelarycznych (dwuwymiarowych).\n", " * Posiada kolumny (cechy) i wiersze (obserwacje, przykłady).\n", " * Możemy też patrzeć na nią jak na słownik, którego wartościami są szeregi.\n", "\n", "```\n", "class pandas.DataFrame(data=None, index=None, columns=None, dtype=None)\n", "```\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Ramkę danych możemy stworzyć na różne sposoby.\n", "\n", "Pierwszy z nich (\"kolumnowy\") polega na zdefiniowaniu ramki poprzez podanie szeregów jako kolumn:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
May682758147898
June737011171494
July779511194316
\n", "
" ], "text/plain": [ " members occasionals\n", "May 682758 147898\n", "June 737011 171494\n", "July 779511 194316" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511})\n", "occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316})\n", "\n", "df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Drugim popularnym sposobem jest przekazanie listy słowników. Wtedy `pandas` zinterpretuje to jako listę przykładów:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
0682758147898
1737011171494
2779511194316
\n", "
" ], "text/plain": [ " members occasionals\n", "0 682758 147898\n", "1 737011 171494\n", "2 779511 194316" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = [\n", " {'members': 682758, 'occasionals': 147898},\n", " {'occasionals': 171494,'members': 737011},\n", " {'members': 779511, 'occasionals': 194316},\n", "]\n", "\n", "df = pd.DataFrame(data)\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Możemy też wykorzystać metodę `from_dict` ([doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html)), która pozwala zdefiniować czy podane dane są w podane w postaci kolumnowej lub wierszowej:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "index\n", " members occasionals\n", "May 682758 147898\n", "June 737011 171494\n", "July 779511 194316\n", "\n", "columns\n", " May June July\n", "members 682758 737011 779511\n", "occasionals 147898 171494 194316\n" ] } ], "source": [ "data = {\n", " 'May': {'members': 682758, 'occasionals': 147898},\n", " 'June': {'members': 737011, 'occasionals': 171494},\n", " 'July': {'members': 779511, 'occasionals': 194316}\n", "}\n", "\n", "df = pd.DataFrame.from_dict(data, orient='index')\n", "print('index\\n', df)\n", "print()\n", "df = pd.DataFrame.from_dict(data, orient='columns')\n", "print('columns\\n', df)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Wczytywanie danych" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Biblioteka `pandas` pozwala na wczytanie i zapis danych z różnych formatów:\n", " * formaty tekstowe, np. `csv`, `json`\n", " * pliki arkuszy kalkulacyjnych: Excel (xls, xlsx)\n", " * bazy danych\n", " * inne: `sas` `spss`\n", "\n", "\n", "Efektem wczytania danych jest odpowiednio stworzona ramka danych (`DataFrame`)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Jednym z najprostszych formatów danych jest format `csv`, gdzie kolejne wartości są rozdzielone przecinkiem.\n", "\n", "Żeby wczytać dane w takim formacie należy użyć funkcji `pandas.read_csv`.\n", "\n", "Pandas pozwala na ustawienie wielu parametrów (np. separator, cudzysłowy). Więcej na ten temat w [dokumentacji](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Countryfemale_BMImale_BMIgdppopulationunder5mortalitylife_expectancyfertility
0Afghanistan21.0740220.620581311.026528741.0110.452.86.20
1Albania25.6572626.446578644.02968026.017.976.81.76
2Algeria26.3684124.5962012314.034811059.029.575.52.73
3Angola23.4843122.250837103.019842251.0192.056.76.43
4Antigua and Barbuda27.5054525.7660225736.085350.010.975.52.16
...........................
170Venezuela28.1340827.4450017911.028116716.017.174.22.53
171Vietnam21.0650020.916304085.086589342.026.274.11.86
172Palestine29.0264326.577503564.03854667.024.774.14.38
173Zambia23.0543620.683213039.013114579.094.951.15.88
174Zimbabwe24.6452222.026601286.013495462.098.347.33.85
\n", "

175 rows × 8 columns

\n", "
" ], "text/plain": [ " Country female_BMI male_BMI gdp population \\\n", "0 Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n", "1 Albania 25.65726 26.44657 8644.0 2968026.0 \n", "2 Algeria 26.36841 24.59620 12314.0 34811059.0 \n", "3 Angola 23.48431 22.25083 7103.0 19842251.0 \n", "4 Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n", ".. ... ... ... ... ... \n", "170 Venezuela 28.13408 27.44500 17911.0 28116716.0 \n", "171 Vietnam 21.06500 20.91630 4085.0 86589342.0 \n", "172 Palestine 29.02643 26.57750 3564.0 3854667.0 \n", "173 Zambia 23.05436 20.68321 3039.0 13114579.0 \n", "174 Zimbabwe 24.64522 22.02660 1286.0 13495462.0 \n", "\n", " under5mortality life_expectancy fertility \n", "0 110.4 52.8 6.20 \n", "1 17.9 76.8 1.76 \n", "2 29.5 75.5 2.73 \n", "3 192.0 56.7 6.43 \n", "4 10.9 75.5 2.16 \n", ".. ... ... ... \n", "170 17.1 74.2 2.53 \n", "171 26.2 74.1 1.86 \n", "172 24.7 74.1 4.38 \n", "173 94.9 51.1 5.88 \n", "174 98.3 47.3 3.85 \n", "\n", "[175 rows x 8 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('gapminder.csv')\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale2210A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female3810PC 1759971.2833C85C
313Heikkinen\\t Miss. Lainafemale2600STON/O2. 31012827.9250NaNS
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S
503Allen\\t Mr. William Henrymale35003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38 \n", "3 Heikkinen\\t Miss. Laina female 26 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35 \n", "5 Allen\\t Mr. William Henry male 35 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./titanic_train.tsv', delimiter='\\t', index_col=0, nrows=5)\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Do wczytania danych z arkusza kalkulacyjnego służy funkcja `pandas.read_excel`. Do otworzenia pliku `xlsx` może być koniecnze ustawienie parametru: `engine='openpyxl`. Więcej opcji w [dokumentacji](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
start_datestart_station_codeend_dateend_station_codeduration_secis_member
02019-04-14 07:55:2260012019-04-14 08:07:1661327131
12019-04-14 07:59:3164112019-04-14 08:09:1864115871
22019-04-14 07:59:5560972019-04-14 08:12:1160367361
32019-04-14 07:59:5763102019-04-14 08:27:58634516801
42019-04-14 08:00:3770292019-04-14 08:14:1262508140
\n", "
" ], "text/plain": [ " start_date start_station_code end_date \\\n", "0 2019-04-14 07:55:22 6001 2019-04-14 08:07:16 \n", "1 2019-04-14 07:59:31 6411 2019-04-14 08:09:18 \n", "2 2019-04-14 07:59:55 6097 2019-04-14 08:12:11 \n", "3 2019-04-14 07:59:57 6310 2019-04-14 08:27:58 \n", "4 2019-04-14 08:00:37 7029 2019-04-14 08:14:12 \n", "\n", " end_station_code duration_sec is_member \n", "0 6132 713 1 \n", "1 6411 587 1 \n", "2 6036 736 1 \n", "3 6345 1680 1 \n", "4 6250 814 0 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel('./bikes.xlsx', engine='openpyxl', nrows=5)\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Innym ważnym źródłem informacji są bazy danych. Pandas potrafi komunikować się z bazą danych za pomocą biblioteki [SQLAlchemy](https://pypi.org/project/SQLAlchemy/) i dostarcza odpowiedną funkcję:\n", " * `pandas.read_sql` - wczytanie całej tabeli lub zapytania do bazy danych" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleArtistId
AlbumId
1For Those About To Rock We Salute You1
2Balls to the Wall2
3Restless and Wild2
4Let There Be Rock1
5Big Ones3
.........
343Respighi:Pines of Rome226
344Schubert: The Late String Quartets & String Qu...272
345Monteverdi: L'Orfeo273
346Mozart: Chamber Music274
347Koyaanisqatsi (Soundtrack from the Motion Pict...275
\n", "

347 rows × 2 columns

\n", "
" ], "text/plain": [ " Title ArtistId\n", "AlbumId \n", "1 For Those About To Rock We Salute You 1\n", "2 Balls to the Wall 2\n", "3 Restless and Wild 2\n", "4 Let There Be Rock 1\n", "5 Big Ones 3\n", "... ... ...\n", "343 Respighi:Pines of Rome 226\n", "344 Schubert: The Late String Quartets & String Qu... 272\n", "345 Monteverdi: L'Orfeo 273\n", "346 Mozart: Chamber Music 274\n", "347 Koyaanisqatsi (Soundtrack from the Motion Pict... 275\n", "\n", "[347 rows x 2 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_sql('Album', con='sqlite:///Chinook.sqlite', index_col='AlbumId')\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2023-11-17 15:53:28,542 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1\n", "2023-11-17 15:53:28,543 INFO sqlalchemy.engine.base.Engine ()\n", "2023-11-17 15:53:28,544 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1\n", "2023-11-17 15:53:28,545 INFO sqlalchemy.engine.base.Engine ()\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitleArtistId
AlbumId
1For Those About To Rock We Salute You1
2Balls to the Wall2
3Restless and Wild2
4Let There Be Rock1
5Big Ones3
.........
343Respighi:Pines of Rome226
344Schubert: The Late String Quartets & String Qu...272
345Monteverdi: L'Orfeo273
346Mozart: Chamber Music274
347Koyaanisqatsi (Soundtrack from the Motion Pict...275
\n", "

347 rows × 2 columns

\n", "
" ], "text/plain": [ " Title ArtistId\n", "AlbumId \n", "1 For Those About To Rock We Salute You 1\n", "2 Balls to the Wall 2\n", "3 Restless and Wild 2\n", "4 Let There Be Rock 1\n", "5 Big Ones 3\n", "... ... ...\n", "343 Respighi:Pines of Rome 226\n", "344 Schubert: The Late String Quartets & String Qu... 272\n", "345 Monteverdi: L'Orfeo 273\n", "346 Mozart: Chamber Music 274\n", "347 Koyaanisqatsi (Soundtrack from the Motion Pict... 275\n", "\n", "[347 rows x 2 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sqlalchemy\n", "\n", "engine = sqlalchemy.create_engine('sqlite:///Chinook.sqlite', echo=True)\n", "connection = engine.raw_connection()\n", "\n", "df = pd.read_sql('SELECT * FROM Album', con='sqlite:///Chinook.sqlite', index_col='AlbumId')\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Podsumowanie\n", "\n", "\n", " * Biblioteka `pandas` wspiera pobieranie danych z różnych formatów i źródeł.\n", " * Każda funkcja ma listę argumentów, które pozwalają na ustawić poszczególne parametry (np. [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv))." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zapis i eksport danych" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pandas pozwala w prosty sposób na zapisywanie ramki danych do pliku. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511})\n", "occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316})\n", "\n", "df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# zapis do formatu CSV\n", "df.to_csv('tmp.csv')\n", "# zapis do arkusza kalkulacyjnego \n", "df.to_excel('tmp.xlsx')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Ponadto możemy przekonwertować ramkę danych do JSONa lub Pythonowego słownika:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"members\":{\"May\":682758,\"June\":737011,\"July\":779511},\"occasionals\":{\"May\":147898,\"June\":171494,\"July\":194316}}\n" ] } ], "source": [ "print(df.to_json())" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'members': {'May': 682758, 'June': 737011, 'July': 779511}, 'occasionals': {'May': 147898, 'June': 171494, 'July': 194316}}\n" ] } ], "source": [ "print(df.to_dict())\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Lub przekopiować dane do schowka:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df.to_clipboard()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie\n", "\n", "\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " * Przekonwertuj tabele `Customer` z bazy `Chinook.sqlite` do arkusza kalkulacyjnego. Plik wynikowy nazwij `customers.xlsx`.\n", " * Tabela `Employee` zawiera informacje o pracownikach firmy Chinook. Wyswietl dane na ekranie i podaj miasta, w których mieszkają pracownicy.\n", " * Tabela `Invoice` zawiera informacje o fakturach. Przekonwertuj kolumnę `BillingCountry` do pythonowego słownika, a następnie podaj najcześciej występującą wartość. Ile razy pojawiła się?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Ramka danych - podstawy" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "#### Kolumny\n", "\n", "Na ramkę danych możemy patrzeć jak na swego rodzaju słownik, którego wartościami są szeregi. Pozwoli to na uzyskanie lepszej intuicji.\n", "\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gdppopulationlife_expectancy
Country
Afghanistan1311.026528741.052.8
Albania8644.02968026.076.8
Algeria12314.034811059.075.5
Angola7103.019842251.056.7
Antigua and Barbuda25736.085350.075.5
Argentina14646.040381860.075.4
Armenia7383.02975029.072.3
Australia41312.021370348.081.6
\n", "
" ], "text/plain": [ " gdp population life_expectancy\n", "Country \n", "Afghanistan 1311.0 26528741.0 52.8\n", "Albania 8644.0 2968026.0 76.8\n", "Algeria 12314.0 34811059.0 75.5\n", "Angola 7103.0 19842251.0 56.7\n", "Antigua and Barbuda 25736.0 85350.0 75.5\n", "Argentina 14646.0 40381860.0 75.4\n", "Armenia 7383.0 2975029.0 72.3\n", "Australia 41312.0 21370348.0 81.6" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=8, usecols=['Country', 'gdp', 'population','life_expectancy'])\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Dostęp do poszczególnej kolumny możemy uzystać na dwa sposoby:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Country\n", "Afghanistan 26528741.0\n", "Albania 2968026.0\n", "Algeria 34811059.0\n", "Angola 19842251.0\n", "Antigua and Barbuda 85350.0\n", "Argentina 40381860.0\n", "Armenia 2975029.0\n", "Australia 21370348.0\n", "Name: population, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# notacja z kropką\n", "df.population" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Country\n", "Afghanistan 26528741.0\n", "Albania 2968026.0\n", "Algeria 34811059.0\n", "Angola 19842251.0\n", "Antigua and Barbuda 85350.0\n", "Argentina 40381860.0\n", "Armenia 2975029.0\n", "Australia 21370348.0\n", "Name: population, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Operator []\n", "df['population']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Do operatora `[]` możemy też podać listę nazw kolumn:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gdppopulation
Country
Afghanistan1311.026528741.0
Albania8644.02968026.0
Algeria12314.034811059.0
Angola7103.019842251.0
Antigua and Barbuda25736.085350.0
Argentina14646.040381860.0
Armenia7383.02975029.0
Australia41312.021370348.0
\n", "
" ], "text/plain": [ " gdp population\n", "Country \n", "Afghanistan 1311.0 26528741.0\n", "Albania 8644.0 2968026.0\n", "Algeria 12314.0 34811059.0\n", "Angola 7103.0 19842251.0\n", "Antigua and Barbuda 25736.0 85350.0\n", "Argentina 14646.0 40381860.0\n", "Armenia 7383.0 2975029.0\n", "Australia 41312.0 21370348.0" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['gdp','population']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Listę kolumn możemy pobrać za pomocą:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Index(['gdp', 'population', 'life_expectancy'], dtype='object')" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PKBPopulacjaODŻ
Country
Afghanistan1311.026528741.052.8
Albania8644.02968026.076.8
Algeria12314.034811059.075.5
Angola7103.019842251.056.7
Antigua and Barbuda25736.085350.075.5
Argentina14646.040381860.075.4
Armenia7383.02975029.072.3
Australia41312.021370348.081.6
\n", "
" ], "text/plain": [ " PKB Populacja ODŻ\n", "Country \n", "Afghanistan 1311.0 26528741.0 52.8\n", "Albania 8644.0 2968026.0 76.8\n", "Algeria 12314.0 34811059.0 75.5\n", "Angola 7103.0 19842251.0 56.7\n", "Antigua and Barbuda 25736.0 85350.0 75.5\n", "Argentina 14646.0 40381860.0 75.4\n", "Armenia 7383.0 2975029.0 72.3\n", "Australia 41312.0 21370348.0 81.6" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns = ['PKB', 'Populacja', 'ODŻ']\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Żeby odwołać się do poszczególnych wierszy należy wykorzystać metodę `loc`:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PKB 14646.0\n", "Populacja 40381860.0\n", "ODŻ 75.4\n", "Name: Argentina, dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['Argentina']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Metoda `loc` również może przyjąć listę wierszy: " ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PKBPopulacjaODŻ
Country
Albania8644.02968026.076.8
Angola7103.019842251.056.7
\n", "
" ], "text/plain": [ " PKB Populacja ODŻ\n", "Country \n", "Albania 8644.0 2968026.0 76.8\n", "Angola 7103.0 19842251.0 56.7" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[['Albania', 'Angola']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Możemy również podać drugi parametr: nazwy kolumn:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PKBPopulacja
Country
Albania8644.02968026.0
Angola7103.019842251.0
\n", "
" ], "text/plain": [ " PKB Populacja\n", "Country \n", "Albania 8644.0 2968026.0\n", "Angola 7103.0 19842251.0" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = df.loc[['Albania', 'Angola'], ['PKB', 'Populacja']]\n", "\n", "df2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Albo wykorzystać tzw. _slicing_, cyzli operator `:`:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PKBPopulacjaODŻ
Country
Albania8644.02968026.076.8
Algeria12314.034811059.075.5
Angola7103.019842251.056.7
\n", "
" ], "text/plain": [ " PKB Populacja ODŻ\n", "Country \n", "Albania 8644.0 2968026.0 76.8\n", "Algeria 12314.0 34811059.0 75.5\n", "Angola 7103.0 19842251.0 56.7" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['Albania': 'Angola', 'PKB': 'ODŻ']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Żeby odwołać się do pojedyńczej wartości możemy użyć metody `at`:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "7103.0" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.at['Angola', 'PKB']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Dostęp do indeksu:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda',\n", " 'Argentina', 'Armenia', 'Australia'],\n", " dtype='object', name='Country')" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Podstawowe metody `pd.Series` i `pd.DataFrame`" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
May682758147898
June737011171494
July779511194316
August673790206809
September673790140492
October44417753596
\n", "
" ], "text/plain": [ " members occasionals\n", "May 682758 147898\n", "June 737011 171494\n", "July 779511 194316\n", "August 673790 206809\n", "September 673790 140492\n", "October 444177 53596" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511, 'August': 673790,\n", "'September': 673790, 'October': 444177})\n", "\n", "occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316, 'August': 206809,\n", "'September': 140492, 'October': 53596})\n", "\n", "df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `head` pozwala tworzy nową ramkę danych z pierwszymi 5 przykładami:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
May682758147898
June737011171494
July779511194316
August673790206809
September673790140492
\n", "
" ], "text/plain": [ " members occasionals\n", "May 682758 147898\n", "June 737011 171494\n", "July 779511 194316\n", "August 673790 206809\n", "September 673790 140492" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `tail` robi to samo, ale z 5 ostatnymi przykładami:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
June737011171494
July779511194316
August673790206809
September673790140492
October44417753596
\n", "
" ], "text/plain": [ " members occasionals\n", "June 737011 171494\n", "July 779511 194316\n", "August 673790 206809\n", "September 673790 140492\n", "October 444177 53596" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `sample` pozwala na stworzenie nowej ramki danych z wylosowanymi `n` przykładami:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
October44417753596
June737011171494
May682758147898
\n", "
" ], "text/plain": [ " members occasionals\n", "October 444177 53596\n", "June 737011 171494\n", "May 682758 147898" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `describe` zwraca podstawowe statystyki m.in.: liczebność, średnią, wartości skrajne: " ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
membersoccasionals
count6.0000006.000000
mean665172.833333152434.166667
std116216.04545654783.506738
min444177.00000053596.000000
25%673790.000000142343.500000
50%678274.000000159696.000000
75%723447.750000188610.500000
max779511.000000206809.000000
\n", "
" ], "text/plain": [ " members occasionals\n", "count 6.000000 6.000000\n", "mean 665172.833333 152434.166667\n", "std 116216.045456 54783.506738\n", "min 444177.000000 53596.000000\n", "25% 673790.000000 142343.500000\n", "50% 678274.000000 159696.000000\n", "75% 723447.750000 188610.500000\n", "max 779511.000000 206809.000000" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Metoda `info` zwraca informacje techniczne o kolumnach: np. typ danych:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 6 entries, May to October\n", "Data columns (total 2 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 members 6 non-null int64\n", " 1 occasionals 6 non-null int64\n", "dtypes: int64(2)\n", "memory usage: 144.0+ bytes\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podstawową informacją o ramce danych to liczba przykładów w ramce danych. Możemy wykorzystać to tego funkcję `len`:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "6" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Natomiast atrybut `shape` zwraca nam krotkę z liczbą przykładów i liczbą kolumn:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "(6, 2)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Operacja arytmetyczne\n", "\n", " * `max`, `idxmax`\n", " * `min`, `idxmin`\n", " * `mean`\n", " * `count`" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "members 665172.833333\n", "occasionals 152434.166667\n", "dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Zbiór wartości i zliczanie wartości:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 3 2]\n", "3 4\n", "1 3\n", "2 3\n", "dtype: int64\n" ] } ], "source": [ "dane = pd.Series([1, 3, 2, 3, 1, 1, 2, 3, 2, 3])\n", "\n", "print(dane.unique())\n", "\n", "dane = pd.Series([1, 3, 2, 3, 1, 1, 2, 3, 2, 3])\n", "\n", "print(dane.value_counts())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Sprawdzanie czy brakuje danych:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", " ... \n", "887 False\n", "888 False\n", "889 True\n", "890 False\n", "891 False\n", "Name: Age, Length: 891, dtype: bool" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n", "df.Age.isnull()\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Dodawanie i modyfikowanie danych" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
female_BMImale_BMIgdppopulationunder5mortalitylife_expectancyfertility
Country
Afghanistan21.0740220.620581311.026528741.0110.452.86.20
Albania25.6572626.446578644.02968026.017.976.81.76
Algeria26.3684124.5962012314.034811059.029.575.52.73
Angola23.4843122.250837103.019842251.0192.056.76.43
Antigua and Barbuda27.5054525.7660225736.085350.010.975.52.16
\n", "
" ], "text/plain": [ " female_BMI male_BMI gdp population \\\n", "Country \n", "Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n", "Albania 25.65726 26.44657 8644.0 2968026.0 \n", "Algeria 26.36841 24.59620 12314.0 34811059.0 \n", "Angola 23.48431 22.25083 7103.0 19842251.0 \n", "Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n", "\n", " under5mortality life_expectancy fertility \n", "Country \n", "Afghanistan 110.4 52.8 6.20 \n", "Albania 17.9 76.8 1.76 \n", "Algeria 29.5 75.5 2.73 \n", "Angola 192.0 56.7 6.43 \n", "Antigua and Barbuda 10.9 75.5 2.16 " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=5)\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
female_BMImale_BMIgdppopulationunder5mortalitylife_expectancyfertilitycontinenttmp
Country
Afghanistan21.0740220.620581311.026528741.0110.452.86.20Asia1
Albania25.6572626.446578644.02968026.017.976.81.76Europe1
Algeria26.3684124.5962012314.034811059.029.575.52.73Africa1
Angola23.4843122.250837103.019842251.0192.056.76.43Africa1
Antigua and Barbuda27.5054525.7660225736.085350.010.975.52.16Americas1
\n", "
" ], "text/plain": [ " female_BMI male_BMI gdp population \\\n", "Country \n", "Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n", "Albania 25.65726 26.44657 8644.0 2968026.0 \n", "Algeria 26.36841 24.59620 12314.0 34811059.0 \n", "Angola 23.48431 22.25083 7103.0 19842251.0 \n", "Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n", "\n", " under5mortality life_expectancy fertility continent \\\n", "Country \n", "Afghanistan 110.4 52.8 6.20 Asia \n", "Albania 17.9 76.8 1.76 Europe \n", "Algeria 29.5 75.5 2.73 Africa \n", "Angola 192.0 56.7 6.43 Africa \n", "Antigua and Barbuda 10.9 75.5 2.16 Americas \n", "\n", " tmp \n", "Country \n", "Afghanistan 1 \n", "Albania 1 \n", "Algeria 1 \n", "Angola 1 \n", "Antigua and Barbuda 1 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conts = pd.Series({\n", " 'Afghanistan': 'Asia', 'Albania': 'Europe', 'Algeria':' Africa', 'Angola': 'Africa', 'Antigua and Barbuda': 'Americas'})\n", "\n", "df['continent'] = conts\n", "\n", "df['tmp'] = 1\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
female_BMImale_BMIgdppopulationunder5mortalitylife_expectancyfertilitycontinenttmp
Country
Afghanistan21.0740220.620581311.026528741.0110.452.86.20Asia1.0
Albania25.6572626.446578644.02968026.017.976.81.76Europe1.0
Algeria26.3684124.5962012314.034811059.029.575.52.73Africa1.0
Angola23.4843122.250837103.019842251.0192.056.76.43Africa1.0
Antigua and Barbuda27.5054525.7660225736.085350.010.975.52.16Americas1.0
Argentina27.4652327.5017014646.040381860.015.475.42.24NaNNaN
\n", "
" ], "text/plain": [ " female_BMI male_BMI gdp population \\\n", "Country \n", "Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n", "Albania 25.65726 26.44657 8644.0 2968026.0 \n", "Algeria 26.36841 24.59620 12314.0 34811059.0 \n", "Angola 23.48431 22.25083 7103.0 19842251.0 \n", "Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n", "Argentina 27.46523 27.50170 14646.0 40381860.0 \n", "\n", " under5mortality life_expectancy fertility continent \\\n", "Country \n", "Afghanistan 110.4 52.8 6.20 Asia \n", "Albania 17.9 76.8 1.76 Europe \n", "Algeria 29.5 75.5 2.73 Africa \n", "Angola 192.0 56.7 6.43 Africa \n", "Antigua and Barbuda 10.9 75.5 2.16 Americas \n", "Argentina 15.4 75.4 2.24 NaN \n", "\n", " tmp \n", "Country \n", "Afghanistan 1.0 \n", "Albania 1.0 \n", "Algeria 1.0 \n", "Angola 1.0 \n", "Antigua and Barbuda 1.0 \n", "Argentina NaN " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['Argentina'] = {\n", " 'female_BMI': 27.46523,\n", " 'male_BMI': 27.5017,\n", " 'gdp': 14646.0,\n", " 'population': 40381860.0,\n", " 'under5mortality': 15.4,\n", " 'life_expectancy': 75.4,\n", " 'fertility': 2.24\n", "}\n", "df" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
female_BMImale_BMIpopulationunder5mortalitylife_expectancyfertilitycontinenttmp
Country
Afghanistan21.0740220.6205826528741.0110.452.86.20Asia1.0
Albania25.6572626.446572968026.017.976.81.76Europe1.0
Algeria26.3684124.5962034811059.029.575.52.73Africa1.0
Angola23.4843122.2508319842251.0192.056.76.43Africa1.0
Antigua and Barbuda27.5054525.7660285350.010.975.52.16Americas1.0
Argentina27.4652327.5017040381860.015.475.42.24NaNNaN
\n", "
" ], "text/plain": [ " female_BMI male_BMI population under5mortality \\\n", "Country \n", "Afghanistan 21.07402 20.62058 26528741.0 110.4 \n", "Albania 25.65726 26.44657 2968026.0 17.9 \n", "Algeria 26.36841 24.59620 34811059.0 29.5 \n", "Angola 23.48431 22.25083 19842251.0 192.0 \n", "Antigua and Barbuda 27.50545 25.76602 85350.0 10.9 \n", "Argentina 27.46523 27.50170 40381860.0 15.4 \n", "\n", " life_expectancy fertility continent tmp \n", "Country \n", "Afghanistan 52.8 6.20 Asia 1.0 \n", "Albania 76.8 1.76 Europe 1.0 \n", "Algeria 75.5 2.73 Africa 1.0 \n", "Angola 56.7 6.43 Africa 1.0 \n", "Antigua and Barbuda 75.5 2.16 Americas 1.0 \n", "Argentina 75.4 2.24 NaN NaN " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop('gdp', axis='columns')\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Filtrowanie danych" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Biblioteka pandas posiada 2 sposoby na filtrowanie danych zawartych w ramce danych:\n", " * operator `[]` -- najbardziej rozpowszechniony;\n", " * metoda `query()`.\n", "Oba sposoby mają różną składnię.\n", " " ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
313Heikkinen\\t Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen\\t Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22.0 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "3 Heikkinen\\t Miss. Laina female 26.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen\\t Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 0\n", "2 1\n", "3 1\n", "4 1\n", "5 0\n", " ..\n", "887 0\n", "888 1\n", "889 0\n", "890 1\n", "891 0\n", "Name: Survived, Length: 891, dtype: int64" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Survived']" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 False\n", "2 True\n", "3 True\n", "4 True\n", "5 False\n", " ... \n", "887 False\n", "888 True\n", "889 False\n", "890 True\n", "891 False\n", "Name: Survived, Length: 891, dtype: bool" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Survived'] == 1" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
701McCarthy\\t Mr. Timothy Jmale54.0001746351.8625E46S
1211Bonnell\\t Miss. Elizabethfemale58.00011378326.5500C103S
2411Sloper\\t Mr. William Thompsonmale28.00011378835.5000A6S
....................................
87211Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny)female47.0111175152.5542D35S
87301Carlsson\\t Mr. Frans Olofmale33.0006955.0000B51 B53 B55S
88011Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88811Graham\\t Miss. Margaret Edithfemale19.00011205330.0000B42S
89011Behr\\t Mr. Karl Howellmale26.00011136930.0000C148C
\n", "

216 rows × 11 columns

\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "4 1 1 \n", "7 0 1 \n", "12 1 1 \n", "24 1 1 \n", "... ... ... \n", "872 1 1 \n", "873 0 1 \n", "880 1 1 \n", "888 1 1 \n", "890 1 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "7 McCarthy\\t Mr. Timothy J male 54.0 \n", "12 Bonnell\\t Miss. Elizabeth female 58.0 \n", "24 Sloper\\t Mr. William Thompson male 28.0 \n", "... ... ... ... \n", "872 Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) female 47.0 \n", "873 Carlsson\\t Mr. Frans Olof male 33.0 \n", "880 Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 \n", "888 Graham\\t Miss. Margaret Edith female 19.0 \n", "890 Behr\\t Mr. Karl Howell male 26.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "7 0 0 17463 51.8625 E46 S \n", "12 0 0 113783 26.5500 C103 S \n", "24 0 0 113788 35.5000 A6 S \n", "... ... ... ... ... ... ... \n", "872 1 1 11751 52.5542 D35 S \n", "873 0 0 695 5.0000 B51 B53 B55 S \n", "880 0 1 11767 83.1583 C50 C \n", "888 0 0 112053 30.0000 B42 S \n", "890 0 0 111369 30.0000 C148 C \n", "\n", "[216 rows x 11 columns]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Pclass'] == 1]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Operatory\n", "\n", "* `&` - koniukcja (i)\n", "* `|` - alternatywa (lub)\n", "* `~` - negacja (nie)\n", "* `()` - jeżeli mamy kilka warunków to warto je uporządkować w nawiasy" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
1211Bonnell\\t Miss. Elizabethfemale58.00011378326.5500C103S
3211Spencer\\t Mrs. William Augustus (Marie Eugenie)femaleNaN10PC 17569146.5208B78C
5311Harper\\t Mrs. Henry Sleeper (Myna Haxtun)female49.010PC 1757276.7292D33C
....................................
85711Wick\\t Mrs. George Dennick (Mary Hitchcock)female45.01136928164.8667NaNS
86311Swift\\t Mrs. Frederick Joel (Margaret Welles B...female48.0001746625.9292D17S
87211Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny)female47.0111175152.5542D35S
88011Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88811Graham\\t Miss. Margaret Edithfemale19.00011205330.0000B42S
\n", "

94 rows × 11 columns

\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "4 1 1 \n", "12 1 1 \n", "32 1 1 \n", "53 1 1 \n", "... ... ... \n", "857 1 1 \n", "863 1 1 \n", "872 1 1 \n", "880 1 1 \n", "888 1 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "12 Bonnell\\t Miss. Elizabeth female 58.0 \n", "32 Spencer\\t Mrs. William Augustus (Marie Eugenie) female NaN \n", "53 Harper\\t Mrs. Henry Sleeper (Myna Haxtun) female 49.0 \n", "... ... ... ... \n", "857 Wick\\t Mrs. George Dennick (Mary Hitchcock) female 45.0 \n", "863 Swift\\t Mrs. Frederick Joel (Margaret Welles B... female 48.0 \n", "872 Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) female 47.0 \n", "880 Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 \n", "888 Graham\\t Miss. Margaret Edith female 19.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "12 0 0 113783 26.5500 C103 S \n", "32 1 0 PC 17569 146.5208 B78 C \n", "53 1 0 PC 17572 76.7292 D33 C \n", "... ... ... ... ... ... ... \n", "857 1 1 36928 164.8667 NaN S \n", "863 0 0 17466 25.9292 D17 S \n", "872 1 1 11751 52.5542 D35 S \n", "880 0 1 11767 83.1583 C50 C \n", "888 0 0 112053 30.0000 B42 S \n", "\n", "[94 rows x 11 columns]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pierwsza_klasa = df['Pclass'] == 1\n", "kobiety = df['Sex'] == 'female'\n", "\n", "df[pierwsza_klasa & kobiety]\n" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
803Palsson\\t Master. Gosta Leonardmale2.03134990921.0750NaNS
1012Nasser\\t Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
....................................
86103Hansen\\t Mr. Claus Petermale41.02035002614.1083NaNS
86202Giles\\t Mr. Frederick Edwardmale21.0102813411.5000NaNS
86403Sage\\t Miss. Dorothy Edith \"Dolly\"femaleNaN82CA. 234369.5500NaNS
86712Duran y More\\t Miss. Asuncionfemale27.010SC/PARIS 214913.8583NaNC
87512Abelson\\t Mrs. Samuel (Hannah Wizosky)female28.010P/PP 338124.0000NaNC
\n", "

192 rows × 11 columns

\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "4 1 1 \n", "8 0 3 \n", "10 1 2 \n", "... ... ... \n", "861 0 3 \n", "862 0 2 \n", "864 0 3 \n", "867 1 2 \n", "875 1 2 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22.0 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "8 Palsson\\t Master. Gosta Leonard male 2.0 \n", "10 Nasser\\t Mrs. Nicholas (Adele Achem) female 14.0 \n", "... ... ... ... \n", "861 Hansen\\t Mr. Claus Peter male 41.0 \n", "862 Giles\\t Mr. Frederick Edward male 21.0 \n", "864 Sage\\t Miss. Dorothy Edith \"Dolly\" female NaN \n", "867 Duran y More\\t Miss. Asuncion female 27.0 \n", "875 Abelson\\t Mrs. Samuel (Hannah Wizosky) female 28.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "8 3 1 349909 21.0750 NaN S \n", "10 1 0 237736 30.0708 NaN C \n", "... ... ... ... ... ... ... \n", "861 2 0 350026 14.1083 NaN S \n", "862 1 0 28134 11.5000 NaN S \n", "864 8 2 CA. 2343 69.5500 NaN S \n", "867 1 0 SC/PARIS 2149 13.8583 NaN C \n", "875 1 0 P/PP 3381 24.0000 NaN C \n", "\n", "[192 rows x 11 columns]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "df[df['SibSp'] > df['Parch']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### `pd.DataFrame.query`\n", "\n", "Innym sposobem na filtrowanie danych jest metoda `query`, która jako argument przyjmuje wyrażenie:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
701McCarthy\\t Mr. Timothy Jmale54.0001746351.8625E46S
1211Bonnell\\t Miss. Elizabethfemale58.00011378326.5500C103S
2411Sloper\\t Mr. William Thompsonmale28.00011378835.5000A6S
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "4 1 1 \n", "7 0 1 \n", "12 1 1 \n", "24 1 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "7 McCarthy\\t Mr. Timothy J male 54.0 \n", "12 Bonnell\\t Miss. Elizabeth female 58.0 \n", "24 Sloper\\t Mr. William Thompson male 28.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "7 0 0 17463 51.8625 E46 S \n", "12 0 0 113783 26.5500 C103 S \n", "24 0 0 113788 35.5000 A6 S " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query('Pclass == 1').head()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
1211Bonnell\\t Miss. Elizabethfemale58.00011378326.5500C103S
3211Spencer\\t Mrs. William Augustus (Marie Eugenie)femaleNaN10PC 17569146.5208B78C
5311Harper\\t Mrs. Henry Sleeper (Myna Haxtun)female49.010PC 1757276.7292D33C
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "4 1 1 \n", "12 1 1 \n", "32 1 1 \n", "53 1 1 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "12 Bonnell\\t Miss. Elizabeth female 58.0 \n", "32 Spencer\\t Mrs. William Augustus (Marie Eugenie) female NaN \n", "53 Harper\\t Mrs. Henry Sleeper (Myna Haxtun) female 49.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "12 0 0 113783 26.5500 C103 S \n", "32 1 0 PC 17569 146.5208 B78 C \n", "53 1 0 PC 17572 76.7292 D33 C " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query('(Pclass == 1) and (Sex == \"female\")').head()" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
803Palsson\\t Master. Gosta Leonardmale2.03134990921.0750NaNS
1012Nasser\\t Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
....................................
86103Hansen\\t Mr. Claus Petermale41.02035002614.1083NaNS
86202Giles\\t Mr. Frederick Edwardmale21.0102813411.5000NaNS
86403Sage\\t Miss. Dorothy Edith \"Dolly\"femaleNaN82CA. 234369.5500NaNS
86712Duran y More\\t Miss. Asuncionfemale27.010SC/PARIS 214913.8583NaNC
87512Abelson\\t Mrs. Samuel (Hannah Wizosky)female28.010P/PP 338124.0000NaNC
\n", "

192 rows × 11 columns

\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "4 1 1 \n", "8 0 3 \n", "10 1 2 \n", "... ... ... \n", "861 0 3 \n", "862 0 2 \n", "864 0 3 \n", "867 1 2 \n", "875 1 2 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22.0 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "8 Palsson\\t Master. Gosta Leonard male 2.0 \n", "10 Nasser\\t Mrs. Nicholas (Adele Achem) female 14.0 \n", "... ... ... ... \n", "861 Hansen\\t Mr. Claus Peter male 41.0 \n", "862 Giles\\t Mr. Frederick Edward male 21.0 \n", "864 Sage\\t Miss. Dorothy Edith \"Dolly\" female NaN \n", "867 Duran y More\\t Miss. Asuncion female 27.0 \n", "875 Abelson\\t Mrs. Samuel (Hannah Wizosky) female 28.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "4 1 0 113803 53.1000 C123 S \n", "8 3 1 349909 21.0750 NaN S \n", "10 1 0 237736 30.0708 NaN C \n", "... ... ... ... ... ... ... \n", "861 2 0 350026 14.1083 NaN S \n", "862 1 0 28134 11.5000 NaN S \n", "864 8 2 CA. 2343 69.5500 NaN S \n", "867 1 0 SC/PARIS 2149 13.8583 NaN C \n", "875 1 0 P/PP 3381 24.0000 NaN C \n", "\n", "[192 rows x 11 columns]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.query('SibSp > Parch')" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "(113, 11)" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "young = 18\n", "df.query('Age < @young').shape" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Operacje na wierszach i kolumnach" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
female_BMImale_BMIgdppopulationunder5mortalitylife_expectancyfertility
Country
Afghanistan21.0740220.620581311.026528741.0110.452.86.20
Albania25.6572626.446578644.02968026.017.976.81.76
Algeria26.3684124.5962012314.034811059.029.575.52.73
Angola23.4843122.250837103.019842251.0192.056.76.43
Antigua and Barbuda27.5054525.7660225736.085350.010.975.52.16
\n", "
" ], "text/plain": [ " female_BMI male_BMI gdp population \\\n", "Country \n", "Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n", "Albania 25.65726 26.44657 8644.0 2968026.0 \n", "Algeria 26.36841 24.59620 12314.0 34811059.0 \n", "Angola 23.48431 22.25083 7103.0 19842251.0 \n", "Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n", "\n", " under5mortality life_expectancy fertility \n", "Country \n", "Afghanistan 110.4 52.8 6.20 \n", "Albania 17.9 76.8 1.76 \n", "Algeria 29.5 75.5 2.73 \n", "Angola 192.0 56.7 6.43 \n", "Antigua and Barbuda 10.9 75.5 2.16 " ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=5)\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Iterowanie po ramce danych oznacza oznacza przejście po nazwach kolumn:" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "female_BMI\n", "male_BMI\n", "gdp\n", "population\n", "under5mortality\n", "life_expectancy\n", "fertility\n" ] } ], "source": [ "for column_name in df:\n", " print(column_name)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "female_BMI Country\n", "Afghanistan 21.07402\n", "Albania 25.65726\n", "Algeria 26.36841\n", "Angola 23.48431\n", "Antigua and Barbuda 27.50545\n", "Name: female_BMI, dtype: float64\n" ] } ], "source": [ "for col_name, series in df.items():\n", " print(col_name, series)\n", " break" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Afghanistan \n", " female_BMI 2.107402e+01\n", "male_BMI 2.062058e+01\n", "gdp 1.311000e+03\n", "population 2.652874e+07\n", "under5mortality 1.104000e+02\n", "life_expectancy 5.280000e+01\n", "fertility 6.200000e+00\n", "Name: Afghanistan, dtype: float64\n" ] } ], "source": [ "for idx, row in df.iterrows():\n", " print(idx, '\\n', row)\n", " break" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Country\n", "Afghanistan normal\n", "Albania overweight\n", "Algeria normal\n", "Angola normal\n", "Antigua and Barbuda overweight\n", "Name: male_BMI, dtype: object" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def bmi_level(bmi):\n", " if bmi <= 18.5:\n", " level = 'underweight'\n", " elif bmi < 25:\n", " level = 'normal'\n", " elif bmi < 30:\n", " level = 'overweight'\n", " else:\n", " level = 'obese'\n", " return level\n", "\n", "s = df['male_BMI'].map(bmi_level)\n", " \n", "s" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Country\n", "Afghanistan normal\n", "Albania overweight\n", "Algeria normal\n", "Angola normal\n", "Antigua and Barbuda overweight\n", "dtype: object" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def bmi_level(row_data):\n", " bmi = row_data['male_BMI']\n", " if bmi <= 18.5:\n", " return 'underweight'\n", " elif bmi < 25:\n", " return 'normal'\n", " elif bmi < 30:\n", " return 'overweight'\n", " return 'obese'\n", "\n", "df.apply(bmi_level, axis=1)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CountryAfghanistanAlbaniaAlgeriaAngolaAntigua and Barbuda
female_BMI2.107402e+012.565726e+012.636841e+012.348431e+0127.50545
male_BMI2.062058e+012.644657e+012.459620e+012.225083e+0125.76602
gdp1.311000e+038.644000e+031.231400e+047.103000e+0325736.00000
population2.652874e+072.968026e+063.481106e+071.984225e+0785350.00000
under5mortality1.104000e+021.790000e+012.950000e+011.920000e+0210.90000
life_expectancy5.280000e+017.680000e+017.550000e+015.670000e+0175.50000
fertility6.200000e+001.760000e+002.730000e+006.430000e+002.16000
\n", "
" ], "text/plain": [ "Country Afghanistan Albania Algeria Angola \\\n", "female_BMI 2.107402e+01 2.565726e+01 2.636841e+01 2.348431e+01 \n", "male_BMI 2.062058e+01 2.644657e+01 2.459620e+01 2.225083e+01 \n", "gdp 1.311000e+03 8.644000e+03 1.231400e+04 7.103000e+03 \n", "population 2.652874e+07 2.968026e+06 3.481106e+07 1.984225e+07 \n", "under5mortality 1.104000e+02 1.790000e+01 2.950000e+01 1.920000e+02 \n", "life_expectancy 5.280000e+01 7.680000e+01 7.550000e+01 5.670000e+01 \n", "fertility 6.200000e+00 1.760000e+00 2.730000e+00 6.430000e+00 \n", "\n", "Country Antigua and Barbuda \n", "female_BMI 27.50545 \n", "male_BMI 25.76602 \n", "gdp 25736.00000 \n", "population 85350.00000 \n", "under5mortality 10.90000 \n", "life_expectancy 75.50000 \n", "fertility 2.16000 " ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.transpose()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grupowanie (`groupby`)\n", "\n", "Często zdarza się, gdy potrzebujemy podzielić dane ze względu na wartości w zadanej kolumnie, a następnie obliczenie zebranie danych w każdej z grup. Do tego służy metody `groupby`." ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameTeamNumberPositionAgeHeightWeightCollegeSalary
401Tyus JonesMinnesota Timberwolves1.0PG20.06-2195.0Duke1282080.0
342Gerald GreenMiami Heat14.0SF30.06-7205.0NaN947276.0
143DeMarcus CousinsSacramento Kings15.0C25.06-11270.0Kentucky15851950.0
267P.J. HairstonMemphis Grizzlies19.0SF23.06-6230.0North Carolina1201440.0
335Jeremy LinCharlotte Hornets7.0PG27.06-3200.0Harvard2139000.0
\n", "
" ], "text/plain": [ " Name Team Number Position Age Height \\\n", "401 Tyus Jones Minnesota Timberwolves 1.0 PG 20.0 6-2 \n", "342 Gerald Green Miami Heat 14.0 SF 30.0 6-7 \n", "143 DeMarcus Cousins Sacramento Kings 15.0 C 25.0 6-11 \n", "267 P.J. Hairston Memphis Grizzlies 19.0 SF 23.0 6-6 \n", "335 Jeremy Lin Charlotte Hornets 7.0 PG 27.0 6-3 \n", "\n", " Weight College Salary \n", "401 195.0 Duke 1282080.0 \n", "342 205.0 NaN 947276.0 \n", "143 270.0 Kentucky 15851950.0 \n", "267 230.0 North Carolina 1201440.0 \n", "335 200.0 Harvard 2139000.0 " ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('./nba.csv')\n", "\n", "df.sample(5)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "_Przykład_: chcemy obliczyć średnią wypłatę dla każdej z drużyn." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Salary
Team
Atlanta Hawks4.860197e+06
Boston Celtics4.181505e+06
Brooklyn Nets3.501898e+06
Charlotte Hornets5.222728e+06
Chicago Bulls5.785559e+06
\n", "
" ], "text/plain": [ " Salary\n", "Team \n", "Atlanta Hawks 4.860197e+06\n", "Boston Celtics 4.181505e+06\n", "Brooklyn Nets 3.501898e+06\n", "Charlotte Hornets 5.222728e+06\n", "Chicago Bulls 5.785559e+06" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Team', 'Salary']].groupby('Team').mean().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Możemy też podać listę nazw kolumn. Wtedy wartości zostaną obliczone dla każdej z wytworzonych grup:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Team Position\n", "Atlanta Hawks C 7.585417e+06\n", " PF 5.988067e+06\n", " PG 4.881700e+06\n", " SF 3.000000e+06\n", " SG 2.607758e+06\n", " ... \n", "Washington Wizards C 8.163476e+06\n", " PF 5.650000e+06\n", " PG 9.011208e+06\n", " SF 2.789700e+06\n", " SG 2.839248e+06\n", "Name: Salary, Length: 149, dtype: float64" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['Team', 'Position'])['Salary'].mean()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " * `sum()`\n", " * `min()`\n", " * `max()`\n", " * `mean()`\n", " * `size()`\n", " * `describe()`\n", " * `first()`\n", " * `last()`\n", " * `count()`\n", " * `std()`\n", " * `var()`\n", " * `sem()`" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Salary
meanstdcount
Position
C5.967052e+065.787989e+0678
PF4.562483e+064.800054e+0697
PG5.077829e+065.051809e+0688
SF4.857393e+066.011889e+0684
SG4.009861e+064.491609e+0699
\n", "
" ], "text/plain": [ " Salary \n", " mean std count\n", "Position \n", "C 5.967052e+06 5.787989e+06 78\n", "PF 4.562483e+06 4.800054e+06 97\n", "PG 5.077829e+06 5.051809e+06 88\n", "SF 4.857393e+06 6.011889e+06 84\n", "SG 4.009861e+06 4.491609e+06 99" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Position', 'Salary']].groupby('Position').agg(['mean', 'std', 'count'])" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Salary
Position
C22275967.0
PF22081286.0
PG21412973.0
SF24969112.0
SG19944278.0
\n", "
" ], "text/plain": [ " Salary\n", "Position \n", "C 22275967.0\n", "PF 22081286.0\n", "PG 21412973.0\n", "SF 24969112.0\n", "SG 19944278.0" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def group_range(x):\n", " return x.max() - x.min()\n", "\n", "df[['Position', 'Salary']].groupby('Position').apply(group_range)\n" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Liczba grup: 5\n", "dict_keys(['C', 'PF', 'PG', 'SF', 'SG'])\n", " Name Team Number Position Age Height Weight \\\n", "7 Kelly Olynyk Boston Celtics 41.0 C 25.0 7-0 238.0 \n", "10 Jared Sullinger Boston Celtics 7.0 C 24.0 6-9 260.0 \n", "14 Tyler Zeller Boston Celtics 44.0 C 26.0 7-0 253.0 \n", "23 Brook Lopez Brooklyn Nets 11.0 C 28.0 7-0 275.0 \n", "27 Henry Sims Brooklyn Nets 14.0 C 26.0 6-10 248.0 \n", "\n", " College Salary \n", "7 Gonzaga 2165160.0 \n", "10 Ohio State 2569260.0 \n", "14 North Carolina 2616975.0 \n", "23 Stanford 19689000.0 \n", "27 Georgetown 947276.0 \n" ] } ], "source": [ "gb = df.groupby(['Position'])\n", "\n", "print('Liczba grup:', gb.ngroups)\n", "print(gb.groups.keys())\n", "\n", "print(gb.get_group('C').head())" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 15.36\n", "1 15.36\n", "2 15.36\n", "3 15.36\n", "4 15.36\n", " ... \n", "453 15.36\n", "454 15.36\n", "455 17.92\n", "456 17.92\n", "457 \n", "Name: Height, Length: 458, dtype: Float64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "df.Height.str.split('-').str[0].astype('Int64') * 2.56" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Pivot\n", "Metoda `pivot` pozwala na stworzenie nowej ramki danych, gdzie indeks i nazwy kolumn są wartościami początkowej ranki danych. \n", "\n", "_Przykład_: zobaczmy na poniższą ramkę danych, która zawiera informacje o jakości tłumaczenia dla pary językowej hausa-angielski. Kolumna `system` zawiera nazwę systemu, kolumna `metric` - nazwę metryki, zaś kolumna `score`- wartość metryki. Chcemy przedstawić te dane w następujący sposób: jako klucz chcemy mieć nazwę systemu, zaś jako kolumny - metryki. Możemy wykorzystać do tego metodę `pivot`, gdzie musimy podać 3 argumenty:\n", " * `index`: nazwę kolumny, na podstawie której zostanie stworzony indeks;\n", " * `columns`: nazwa kolumny, które zawiera nazwy kolumn dla nowej ramki danych;\n", " * `values`: nazwa kolumny, która zawiera interesujące nas dane." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pairsystemidis_constrainedmetricscore
1214ha-enNiuTrans382Truebleu-all16.512243
1215ha-enNiuTrans382Truechrf-all44.724766
1216ha-enNiuTrans382Truebleu-A16.512243
1217ha-enNiuTrans382Truechrf-A44.724766
1218ha-enFacebook-AI181Falsebleu-all20.982704
1219ha-enFacebook-AI181Falsechrf-all48.653770
1220ha-enFacebook-AI181Falsebleu-A20.982704
1221ha-enFacebook-AI181Falsechrf-A48.653770
1222ha-enTRANSSION336Falsebleu-all18.834851
1223ha-enTRANSSION336Falsechrf-all47.238279
1224ha-enTRANSSION336Falsebleu-A18.834851
1225ha-enTRANSSION336Falsechrf-A47.238279
1226ha-enAMU628Truebleu-all14.132845
1227ha-enAMU628Truechrf-all41.256570
1228ha-enAMU628Truebleu-A14.132845
1229ha-enAMU628Truechrf-A41.256570
1230ha-enP3AI715Truebleu-all17.793617
1231ha-enP3AI715Truechrf-all46.307402
1232ha-enP3AI715Truebleu-A17.793617
1233ha-enP3AI715Truechrf-A46.307402
1234ha-enOnline-B1356Falsebleu-all18.655658
1235ha-enOnline-B1356Falsechrf-all46.658216
1236ha-enOnline-B1356Falsebleu-A18.655658
1237ha-enOnline-B1356Falsechrf-A46.658216
1238ha-enTWB1335Falsebleu-all12.326443
1239ha-enTWB1335Falsechrf-all40.282629
1240ha-enTWB1335Falsebleu-A12.326443
1241ha-enTWB1335Falsechrf-A40.282629
1242ha-enZMT553Falsebleu-all18.837023
1243ha-enZMT553Falsechrf-all47.231474
1244ha-enZMT553Falsebleu-A18.837023
1245ha-enZMT553Falsechrf-A47.231474
1246ha-enManifold437Truebleu-all16.943915
1247ha-enManifold437Truechrf-all45.638356
1248ha-enManifold437Truebleu-A16.943915
1249ha-enManifold437Truechrf-A45.638356
1250ha-enOnline-Y1374Falsebleu-all13.898531
1251ha-enOnline-Y1374Falsechrf-all44.842874
1252ha-enOnline-Y1374Falsebleu-A13.898531
1253ha-enOnline-Y1374Falsechrf-A44.842874
1254ha-enHuaweiTSC758Truebleu-all17.492440
1255ha-enHuaweiTSC758Truechrf-all46.795737
1256ha-enHuaweiTSC758Truebleu-A17.492440
1257ha-enHuaweiTSC758Truechrf-A46.795737
1258ha-enMS-EgDC896Truebleu-all17.133350
1259ha-enMS-EgDC896Truechrf-all45.266274
1260ha-enMS-EgDC896Truebleu-A17.133350
1261ha-enMS-EgDC896Truechrf-A45.266274
1262ha-enGTCOM1298Falsebleu-all17.794272
1263ha-enGTCOM1298Falsechrf-all46.714831
1264ha-enGTCOM1298Falsebleu-A17.794272
1265ha-enGTCOM1298Falsechrf-A46.714831
1266ha-enUEdin1149Truebleu-all14.887836
1267ha-enUEdin1149Truechrf-all42.247415
1268ha-enUEdin1149Truebleu-A14.887836
1269ha-enUEdin1149Truechrf-A42.247415
\n", "
" ], "text/plain": [ " pair system id is_constrained metric score\n", "1214 ha-en NiuTrans 382 True bleu-all 16.512243\n", "1215 ha-en NiuTrans 382 True chrf-all 44.724766\n", "1216 ha-en NiuTrans 382 True bleu-A 16.512243\n", "1217 ha-en NiuTrans 382 True chrf-A 44.724766\n", "1218 ha-en Facebook-AI 181 False bleu-all 20.982704\n", "1219 ha-en Facebook-AI 181 False chrf-all 48.653770\n", "1220 ha-en Facebook-AI 181 False bleu-A 20.982704\n", "1221 ha-en Facebook-AI 181 False chrf-A 48.653770\n", "1222 ha-en TRANSSION 336 False bleu-all 18.834851\n", "1223 ha-en TRANSSION 336 False chrf-all 47.238279\n", "1224 ha-en TRANSSION 336 False bleu-A 18.834851\n", "1225 ha-en TRANSSION 336 False chrf-A 47.238279\n", "1226 ha-en AMU 628 True bleu-all 14.132845\n", "1227 ha-en AMU 628 True chrf-all 41.256570\n", "1228 ha-en AMU 628 True bleu-A 14.132845\n", "1229 ha-en AMU 628 True chrf-A 41.256570\n", "1230 ha-en P3AI 715 True bleu-all 17.793617\n", "1231 ha-en P3AI 715 True chrf-all 46.307402\n", "1232 ha-en P3AI 715 True bleu-A 17.793617\n", "1233 ha-en P3AI 715 True chrf-A 46.307402\n", "1234 ha-en Online-B 1356 False bleu-all 18.655658\n", "1235 ha-en Online-B 1356 False chrf-all 46.658216\n", "1236 ha-en Online-B 1356 False bleu-A 18.655658\n", "1237 ha-en Online-B 1356 False chrf-A 46.658216\n", "1238 ha-en TWB 1335 False bleu-all 12.326443\n", "1239 ha-en TWB 1335 False chrf-all 40.282629\n", "1240 ha-en TWB 1335 False bleu-A 12.326443\n", "1241 ha-en TWB 1335 False chrf-A 40.282629\n", "1242 ha-en ZMT 553 False bleu-all 18.837023\n", "1243 ha-en ZMT 553 False chrf-all 47.231474\n", "1244 ha-en ZMT 553 False bleu-A 18.837023\n", "1245 ha-en ZMT 553 False chrf-A 47.231474\n", "1246 ha-en Manifold 437 True bleu-all 16.943915\n", "1247 ha-en Manifold 437 True chrf-all 45.638356\n", "1248 ha-en Manifold 437 True bleu-A 16.943915\n", "1249 ha-en Manifold 437 True chrf-A 45.638356\n", "1250 ha-en Online-Y 1374 False bleu-all 13.898531\n", "1251 ha-en Online-Y 1374 False chrf-all 44.842874\n", "1252 ha-en Online-Y 1374 False bleu-A 13.898531\n", "1253 ha-en Online-Y 1374 False chrf-A 44.842874\n", "1254 ha-en HuaweiTSC 758 True bleu-all 17.492440\n", "1255 ha-en HuaweiTSC 758 True chrf-all 46.795737\n", "1256 ha-en HuaweiTSC 758 True bleu-A 17.492440\n", "1257 ha-en HuaweiTSC 758 True chrf-A 46.795737\n", "1258 ha-en MS-EgDC 896 True bleu-all 17.133350\n", "1259 ha-en MS-EgDC 896 True chrf-all 45.266274\n", "1260 ha-en MS-EgDC 896 True bleu-A 17.133350\n", "1261 ha-en MS-EgDC 896 True chrf-A 45.266274\n", "1262 ha-en GTCOM 1298 False bleu-all 17.794272\n", "1263 ha-en GTCOM 1298 False chrf-all 46.714831\n", "1264 ha-en GTCOM 1298 False bleu-A 17.794272\n", "1265 ha-en GTCOM 1298 False chrf-A 46.714831\n", "1266 ha-en UEdin 1149 True bleu-all 14.887836\n", "1267 ha-en UEdin 1149 True chrf-all 42.247415\n", "1268 ha-en UEdin 1149 True bleu-A 14.887836\n", "1269 ha-en UEdin 1149 True chrf-A 42.247415" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/wmt-conference/wmt21-news-systems/main/scores/automatic-scores.tsv', sep='\\t')\n", "df = df[df.pair == 'ha-en']\n", "df" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
metricbleu-Ableu-allchrf-Achrf-all
system
AMU14.13284514.13284541.25657041.256570
Facebook-AI20.98270420.98270448.65377048.653770
GTCOM17.79427217.79427246.71483146.714831
HuaweiTSC17.49244017.49244046.79573746.795737
MS-EgDC17.13335017.13335045.26627445.266274
Manifold16.94391516.94391545.63835645.638356
NiuTrans16.51224316.51224344.72476644.724766
Online-B18.65565818.65565846.65821646.658216
Online-Y13.89853113.89853144.84287444.842874
P3AI17.79361717.79361746.30740246.307402
TRANSSION18.83485118.83485147.23827947.238279
TWB12.32644312.32644340.28262940.282629
UEdin14.88783614.88783642.24741542.247415
ZMT18.83702318.83702347.23147447.231474
\n", "
" ], "text/plain": [ "metric bleu-A bleu-all chrf-A chrf-all\n", "system \n", "AMU 14.132845 14.132845 41.256570 41.256570\n", "Facebook-AI 20.982704 20.982704 48.653770 48.653770\n", "GTCOM 17.794272 17.794272 46.714831 46.714831\n", "HuaweiTSC 17.492440 17.492440 46.795737 46.795737\n", "MS-EgDC 17.133350 17.133350 45.266274 45.266274\n", "Manifold 16.943915 16.943915 45.638356 45.638356\n", "NiuTrans 16.512243 16.512243 44.724766 44.724766\n", "Online-B 18.655658 18.655658 46.658216 46.658216\n", "Online-Y 13.898531 13.898531 44.842874 44.842874\n", "P3AI 17.793617 17.793617 46.307402 46.307402\n", "TRANSSION 18.834851 18.834851 47.238279 47.238279\n", "TWB 12.326443 12.326443 40.282629 40.282629\n", "UEdin 14.887836 14.887836 42.247415 42.247415\n", "ZMT 18.837023 18.837023 47.231474 47.231474" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pivot(index='system', columns='metric', values='score')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Dane tekstowe" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "`pandas` posiada udogodnienia do pracy z wartościami tekstowymi:\n", " * dostęp następuje przez atrybut `str`;\n", " * funkcje:\n", " * formatujące: `lower()`, `upper()`;\n", " * wyrażenia regularne: `contains()`, `match()`;\n", " * inne: `split()`" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
313Heikkinen\\t Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen\\t Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22.0 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "3 Heikkinen\\t Miss. Laina female 26.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen\\t Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 BRAUND\\t MR. OWEN HARRIS\n", "2 CUMINGS\\t MRS. JOHN BRADLEY (FLORENCE BRIGGS T...\n", "3 HEIKKINEN\\t MISS. LAINA\n", "4 FUTRELLE\\t MRS. JACQUES HEATH (LILY MAY PEEL)\n", "5 ALLEN\\t MR. WILLIAM HENRY\n", " ... \n", "887 MONTVILA\\t REV. JUOZAS\n", "888 GRAHAM\\t MISS. MARGARET EDITH\n", "889 JOHNSTON\\t MISS. CATHERINE HELEN \"CARRIE\"\n", "890 BEHR\\t MR. KARL HOWELL\n", "891 DOOLEY\\t MR. PATRICK\n", "Name: Name, Length: 891, dtype: object" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Name.str.upper()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PassengerId\n", "1 Braund\\t Mr. Owen Harris\n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T...\n", "3 Heikkinen\\t Miss. Laina\n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel)\n", "5 Allen\\t Mr. William Henry\n", "Name: Name, dtype: object\n" ] }, { "data": { "text/plain": [ "PassengerId\n", "1 False\n", "2 True\n", "3 True\n", "4 True\n", "5 False\n", "Name: Name, dtype: bool" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(df.Name.head())\n", "df.Name.str.contains('Miss|Mrs').head()" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
PassengerId
1BraundMr. Owen Harris
2CumingsMrs. John Bradley (Florence Briggs Thayer)
3HeikkinenMiss. Laina
4FutrelleMrs. Jacques Heath (Lily May Peel)
5AllenMr. William Henry
.........
887MontvilaRev. Juozas
888GrahamMiss. Margaret Edith
889JohnstonMiss. Catherine Helen \"Carrie\"
890BehrMr. Karl Howell
891DooleyMr. Patrick
\n", "

891 rows × 2 columns

\n", "
" ], "text/plain": [ " 0 1\n", "PassengerId \n", "1 Braund Mr. Owen Harris\n", "2 Cumings Mrs. John Bradley (Florence Briggs Thayer)\n", "3 Heikkinen Miss. Laina\n", "4 Futrelle Mrs. Jacques Heath (Lily May Peel)\n", "5 Allen Mr. William Henry\n", "... ... ...\n", "887 Montvila Rev. Juozas\n", "888 Graham Miss. Margaret Edith\n", "889 Johnston Miss. Catherine Helen \"Carrie\"\n", "890 Behr Mr. Karl Howell\n", "891 Dooley Mr. Patrick\n", "\n", "[891 rows x 2 columns]" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Name.str.split('\\t', expand=True)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 [Braund, Mr. Owen Harris]\n", "2 [Cumings, Mrs. John Bradley (Florence Briggs ...\n", "3 [Heikkinen, Miss. Laina]\n", "4 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]\n", "5 [Allen, Mr. William Henry]\n", " ... \n", "887 [Montvila, Rev. Juozas]\n", "888 [Graham, Miss. Margaret Edith]\n", "889 [Johnston, Miss. Catherine Helen \"Carrie\"]\n", "890 [Behr, Mr. Karl Howell]\n", "891 [Dooley, Mr. Patrick]\n", "Name: Name, Length: 891, dtype: object" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Name.str.split('\\t')" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 Mr. Owen Harris\n", "2 Mrs. John Bradley (Florence Briggs Thayer)\n", "3 Miss. Laina\n", "4 Mrs. Jacques Heath (Lily May Peel)\n", "5 Mr. William Henry\n", " ... \n", "887 Rev. Juozas\n", "888 Miss. Margaret Edith\n", "889 Miss. Catherine Helen \"Carrie\"\n", "890 Mr. Karl Howell\n", "891 Mr. Patrick\n", "Name: Name, Length: 891, dtype: object" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Name.str.split('\\t').str[1]" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 Mr.\n", "2 Mrs.\n", "3 Miss.\n", "4 Mrs.\n", "5 Mr.\n", " ... \n", "887 Rev.\n", "888 Miss.\n", "889 Miss.\n", "890 Mr.\n", "891 Mr.\n", "Name: Name, Length: 891, dtype: object" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Name.str.split('\\t').str[1].str.strip().str.split(' ').str[0]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zadanie\n", "Zestaw `nba.csv` zawiera informaję o wysokości zawodników. Oblicz wzrost każdego z zawodników w systemie metrycznym przyjmując, że stop to `30.48` cm., a cal to `2.54` cm." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "interpreter": { "hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }