2024-programowanie-w-python.../zajecia2/data_analysis.ipynb

8094 lines
235 KiB
Plaintext
Raw Normal View History

2024-11-22 14:27:51 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Analiza Danych w Pythonie: `pandas`\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### `pandas`\n",
"Biblioteka `pandas` jest podstawowym narzędziem w ekosystemie Pythona do analizy danych:\n",
" * dostarcza dwa podstawowe typy danych: \n",
" * `Series` (szereg, 1D)\n",
" * `DataFrame` (ramka danych, 2D)\n",
" * operacje na tych obiektach: obsługa brakujących wartości, łączenie danych;\n",
" * obsługuje dane różnego typu, np. szeregi czasowe;\n",
" * biblioteka bazuje na `numpy` -- bibliotece do obliczeń numerycznych;\n",
" * pozwala też na prostą wizualizację danych;\n",
" * ETL: extract, transform, load."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Żeby zaimportowąc bibliotekę `pandas` wystarczy:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### __Zadanie 0__: sprawdź, czy masz zainstalowaną bibliotekę `pandas`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### [Szeregi](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) (`pd.Series`)\n",
"\n",
" Szereg reprezentuje jednorodne dane jednowymiarowe - jest odpowiednikiem wektora w R.\n",
" * Szeregi możemy tworzyć na różne sposoby (więcej za chwilę), np. z obiektów tj. listy i słowniki.\n",
" * Dane muszą być jednorodne. W przeciwnym przypadku nastąpi automatyczna konwersja.\n",
" * Podczas tworzenia szeregu musimy podać jeden obowiązkowy argument `data` - dane.\n",
" * Ponadto możemy podać też indeks (`index`), typ danych (`dtype`) lub nazwę (`name`).\n",
" \n",
" \n",
" ```\n",
" class pandas.Series(data=None, index=None, dtype=None, name=None)\n",
" ```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Podczas tworzenie szeregu mozemy podać dane w formacie listy lub słownika.\n",
"\n",
"Poniżej jest przykład przedstawiający tworzenie szeregu z danych, które są zawarte w liście:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 211819\n",
"1 682758\n",
"2 737011\n",
"3 779511\n",
"4 673790\n",
"5 673790\n",
"6 444177\n",
"7 136791\n",
"dtype: int64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"data = [211819, 682758, 737011, 779511, 673790, 673790, 444177, 136791]\n",
"\n",
"s = pd.Series(data)\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"W przypadku, gdy dane pochodzą z listy i nie podaliśmy indeksu, pandas doda automatyczny indeks liczbowy zaczynający się od 0."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"W przypadku przekazania słownika jako danych do szeregu, pandas wykorzysta klucze do stworzenia indeksu:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"April 211819\n",
"May 682758\n",
"June 737011\n",
"July 779511\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n",
"\n",
"s = pd.Series(members)\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Podczas tworzenia szeregu możemy zdefiniować indeks, jak i nazwę szeregu:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"April 211819.0\n",
"May 682758.0\n",
"June 737011.0\n",
"July 779511.0\n",
"Name: Rides, dtype: float64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"months = ['April', 'May', 'June', 'July']\n",
"\n",
"data = [211819, 682758, 737011, 779511]\n",
"\n",
"s = pd.Series(data=data, index=months, dtype=float, name='Rides')\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Odwołanie się do poszczególnego elementu odbywa się przy pomocy klucza z indeksu."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"211819\n"
]
},
{
"data": {
"text/plain": [
"April 211819\n",
"May 682758\n",
"June 737011\n",
"July 779511\n",
"August 673790\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n",
"\n",
"s = pd.Series(members)\n",
"\n",
"print(s['April'])\n",
"\n",
"s['August'] = 673790\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Dodanie elementu do szeregu odbywa się poprzez definiowanie nowego klucza:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"April 211819\n",
"May 682758\n",
"June 737011\n",
"July 779511\n",
"August 673790\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n",
"\n",
"s = pd.Series(members)\n",
"\n",
"s['August'] = 673790\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Więcej nt. indeksowania w szeregach w dalszej części kursu."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Podstawowa cechą szeregu jest wykonywanie operacji w sposób wektorowy. Działa to w następujący sposób:\n",
" * gdy w obu szeregach jest zawarty ten sam klucz, to są sumowane ich wartości;\n",
" * w przeciwnym przypadku wartość klucza w wynikowym szeregu to `pd.NaN`. \n",
" * Równoważnie możemy wykorzystać metodę `pandas.Series.add`. W tym przypadku możemy podać domyślną wartość w przypadku braku klucza."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"August 880599.0\n",
"July 973827.0\n",
"June 908505.0\n",
"May 830656.0\n",
"October NaN\n",
"September 814282.0\n",
"dtype: float64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = pd.Series({'May': 682758, 'June': 737011, 'August': 673790, 'July': 779511,\n",
"'September': 673790, 'October': 444177})\n",
"\n",
"occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316, 'August': 206809,\n",
"'September': 140492})\n",
"\n",
"all_data = members + occasionals\n",
"# Równoważnie\n",
"all_data = members.add(occasionals)\n",
"all_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Możemy wykonać operacje arytmetyczne na szeregu: "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"May 683758\n",
"June 738011\n",
"July 780511\n",
"August 674790\n",
"September 674790\n",
"October 445177\n",
"dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511, 'August': 673790,\n",
"'September': 673790, 'October': 444177})\n",
"\n",
"members += 1000\n",
"\n",
"members"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Podsumowanie\n",
" * Szeregi działają podobnie do słowników, z tą różnicą, że wartości muszą być jednorodne (tego samego typu).\n",
" * Odwołanie do poszczególnych elementów odbywa się poprzez nawiasy `[]` i podanie klucza.\n",
" * W przeciwieństwie do słowników, możemy w prosty sposób wykonywać operacje arytmetyczne."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie 1\n",
" * Stwórz szereg `n`, który będzie zawierać liczby od 0 do 10 (włącznie).\n",
" * Stwórz szereg `n2`, który będzie zawierać kwadraty liczb od 0 do 10 (włącznie).\n",
" * Następnie stwórz szereg `trojkatne`, który będzie sumą powyższych szeregów podzieloną przez 2."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### [Ramka danych](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) (`pd.DataFrame`)\n",
"\n",
"Ramka danych jest podstawową strukturą danych w bibliotece `pandas`, która pozwala na trzymanie i reprezentowanie danych tabelarycznych (dwuwymiarowych).\n",
" * Posiada kolumny (cechy) i wiersze (obserwacje, przykłady).\n",
" * Możemy też patrzeć na nią jak na słownik, którego wartościami są szeregi.\n",
"\n",
"```\n",
"class pandas.DataFrame(data=None, index=None, columns=None, dtype=None)\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Ramkę danych możemy stworzyć na różne sposoby.\n",
"\n",
"Pierwszy z nich (\"kolumnowy\") polega na zdefiniowaniu ramki poprzez podanie szeregów jako kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>members</th>\n",
" <th>occasionals</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>May</th>\n",
" <td>682758</td>\n",
" <td>147898</td>\n",
" </tr>\n",
" <tr>\n",
" <th>June</th>\n",
" <td>737011</td>\n",
" <td>171494</td>\n",
" </tr>\n",
" <tr>\n",
" <th>July</th>\n",
" <td>779511</td>\n",
" <td>194316</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" members occasionals\n",
"May 682758 147898\n",
"June 737011 171494\n",
"July 779511 194316"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511})\n",
"occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316})\n",
"\n",
"df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Drugim popularnym sposobem jest przekazanie listy słowników. Wtedy `pandas` zinterpretuje to jako listę przykładów:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>members</th>\n",
" <th>occasionals</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>682758</td>\n",
" <td>147898</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>737011</td>\n",
" <td>171494</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>779511</td>\n",
" <td>194316</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" members occasionals\n",
"0 682758 147898\n",
"1 737011 171494\n",
"2 779511 194316"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = [\n",
" {'members': 682758, 'occasionals': 147898},\n",
" {'occasionals': 171494,'members': 737011},\n",
" {'members': 779511, 'occasionals': 194316},\n",
"]\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Możemy też wykorzystać metodę `from_dict` ([doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html)), która pozwala zdefiniować czy podane dane są w podane w postaci kolumnowej lub wierszowej:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"index\n",
" members occasionals\n",
"May 682758 147898\n",
"June 737011 171494\n",
"July 779511 194316\n",
"\n",
"columns\n",
" May June July\n",
"members 682758 737011 779511\n",
"occasionals 147898 171494 194316\n"
]
}
],
"source": [
"data = {\n",
" 'May': {'members': 682758, 'occasionals': 147898},\n",
" 'June': {'members': 737011, 'occasionals': 171494},\n",
" 'July': {'members': 779511, 'occasionals': 194316}\n",
"}\n",
"\n",
"df = pd.DataFrame.from_dict(data, orient='index')\n",
"print('index\\n', df)\n",
"print()\n",
"df = pd.DataFrame.from_dict(data, orient='columns')\n",
"print('columns\\n', df)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Wczytywanie danych"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Biblioteka `pandas` pozwala na wczytanie i zapis danych z różnych formatów:\n",
" * formaty tekstowe, np. `csv`, `json`\n",
" * pliki arkuszy kalkulacyjnych: Excel (xls, xlsx)\n",
" * bazy danych\n",
" * inne: `sas` `spss`\n",
"\n",
"\n",
"Efektem wczytania danych jest odpowiednio stworzona ramka danych (`DataFrame`)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Jednym z najprostszych formatów danych jest format `csv`, gdzie kolejne wartości są rozdzielone przecinkiem.\n",
"\n",
"Żeby wczytać dane w takim formacie należy użyć funkcji `pandas.read_csv`.\n",
"\n",
"Pandas pozwala na ustawienie wielu parametrów (np. separator, cudzysłowy). Więcej na ten temat w [dokumentacji](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Country</th>\n",
" <th>female_BMI</th>\n",
" <th>male_BMI</th>\n",
" <th>gdp</th>\n",
" <th>population</th>\n",
" <th>under5mortality</th>\n",
" <th>life_expectancy</th>\n",
" <th>fertility</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Afghanistan</td>\n",
" <td>21.07402</td>\n",
" <td>20.62058</td>\n",
" <td>1311.0</td>\n",
" <td>26528741.0</td>\n",
" <td>110.4</td>\n",
" <td>52.8</td>\n",
" <td>6.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Albania</td>\n",
" <td>25.65726</td>\n",
" <td>26.44657</td>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" <td>17.9</td>\n",
" <td>76.8</td>\n",
" <td>1.76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Algeria</td>\n",
" <td>26.36841</td>\n",
" <td>24.59620</td>\n",
" <td>12314.0</td>\n",
" <td>34811059.0</td>\n",
" <td>29.5</td>\n",
" <td>75.5</td>\n",
" <td>2.73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Angola</td>\n",
" <td>23.48431</td>\n",
" <td>22.25083</td>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" <td>192.0</td>\n",
" <td>56.7</td>\n",
" <td>6.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Antigua and Barbuda</td>\n",
" <td>27.50545</td>\n",
" <td>25.76602</td>\n",
" <td>25736.0</td>\n",
" <td>85350.0</td>\n",
" <td>10.9</td>\n",
" <td>75.5</td>\n",
" <td>2.16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>170</th>\n",
" <td>Venezuela</td>\n",
" <td>28.13408</td>\n",
" <td>27.44500</td>\n",
" <td>17911.0</td>\n",
" <td>28116716.0</td>\n",
" <td>17.1</td>\n",
" <td>74.2</td>\n",
" <td>2.53</td>\n",
" </tr>\n",
" <tr>\n",
" <th>171</th>\n",
" <td>Vietnam</td>\n",
" <td>21.06500</td>\n",
" <td>20.91630</td>\n",
" <td>4085.0</td>\n",
" <td>86589342.0</td>\n",
" <td>26.2</td>\n",
" <td>74.1</td>\n",
" <td>1.86</td>\n",
" </tr>\n",
" <tr>\n",
" <th>172</th>\n",
" <td>Palestine</td>\n",
" <td>29.02643</td>\n",
" <td>26.57750</td>\n",
" <td>3564.0</td>\n",
" <td>3854667.0</td>\n",
" <td>24.7</td>\n",
" <td>74.1</td>\n",
" <td>4.38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>173</th>\n",
" <td>Zambia</td>\n",
" <td>23.05436</td>\n",
" <td>20.68321</td>\n",
" <td>3039.0</td>\n",
" <td>13114579.0</td>\n",
" <td>94.9</td>\n",
" <td>51.1</td>\n",
" <td>5.88</td>\n",
" </tr>\n",
" <tr>\n",
" <th>174</th>\n",
" <td>Zimbabwe</td>\n",
" <td>24.64522</td>\n",
" <td>22.02660</td>\n",
" <td>1286.0</td>\n",
" <td>13495462.0</td>\n",
" <td>98.3</td>\n",
" <td>47.3</td>\n",
" <td>3.85</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>175 rows × 8 columns</p>\n",
"</div>"
],
"text/plain": [
" Country female_BMI male_BMI gdp population \\\n",
"0 Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n",
"1 Albania 25.65726 26.44657 8644.0 2968026.0 \n",
"2 Algeria 26.36841 24.59620 12314.0 34811059.0 \n",
"3 Angola 23.48431 22.25083 7103.0 19842251.0 \n",
"4 Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n",
".. ... ... ... ... ... \n",
"170 Venezuela 28.13408 27.44500 17911.0 28116716.0 \n",
"171 Vietnam 21.06500 20.91630 4085.0 86589342.0 \n",
"172 Palestine 29.02643 26.57750 3564.0 3854667.0 \n",
"173 Zambia 23.05436 20.68321 3039.0 13114579.0 \n",
"174 Zimbabwe 24.64522 22.02660 1286.0 13495462.0 \n",
"\n",
" under5mortality life_expectancy fertility \n",
"0 110.4 52.8 6.20 \n",
"1 17.9 76.8 1.76 \n",
"2 29.5 75.5 2.73 \n",
"3 192.0 56.7 6.43 \n",
"4 10.9 75.5 2.16 \n",
".. ... ... ... \n",
"170 17.1 74.2 2.53 \n",
"171 26.2 74.1 1.86 \n",
"172 24.7 74.1 4.38 \n",
"173 94.9 51.1 5.88 \n",
"174 98.3 47.3 3.85 \n",
"\n",
"[175 rows x 8 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('gapminder.csv')\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund\\t Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings\\t Mrs. John Bradley (Florence Briggs T...</td>\n",
" <td>female</td>\n",
" <td>38</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen\\t Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle\\t Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen\\t Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"3 1 3 \n",
"4 1 1 \n",
"5 0 3 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund\\t Mr. Owen Harris male 22 \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38 \n",
"3 Heikkinen\\t Miss. Laina female 26 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35 \n",
"5 Allen\\t Mr. William Henry male 35 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"3 0 0 STON/O2. 3101282 7.9250 NaN S \n",
"4 1 0 113803 53.1000 C123 S \n",
"5 0 0 373450 8.0500 NaN S "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./titanic_train.tsv', delimiter='\\t', index_col=0, nrows=5)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Do wczytania danych z arkusza kalkulacyjnego służy funkcja `pandas.read_excel`. Do otworzenia pliku `xlsx` może być koniecnze ustawienie parametru: `engine='openpyxl`. Więcej opcji w [dokumentacji](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>start_date</th>\n",
" <th>start_station_code</th>\n",
" <th>end_date</th>\n",
" <th>end_station_code</th>\n",
" <th>duration_sec</th>\n",
" <th>is_member</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2019-04-14 07:55:22</td>\n",
" <td>6001</td>\n",
" <td>2019-04-14 08:07:16</td>\n",
" <td>6132</td>\n",
" <td>713</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2019-04-14 07:59:31</td>\n",
" <td>6411</td>\n",
" <td>2019-04-14 08:09:18</td>\n",
" <td>6411</td>\n",
" <td>587</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2019-04-14 07:59:55</td>\n",
" <td>6097</td>\n",
" <td>2019-04-14 08:12:11</td>\n",
" <td>6036</td>\n",
" <td>736</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2019-04-14 07:59:57</td>\n",
" <td>6310</td>\n",
" <td>2019-04-14 08:27:58</td>\n",
" <td>6345</td>\n",
" <td>1680</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2019-04-14 08:00:37</td>\n",
" <td>7029</td>\n",
" <td>2019-04-14 08:14:12</td>\n",
" <td>6250</td>\n",
" <td>814</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" start_date start_station_code end_date \\\n",
"0 2019-04-14 07:55:22 6001 2019-04-14 08:07:16 \n",
"1 2019-04-14 07:59:31 6411 2019-04-14 08:09:18 \n",
"2 2019-04-14 07:59:55 6097 2019-04-14 08:12:11 \n",
"3 2019-04-14 07:59:57 6310 2019-04-14 08:27:58 \n",
"4 2019-04-14 08:00:37 7029 2019-04-14 08:14:12 \n",
"\n",
" end_station_code duration_sec is_member \n",
"0 6132 713 1 \n",
"1 6411 587 1 \n",
"2 6036 736 1 \n",
"3 6345 1680 1 \n",
"4 6250 814 0 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_excel('./bikes.xlsx', engine='openpyxl', nrows=5)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Innym ważnym źródłem informacji są bazy danych. Pandas potrafi komunikować się z bazą danych za pomocą biblioteki [SQLAlchemy](https://pypi.org/project/SQLAlchemy/) i dostarcza odpowiedną funkcję:\n",
" * `pandas.read_sql` - wczytanie całej tabeli lub zapytania do bazy danych"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Title</th>\n",
" <th>ArtistId</th>\n",
" </tr>\n",
" <tr>\n",
" <th>AlbumId</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>For Those About To Rock We Salute You</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Balls to the Wall</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Restless and Wild</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Let There Be Rock</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Big Ones</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>343</th>\n",
" <td>Respighi:Pines of Rome</td>\n",
" <td>226</td>\n",
" </tr>\n",
" <tr>\n",
" <th>344</th>\n",
" <td>Schubert: The Late String Quartets &amp; String Qu...</td>\n",
" <td>272</td>\n",
" </tr>\n",
" <tr>\n",
" <th>345</th>\n",
" <td>Monteverdi: L'Orfeo</td>\n",
" <td>273</td>\n",
" </tr>\n",
" <tr>\n",
" <th>346</th>\n",
" <td>Mozart: Chamber Music</td>\n",
" <td>274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>347</th>\n",
" <td>Koyaanisqatsi (Soundtrack from the Motion Pict...</td>\n",
" <td>275</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>347 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" Title ArtistId\n",
"AlbumId \n",
"1 For Those About To Rock We Salute You 1\n",
"2 Balls to the Wall 2\n",
"3 Restless and Wild 2\n",
"4 Let There Be Rock 1\n",
"5 Big Ones 3\n",
"... ... ...\n",
"343 Respighi:Pines of Rome 226\n",
"344 Schubert: The Late String Quartets & String Qu... 272\n",
"345 Monteverdi: L'Orfeo 273\n",
"346 Mozart: Chamber Music 274\n",
"347 Koyaanisqatsi (Soundtrack from the Motion Pict... 275\n",
"\n",
"[347 rows x 2 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_sql('Album', con='sqlite:///Chinook.sqlite', index_col='AlbumId')\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Title</th>\n",
" <th>ArtistId</th>\n",
" </tr>\n",
" <tr>\n",
" <th>AlbumId</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>For Those About To Rock We Salute You</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Balls to the Wall</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Restless and Wild</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Let There Be Rock</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Big Ones</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>343</th>\n",
" <td>Respighi:Pines of Rome</td>\n",
" <td>226</td>\n",
" </tr>\n",
" <tr>\n",
" <th>344</th>\n",
" <td>Schubert: The Late String Quartets &amp; String Qu...</td>\n",
" <td>272</td>\n",
" </tr>\n",
" <tr>\n",
" <th>345</th>\n",
" <td>Monteverdi: L'Orfeo</td>\n",
" <td>273</td>\n",
" </tr>\n",
" <tr>\n",
" <th>346</th>\n",
" <td>Mozart: Chamber Music</td>\n",
" <td>274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>347</th>\n",
" <td>Koyaanisqatsi (Soundtrack from the Motion Pict...</td>\n",
" <td>275</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>347 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" Title ArtistId\n",
"AlbumId \n",
"1 For Those About To Rock We Salute You 1\n",
"2 Balls to the Wall 2\n",
"3 Restless and Wild 2\n",
"4 Let There Be Rock 1\n",
"5 Big Ones 3\n",
"... ... ...\n",
"343 Respighi:Pines of Rome 226\n",
"344 Schubert: The Late String Quartets & String Qu... 272\n",
"345 Monteverdi: L'Orfeo 273\n",
"346 Mozart: Chamber Music 274\n",
"347 Koyaanisqatsi (Soundtrack from the Motion Pict... 275\n",
"\n",
"[347 rows x 2 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sqlalchemy\n",
"\n",
"engine = sqlalchemy.create_engine('sqlite:///Chinook.sqlite', echo=True)\n",
"connection = engine.raw_connection()\n",
"\n",
"df = pd.read_sql('SELECT * FROM Album', con='sqlite:///Chinook.sqlite', index_col='AlbumId')\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Podsumowanie\n",
"\n",
"\n",
" * Biblioteka `pandas` wspiera pobieranie danych z różnych formatów i źródeł.\n",
" * Każda funkcja ma listę argumentów, które pozwalają na ustawić poszczególne parametry (np. [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv))."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zapis i eksport danych"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas pozwala w prosty sposób na zapisywanie ramki danych do pliku. "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511})\n",
"occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316})\n",
"\n",
"df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# zapis do formatu CSV\n",
"df.to_csv('tmp.csv')\n",
"# zapis do arkusza kalkulacyjnego \n",
"df.to_excel('tmp.xlsx')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Ponadto możemy przekonwertować ramkę danych do JSONa lub Pythonowego słownika:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\"members\":{\"May\":682758,\"June\":737011,\"July\":779511},\"occasionals\":{\"May\":147898,\"June\":171494,\"July\":194316}}\n"
]
}
],
"source": [
"print(df.to_json())"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'members': {'May': 682758, 'June': 737011, 'July': 779511}, 'occasionals': {'May': 147898, 'June': 171494, 'July': 194316}}\n"
]
}
],
"source": [
"print(df.to_dict())\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Lub przekopiować dane do schowka:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.to_clipboard()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie\n",
"\n",
"\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
" * Przekonwertuj tabele `Customer` z bazy `Chinook.sqlite` do arkusza kalkulacyjnego. Plik wynikowy nazwij `customers.xlsx`.\n",
" * Tabela `Employee` zawiera informacje o pracownikach firmy Chinook. Wyswietl dane na ekranie i podaj miasta, w których mieszkają pracownicy.\n",
" * Tabela `Invoice` zawiera informacje o fakturach. Przekonwertuj kolumnę `BillingCountry` do pythonowego słownika, a następnie podaj najcześciej występującą wartość. Ile razy pojawiła się?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Ramka danych - podstawy"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"#### Kolumny\n",
"\n",
"Na ramkę danych możemy patrzeć jak na swego rodzaju słownik, którego wartościami są szeregi. Pozwoli to na uzyskanie lepszej intuicji.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>gdp</th>\n",
" <th>population</th>\n",
" <th>life_expectancy</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>1311.0</td>\n",
" <td>26528741.0</td>\n",
" <td>52.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" <td>76.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>12314.0</td>\n",
" <td>34811059.0</td>\n",
" <td>75.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" <td>56.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antigua and Barbuda</th>\n",
" <td>25736.0</td>\n",
" <td>85350.0</td>\n",
" <td>75.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Argentina</th>\n",
" <td>14646.0</td>\n",
" <td>40381860.0</td>\n",
" <td>75.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Armenia</th>\n",
" <td>7383.0</td>\n",
" <td>2975029.0</td>\n",
" <td>72.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Australia</th>\n",
" <td>41312.0</td>\n",
" <td>21370348.0</td>\n",
" <td>81.6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" gdp population life_expectancy\n",
"Country \n",
"Afghanistan 1311.0 26528741.0 52.8\n",
"Albania 8644.0 2968026.0 76.8\n",
"Algeria 12314.0 34811059.0 75.5\n",
"Angola 7103.0 19842251.0 56.7\n",
"Antigua and Barbuda 25736.0 85350.0 75.5\n",
"Argentina 14646.0 40381860.0 75.4\n",
"Armenia 7383.0 2975029.0 72.3\n",
"Australia 41312.0 21370348.0 81.6"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=8, usecols=['Country', 'gdp', 'population','life_expectancy'])\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Dostęp do poszczególnej kolumny możemy uzystać na dwa sposoby:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Country\n",
"Afghanistan 26528741.0\n",
"Albania 2968026.0\n",
"Algeria 34811059.0\n",
"Angola 19842251.0\n",
"Antigua and Barbuda 85350.0\n",
"Argentina 40381860.0\n",
"Armenia 2975029.0\n",
"Australia 21370348.0\n",
"Name: population, dtype: float64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# notacja z kropką\n",
"df.population"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Country\n",
"Afghanistan 26528741.0\n",
"Albania 2968026.0\n",
"Algeria 34811059.0\n",
"Angola 19842251.0\n",
"Antigua and Barbuda 85350.0\n",
"Argentina 40381860.0\n",
"Armenia 2975029.0\n",
"Australia 21370348.0\n",
"Name: population, dtype: float64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Operator []\n",
"df['population']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Do operatora `[]` możemy też podać listę nazw kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>gdp</th>\n",
" <th>population</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>1311.0</td>\n",
" <td>26528741.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>12314.0</td>\n",
" <td>34811059.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antigua and Barbuda</th>\n",
" <td>25736.0</td>\n",
" <td>85350.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Argentina</th>\n",
" <td>14646.0</td>\n",
" <td>40381860.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Armenia</th>\n",
" <td>7383.0</td>\n",
" <td>2975029.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Australia</th>\n",
" <td>41312.0</td>\n",
" <td>21370348.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" gdp population\n",
"Country \n",
"Afghanistan 1311.0 26528741.0\n",
"Albania 8644.0 2968026.0\n",
"Algeria 12314.0 34811059.0\n",
"Angola 7103.0 19842251.0\n",
"Antigua and Barbuda 25736.0 85350.0\n",
"Argentina 14646.0 40381860.0\n",
"Armenia 7383.0 2975029.0\n",
"Australia 41312.0 21370348.0"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['gdp','population']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Listę kolumn możemy pobrać za pomocą:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['gdp', 'population', 'life_expectancy'], dtype='object')"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PKB</th>\n",
" <th>Populacja</th>\n",
" <th>ODŻ</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>1311.0</td>\n",
" <td>26528741.0</td>\n",
" <td>52.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" <td>76.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>12314.0</td>\n",
" <td>34811059.0</td>\n",
" <td>75.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" <td>56.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antigua and Barbuda</th>\n",
" <td>25736.0</td>\n",
" <td>85350.0</td>\n",
" <td>75.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Argentina</th>\n",
" <td>14646.0</td>\n",
" <td>40381860.0</td>\n",
" <td>75.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Armenia</th>\n",
" <td>7383.0</td>\n",
" <td>2975029.0</td>\n",
" <td>72.3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Australia</th>\n",
" <td>41312.0</td>\n",
" <td>21370348.0</td>\n",
" <td>81.6</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PKB Populacja ODŻ\n",
"Country \n",
"Afghanistan 1311.0 26528741.0 52.8\n",
"Albania 8644.0 2968026.0 76.8\n",
"Algeria 12314.0 34811059.0 75.5\n",
"Angola 7103.0 19842251.0 56.7\n",
"Antigua and Barbuda 25736.0 85350.0 75.5\n",
"Argentina 14646.0 40381860.0 75.4\n",
"Armenia 7383.0 2975029.0 72.3\n",
"Australia 41312.0 21370348.0 81.6"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns = ['PKB', 'Populacja', 'ODŻ']\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Żeby odwołać się do poszczególnych wierszy należy wykorzystać metodę `loc`:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PKB 14646.0\n",
"Populacja 40381860.0\n",
"ODŻ 75.4\n",
"Name: Argentina, dtype: float64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['Argentina']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Metoda `loc` również może przyjąć listę wierszy: "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PKB</th>\n",
" <th>Populacja</th>\n",
" <th>ODŻ</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" <td>76.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" <td>56.7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PKB Populacja ODŻ\n",
"Country \n",
"Albania 8644.0 2968026.0 76.8\n",
"Angola 7103.0 19842251.0 56.7"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[['Albania', 'Angola']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Możemy również podać drugi parametr: nazwy kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PKB</th>\n",
" <th>Populacja</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PKB Populacja\n",
"Country \n",
"Albania 8644.0 2968026.0\n",
"Angola 7103.0 19842251.0"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.loc[['Albania', 'Angola'], ['PKB', 'Populacja']]\n",
"\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Albo wykorzystać tzw. _slicing_, cyzli operator `:`:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PKB</th>\n",
" <th>Populacja</th>\n",
" <th>ODŻ</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" <td>76.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>12314.0</td>\n",
" <td>34811059.0</td>\n",
" <td>75.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" <td>56.7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PKB Populacja ODŻ\n",
"Country \n",
"Albania 8644.0 2968026.0 76.8\n",
"Algeria 12314.0 34811059.0 75.5\n",
"Angola 7103.0 19842251.0 56.7"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['Albania': 'Angola', 'PKB': 'ODŻ']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Żeby odwołać się do pojedyńczej wartości możemy użyć metody `at`:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"7103.0"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.at['Angola', 'PKB']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Dostęp do indeksu:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda',\n",
" 'Argentina', 'Armenia', 'Australia'],\n",
" dtype='object', name='Country')"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.index"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Podstawowe metody `pd.Series` i `pd.DataFrame`"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>members</th>\n",
" <th>occasionals</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>May</th>\n",
" <td>682758</td>\n",
" <td>147898</td>\n",
" </tr>\n",
" <tr>\n",
" <th>June</th>\n",
" <td>737011</td>\n",
" <td>171494</td>\n",
" </tr>\n",
" <tr>\n",
" <th>July</th>\n",
" <td>779511</td>\n",
" <td>194316</td>\n",
" </tr>\n",
" <tr>\n",
" <th>August</th>\n",
" <td>673790</td>\n",
" <td>206809</td>\n",
" </tr>\n",
" <tr>\n",
" <th>September</th>\n",
" <td>673790</td>\n",
" <td>140492</td>\n",
" </tr>\n",
" <tr>\n",
" <th>October</th>\n",
" <td>444177</td>\n",
" <td>53596</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" members occasionals\n",
"May 682758 147898\n",
"June 737011 171494\n",
"July 779511 194316\n",
"August 673790 206809\n",
"September 673790 140492\n",
"October 444177 53596"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511, 'August': 673790,\n",
"'September': 673790, 'October': 444177})\n",
"\n",
"occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316, 'August': 206809,\n",
"'September': 140492, 'October': 53596})\n",
"\n",
"df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `head` pozwala tworzy nową ramkę danych z pierwszymi 5 przykładami:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>members</th>\n",
" <th>occasionals</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>May</th>\n",
" <td>682758</td>\n",
" <td>147898</td>\n",
" </tr>\n",
" <tr>\n",
" <th>June</th>\n",
" <td>737011</td>\n",
" <td>171494</td>\n",
" </tr>\n",
" <tr>\n",
" <th>July</th>\n",
" <td>779511</td>\n",
" <td>194316</td>\n",
" </tr>\n",
" <tr>\n",
" <th>August</th>\n",
" <td>673790</td>\n",
" <td>206809</td>\n",
" </tr>\n",
" <tr>\n",
" <th>September</th>\n",
" <td>673790</td>\n",
" <td>140492</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" members occasionals\n",
"May 682758 147898\n",
"June 737011 171494\n",
"July 779511 194316\n",
"August 673790 206809\n",
"September 673790 140492"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `tail` robi to samo, ale z 5 ostatnymi przykładami:"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>members</th>\n",
" <th>occasionals</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>June</th>\n",
" <td>737011</td>\n",
" <td>171494</td>\n",
" </tr>\n",
" <tr>\n",
" <th>July</th>\n",
" <td>779511</td>\n",
" <td>194316</td>\n",
" </tr>\n",
" <tr>\n",
" <th>August</th>\n",
" <td>673790</td>\n",
" <td>206809</td>\n",
" </tr>\n",
" <tr>\n",
" <th>September</th>\n",
" <td>673790</td>\n",
" <td>140492</td>\n",
" </tr>\n",
" <tr>\n",
" <th>October</th>\n",
" <td>444177</td>\n",
" <td>53596</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" members occasionals\n",
"June 737011 171494\n",
"July 779511 194316\n",
"August 673790 206809\n",
"September 673790 140492\n",
"October 444177 53596"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `sample` pozwala na stworzenie nowej ramki danych z wylosowanymi `n` przykładami:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>members</th>\n",
" <th>occasionals</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>September</th>\n",
" <td>673790</td>\n",
" <td>140492</td>\n",
" </tr>\n",
" <tr>\n",
" <th>August</th>\n",
" <td>673790</td>\n",
" <td>206809</td>\n",
" </tr>\n",
" <tr>\n",
" <th>May</th>\n",
" <td>682758</td>\n",
" <td>147898</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" members occasionals\n",
"September 673790 140492\n",
"August 673790 206809\n",
"May 682758 147898"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `describe` zwraca podstawowe statystyki m.in.: liczebność, średnią, wartości skrajne: "
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>members</th>\n",
" <th>occasionals</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>6.000000</td>\n",
" <td>6.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>665172.833333</td>\n",
" <td>152434.166667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>116216.045456</td>\n",
" <td>54783.506738</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>444177.000000</td>\n",
" <td>53596.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>673790.000000</td>\n",
" <td>142343.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>678274.000000</td>\n",
" <td>159696.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>723447.750000</td>\n",
" <td>188610.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>779511.000000</td>\n",
" <td>206809.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" members occasionals\n",
"count 6.000000 6.000000\n",
"mean 665172.833333 152434.166667\n",
"std 116216.045456 54783.506738\n",
"min 444177.000000 53596.000000\n",
"25% 673790.000000 142343.500000\n",
"50% 678274.000000 159696.000000\n",
"75% 723447.750000 188610.500000\n",
"max 779511.000000 206809.000000"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `info` zwraca informacje techniczne o kolumnach: np. typ danych:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 6 entries, May to October\n",
"Data columns (total 2 columns):\n",
" # Column Non-Null Count Dtype\n",
"--- ------ -------------- -----\n",
" 0 members 6 non-null int64\n",
" 1 occasionals 6 non-null int64\n",
"dtypes: int64(2)\n",
"memory usage: 144.0+ bytes\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Podstawową informacją o ramce danych to liczba przykładów w ramce danych. Możemy wykorzystać to tego funkcję `len`:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"6"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Natomiast atrybut `shape` zwraca nam krotkę z liczbą przykładów i liczbą kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(6, 2)"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Operacja arytmetyczne\n",
"\n",
" * `max`, `idxmax`\n",
" * `min`, `idxmin`\n",
" * `mean`\n",
" * `count`"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"members 665172.833333\n",
"occasionals 152434.166667\n",
"dtype: float64"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Zbiór wartości i zliczanie wartości:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1 3 2]\n",
"3 4\n",
"1 3\n",
"2 3\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"dane = pd.Series([1, 3, 2, 3, 1, 1, 2, 3, 2, 3])\n",
"\n",
"print(dane.unique())\n",
"\n",
"dane = pd.Series([1, 3, 2, 3, 1, 1, 2, 3, 2, 3])\n",
"\n",
"print(dane.value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Sprawdzanie czy brakuje danych:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
"5 False\n",
" ... \n",
"887 False\n",
"888 False\n",
"889 True\n",
"890 False\n",
"891 False\n",
"Name: Age, Length: 891, dtype: bool"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n",
"df.Age.isnull()\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Dodawanie i modyfikowanie danych"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>female_BMI</th>\n",
" <th>male_BMI</th>\n",
" <th>gdp</th>\n",
" <th>population</th>\n",
" <th>under5mortality</th>\n",
" <th>life_expectancy</th>\n",
" <th>fertility</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>21.07402</td>\n",
" <td>20.62058</td>\n",
" <td>1311.0</td>\n",
" <td>26528741.0</td>\n",
" <td>110.4</td>\n",
" <td>52.8</td>\n",
" <td>6.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>25.65726</td>\n",
" <td>26.44657</td>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" <td>17.9</td>\n",
" <td>76.8</td>\n",
" <td>1.76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>26.36841</td>\n",
" <td>24.59620</td>\n",
" <td>12314.0</td>\n",
" <td>34811059.0</td>\n",
" <td>29.5</td>\n",
" <td>75.5</td>\n",
" <td>2.73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>23.48431</td>\n",
" <td>22.25083</td>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" <td>192.0</td>\n",
" <td>56.7</td>\n",
" <td>6.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antigua and Barbuda</th>\n",
" <td>27.50545</td>\n",
" <td>25.76602</td>\n",
" <td>25736.0</td>\n",
" <td>85350.0</td>\n",
" <td>10.9</td>\n",
" <td>75.5</td>\n",
" <td>2.16</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" female_BMI male_BMI gdp population \\\n",
"Country \n",
"Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n",
"Albania 25.65726 26.44657 8644.0 2968026.0 \n",
"Algeria 26.36841 24.59620 12314.0 34811059.0 \n",
"Angola 23.48431 22.25083 7103.0 19842251.0 \n",
"Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n",
"\n",
" under5mortality life_expectancy fertility \n",
"Country \n",
"Afghanistan 110.4 52.8 6.20 \n",
"Albania 17.9 76.8 1.76 \n",
"Algeria 29.5 75.5 2.73 \n",
"Angola 192.0 56.7 6.43 \n",
"Antigua and Barbuda 10.9 75.5 2.16 "
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=5)\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>female_BMI</th>\n",
" <th>male_BMI</th>\n",
" <th>gdp</th>\n",
" <th>population</th>\n",
" <th>under5mortality</th>\n",
" <th>life_expectancy</th>\n",
" <th>fertility</th>\n",
" <th>continent</th>\n",
" <th>tmp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>21.07402</td>\n",
" <td>20.62058</td>\n",
" <td>1311.0</td>\n",
" <td>26528741.0</td>\n",
" <td>110.4</td>\n",
" <td>52.8</td>\n",
" <td>6.20</td>\n",
" <td>Asia</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>25.65726</td>\n",
" <td>26.44657</td>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" <td>17.9</td>\n",
" <td>76.8</td>\n",
" <td>1.76</td>\n",
" <td>Europe</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>26.36841</td>\n",
" <td>24.59620</td>\n",
" <td>12314.0</td>\n",
" <td>34811059.0</td>\n",
" <td>29.5</td>\n",
" <td>75.5</td>\n",
" <td>2.73</td>\n",
" <td>Africa</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>23.48431</td>\n",
" <td>22.25083</td>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" <td>192.0</td>\n",
" <td>56.7</td>\n",
" <td>6.43</td>\n",
" <td>Africa</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antigua and Barbuda</th>\n",
" <td>27.50545</td>\n",
" <td>25.76602</td>\n",
" <td>25736.0</td>\n",
" <td>85350.0</td>\n",
" <td>10.9</td>\n",
" <td>75.5</td>\n",
" <td>2.16</td>\n",
" <td>Americas</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" female_BMI male_BMI gdp population \\\n",
"Country \n",
"Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n",
"Albania 25.65726 26.44657 8644.0 2968026.0 \n",
"Algeria 26.36841 24.59620 12314.0 34811059.0 \n",
"Angola 23.48431 22.25083 7103.0 19842251.0 \n",
"Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n",
"\n",
" under5mortality life_expectancy fertility continent \\\n",
"Country \n",
"Afghanistan 110.4 52.8 6.20 Asia \n",
"Albania 17.9 76.8 1.76 Europe \n",
"Algeria 29.5 75.5 2.73 Africa \n",
"Angola 192.0 56.7 6.43 Africa \n",
"Antigua and Barbuda 10.9 75.5 2.16 Americas \n",
"\n",
" tmp \n",
"Country \n",
"Afghanistan 1 \n",
"Albania 1 \n",
"Algeria 1 \n",
"Angola 1 \n",
"Antigua and Barbuda 1 "
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conts = pd.Series({\n",
" 'Afghanistan': 'Asia', 'Albania': 'Europe', 'Algeria':' Africa', 'Angola': 'Africa', 'Antigua and Barbuda': 'Americas'})\n",
"\n",
"df['continent'] = conts\n",
"\n",
"df['tmp'] = 1\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>female_BMI</th>\n",
" <th>male_BMI</th>\n",
" <th>gdp</th>\n",
" <th>population</th>\n",
" <th>under5mortality</th>\n",
" <th>life_expectancy</th>\n",
" <th>fertility</th>\n",
" <th>continent</th>\n",
" <th>tmp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>21.07402</td>\n",
" <td>20.62058</td>\n",
" <td>1311.0</td>\n",
" <td>26528741.0</td>\n",
" <td>110.4</td>\n",
" <td>52.8</td>\n",
" <td>6.20</td>\n",
" <td>Asia</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>25.65726</td>\n",
" <td>26.44657</td>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" <td>17.9</td>\n",
" <td>76.8</td>\n",
" <td>1.76</td>\n",
" <td>Europe</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>26.36841</td>\n",
" <td>24.59620</td>\n",
" <td>12314.0</td>\n",
" <td>34811059.0</td>\n",
" <td>29.5</td>\n",
" <td>75.5</td>\n",
" <td>2.73</td>\n",
" <td>Africa</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>23.48431</td>\n",
" <td>22.25083</td>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" <td>192.0</td>\n",
" <td>56.7</td>\n",
" <td>6.43</td>\n",
" <td>Africa</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antigua and Barbuda</th>\n",
" <td>27.50545</td>\n",
" <td>25.76602</td>\n",
" <td>25736.0</td>\n",
" <td>85350.0</td>\n",
" <td>10.9</td>\n",
" <td>75.5</td>\n",
" <td>2.16</td>\n",
" <td>Americas</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Argentina</th>\n",
" <td>27.46523</td>\n",
" <td>27.50170</td>\n",
" <td>14646.0</td>\n",
" <td>40381860.0</td>\n",
" <td>15.4</td>\n",
" <td>75.4</td>\n",
" <td>2.24</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" female_BMI male_BMI gdp population \\\n",
"Country \n",
"Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n",
"Albania 25.65726 26.44657 8644.0 2968026.0 \n",
"Algeria 26.36841 24.59620 12314.0 34811059.0 \n",
"Angola 23.48431 22.25083 7103.0 19842251.0 \n",
"Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n",
"Argentina 27.46523 27.50170 14646.0 40381860.0 \n",
"\n",
" under5mortality life_expectancy fertility continent \\\n",
"Country \n",
"Afghanistan 110.4 52.8 6.20 Asia \n",
"Albania 17.9 76.8 1.76 Europe \n",
"Algeria 29.5 75.5 2.73 Africa \n",
"Angola 192.0 56.7 6.43 Africa \n",
"Antigua and Barbuda 10.9 75.5 2.16 Americas \n",
"Argentina 15.4 75.4 2.24 NaN \n",
"\n",
" tmp \n",
"Country \n",
"Afghanistan 1.0 \n",
"Albania 1.0 \n",
"Algeria 1.0 \n",
"Angola 1.0 \n",
"Antigua and Barbuda 1.0 \n",
"Argentina NaN "
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['Argentina'] = {\n",
" 'female_BMI': 27.46523,\n",
" 'male_BMI': 27.5017,\n",
" 'gdp': 14646.0,\n",
" 'population': 40381860.0,\n",
" 'under5mortality': 15.4,\n",
" 'life_expectancy': 75.4,\n",
" 'fertility': 2.24\n",
"}\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>female_BMI</th>\n",
" <th>male_BMI</th>\n",
" <th>population</th>\n",
" <th>under5mortality</th>\n",
" <th>life_expectancy</th>\n",
" <th>fertility</th>\n",
" <th>continent</th>\n",
" <th>tmp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>21.07402</td>\n",
" <td>20.62058</td>\n",
" <td>26528741.0</td>\n",
" <td>110.4</td>\n",
" <td>52.8</td>\n",
" <td>6.20</td>\n",
" <td>Asia</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>25.65726</td>\n",
" <td>26.44657</td>\n",
" <td>2968026.0</td>\n",
" <td>17.9</td>\n",
" <td>76.8</td>\n",
" <td>1.76</td>\n",
" <td>Europe</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>26.36841</td>\n",
" <td>24.59620</td>\n",
" <td>34811059.0</td>\n",
" <td>29.5</td>\n",
" <td>75.5</td>\n",
" <td>2.73</td>\n",
" <td>Africa</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>23.48431</td>\n",
" <td>22.25083</td>\n",
" <td>19842251.0</td>\n",
" <td>192.0</td>\n",
" <td>56.7</td>\n",
" <td>6.43</td>\n",
" <td>Africa</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antigua and Barbuda</th>\n",
" <td>27.50545</td>\n",
" <td>25.76602</td>\n",
" <td>85350.0</td>\n",
" <td>10.9</td>\n",
" <td>75.5</td>\n",
" <td>2.16</td>\n",
" <td>Americas</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Argentina</th>\n",
" <td>27.46523</td>\n",
" <td>27.50170</td>\n",
" <td>40381860.0</td>\n",
" <td>15.4</td>\n",
" <td>75.4</td>\n",
" <td>2.24</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" female_BMI male_BMI population under5mortality \\\n",
"Country \n",
"Afghanistan 21.07402 20.62058 26528741.0 110.4 \n",
"Albania 25.65726 26.44657 2968026.0 17.9 \n",
"Algeria 26.36841 24.59620 34811059.0 29.5 \n",
"Angola 23.48431 22.25083 19842251.0 192.0 \n",
"Antigua and Barbuda 27.50545 25.76602 85350.0 10.9 \n",
"Argentina 27.46523 27.50170 40381860.0 15.4 \n",
"\n",
" life_expectancy fertility continent tmp \n",
"Country \n",
"Afghanistan 52.8 6.20 Asia 1.0 \n",
"Albania 76.8 1.76 Europe 1.0 \n",
"Algeria 75.5 2.73 Africa 1.0 \n",
"Angola 56.7 6.43 Africa 1.0 \n",
"Antigua and Barbuda 75.5 2.16 Americas 1.0 \n",
"Argentina 75.4 2.24 NaN NaN "
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.drop('gdp', axis='columns')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Filtrowanie danych"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Biblioteka pandas posiada 2 sposoby na filtrowanie danych zawartych w ramce danych:\n",
" * operator `[]` -- najbardziej rozpowszechniony;\n",
" * metoda `query()`.\n",
"Oba sposoby mają różną składnię.\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund\\t Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings\\t Mrs. John Bradley (Florence Briggs T...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen\\t Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle\\t Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen\\t Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"3 1 3 \n",
"4 1 1 \n",
"5 0 3 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund\\t Mr. Owen Harris male 22.0 \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"3 Heikkinen\\t Miss. Laina female 26.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"5 Allen\\t Mr. William Henry male 35.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"3 0 0 STON/O2. 3101282 7.9250 NaN S \n",
"4 1 0 113803 53.1000 C123 S \n",
"5 0 0 373450 8.0500 NaN S "
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 0\n",
"2 1\n",
"3 1\n",
"4 1\n",
"5 0\n",
" ..\n",
"887 0\n",
"888 1\n",
"889 0\n",
"890 1\n",
"891 0\n",
"Name: Survived, Length: 891, dtype: int64"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Survived']"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 False\n",
"2 True\n",
"3 True\n",
"4 True\n",
"5 False\n",
" ... \n",
"887 False\n",
"888 True\n",
"889 False\n",
"890 True\n",
"891 False\n",
"Name: Survived, Length: 891, dtype: bool"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Survived'] == 1"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings\\t Mrs. John Bradley (Florence Briggs T...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle\\t Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>McCarthy\\t Mr. Timothy J</td>\n",
" <td>male</td>\n",
" <td>54.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17463</td>\n",
" <td>51.8625</td>\n",
" <td>E46</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Bonnell\\t Miss. Elizabeth</td>\n",
" <td>female</td>\n",
" <td>58.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>113783</td>\n",
" <td>26.5500</td>\n",
" <td>C103</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Sloper\\t Mr. William Thompson</td>\n",
" <td>male</td>\n",
" <td>28.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>113788</td>\n",
" <td>35.5000</td>\n",
" <td>A6</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>872</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny)</td>\n",
" <td>female</td>\n",
" <td>47.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>11751</td>\n",
" <td>52.5542</td>\n",
" <td>D35</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>873</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>Carlsson\\t Mr. Frans Olof</td>\n",
" <td>male</td>\n",
" <td>33.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>695</td>\n",
" <td>5.0000</td>\n",
" <td>B51 B53 B55</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>880</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson)</td>\n",
" <td>female</td>\n",
" <td>56.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>11767</td>\n",
" <td>83.1583</td>\n",
" <td>C50</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham\\t Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.0000</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr\\t Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>216 rows × 11 columns</p>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"2 1 1 \n",
"4 1 1 \n",
"7 0 1 \n",
"12 1 1 \n",
"24 1 1 \n",
"... ... ... \n",
"872 1 1 \n",
"873 0 1 \n",
"880 1 1 \n",
"888 1 1 \n",
"890 1 1 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"7 McCarthy\\t Mr. Timothy J male 54.0 \n",
"12 Bonnell\\t Miss. Elizabeth female 58.0 \n",
"24 Sloper\\t Mr. William Thompson male 28.0 \n",
"... ... ... ... \n",
"872 Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) female 47.0 \n",
"873 Carlsson\\t Mr. Frans Olof male 33.0 \n",
"880 Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 \n",
"888 Graham\\t Miss. Margaret Edith female 19.0 \n",
"890 Behr\\t Mr. Karl Howell male 26.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"7 0 0 17463 51.8625 E46 S \n",
"12 0 0 113783 26.5500 C103 S \n",
"24 0 0 113788 35.5000 A6 S \n",
"... ... ... ... ... ... ... \n",
"872 1 1 11751 52.5542 D35 S \n",
"873 0 0 695 5.0000 B51 B53 B55 S \n",
"880 0 1 11767 83.1583 C50 C \n",
"888 0 0 112053 30.0000 B42 S \n",
"890 0 0 111369 30.0000 C148 C \n",
"\n",
"[216 rows x 11 columns]"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['Pclass'] == 1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Operatory\n",
"\n",
"* `&` - koniukcja (i)\n",
"* `|` - alternatywa (lub)\n",
"* `~` - negacja (nie)\n",
"* `()` - jeżeli mamy kilka warunków to warto je uporządkować w nawiasy"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings\\t Mrs. John Bradley (Florence Briggs T...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle\\t Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Bonnell\\t Miss. Elizabeth</td>\n",
" <td>female</td>\n",
" <td>58.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>113783</td>\n",
" <td>26.5500</td>\n",
" <td>C103</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Spencer\\t Mrs. William Augustus (Marie Eugenie)</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17569</td>\n",
" <td>146.5208</td>\n",
" <td>B78</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Harper\\t Mrs. Henry Sleeper (Myna Haxtun)</td>\n",
" <td>female</td>\n",
" <td>49.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17572</td>\n",
" <td>76.7292</td>\n",
" <td>D33</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>857</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Wick\\t Mrs. George Dennick (Mary Hitchcock)</td>\n",
" <td>female</td>\n",
" <td>45.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>36928</td>\n",
" <td>164.8667</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>863</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Swift\\t Mrs. Frederick Joel (Margaret Welles B...</td>\n",
" <td>female</td>\n",
" <td>48.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17466</td>\n",
" <td>25.9292</td>\n",
" <td>D17</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>872</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny)</td>\n",
" <td>female</td>\n",
" <td>47.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>11751</td>\n",
" <td>52.5542</td>\n",
" <td>D35</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>880</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson)</td>\n",
" <td>female</td>\n",
" <td>56.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>11767</td>\n",
" <td>83.1583</td>\n",
" <td>C50</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham\\t Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.0000</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>94 rows × 11 columns</p>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"2 1 1 \n",
"4 1 1 \n",
"12 1 1 \n",
"32 1 1 \n",
"53 1 1 \n",
"... ... ... \n",
"857 1 1 \n",
"863 1 1 \n",
"872 1 1 \n",
"880 1 1 \n",
"888 1 1 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"12 Bonnell\\t Miss. Elizabeth female 58.0 \n",
"32 Spencer\\t Mrs. William Augustus (Marie Eugenie) female NaN \n",
"53 Harper\\t Mrs. Henry Sleeper (Myna Haxtun) female 49.0 \n",
"... ... ... ... \n",
"857 Wick\\t Mrs. George Dennick (Mary Hitchcock) female 45.0 \n",
"863 Swift\\t Mrs. Frederick Joel (Margaret Welles B... female 48.0 \n",
"872 Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) female 47.0 \n",
"880 Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 \n",
"888 Graham\\t Miss. Margaret Edith female 19.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"12 0 0 113783 26.5500 C103 S \n",
"32 1 0 PC 17569 146.5208 B78 C \n",
"53 1 0 PC 17572 76.7292 D33 C \n",
"... ... ... ... ... ... ... \n",
"857 1 1 36928 164.8667 NaN S \n",
"863 0 0 17466 25.9292 D17 S \n",
"872 1 1 11751 52.5542 D35 S \n",
"880 0 1 11767 83.1583 C50 C \n",
"888 0 0 112053 30.0000 B42 S \n",
"\n",
"[94 rows x 11 columns]"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pierwsza_klasa = df['Pclass'] == 1\n",
"kobiety = df['Sex'] == 'female'\n",
"\n",
"df[pierwsza_klasa & kobiety]\n"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund\\t Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings\\t Mrs. John Bradley (Florence Briggs T...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle\\t Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Palsson\\t Master. Gosta Leonard</td>\n",
" <td>male</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>349909</td>\n",
" <td>21.0750</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Nasser\\t Mrs. Nicholas (Adele Achem)</td>\n",
" <td>female</td>\n",
" <td>14.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>237736</td>\n",
" <td>30.0708</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>861</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Hansen\\t Mr. Claus Peter</td>\n",
" <td>male</td>\n",
" <td>41.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>350026</td>\n",
" <td>14.1083</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>862</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Giles\\t Mr. Frederick Edward</td>\n",
" <td>male</td>\n",
" <td>21.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>28134</td>\n",
" <td>11.5000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>864</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Sage\\t Miss. Dorothy Edith \"Dolly\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>8</td>\n",
" <td>2</td>\n",
" <td>CA. 2343</td>\n",
" <td>69.5500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>867</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Duran y More\\t Miss. Asuncion</td>\n",
" <td>female</td>\n",
" <td>27.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>SC/PARIS 2149</td>\n",
" <td>13.8583</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>875</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Abelson\\t Mrs. Samuel (Hannah Wizosky)</td>\n",
" <td>female</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>P/PP 3381</td>\n",
" <td>24.0000</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>192 rows × 11 columns</p>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"4 1 1 \n",
"8 0 3 \n",
"10 1 2 \n",
"... ... ... \n",
"861 0 3 \n",
"862 0 2 \n",
"864 0 3 \n",
"867 1 2 \n",
"875 1 2 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund\\t Mr. Owen Harris male 22.0 \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"8 Palsson\\t Master. Gosta Leonard male 2.0 \n",
"10 Nasser\\t Mrs. Nicholas (Adele Achem) female 14.0 \n",
"... ... ... ... \n",
"861 Hansen\\t Mr. Claus Peter male 41.0 \n",
"862 Giles\\t Mr. Frederick Edward male 21.0 \n",
"864 Sage\\t Miss. Dorothy Edith \"Dolly\" female NaN \n",
"867 Duran y More\\t Miss. Asuncion female 27.0 \n",
"875 Abelson\\t Mrs. Samuel (Hannah Wizosky) female 28.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"8 3 1 349909 21.0750 NaN S \n",
"10 1 0 237736 30.0708 NaN C \n",
"... ... ... ... ... ... ... \n",
"861 2 0 350026 14.1083 NaN S \n",
"862 1 0 28134 11.5000 NaN S \n",
"864 8 2 CA. 2343 69.5500 NaN S \n",
"867 1 0 SC/PARIS 2149 13.8583 NaN C \n",
"875 1 0 P/PP 3381 24.0000 NaN C \n",
"\n",
"[192 rows x 11 columns]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"df[df['SibSp'] > df['Parch']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### `pd.DataFrame.query`\n",
"\n",
"Innym sposobem na filtrowanie danych jest metoda `query`, która jako argument przyjmuje wyrażenie:"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings\\t Mrs. John Bradley (Florence Briggs T...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle\\t Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>McCarthy\\t Mr. Timothy J</td>\n",
" <td>male</td>\n",
" <td>54.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17463</td>\n",
" <td>51.8625</td>\n",
" <td>E46</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Bonnell\\t Miss. Elizabeth</td>\n",
" <td>female</td>\n",
" <td>58.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>113783</td>\n",
" <td>26.5500</td>\n",
" <td>C103</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Sloper\\t Mr. William Thompson</td>\n",
" <td>male</td>\n",
" <td>28.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>113788</td>\n",
" <td>35.5000</td>\n",
" <td>A6</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"2 1 1 \n",
"4 1 1 \n",
"7 0 1 \n",
"12 1 1 \n",
"24 1 1 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"7 McCarthy\\t Mr. Timothy J male 54.0 \n",
"12 Bonnell\\t Miss. Elizabeth female 58.0 \n",
"24 Sloper\\t Mr. William Thompson male 28.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"7 0 0 17463 51.8625 E46 S \n",
"12 0 0 113783 26.5500 C103 S \n",
"24 0 0 113788 35.5000 A6 S "
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.query('Pclass == 1').head()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings\\t Mrs. John Bradley (Florence Briggs T...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle\\t Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Bonnell\\t Miss. Elizabeth</td>\n",
" <td>female</td>\n",
" <td>58.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>113783</td>\n",
" <td>26.5500</td>\n",
" <td>C103</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Spencer\\t Mrs. William Augustus (Marie Eugenie)</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17569</td>\n",
" <td>146.5208</td>\n",
" <td>B78</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Harper\\t Mrs. Henry Sleeper (Myna Haxtun)</td>\n",
" <td>female</td>\n",
" <td>49.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17572</td>\n",
" <td>76.7292</td>\n",
" <td>D33</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"2 1 1 \n",
"4 1 1 \n",
"12 1 1 \n",
"32 1 1 \n",
"53 1 1 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"12 Bonnell\\t Miss. Elizabeth female 58.0 \n",
"32 Spencer\\t Mrs. William Augustus (Marie Eugenie) female NaN \n",
"53 Harper\\t Mrs. Henry Sleeper (Myna Haxtun) female 49.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"12 0 0 113783 26.5500 C103 S \n",
"32 1 0 PC 17569 146.5208 B78 C \n",
"53 1 0 PC 17572 76.7292 D33 C "
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.query('(Pclass == 1) and (Sex == \"female\")').head()"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund\\t Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings\\t Mrs. John Bradley (Florence Briggs T...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle\\t Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Palsson\\t Master. Gosta Leonard</td>\n",
" <td>male</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>349909</td>\n",
" <td>21.0750</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Nasser\\t Mrs. Nicholas (Adele Achem)</td>\n",
" <td>female</td>\n",
" <td>14.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>237736</td>\n",
" <td>30.0708</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>861</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Hansen\\t Mr. Claus Peter</td>\n",
" <td>male</td>\n",
" <td>41.0</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>350026</td>\n",
" <td>14.1083</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>862</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Giles\\t Mr. Frederick Edward</td>\n",
" <td>male</td>\n",
" <td>21.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>28134</td>\n",
" <td>11.5000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>864</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Sage\\t Miss. Dorothy Edith \"Dolly\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>8</td>\n",
" <td>2</td>\n",
" <td>CA. 2343</td>\n",
" <td>69.5500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>867</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Duran y More\\t Miss. Asuncion</td>\n",
" <td>female</td>\n",
" <td>27.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>SC/PARIS 2149</td>\n",
" <td>13.8583</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>875</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Abelson\\t Mrs. Samuel (Hannah Wizosky)</td>\n",
" <td>female</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>P/PP 3381</td>\n",
" <td>24.0000</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>192 rows × 11 columns</p>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"4 1 1 \n",
"8 0 3 \n",
"10 1 2 \n",
"... ... ... \n",
"861 0 3 \n",
"862 0 2 \n",
"864 0 3 \n",
"867 1 2 \n",
"875 1 2 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund\\t Mr. Owen Harris male 22.0 \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"8 Palsson\\t Master. Gosta Leonard male 2.0 \n",
"10 Nasser\\t Mrs. Nicholas (Adele Achem) female 14.0 \n",
"... ... ... ... \n",
"861 Hansen\\t Mr. Claus Peter male 41.0 \n",
"862 Giles\\t Mr. Frederick Edward male 21.0 \n",
"864 Sage\\t Miss. Dorothy Edith \"Dolly\" female NaN \n",
"867 Duran y More\\t Miss. Asuncion female 27.0 \n",
"875 Abelson\\t Mrs. Samuel (Hannah Wizosky) female 28.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"8 3 1 349909 21.0750 NaN S \n",
"10 1 0 237736 30.0708 NaN C \n",
"... ... ... ... ... ... ... \n",
"861 2 0 350026 14.1083 NaN S \n",
"862 1 0 28134 11.5000 NaN S \n",
"864 8 2 CA. 2343 69.5500 NaN S \n",
"867 1 0 SC/PARIS 2149 13.8583 NaN C \n",
"875 1 0 P/PP 3381 24.0000 NaN C \n",
"\n",
"[192 rows x 11 columns]"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.query('SibSp > Parch')"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(113, 11)"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"young = 18\n",
"df.query('Age < @young').shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Operacje na wierszach i kolumnach"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>female_BMI</th>\n",
" <th>male_BMI</th>\n",
" <th>gdp</th>\n",
" <th>population</th>\n",
" <th>under5mortality</th>\n",
" <th>life_expectancy</th>\n",
" <th>fertility</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>21.07402</td>\n",
" <td>20.62058</td>\n",
" <td>1311.0</td>\n",
" <td>26528741.0</td>\n",
" <td>110.4</td>\n",
" <td>52.8</td>\n",
" <td>6.20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>25.65726</td>\n",
" <td>26.44657</td>\n",
" <td>8644.0</td>\n",
" <td>2968026.0</td>\n",
" <td>17.9</td>\n",
" <td>76.8</td>\n",
" <td>1.76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>26.36841</td>\n",
" <td>24.59620</td>\n",
" <td>12314.0</td>\n",
" <td>34811059.0</td>\n",
" <td>29.5</td>\n",
" <td>75.5</td>\n",
" <td>2.73</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>23.48431</td>\n",
" <td>22.25083</td>\n",
" <td>7103.0</td>\n",
" <td>19842251.0</td>\n",
" <td>192.0</td>\n",
" <td>56.7</td>\n",
" <td>6.43</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Antigua and Barbuda</th>\n",
" <td>27.50545</td>\n",
" <td>25.76602</td>\n",
" <td>25736.0</td>\n",
" <td>85350.0</td>\n",
" <td>10.9</td>\n",
" <td>75.5</td>\n",
" <td>2.16</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" female_BMI male_BMI gdp population \\\n",
"Country \n",
"Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n",
"Albania 25.65726 26.44657 8644.0 2968026.0 \n",
"Algeria 26.36841 24.59620 12314.0 34811059.0 \n",
"Angola 23.48431 22.25083 7103.0 19842251.0 \n",
"Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n",
"\n",
" under5mortality life_expectancy fertility \n",
"Country \n",
"Afghanistan 110.4 52.8 6.20 \n",
"Albania 17.9 76.8 1.76 \n",
"Algeria 29.5 75.5 2.73 \n",
"Angola 192.0 56.7 6.43 \n",
"Antigua and Barbuda 10.9 75.5 2.16 "
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=5)\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Iterowanie po ramce danych oznacza oznacza przejście po nazwach kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"female_BMI\n",
"male_BMI\n",
"gdp\n",
"population\n",
"under5mortality\n",
"life_expectancy\n",
"fertility\n"
]
}
],
"source": [
"for column_name in df:\n",
" print(column_name)"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"female_BMI Country\n",
"Afghanistan 21.07402\n",
"Albania 25.65726\n",
"Algeria 26.36841\n",
"Angola 23.48431\n",
"Antigua and Barbuda 27.50545\n",
"Name: female_BMI, dtype: float64\n"
]
}
],
"source": [
"for col_name, series in df.items():\n",
" print(col_name, series)\n",
" break"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Afghanistan \n",
" female_BMI 2.107402e+01\n",
"male_BMI 2.062058e+01\n",
"gdp 1.311000e+03\n",
"population 2.652874e+07\n",
"under5mortality 1.104000e+02\n",
"life_expectancy 5.280000e+01\n",
"fertility 6.200000e+00\n",
"Name: Afghanistan, dtype: float64\n"
]
}
],
"source": [
"for idx, row in df.iterrows():\n",
" print(idx, '\\n', row)\n",
" break"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Country\n",
"Afghanistan normal\n",
"Albania overweight\n",
"Algeria normal\n",
"Angola normal\n",
"Antigua and Barbuda overweight\n",
"Name: male_BMI, dtype: object"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def bmi_level(bmi):\n",
" if bmi <= 18.5:\n",
" level = 'underweight'\n",
" elif bmi < 25:\n",
" level = 'normal'\n",
" elif bmi < 30:\n",
" level = 'overweight'\n",
" else:\n",
" level = 'obese'\n",
" return level\n",
"\n",
"s = df['male_BMI'].map(bmi_level)\n",
" \n",
"s"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Country\n",
"Afghanistan normal\n",
"Albania overweight\n",
"Algeria normal\n",
"Angola normal\n",
"Antigua and Barbuda overweight\n",
"dtype: object"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def bmi_level(row_data):\n",
" bmi = row_data['male_BMI']\n",
" if bmi <= 18.5:\n",
" return 'underweight'\n",
" elif bmi < 25:\n",
" return 'normal'\n",
" elif bmi < 30:\n",
" return 'overweight'\n",
" return 'obese'\n",
"\n",
"df.apply(bmi_level, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>Country</th>\n",
" <th>Afghanistan</th>\n",
" <th>Albania</th>\n",
" <th>Algeria</th>\n",
" <th>Angola</th>\n",
" <th>Antigua and Barbuda</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>female_BMI</th>\n",
" <td>2.107402e+01</td>\n",
" <td>2.565726e+01</td>\n",
" <td>2.636841e+01</td>\n",
" <td>2.348431e+01</td>\n",
" <td>27.50545</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male_BMI</th>\n",
" <td>2.062058e+01</td>\n",
" <td>2.644657e+01</td>\n",
" <td>2.459620e+01</td>\n",
" <td>2.225083e+01</td>\n",
" <td>25.76602</td>\n",
" </tr>\n",
" <tr>\n",
" <th>gdp</th>\n",
" <td>1.311000e+03</td>\n",
" <td>8.644000e+03</td>\n",
" <td>1.231400e+04</td>\n",
" <td>7.103000e+03</td>\n",
" <td>25736.00000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>population</th>\n",
" <td>2.652874e+07</td>\n",
" <td>2.968026e+06</td>\n",
" <td>3.481106e+07</td>\n",
" <td>1.984225e+07</td>\n",
" <td>85350.00000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>under5mortality</th>\n",
" <td>1.104000e+02</td>\n",
" <td>1.790000e+01</td>\n",
" <td>2.950000e+01</td>\n",
" <td>1.920000e+02</td>\n",
" <td>10.90000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>life_expectancy</th>\n",
" <td>5.280000e+01</td>\n",
" <td>7.680000e+01</td>\n",
" <td>7.550000e+01</td>\n",
" <td>5.670000e+01</td>\n",
" <td>75.50000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>fertility</th>\n",
" <td>6.200000e+00</td>\n",
" <td>1.760000e+00</td>\n",
" <td>2.730000e+00</td>\n",
" <td>6.430000e+00</td>\n",
" <td>2.16000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Country Afghanistan Albania Algeria Angola \\\n",
"female_BMI 2.107402e+01 2.565726e+01 2.636841e+01 2.348431e+01 \n",
"male_BMI 2.062058e+01 2.644657e+01 2.459620e+01 2.225083e+01 \n",
"gdp 1.311000e+03 8.644000e+03 1.231400e+04 7.103000e+03 \n",
"population 2.652874e+07 2.968026e+06 3.481106e+07 1.984225e+07 \n",
"under5mortality 1.104000e+02 1.790000e+01 2.950000e+01 1.920000e+02 \n",
"life_expectancy 5.280000e+01 7.680000e+01 7.550000e+01 5.670000e+01 \n",
"fertility 6.200000e+00 1.760000e+00 2.730000e+00 6.430000e+00 \n",
"\n",
"Country Antigua and Barbuda \n",
"female_BMI 27.50545 \n",
"male_BMI 25.76602 \n",
"gdp 25736.00000 \n",
"population 85350.00000 \n",
"under5mortality 10.90000 \n",
"life_expectancy 75.50000 \n",
"fertility 2.16000 "
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.transpose()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grupowanie (`groupby`)\n",
"\n",
"Często zdarza się, gdy potrzebujemy podzielić dane ze względu na wartości w zadanej kolumnie, a następnie obliczenie zebranie danych w każdej z grup. Do tego służy metody `groupby`."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Team</th>\n",
" <th>Number</th>\n",
" <th>Position</th>\n",
" <th>Age</th>\n",
" <th>Height</th>\n",
" <th>Weight</th>\n",
" <th>College</th>\n",
" <th>Salary</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>202</th>\n",
" <td>Solomon Hill</td>\n",
" <td>Indiana Pacers</td>\n",
" <td>44.0</td>\n",
" <td>SF</td>\n",
" <td>25.0</td>\n",
" <td>6-7</td>\n",
" <td>225.0</td>\n",
" <td>Arizona</td>\n",
" <td>1358880.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>286</th>\n",
" <td>Tim Frazier</td>\n",
" <td>New Orleans Pelicans</td>\n",
" <td>2.0</td>\n",
" <td>PG</td>\n",
" <td>25.0</td>\n",
" <td>6-1</td>\n",
" <td>170.0</td>\n",
" <td>Penn State</td>\n",
" <td>845059.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>210</th>\n",
" <td>Joe Young</td>\n",
" <td>Indiana Pacers</td>\n",
" <td>1.0</td>\n",
" <td>PG</td>\n",
" <td>23.0</td>\n",
" <td>6-2</td>\n",
" <td>180.0</td>\n",
" <td>Oregon</td>\n",
" <td>1007026.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>420</th>\n",
" <td>Nazr Mohammed</td>\n",
" <td>Oklahoma City Thunder</td>\n",
" <td>13.0</td>\n",
" <td>C</td>\n",
" <td>38.0</td>\n",
" <td>6-10</td>\n",
" <td>250.0</td>\n",
" <td>Kentucky</td>\n",
" <td>222888.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>258</th>\n",
" <td>Tony Allen</td>\n",
" <td>Memphis Grizzlies</td>\n",
" <td>9.0</td>\n",
" <td>SG</td>\n",
" <td>34.0</td>\n",
" <td>6-4</td>\n",
" <td>213.0</td>\n",
" <td>Oklahoma State</td>\n",
" <td>5158539.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Team Number Position Age Height \\\n",
"202 Solomon Hill Indiana Pacers 44.0 SF 25.0 6-7 \n",
"286 Tim Frazier New Orleans Pelicans 2.0 PG 25.0 6-1 \n",
"210 Joe Young Indiana Pacers 1.0 PG 23.0 6-2 \n",
"420 Nazr Mohammed Oklahoma City Thunder 13.0 C 38.0 6-10 \n",
"258 Tony Allen Memphis Grizzlies 9.0 SG 34.0 6-4 \n",
"\n",
" Weight College Salary \n",
"202 225.0 Arizona 1358880.0 \n",
"286 170.0 Penn State 845059.0 \n",
"210 180.0 Oregon 1007026.0 \n",
"420 250.0 Kentucky 222888.0 \n",
"258 213.0 Oklahoma State 5158539.0 "
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv('./nba.csv')\n",
"\n",
"df.sample(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"_Przykład_: chcemy obliczyć średnią wypłatę dla każdej z drużyn."
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Salary</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Team</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Atlanta Hawks</th>\n",
" <td>4.860197e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Boston Celtics</th>\n",
" <td>4.181505e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Brooklyn Nets</th>\n",
" <td>3.501898e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Charlotte Hornets</th>\n",
" <td>5.222728e+06</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Chicago Bulls</th>\n",
" <td>5.785559e+06</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Salary\n",
"Team \n",
"Atlanta Hawks 4.860197e+06\n",
"Boston Celtics 4.181505e+06\n",
"Brooklyn Nets 3.501898e+06\n",
"Charlotte Hornets 5.222728e+06\n",
"Chicago Bulls 5.785559e+06"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['Team', 'Salary']].groupby('Team').mean().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Możemy też podać listę nazw kolumn. Wtedy wartości zostaną obliczone dla każdej z wytworzonych grup:"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Team Position\n",
"Atlanta Hawks C 7.585417e+06\n",
" PF 5.988067e+06\n",
" PG 4.881700e+06\n",
" SF 3.000000e+06\n",
" SG 2.607758e+06\n",
" ... \n",
"Washington Wizards C 8.163476e+06\n",
" PF 5.650000e+06\n",
" PG 9.011208e+06\n",
" SF 2.789700e+06\n",
" SG 2.839248e+06\n",
"Name: Salary, Length: 149, dtype: float64"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.groupby(['Team', 'Position'])['Salary'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
" * `sum()`\n",
" * `min()`\n",
" * `max()`\n",
" * `mean()`\n",
" * `size()`\n",
" * `describe()`\n",
" * `first()`\n",
" * `last()`\n",
" * `count()`\n",
" * `std()`\n",
" * `var()`\n",
" * `sem()`"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe thead tr:last-of-type th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th colspan=\"3\" halign=\"left\">Salary</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>mean</th>\n",
" <th>std</th>\n",
" <th>count</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Position</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>C</th>\n",
" <td>5.967052e+06</td>\n",
" <td>5.787989e+06</td>\n",
" <td>78</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PF</th>\n",
" <td>4.562483e+06</td>\n",
" <td>4.800054e+06</td>\n",
" <td>97</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PG</th>\n",
" <td>5.077829e+06</td>\n",
" <td>5.051809e+06</td>\n",
" <td>88</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SF</th>\n",
" <td>4.857393e+06</td>\n",
" <td>6.011889e+06</td>\n",
" <td>84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SG</th>\n",
" <td>4.009861e+06</td>\n",
" <td>4.491609e+06</td>\n",
" <td>99</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Salary \n",
" mean std count\n",
"Position \n",
"C 5.967052e+06 5.787989e+06 78\n",
"PF 4.562483e+06 4.800054e+06 97\n",
"PG 5.077829e+06 5.051809e+06 88\n",
"SF 4.857393e+06 6.011889e+06 84\n",
"SG 4.009861e+06 4.491609e+06 99"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['Position', 'Salary']].groupby('Position').agg(['mean', 'std', 'count'])"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Salary</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Position</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>C</th>\n",
" <td>22275967.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PF</th>\n",
" <td>22081286.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>PG</th>\n",
" <td>21412973.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SF</th>\n",
" <td>24969112.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>SG</th>\n",
" <td>19944278.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Salary\n",
"Position \n",
"C 22275967.0\n",
"PF 22081286.0\n",
"PG 21412973.0\n",
"SF 24969112.0\n",
"SG 19944278.0"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def group_range(x):\n",
" return x.max() - x.min()\n",
"\n",
"df[['Position', 'Salary']].groupby('Position').apply(group_range)\n"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Liczba grup: 5\n",
"dict_keys(['C', 'PF', 'PG', 'SF', 'SG'])\n",
" Name Team Number Position Age Height Weight \\\n",
"7 Kelly Olynyk Boston Celtics 41.0 C 25.0 7-0 238.0 \n",
"10 Jared Sullinger Boston Celtics 7.0 C 24.0 6-9 260.0 \n",
"14 Tyler Zeller Boston Celtics 44.0 C 26.0 7-0 253.0 \n",
"23 Brook Lopez Brooklyn Nets 11.0 C 28.0 7-0 275.0 \n",
"27 Henry Sims Brooklyn Nets 14.0 C 26.0 6-10 248.0 \n",
"\n",
" College Salary \n",
"7 Gonzaga 2165160.0 \n",
"10 Ohio State 2569260.0 \n",
"14 North Carolina 2616975.0 \n",
"23 Stanford 19689000.0 \n",
"27 Georgetown 947276.0 \n"
]
}
],
"source": [
"gb = df.groupby(['Position'])\n",
"\n",
"print('Liczba grup:', gb.ngroups)\n",
"print(gb.groups.keys())\n",
"\n",
"print(gb.get_group('C').head())"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 15.36\n",
"1 15.36\n",
"2 15.36\n",
"3 15.36\n",
"4 15.36\n",
" ... \n",
"453 15.36\n",
"454 15.36\n",
"455 17.92\n",
"456 17.92\n",
"457 <NA>\n",
"Name: Height, Length: 458, dtype: Float64"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"df.Height.str.split('-').str[0].astype('Int64') * 2.56"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Pivot\n",
"Metoda `pivot` pozwala na stworzenie nowej ramki danych, gdzie indeks i nazwy kolumn są wartościami początkowej ranki danych. \n",
"\n",
"_Przykład_: zobaczmy na poniższą ramkę danych, która zawiera informacje o jakości tłumaczenia dla pary językowej hausa-angielski. Kolumna `system` zawiera nazwę systemu, kolumna `metric` - nazwę metryki, zaś kolumna `score`- wartość metryki. Chcemy przedstawić te dane w następujący sposób: jako klucz chcemy mieć nazwę systemu, zaś jako kolumny - metryki. Możemy wykorzystać do tego metodę `pivot`, gdzie musimy podać 3 argumenty:\n",
" * `index`: nazwę kolumny, na podstawie której zostanie stworzony indeks;\n",
" * `columns`: nazwa kolumny, które zawiera nazwy kolumn dla nowej ramki danych;\n",
" * `values`: nazwa kolumny, która zawiera interesujące nas dane."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>pair</th>\n",
" <th>system</th>\n",
" <th>id</th>\n",
" <th>is_constrained</th>\n",
" <th>metric</th>\n",
" <th>score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1214</th>\n",
" <td>ha-en</td>\n",
" <td>NiuTrans</td>\n",
" <td>382</td>\n",
" <td>True</td>\n",
" <td>bleu-all</td>\n",
" <td>16.512243</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1215</th>\n",
" <td>ha-en</td>\n",
" <td>NiuTrans</td>\n",
" <td>382</td>\n",
" <td>True</td>\n",
" <td>chrf-all</td>\n",
" <td>44.724766</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1216</th>\n",
" <td>ha-en</td>\n",
" <td>NiuTrans</td>\n",
" <td>382</td>\n",
" <td>True</td>\n",
" <td>bleu-A</td>\n",
" <td>16.512243</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1217</th>\n",
" <td>ha-en</td>\n",
" <td>NiuTrans</td>\n",
" <td>382</td>\n",
" <td>True</td>\n",
" <td>chrf-A</td>\n",
" <td>44.724766</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1218</th>\n",
" <td>ha-en</td>\n",
" <td>Facebook-AI</td>\n",
" <td>181</td>\n",
" <td>False</td>\n",
" <td>bleu-all</td>\n",
" <td>20.982704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1219</th>\n",
" <td>ha-en</td>\n",
" <td>Facebook-AI</td>\n",
" <td>181</td>\n",
" <td>False</td>\n",
" <td>chrf-all</td>\n",
" <td>48.653770</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1220</th>\n",
" <td>ha-en</td>\n",
" <td>Facebook-AI</td>\n",
" <td>181</td>\n",
" <td>False</td>\n",
" <td>bleu-A</td>\n",
" <td>20.982704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1221</th>\n",
" <td>ha-en</td>\n",
" <td>Facebook-AI</td>\n",
" <td>181</td>\n",
" <td>False</td>\n",
" <td>chrf-A</td>\n",
" <td>48.653770</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1222</th>\n",
" <td>ha-en</td>\n",
" <td>TRANSSION</td>\n",
" <td>336</td>\n",
" <td>False</td>\n",
" <td>bleu-all</td>\n",
" <td>18.834851</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1223</th>\n",
" <td>ha-en</td>\n",
" <td>TRANSSION</td>\n",
" <td>336</td>\n",
" <td>False</td>\n",
" <td>chrf-all</td>\n",
" <td>47.238279</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1224</th>\n",
" <td>ha-en</td>\n",
" <td>TRANSSION</td>\n",
" <td>336</td>\n",
" <td>False</td>\n",
" <td>bleu-A</td>\n",
" <td>18.834851</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1225</th>\n",
" <td>ha-en</td>\n",
" <td>TRANSSION</td>\n",
" <td>336</td>\n",
" <td>False</td>\n",
" <td>chrf-A</td>\n",
" <td>47.238279</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1226</th>\n",
" <td>ha-en</td>\n",
" <td>AMU</td>\n",
" <td>628</td>\n",
" <td>True</td>\n",
" <td>bleu-all</td>\n",
" <td>14.132845</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1227</th>\n",
" <td>ha-en</td>\n",
" <td>AMU</td>\n",
" <td>628</td>\n",
" <td>True</td>\n",
" <td>chrf-all</td>\n",
" <td>41.256570</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1228</th>\n",
" <td>ha-en</td>\n",
" <td>AMU</td>\n",
" <td>628</td>\n",
" <td>True</td>\n",
" <td>bleu-A</td>\n",
" <td>14.132845</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1229</th>\n",
" <td>ha-en</td>\n",
" <td>AMU</td>\n",
" <td>628</td>\n",
" <td>True</td>\n",
" <td>chrf-A</td>\n",
" <td>41.256570</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1230</th>\n",
" <td>ha-en</td>\n",
" <td>P3AI</td>\n",
" <td>715</td>\n",
" <td>True</td>\n",
" <td>bleu-all</td>\n",
" <td>17.793617</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1231</th>\n",
" <td>ha-en</td>\n",
" <td>P3AI</td>\n",
" <td>715</td>\n",
" <td>True</td>\n",
" <td>chrf-all</td>\n",
" <td>46.307402</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1232</th>\n",
" <td>ha-en</td>\n",
" <td>P3AI</td>\n",
" <td>715</td>\n",
" <td>True</td>\n",
" <td>bleu-A</td>\n",
" <td>17.793617</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1233</th>\n",
" <td>ha-en</td>\n",
" <td>P3AI</td>\n",
" <td>715</td>\n",
" <td>True</td>\n",
" <td>chrf-A</td>\n",
" <td>46.307402</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1234</th>\n",
" <td>ha-en</td>\n",
" <td>Online-B</td>\n",
" <td>1356</td>\n",
" <td>False</td>\n",
" <td>bleu-all</td>\n",
" <td>18.655658</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1235</th>\n",
" <td>ha-en</td>\n",
" <td>Online-B</td>\n",
" <td>1356</td>\n",
" <td>False</td>\n",
" <td>chrf-all</td>\n",
" <td>46.658216</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1236</th>\n",
" <td>ha-en</td>\n",
" <td>Online-B</td>\n",
" <td>1356</td>\n",
" <td>False</td>\n",
" <td>bleu-A</td>\n",
" <td>18.655658</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1237</th>\n",
" <td>ha-en</td>\n",
" <td>Online-B</td>\n",
" <td>1356</td>\n",
" <td>False</td>\n",
" <td>chrf-A</td>\n",
" <td>46.658216</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1238</th>\n",
" <td>ha-en</td>\n",
" <td>TWB</td>\n",
" <td>1335</td>\n",
" <td>False</td>\n",
" <td>bleu-all</td>\n",
" <td>12.326443</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1239</th>\n",
" <td>ha-en</td>\n",
" <td>TWB</td>\n",
" <td>1335</td>\n",
" <td>False</td>\n",
" <td>chrf-all</td>\n",
" <td>40.282629</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1240</th>\n",
" <td>ha-en</td>\n",
" <td>TWB</td>\n",
" <td>1335</td>\n",
" <td>False</td>\n",
" <td>bleu-A</td>\n",
" <td>12.326443</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1241</th>\n",
" <td>ha-en</td>\n",
" <td>TWB</td>\n",
" <td>1335</td>\n",
" <td>False</td>\n",
" <td>chrf-A</td>\n",
" <td>40.282629</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1242</th>\n",
" <td>ha-en</td>\n",
" <td>ZMT</td>\n",
" <td>553</td>\n",
" <td>False</td>\n",
" <td>bleu-all</td>\n",
" <td>18.837023</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1243</th>\n",
" <td>ha-en</td>\n",
" <td>ZMT</td>\n",
" <td>553</td>\n",
" <td>False</td>\n",
" <td>chrf-all</td>\n",
" <td>47.231474</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1244</th>\n",
" <td>ha-en</td>\n",
" <td>ZMT</td>\n",
" <td>553</td>\n",
" <td>False</td>\n",
" <td>bleu-A</td>\n",
" <td>18.837023</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1245</th>\n",
" <td>ha-en</td>\n",
" <td>ZMT</td>\n",
" <td>553</td>\n",
" <td>False</td>\n",
" <td>chrf-A</td>\n",
" <td>47.231474</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1246</th>\n",
" <td>ha-en</td>\n",
" <td>Manifold</td>\n",
" <td>437</td>\n",
" <td>True</td>\n",
" <td>bleu-all</td>\n",
" <td>16.943915</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1247</th>\n",
" <td>ha-en</td>\n",
" <td>Manifold</td>\n",
" <td>437</td>\n",
" <td>True</td>\n",
" <td>chrf-all</td>\n",
" <td>45.638356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1248</th>\n",
" <td>ha-en</td>\n",
" <td>Manifold</td>\n",
" <td>437</td>\n",
" <td>True</td>\n",
" <td>bleu-A</td>\n",
" <td>16.943915</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1249</th>\n",
" <td>ha-en</td>\n",
" <td>Manifold</td>\n",
" <td>437</td>\n",
" <td>True</td>\n",
" <td>chrf-A</td>\n",
" <td>45.638356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1250</th>\n",
" <td>ha-en</td>\n",
" <td>Online-Y</td>\n",
" <td>1374</td>\n",
" <td>False</td>\n",
" <td>bleu-all</td>\n",
" <td>13.898531</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1251</th>\n",
" <td>ha-en</td>\n",
" <td>Online-Y</td>\n",
" <td>1374</td>\n",
" <td>False</td>\n",
" <td>chrf-all</td>\n",
" <td>44.842874</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1252</th>\n",
" <td>ha-en</td>\n",
" <td>Online-Y</td>\n",
" <td>1374</td>\n",
" <td>False</td>\n",
" <td>bleu-A</td>\n",
" <td>13.898531</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1253</th>\n",
" <td>ha-en</td>\n",
" <td>Online-Y</td>\n",
" <td>1374</td>\n",
" <td>False</td>\n",
" <td>chrf-A</td>\n",
" <td>44.842874</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1254</th>\n",
" <td>ha-en</td>\n",
" <td>HuaweiTSC</td>\n",
" <td>758</td>\n",
" <td>True</td>\n",
" <td>bleu-all</td>\n",
" <td>17.492440</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1255</th>\n",
" <td>ha-en</td>\n",
" <td>HuaweiTSC</td>\n",
" <td>758</td>\n",
" <td>True</td>\n",
" <td>chrf-all</td>\n",
" <td>46.795737</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1256</th>\n",
" <td>ha-en</td>\n",
" <td>HuaweiTSC</td>\n",
" <td>758</td>\n",
" <td>True</td>\n",
" <td>bleu-A</td>\n",
" <td>17.492440</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1257</th>\n",
" <td>ha-en</td>\n",
" <td>HuaweiTSC</td>\n",
" <td>758</td>\n",
" <td>True</td>\n",
" <td>chrf-A</td>\n",
" <td>46.795737</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1258</th>\n",
" <td>ha-en</td>\n",
" <td>MS-EgDC</td>\n",
" <td>896</td>\n",
" <td>True</td>\n",
" <td>bleu-all</td>\n",
" <td>17.133350</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1259</th>\n",
" <td>ha-en</td>\n",
" <td>MS-EgDC</td>\n",
" <td>896</td>\n",
" <td>True</td>\n",
" <td>chrf-all</td>\n",
" <td>45.266274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1260</th>\n",
" <td>ha-en</td>\n",
" <td>MS-EgDC</td>\n",
" <td>896</td>\n",
" <td>True</td>\n",
" <td>bleu-A</td>\n",
" <td>17.133350</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1261</th>\n",
" <td>ha-en</td>\n",
" <td>MS-EgDC</td>\n",
" <td>896</td>\n",
" <td>True</td>\n",
" <td>chrf-A</td>\n",
" <td>45.266274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1262</th>\n",
" <td>ha-en</td>\n",
" <td>GTCOM</td>\n",
" <td>1298</td>\n",
" <td>False</td>\n",
" <td>bleu-all</td>\n",
" <td>17.794272</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1263</th>\n",
" <td>ha-en</td>\n",
" <td>GTCOM</td>\n",
" <td>1298</td>\n",
" <td>False</td>\n",
" <td>chrf-all</td>\n",
" <td>46.714831</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1264</th>\n",
" <td>ha-en</td>\n",
" <td>GTCOM</td>\n",
" <td>1298</td>\n",
" <td>False</td>\n",
" <td>bleu-A</td>\n",
" <td>17.794272</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1265</th>\n",
" <td>ha-en</td>\n",
" <td>GTCOM</td>\n",
" <td>1298</td>\n",
" <td>False</td>\n",
" <td>chrf-A</td>\n",
" <td>46.714831</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1266</th>\n",
" <td>ha-en</td>\n",
" <td>UEdin</td>\n",
" <td>1149</td>\n",
" <td>True</td>\n",
" <td>bleu-all</td>\n",
" <td>14.887836</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1267</th>\n",
" <td>ha-en</td>\n",
" <td>UEdin</td>\n",
" <td>1149</td>\n",
" <td>True</td>\n",
" <td>chrf-all</td>\n",
" <td>42.247415</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1268</th>\n",
" <td>ha-en</td>\n",
" <td>UEdin</td>\n",
" <td>1149</td>\n",
" <td>True</td>\n",
" <td>bleu-A</td>\n",
" <td>14.887836</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1269</th>\n",
" <td>ha-en</td>\n",
" <td>UEdin</td>\n",
" <td>1149</td>\n",
" <td>True</td>\n",
" <td>chrf-A</td>\n",
" <td>42.247415</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" pair system id is_constrained metric score\n",
"1214 ha-en NiuTrans 382 True bleu-all 16.512243\n",
"1215 ha-en NiuTrans 382 True chrf-all 44.724766\n",
"1216 ha-en NiuTrans 382 True bleu-A 16.512243\n",
"1217 ha-en NiuTrans 382 True chrf-A 44.724766\n",
"1218 ha-en Facebook-AI 181 False bleu-all 20.982704\n",
"1219 ha-en Facebook-AI 181 False chrf-all 48.653770\n",
"1220 ha-en Facebook-AI 181 False bleu-A 20.982704\n",
"1221 ha-en Facebook-AI 181 False chrf-A 48.653770\n",
"1222 ha-en TRANSSION 336 False bleu-all 18.834851\n",
"1223 ha-en TRANSSION 336 False chrf-all 47.238279\n",
"1224 ha-en TRANSSION 336 False bleu-A 18.834851\n",
"1225 ha-en TRANSSION 336 False chrf-A 47.238279\n",
"1226 ha-en AMU 628 True bleu-all 14.132845\n",
"1227 ha-en AMU 628 True chrf-all 41.256570\n",
"1228 ha-en AMU 628 True bleu-A 14.132845\n",
"1229 ha-en AMU 628 True chrf-A 41.256570\n",
"1230 ha-en P3AI 715 True bleu-all 17.793617\n",
"1231 ha-en P3AI 715 True chrf-all 46.307402\n",
"1232 ha-en P3AI 715 True bleu-A 17.793617\n",
"1233 ha-en P3AI 715 True chrf-A 46.307402\n",
"1234 ha-en Online-B 1356 False bleu-all 18.655658\n",
"1235 ha-en Online-B 1356 False chrf-all 46.658216\n",
"1236 ha-en Online-B 1356 False bleu-A 18.655658\n",
"1237 ha-en Online-B 1356 False chrf-A 46.658216\n",
"1238 ha-en TWB 1335 False bleu-all 12.326443\n",
"1239 ha-en TWB 1335 False chrf-all 40.282629\n",
"1240 ha-en TWB 1335 False bleu-A 12.326443\n",
"1241 ha-en TWB 1335 False chrf-A 40.282629\n",
"1242 ha-en ZMT 553 False bleu-all 18.837023\n",
"1243 ha-en ZMT 553 False chrf-all 47.231474\n",
"1244 ha-en ZMT 553 False bleu-A 18.837023\n",
"1245 ha-en ZMT 553 False chrf-A 47.231474\n",
"1246 ha-en Manifold 437 True bleu-all 16.943915\n",
"1247 ha-en Manifold 437 True chrf-all 45.638356\n",
"1248 ha-en Manifold 437 True bleu-A 16.943915\n",
"1249 ha-en Manifold 437 True chrf-A 45.638356\n",
"1250 ha-en Online-Y 1374 False bleu-all 13.898531\n",
"1251 ha-en Online-Y 1374 False chrf-all 44.842874\n",
"1252 ha-en Online-Y 1374 False bleu-A 13.898531\n",
"1253 ha-en Online-Y 1374 False chrf-A 44.842874\n",
"1254 ha-en HuaweiTSC 758 True bleu-all 17.492440\n",
"1255 ha-en HuaweiTSC 758 True chrf-all 46.795737\n",
"1256 ha-en HuaweiTSC 758 True bleu-A 17.492440\n",
"1257 ha-en HuaweiTSC 758 True chrf-A 46.795737\n",
"1258 ha-en MS-EgDC 896 True bleu-all 17.133350\n",
"1259 ha-en MS-EgDC 896 True chrf-all 45.266274\n",
"1260 ha-en MS-EgDC 896 True bleu-A 17.133350\n",
"1261 ha-en MS-EgDC 896 True chrf-A 45.266274\n",
"1262 ha-en GTCOM 1298 False bleu-all 17.794272\n",
"1263 ha-en GTCOM 1298 False chrf-all 46.714831\n",
"1264 ha-en GTCOM 1298 False bleu-A 17.794272\n",
"1265 ha-en GTCOM 1298 False chrf-A 46.714831\n",
"1266 ha-en UEdin 1149 True bleu-all 14.887836\n",
"1267 ha-en UEdin 1149 True chrf-all 42.247415\n",
"1268 ha-en UEdin 1149 True bleu-A 14.887836\n",
"1269 ha-en UEdin 1149 True chrf-A 42.247415"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('https://raw.githubusercontent.com/wmt-conference/wmt21-news-systems/main/scores/automatic-scores.tsv', sep='\\t')\n",
"df = df[df.pair == 'ha-en']\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>metric</th>\n",
" <th>bleu-A</th>\n",
" <th>bleu-all</th>\n",
" <th>chrf-A</th>\n",
" <th>chrf-all</th>\n",
" </tr>\n",
" <tr>\n",
" <th>system</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>AMU</th>\n",
" <td>14.132845</td>\n",
" <td>14.132845</td>\n",
" <td>41.256570</td>\n",
" <td>41.256570</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Facebook-AI</th>\n",
" <td>20.982704</td>\n",
" <td>20.982704</td>\n",
" <td>48.653770</td>\n",
" <td>48.653770</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GTCOM</th>\n",
" <td>17.794272</td>\n",
" <td>17.794272</td>\n",
" <td>46.714831</td>\n",
" <td>46.714831</td>\n",
" </tr>\n",
" <tr>\n",
" <th>HuaweiTSC</th>\n",
" <td>17.492440</td>\n",
" <td>17.492440</td>\n",
" <td>46.795737</td>\n",
" <td>46.795737</td>\n",
" </tr>\n",
" <tr>\n",
" <th>MS-EgDC</th>\n",
" <td>17.133350</td>\n",
" <td>17.133350</td>\n",
" <td>45.266274</td>\n",
" <td>45.266274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Manifold</th>\n",
" <td>16.943915</td>\n",
" <td>16.943915</td>\n",
" <td>45.638356</td>\n",
" <td>45.638356</td>\n",
" </tr>\n",
" <tr>\n",
" <th>NiuTrans</th>\n",
" <td>16.512243</td>\n",
" <td>16.512243</td>\n",
" <td>44.724766</td>\n",
" <td>44.724766</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Online-B</th>\n",
" <td>18.655658</td>\n",
" <td>18.655658</td>\n",
" <td>46.658216</td>\n",
" <td>46.658216</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Online-Y</th>\n",
" <td>13.898531</td>\n",
" <td>13.898531</td>\n",
" <td>44.842874</td>\n",
" <td>44.842874</td>\n",
" </tr>\n",
" <tr>\n",
" <th>P3AI</th>\n",
" <td>17.793617</td>\n",
" <td>17.793617</td>\n",
" <td>46.307402</td>\n",
" <td>46.307402</td>\n",
" </tr>\n",
" <tr>\n",
" <th>TRANSSION</th>\n",
" <td>18.834851</td>\n",
" <td>18.834851</td>\n",
" <td>47.238279</td>\n",
" <td>47.238279</td>\n",
" </tr>\n",
" <tr>\n",
" <th>TWB</th>\n",
" <td>12.326443</td>\n",
" <td>12.326443</td>\n",
" <td>40.282629</td>\n",
" <td>40.282629</td>\n",
" </tr>\n",
" <tr>\n",
" <th>UEdin</th>\n",
" <td>14.887836</td>\n",
" <td>14.887836</td>\n",
" <td>42.247415</td>\n",
" <td>42.247415</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ZMT</th>\n",
" <td>18.837023</td>\n",
" <td>18.837023</td>\n",
" <td>47.231474</td>\n",
" <td>47.231474</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"metric bleu-A bleu-all chrf-A chrf-all\n",
"system \n",
"AMU 14.132845 14.132845 41.256570 41.256570\n",
"Facebook-AI 20.982704 20.982704 48.653770 48.653770\n",
"GTCOM 17.794272 17.794272 46.714831 46.714831\n",
"HuaweiTSC 17.492440 17.492440 46.795737 46.795737\n",
"MS-EgDC 17.133350 17.133350 45.266274 45.266274\n",
"Manifold 16.943915 16.943915 45.638356 45.638356\n",
"NiuTrans 16.512243 16.512243 44.724766 44.724766\n",
"Online-B 18.655658 18.655658 46.658216 46.658216\n",
"Online-Y 13.898531 13.898531 44.842874 44.842874\n",
"P3AI 17.793617 17.793617 46.307402 46.307402\n",
"TRANSSION 18.834851 18.834851 47.238279 47.238279\n",
"TWB 12.326443 12.326443 40.282629 40.282629\n",
"UEdin 14.887836 14.887836 42.247415 42.247415\n",
"ZMT 18.837023 18.837023 47.231474 47.231474"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.pivot(index='system', columns='metric', values='score')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Dane tekstowe"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`pandas` posiada udogodnienia do pracy z wartościami tekstowymi:\n",
" * dostęp następuje przez atrybut `str`;\n",
" * funkcje:\n",
" * formatujące: `lower()`, `upper()`;\n",
" * wyrażenia regularne: `contains()`, `match()`;\n",
" * inne: `split()`"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund\\t Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings\\t Mrs. John Bradley (Florence Briggs T...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen\\t Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle\\t Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen\\t Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"3 1 3 \n",
"4 1 1 \n",
"5 0 3 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund\\t Mr. Owen Harris male 22.0 \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"3 Heikkinen\\t Miss. Laina female 26.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"5 Allen\\t Mr. William Henry male 35.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"3 0 0 STON/O2. 3101282 7.9250 NaN S \n",
"4 1 0 113803 53.1000 C123 S \n",
"5 0 0 373450 8.0500 NaN S "
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 BRAUND\\t MR. OWEN HARRIS\n",
"2 CUMINGS\\t MRS. JOHN BRADLEY (FLORENCE BRIGGS T...\n",
"3 HEIKKINEN\\t MISS. LAINA\n",
"4 FUTRELLE\\t MRS. JACQUES HEATH (LILY MAY PEEL)\n",
"5 ALLEN\\t MR. WILLIAM HENRY\n",
" ... \n",
"887 MONTVILA\\t REV. JUOZAS\n",
"888 GRAHAM\\t MISS. MARGARET EDITH\n",
"889 JOHNSTON\\t MISS. CATHERINE HELEN \"CARRIE\"\n",
"890 BEHR\\t MR. KARL HOWELL\n",
"891 DOOLEY\\t MR. PATRICK\n",
"Name: Name, Length: 891, dtype: object"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.Name.str.upper()"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PassengerId\n",
"1 Braund\\t Mr. Owen Harris\n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T...\n",
"3 Heikkinen\\t Miss. Laina\n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel)\n",
"5 Allen\\t Mr. William Henry\n",
"Name: Name, dtype: object\n"
]
},
{
"data": {
"text/plain": [
"PassengerId\n",
"1 False\n",
"2 True\n",
"3 True\n",
"4 True\n",
"5 False\n",
"Name: Name, dtype: bool"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(df.Name.head())\n",
"df.Name.str.contains('Miss|Mrs').head()"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" <tr>\n",
" <th>PassengerId</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Braund</td>\n",
" <td>Mr. Owen Harris</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Cumings</td>\n",
" <td>Mrs. John Bradley (Florence Briggs Thayer)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Heikkinen</td>\n",
" <td>Miss. Laina</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Futrelle</td>\n",
" <td>Mrs. Jacques Heath (Lily May Peel)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Allen</td>\n",
" <td>Mr. William Henry</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>Montvila</td>\n",
" <td>Rev. Juozas</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>Graham</td>\n",
" <td>Miss. Margaret Edith</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>Johnston</td>\n",
" <td>Miss. Catherine Helen \"Carrie\"</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>Behr</td>\n",
" <td>Mr. Karl Howell</td>\n",
" </tr>\n",
" <tr>\n",
" <th>891</th>\n",
" <td>Dooley</td>\n",
" <td>Mr. Patrick</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"PassengerId \n",
"1 Braund Mr. Owen Harris\n",
"2 Cumings Mrs. John Bradley (Florence Briggs Thayer)\n",
"3 Heikkinen Miss. Laina\n",
"4 Futrelle Mrs. Jacques Heath (Lily May Peel)\n",
"5 Allen Mr. William Henry\n",
"... ... ...\n",
"887 Montvila Rev. Juozas\n",
"888 Graham Miss. Margaret Edith\n",
"889 Johnston Miss. Catherine Helen \"Carrie\"\n",
"890 Behr Mr. Karl Howell\n",
"891 Dooley Mr. Patrick\n",
"\n",
"[891 rows x 2 columns]"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.Name.str.split('\\t', expand=True)"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 [Braund, Mr. Owen Harris]\n",
"2 [Cumings, Mrs. John Bradley (Florence Briggs ...\n",
"3 [Heikkinen, Miss. Laina]\n",
"4 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]\n",
"5 [Allen, Mr. William Henry]\n",
" ... \n",
"887 [Montvila, Rev. Juozas]\n",
"888 [Graham, Miss. Margaret Edith]\n",
"889 [Johnston, Miss. Catherine Helen \"Carrie\"]\n",
"890 [Behr, Mr. Karl Howell]\n",
"891 [Dooley, Mr. Patrick]\n",
"Name: Name, Length: 891, dtype: object"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.Name.str.split('\\t')"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 Mr. Owen Harris\n",
"2 Mrs. John Bradley (Florence Briggs Thayer)\n",
"3 Miss. Laina\n",
"4 Mrs. Jacques Heath (Lily May Peel)\n",
"5 Mr. William Henry\n",
" ... \n",
"887 Rev. Juozas\n",
"888 Miss. Margaret Edith\n",
"889 Miss. Catherine Helen \"Carrie\"\n",
"890 Mr. Karl Howell\n",
"891 Mr. Patrick\n",
"Name: Name, Length: 891, dtype: object"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.Name.str.split('\\t').str[1]"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 Mr.\n",
"2 Mrs.\n",
"3 Miss.\n",
"4 Mrs.\n",
"5 Mr.\n",
" ... \n",
"887 Rev.\n",
"888 Miss.\n",
"889 Miss.\n",
"890 Mr.\n",
"891 Mr.\n",
"Name: Name, Length: 891, dtype: object"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.Name.str.split('\\t').str[1].str.strip().str.split(' ').str[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie\n",
"Zestaw `nba.csv` zawiera informaję o wysokości zawodników. Oblicz wzrost każdego z zawodników w systemie metrycznym przyjmując, że stop to `30.48` cm., a cal to `2.54` cm."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Slideshow",
"interpreter": {
"hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}