{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Analiza Danych w Pythonie: `pandas`\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### `pandas`\n",
"Biblioteka `pandas` jest podstawowym narzędziem w ekosystemie Pythona do analizy danych:\n",
" * dostarcza dwa podstawowe typy danych: \n",
" * `Series` (szereg, 1D)\n",
" * `DataFrame` (ramka danych, 2D)\n",
" * operacje na tych obiektach: obsługa brakujących wartości, łączenie danych;\n",
" * obsługuje dane różnego typu, np. szeregi czasowe;\n",
" * biblioteka bazuje na `numpy` -- bibliotece do obliczeń numerycznych;\n",
" * pozwala też na prostą wizualizację danych;\n",
" * ETL: extract, transform, load."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Żeby zaimportowąc bibliotekę `pandas` wystarczy:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### __Zadanie 0__: sprawdź, czy masz zainstalowaną bibliotekę `pandas`."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### [Szeregi](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) (`pd.Series`)\n",
"\n",
" Szereg reprezentuje jednorodne dane jednowymiarowe - jest odpowiednikiem wektora w R.\n",
" * Szeregi możemy tworzyć na różne sposoby (więcej za chwilę), np. z obiektów tj. listy i słowniki.\n",
" * Dane muszą być jednorodne. W przeciwnym przypadku nastąpi automatyczna konwersja.\n",
" * Podczas tworzenia szeregu musimy podać jeden obowiązkowy argument `data` - dane.\n",
" * Ponadto możemy podać też indeks (`index`), typ danych (`dtype`) lub nazwę (`name`).\n",
" \n",
" \n",
" ```\n",
" class pandas.Series(data=None, index=None, dtype=None, name=None)\n",
" ```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Podczas tworzenie szeregu mozemy podać dane w formacie listy lub słownika.\n",
"\n",
"Poniżej jest przykład przedstawiający tworzenie szeregu z danych, które są zawarte w liście:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 211819\n",
"1 682758\n",
"2 737011\n",
"3 779511\n",
"4 673790\n",
"5 673790\n",
"6 444177\n",
"7 136791\n",
"dtype: int64"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"data = [211819, 682758, 737011, 779511, 673790, 673790, 444177, 136791]\n",
"\n",
"s = pd.Series(data)\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"W przypadku, gdy dane pochodzą z listy i nie podaliśmy indeksu, pandas doda automatyczny indeks liczbowy zaczynający się od 0."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"W przypadku przekazania słownika jako danych do szeregu, pandas wykorzysta klucze do stworzenia indeksu:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"April 211819\n",
"May 682758\n",
"June 737011\n",
"July 779511\n",
"dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n",
"\n",
"s = pd.Series(members)\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Podczas tworzenia szeregu możemy zdefiniować indeks, jak i nazwę szeregu:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"April 211819.0\n",
"May 682758.0\n",
"June 737011.0\n",
"July 779511.0\n",
"Name: Rides, dtype: float64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"months = ['April', 'May', 'June', 'July']\n",
"\n",
"data = [211819, 682758, 737011, 779511]\n",
"\n",
"s = pd.Series(data=data, index=months, dtype=float, name='Rides')\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Odwołanie się do poszczególnego elementu odbywa się przy pomocy klucza z indeksu."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"211819\n"
]
},
{
"data": {
"text/plain": [
"April 211819\n",
"May 682758\n",
"June 737011\n",
"July 779511\n",
"August 673790\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n",
"\n",
"s = pd.Series(members)\n",
"\n",
"print(s['April'])\n",
"\n",
"s['August'] = 673790\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Dodanie elementu do szeregu odbywa się poprzez definiowanie nowego klucza:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"April 211819\n",
"May 682758\n",
"June 737011\n",
"July 779511\n",
"August 673790\n",
"dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = {'April': 211819,'May': 682758, 'June': 737011, 'July': 779511}\n",
"\n",
"s = pd.Series(members)\n",
"\n",
"s['August'] = 673790\n",
"\n",
"s"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Więcej nt. indeksowania w szeregach w dalszej części kursu."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Podstawowa cechą szeregu jest wykonywanie operacji w sposób wektorowy. Działa to w następujący sposób:\n",
" * gdy w obu szeregach jest zawarty ten sam klucz, to są sumowane ich wartości;\n",
" * w przeciwnym przypadku wartość klucza w wynikowym szeregu to `pd.NaN`. \n",
" * Równoważnie możemy wykorzystać metodę `pandas.Series.add`. W tym przypadku możemy podać domyślną wartość w przypadku braku klucza."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"August 880599.0\n",
"July 973827.0\n",
"June 908505.0\n",
"May 830656.0\n",
"October NaN\n",
"September 814282.0\n",
"dtype: float64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = pd.Series({'May': 682758, 'June': 737011, 'August': 673790, 'July': 779511,\n",
"'September': 673790, 'October': 444177})\n",
"\n",
"occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316, 'August': 206809,\n",
"'September': 140492})\n",
"\n",
"all_data = members + occasionals\n",
"# Równoważnie\n",
"all_data = members.add(occasionals)\n",
"all_data"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Możemy wykonać operacje arytmetyczne na szeregu: "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"May 683758\n",
"June 738011\n",
"July 780511\n",
"August 674790\n",
"September 674790\n",
"October 445177\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511, 'August': 673790,\n",
"'September': 673790, 'October': 444177})\n",
"\n",
"members += 1000\n",
"\n",
"members"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"May 683758\n",
"June 738011\n",
"July 780511\n",
"August 674790\n",
"September 674790\n",
"October 445177\n",
"dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"May 10000000000683758\n",
"June 10000000000738011\n",
"July 10000000000780511\n",
"August 10000000000674790\n",
"September 10000000000674790\n",
"October 10000000000445177\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members + 10000000000000000"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Podsumowanie\n",
" * Szeregi działają podobnie do słowników, z tą różnicą, że wartości muszą być jednorodne (tego samego typu).\n",
" * Odwołanie do poszczególnych elementów odbywa się poprzez nawiasy `[]` i podanie klucza.\n",
" * W przeciwieństwie do słowników, możemy w prosty sposób wykonywać operacje arytmetyczne."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie 1\n",
" * Stwórz szereg `n`, który będzie zawierać liczby od 0 do 10 (włącznie).\n",
" * Stwórz szereg `n2`, który będzie zawierać kwadraty liczb od 0 do 10 (włącznie).\n",
" * Następnie stwórz szereg `trojkatne`, który będzie sumą powyższych szeregów podzieloną przez 2."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"n= list(range(10+1))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"n = pd.Series(n)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 1\n",
"2 2\n",
"3 3\n",
"4 4\n",
"5 5\n",
"6 6\n",
"7 7\n",
"8 8\n",
"9 9\n",
"10 10\n",
"dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"n2 = n**2"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 1\n",
"2 4\n",
"3 9\n",
"4 16\n",
"5 25\n",
"6 36\n",
"7 49\n",
"8 64\n",
"9 81\n",
"10 100\n",
"dtype: int64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n2"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"trojkatne = ( n + n2 ) / 2"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.0\n",
"1 1.0\n",
"2 3.0\n",
"3 6.0\n",
"4 10.0\n",
"5 15.0\n",
"6 21.0\n",
"7 28.0\n",
"8 36.0\n",
"9 45.0\n",
"10 55.0\n",
"dtype: float64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trojkatne"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### [Ramka danych](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) (`pd.DataFrame`)\n",
"\n",
"Ramka danych jest podstawową strukturą danych w bibliotece `pandas`, która pozwala na trzymanie i reprezentowanie danych tabelarycznych (dwuwymiarowych).\n",
" * Posiada kolumny (cechy) i wiersze (obserwacje, przykłady).\n",
" * Możemy też patrzeć na nią jak na słownik, którego wartościami są szeregi.\n",
"\n",
"```\n",
"class pandas.DataFrame(data=None, index=None, columns=None, dtype=None)\n",
"```\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Ramkę danych możemy stworzyć na różne sposoby.\n",
"\n",
"Pierwszy z nich (\"kolumnowy\") polega na zdefiniowaniu ramki poprzez podanie szeregów jako kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"May 682758\n",
"June 737011\n",
"July 779511\n",
"dtype: int64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"May 147898\n",
"June 171494\n",
"July 194316\n",
"dtype: int64"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"occasionals"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" members | \n",
" occasionals | \n",
"
\n",
" \n",
" \n",
" \n",
" July | \n",
" 779511.0 | \n",
" 194316.0 | \n",
"
\n",
" \n",
" June | \n",
" 737011.0 | \n",
" 171494.0 | \n",
"
\n",
" \n",
" May | \n",
" NaN | \n",
" 147898.0 | \n",
"
\n",
" \n",
" Maydfdsgfdg | \n",
" 682758.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" members occasionals\n",
"July 779511.0 194316.0\n",
"June 737011.0 171494.0\n",
"May NaN 147898.0\n",
"Maydfdsgfdg 682758.0 NaN"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = pd.Series({'Maydfdsgfdg': 682758, 'June': 737011, 'July': 779511})\n",
"occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316})\n",
"\n",
"df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Drugim popularnym sposobem jest przekazanie listy słowników. Wtedy `pandas` zinterpretuje to jako listę przykładów:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" members | \n",
" occasionals | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 682758 | \n",
" 147898 | \n",
"
\n",
" \n",
" 1 | \n",
" 737011 | \n",
" 171494 | \n",
"
\n",
" \n",
" 2 | \n",
" 779511 | \n",
" 194316 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" members occasionals\n",
"0 682758 147898\n",
"1 737011 171494\n",
"2 779511 194316"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = [\n",
" {'members': 682758, 'occasionals': 147898},\n",
" {'occasionals': 171494,'members': 737011},\n",
" {'members': 779511, 'occasionals': 194316},\n",
"]\n",
"\n",
"df = pd.DataFrame(data)\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Możemy też wykorzystać metodę `from_dict` ([doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html)), która pozwala zdefiniować czy podane dane są w podane w postaci kolumnowej lub wierszowej:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"index\n",
" members occasionals\n",
"May 682758 147898\n",
"June 737011 171494\n",
"July 779511 194316\n",
"\n",
"columns\n",
" May June July\n",
"members 682758 737011 779511\n",
"occasionals 147898 171494 194316\n"
]
}
],
"source": [
"data = {\n",
" 'May': {'members': 682758, 'occasionals': 147898},\n",
" 'June': {'members': 737011, 'occasionals': 171494},\n",
" 'July': {'members': 779511, 'occasionals': 194316}\n",
"}\n",
"\n",
"df = pd.DataFrame.from_dict(data, orient='index')\n",
"print('index\\n', df)\n",
"print()\n",
"df = pd.DataFrame.from_dict(data, orient='columns')\n",
"print('columns\\n', df)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Wczytywanie danych"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Biblioteka `pandas` pozwala na wczytanie i zapis danych z różnych formatów:\n",
" * formaty tekstowe, np. `csv`, `json`\n",
" * pliki arkuszy kalkulacyjnych: Excel (xls, xlsx)\n",
" * bazy danych\n",
" * inne: `sas` `spss`\n",
"\n",
"\n",
"Efektem wczytania danych jest odpowiednio stworzona ramka danych (`DataFrame`)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Jednym z najprostszych formatów danych jest format `csv`, gdzie kolejne wartości są rozdzielone przecinkiem.\n",
"\n",
"Żeby wczytać dane w takim formacie należy użyć funkcji `pandas.read_csv`.\n",
"\n",
"Pandas pozwala na ustawienie wielu parametrów (np. separator, cudzysłowy). Więcej na ten temat w [dokumentacji](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Country | \n",
" female_BMI | \n",
" male_BMI | \n",
" gdp | \n",
" population | \n",
" under5mortality | \n",
" life_expectancy | \n",
" fertility | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Afghanistan | \n",
" 21.07402 | \n",
" 20.62058 | \n",
" 1311.0 | \n",
" 26528741.0 | \n",
" 110.4 | \n",
" 52.8 | \n",
" 6.20 | \n",
"
\n",
" \n",
" 1 | \n",
" Albania | \n",
" 25.65726 | \n",
" 26.44657 | \n",
" 8644.0 | \n",
" 2968026.0 | \n",
" 17.9 | \n",
" 76.8 | \n",
" 1.76 | \n",
"
\n",
" \n",
" 2 | \n",
" Algeria | \n",
" 26.36841 | \n",
" 24.59620 | \n",
" 12314.0 | \n",
" 34811059.0 | \n",
" 29.5 | \n",
" 75.5 | \n",
" 2.73 | \n",
"
\n",
" \n",
" 3 | \n",
" Angola | \n",
" 23.48431 | \n",
" 22.25083 | \n",
" 7103.0 | \n",
" 19842251.0 | \n",
" 192.0 | \n",
" 56.7 | \n",
" 6.43 | \n",
"
\n",
" \n",
" 4 | \n",
" Antigua and Barbuda | \n",
" 27.50545 | \n",
" 25.76602 | \n",
" 25736.0 | \n",
" 85350.0 | \n",
" 10.9 | \n",
" 75.5 | \n",
" 2.16 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 170 | \n",
" Venezuela | \n",
" 28.13408 | \n",
" 27.44500 | \n",
" 17911.0 | \n",
" 28116716.0 | \n",
" 17.1 | \n",
" 74.2 | \n",
" 2.53 | \n",
"
\n",
" \n",
" 171 | \n",
" Vietnam | \n",
" 21.06500 | \n",
" 20.91630 | \n",
" 4085.0 | \n",
" 86589342.0 | \n",
" 26.2 | \n",
" 74.1 | \n",
" 1.86 | \n",
"
\n",
" \n",
" 172 | \n",
" Palestine | \n",
" 29.02643 | \n",
" 26.57750 | \n",
" 3564.0 | \n",
" 3854667.0 | \n",
" 24.7 | \n",
" 74.1 | \n",
" 4.38 | \n",
"
\n",
" \n",
" 173 | \n",
" Zambia | \n",
" 23.05436 | \n",
" 20.68321 | \n",
" 3039.0 | \n",
" 13114579.0 | \n",
" 94.9 | \n",
" 51.1 | \n",
" 5.88 | \n",
"
\n",
" \n",
" 174 | \n",
" Zimbabwe | \n",
" 24.64522 | \n",
" 22.02660 | \n",
" 1286.0 | \n",
" 13495462.0 | \n",
" 98.3 | \n",
" 47.3 | \n",
" 3.85 | \n",
"
\n",
" \n",
"
\n",
"
175 rows × 8 columns
\n",
"
"
],
"text/plain": [
" Country female_BMI male_BMI gdp population \\\n",
"0 Afghanistan 21.07402 20.62058 1311.0 26528741.0 \n",
"1 Albania 25.65726 26.44657 8644.0 2968026.0 \n",
"2 Algeria 26.36841 24.59620 12314.0 34811059.0 \n",
"3 Angola 23.48431 22.25083 7103.0 19842251.0 \n",
"4 Antigua and Barbuda 27.50545 25.76602 25736.0 85350.0 \n",
".. ... ... ... ... ... \n",
"170 Venezuela 28.13408 27.44500 17911.0 28116716.0 \n",
"171 Vietnam 21.06500 20.91630 4085.0 86589342.0 \n",
"172 Palestine 29.02643 26.57750 3564.0 3854667.0 \n",
"173 Zambia 23.05436 20.68321 3039.0 13114579.0 \n",
"174 Zimbabwe 24.64522 22.02660 1286.0 13495462.0 \n",
"\n",
" under5mortality life_expectancy fertility \n",
"0 110.4 52.8 6.20 \n",
"1 17.9 76.8 1.76 \n",
"2 29.5 75.5 2.73 \n",
"3 192.0 56.7 6.43 \n",
"4 10.9 75.5 2.16 \n",
".. ... ... ... \n",
"170 17.1 74.2 2.53 \n",
"171 26.2 74.1 1.86 \n",
"172 24.7 74.1 4.38 \n",
"173 94.9 51.1 5.88 \n",
"174 98.3 47.3 3.85 \n",
"\n",
"[175 rows x 8 columns]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('gapminder.csv')\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" PassengerId | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund\\t Mr. Owen Harris | \n",
" male | \n",
" 22 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings\\t Mrs. John Bradley (Florence Briggs T... | \n",
" female | \n",
" 38 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen\\t Miss. Laina | \n",
" female | \n",
" 26 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.9250 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle\\t Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen\\t Mr. William Henry | \n",
" male | \n",
" 35 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.0500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"3 1 3 \n",
"4 1 1 \n",
"5 0 3 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund\\t Mr. Owen Harris male 22 \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38 \n",
"3 Heikkinen\\t Miss. Laina female 26 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35 \n",
"5 Allen\\t Mr. William Henry male 35 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"3 0 0 STON/O2. 3101282 7.9250 NaN S \n",
"4 1 0 113803 53.1000 C123 S \n",
"5 0 0 373450 8.0500 NaN S "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./titanic_train.tsv', delimiter='\\t', index_col=0, nrows=5)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Do wczytania danych z arkusza kalkulacyjnego służy funkcja `pandas.read_excel`. Do otworzenia pliku `xlsx` może być koniecnze ustawienie parametru: `engine='openpyxl`. Więcej opcji w [dokumentacji](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df = pd.read_excel('./bikes.xlsx', engine='openpyxl', nrows=5)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Innym ważnym źródłem informacji są bazy danych. Pandas potrafi komunikować się z bazą danych za pomocą biblioteki [SQLAlchemy](https://pypi.org/project/SQLAlchemy/) i dostarcza odpowiedną funkcję:\n",
" * `pandas.read_sql` - wczytanie całej tabeli lub zapytania do bazy danych"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Title | \n",
" ArtistId | \n",
"
\n",
" \n",
" AlbumId | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" For Those About To Rock We Salute You | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Balls to the Wall | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" Restless and Wild | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" Let There Be Rock | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" Big Ones | \n",
" 3 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 343 | \n",
" Respighi:Pines of Rome | \n",
" 226 | \n",
"
\n",
" \n",
" 344 | \n",
" Schubert: The Late String Quartets & String Qu... | \n",
" 272 | \n",
"
\n",
" \n",
" 345 | \n",
" Monteverdi: L'Orfeo | \n",
" 273 | \n",
"
\n",
" \n",
" 346 | \n",
" Mozart: Chamber Music | \n",
" 274 | \n",
"
\n",
" \n",
" 347 | \n",
" Koyaanisqatsi (Soundtrack from the Motion Pict... | \n",
" 275 | \n",
"
\n",
" \n",
"
\n",
"
347 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Title ArtistId\n",
"AlbumId \n",
"1 For Those About To Rock We Salute You 1\n",
"2 Balls to the Wall 2\n",
"3 Restless and Wild 2\n",
"4 Let There Be Rock 1\n",
"5 Big Ones 3\n",
"... ... ...\n",
"343 Respighi:Pines of Rome 226\n",
"344 Schubert: The Late String Quartets & String Qu... 272\n",
"345 Monteverdi: L'Orfeo 273\n",
"346 Mozart: Chamber Music 274\n",
"347 Koyaanisqatsi (Soundtrack from the Motion Pict... 275\n",
"\n",
"[347 rows x 2 columns]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_sql('Album', con='sqlite:///Chinook.sqlite', index_col='AlbumId')\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Title | \n",
" ArtistId | \n",
"
\n",
" \n",
" AlbumId | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" For Those About To Rock We Salute You | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" Balls to the Wall | \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
" Restless and Wild | \n",
" 2 | \n",
"
\n",
" \n",
" 4 | \n",
" Let There Be Rock | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" Big Ones | \n",
" 3 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 343 | \n",
" Respighi:Pines of Rome | \n",
" 226 | \n",
"
\n",
" \n",
" 344 | \n",
" Schubert: The Late String Quartets & String Qu... | \n",
" 272 | \n",
"
\n",
" \n",
" 345 | \n",
" Monteverdi: L'Orfeo | \n",
" 273 | \n",
"
\n",
" \n",
" 346 | \n",
" Mozart: Chamber Music | \n",
" 274 | \n",
"
\n",
" \n",
" 347 | \n",
" Koyaanisqatsi (Soundtrack from the Motion Pict... | \n",
" 275 | \n",
"
\n",
" \n",
"
\n",
"
347 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Title ArtistId\n",
"AlbumId \n",
"1 For Those About To Rock We Salute You 1\n",
"2 Balls to the Wall 2\n",
"3 Restless and Wild 2\n",
"4 Let There Be Rock 1\n",
"5 Big Ones 3\n",
"... ... ...\n",
"343 Respighi:Pines of Rome 226\n",
"344 Schubert: The Late String Quartets & String Qu... 272\n",
"345 Monteverdi: L'Orfeo 273\n",
"346 Mozart: Chamber Music 274\n",
"347 Koyaanisqatsi (Soundtrack from the Motion Pict... 275\n",
"\n",
"[347 rows x 2 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sqlalchemy\n",
"\n",
"engine = sqlalchemy.create_engine('sqlite:///Chinook.sqlite', echo=True)\n",
"connection = engine.raw_connection()\n",
"\n",
"df = pd.read_sql('SELECT * FROM Album', con='sqlite:///Chinook.sqlite', index_col='AlbumId')\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Podsumowanie\n",
"\n",
"\n",
" * Biblioteka `pandas` wspiera pobieranie danych z różnych formatów i źródeł.\n",
" * Każda funkcja ma listę argumentów, które pozwalają na ustawić poszczególne parametry (np. [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv))."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zapis i eksport danych"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pandas pozwala w prosty sposób na zapisywanie ramki danych do pliku. "
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511})\n",
"occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316})\n",
"\n",
"df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# zapis do formatu CSV\n",
"df.to_csv('tmp.csv')\n",
"# zapis do arkusza kalkulacyjnego \n",
"df.to_excel('tmp.xlsx')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Ponadto możemy przekonwertować ramkę danych do JSONa lub Pythonowego słownika:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\"members\":{\"May\":682758,\"June\":737011,\"July\":779511},\"occasionals\":{\"May\":147898,\"June\":171494,\"July\":194316}}\n"
]
}
],
"source": [
"print(df.to_json())"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'members': {'May': 682758, 'June': 737011, 'July': 779511}, 'occasionals': {'May': 147898, 'June': 171494, 'July': 194316}}\n"
]
}
],
"source": [
"print(df.to_dict())\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Lub przekopiować dane do schowka:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.to_clipboard()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie\n",
"\n",
"\n",
"\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
" * Przekonwertuj tabele `Customer` z bazy `Chinook.sqlite` do arkusza kalkulacyjnego. Plik wynikowy nazwij `customers.xlsx`.\n",
" * Tabela `Employee` zawiera informacje o pracownikach firmy Chinook. Wyswietl dane na ekranie i podaj miasta, w których mieszkają pracownicy.\n",
" * Tabela `Invoice` zawiera informacje o fakturach. Przekonwertuj kolumnę `BillingCountry` do pythonowego słownika, a następnie podaj najcześciej występującą wartość. Ile razy pojawiła się?\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Ramka danych - podstawy"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"#### Kolumny\n",
"\n",
"Na ramkę danych możemy patrzeć jak na swego rodzaju słownik, którego wartościami są szeregi. Pozwoli to na uzyskanie lepszej intuicji.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" gdp | \n",
" population | \n",
" life_expectancy | \n",
"
\n",
" \n",
" Country | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Afghanistan | \n",
" 1311.0 | \n",
" 26528741.0 | \n",
" 52.8 | \n",
"
\n",
" \n",
" Albania | \n",
" 8644.0 | \n",
" 2968026.0 | \n",
" 76.8 | \n",
"
\n",
" \n",
" Algeria | \n",
" 12314.0 | \n",
" 34811059.0 | \n",
" 75.5 | \n",
"
\n",
" \n",
" Angola | \n",
" 7103.0 | \n",
" 19842251.0 | \n",
" 56.7 | \n",
"
\n",
" \n",
" Antigua and Barbuda | \n",
" 25736.0 | \n",
" 85350.0 | \n",
" 75.5 | \n",
"
\n",
" \n",
" Argentina | \n",
" 14646.0 | \n",
" 40381860.0 | \n",
" 75.4 | \n",
"
\n",
" \n",
" Armenia | \n",
" 7383.0 | \n",
" 2975029.0 | \n",
" 72.3 | \n",
"
\n",
" \n",
" Australia | \n",
" 41312.0 | \n",
" 21370348.0 | \n",
" 81.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" gdp population life_expectancy\n",
"Country \n",
"Afghanistan 1311.0 26528741.0 52.8\n",
"Albania 8644.0 2968026.0 76.8\n",
"Algeria 12314.0 34811059.0 75.5\n",
"Angola 7103.0 19842251.0 56.7\n",
"Antigua and Barbuda 25736.0 85350.0 75.5\n",
"Argentina 14646.0 40381860.0 75.4\n",
"Armenia 7383.0 2975029.0 72.3\n",
"Australia 41312.0 21370348.0 81.6"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=8, usecols=['Country', 'gdp', 'population','life_expectancy'])\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Dostęp do poszczególnej kolumny możemy uzystać na dwa sposoby:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Country\n",
"Afghanistan 26528741.0\n",
"Albania 2968026.0\n",
"Algeria 34811059.0\n",
"Angola 19842251.0\n",
"Antigua and Barbuda 85350.0\n",
"Argentina 40381860.0\n",
"Armenia 2975029.0\n",
"Australia 21370348.0\n",
"Name: population, dtype: float64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# notacja z kropką\n",
"df.population"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Country\n",
"Afghanistan 26528741.0\n",
"Albania 2968026.0\n",
"Algeria 34811059.0\n",
"Angola 19842251.0\n",
"Antigua and Barbuda 85350.0\n",
"Argentina 40381860.0\n",
"Armenia 2975029.0\n",
"Australia 21370348.0\n",
"Name: population, dtype: float64"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Operator []\n",
"df['population']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Do operatora `[]` możemy też podać listę nazw kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" gdp | \n",
" population | \n",
"
\n",
" \n",
" Country | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Afghanistan | \n",
" 1311.0 | \n",
" 26528741.0 | \n",
"
\n",
" \n",
" Albania | \n",
" 8644.0 | \n",
" 2968026.0 | \n",
"
\n",
" \n",
" Algeria | \n",
" 12314.0 | \n",
" 34811059.0 | \n",
"
\n",
" \n",
" Angola | \n",
" 7103.0 | \n",
" 19842251.0 | \n",
"
\n",
" \n",
" Antigua and Barbuda | \n",
" 25736.0 | \n",
" 85350.0 | \n",
"
\n",
" \n",
" Argentina | \n",
" 14646.0 | \n",
" 40381860.0 | \n",
"
\n",
" \n",
" Armenia | \n",
" 7383.0 | \n",
" 2975029.0 | \n",
"
\n",
" \n",
" Australia | \n",
" 41312.0 | \n",
" 21370348.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" gdp population\n",
"Country \n",
"Afghanistan 1311.0 26528741.0\n",
"Albania 8644.0 2968026.0\n",
"Algeria 12314.0 34811059.0\n",
"Angola 7103.0 19842251.0\n",
"Antigua and Barbuda 25736.0 85350.0\n",
"Argentina 14646.0 40381860.0\n",
"Armenia 7383.0 2975029.0\n",
"Australia 41312.0 21370348.0"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['gdp','population']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Listę kolumn możemy pobrać za pomocą:"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['gdp', 'population', 'life_expectancy'], dtype='object')"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" gdp | \n",
" population | \n",
" life_expectancy | \n",
"
\n",
" \n",
" Country | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Afghanistan | \n",
" 1311.0 | \n",
" 26528741.0 | \n",
" 52.8 | \n",
"
\n",
" \n",
" Albania | \n",
" 8644.0 | \n",
" 2968026.0 | \n",
" 76.8 | \n",
"
\n",
" \n",
" Algeria | \n",
" 12314.0 | \n",
" 34811059.0 | \n",
" 75.5 | \n",
"
\n",
" \n",
" Angola | \n",
" 7103.0 | \n",
" 19842251.0 | \n",
" 56.7 | \n",
"
\n",
" \n",
" Antigua and Barbuda | \n",
" 25736.0 | \n",
" 85350.0 | \n",
" 75.5 | \n",
"
\n",
" \n",
" Argentina | \n",
" 14646.0 | \n",
" 40381860.0 | \n",
" 75.4 | \n",
"
\n",
" \n",
" Armenia | \n",
" 7383.0 | \n",
" 2975029.0 | \n",
" 72.3 | \n",
"
\n",
" \n",
" Australia | \n",
" 41312.0 | \n",
" 21370348.0 | \n",
" 81.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" gdp population life_expectancy\n",
"Country \n",
"Afghanistan 1311.0 26528741.0 52.8\n",
"Albania 8644.0 2968026.0 76.8\n",
"Algeria 12314.0 34811059.0 75.5\n",
"Angola 7103.0 19842251.0 56.7\n",
"Antigua and Barbuda 25736.0 85350.0 75.5\n",
"Argentina 14646.0 40381860.0 75.4\n",
"Armenia 7383.0 2975029.0 72.3\n",
"Australia 41312.0 21370348.0 81.6"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PKB | \n",
" Populacja | \n",
" ODŻ | \n",
"
\n",
" \n",
" Country | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Afghanistan | \n",
" 1311.0 | \n",
" 26528741.0 | \n",
" 52.8 | \n",
"
\n",
" \n",
" Albania | \n",
" 8644.0 | \n",
" 2968026.0 | \n",
" 76.8 | \n",
"
\n",
" \n",
" Algeria | \n",
" 12314.0 | \n",
" 34811059.0 | \n",
" 75.5 | \n",
"
\n",
" \n",
" Angola | \n",
" 7103.0 | \n",
" 19842251.0 | \n",
" 56.7 | \n",
"
\n",
" \n",
" Antigua and Barbuda | \n",
" 25736.0 | \n",
" 85350.0 | \n",
" 75.5 | \n",
"
\n",
" \n",
" Argentina | \n",
" 14646.0 | \n",
" 40381860.0 | \n",
" 75.4 | \n",
"
\n",
" \n",
" Armenia | \n",
" 7383.0 | \n",
" 2975029.0 | \n",
" 72.3 | \n",
"
\n",
" \n",
" Australia | \n",
" 41312.0 | \n",
" 21370348.0 | \n",
" 81.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PKB Populacja ODŻ\n",
"Country \n",
"Afghanistan 1311.0 26528741.0 52.8\n",
"Albania 8644.0 2968026.0 76.8\n",
"Algeria 12314.0 34811059.0 75.5\n",
"Angola 7103.0 19842251.0 56.7\n",
"Antigua and Barbuda 25736.0 85350.0 75.5\n",
"Argentina 14646.0 40381860.0 75.4\n",
"Armenia 7383.0 2975029.0 72.3\n",
"Australia 41312.0 21370348.0 81.6"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns = ['PKB', 'Populacja', 'ODŻ']\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Żeby odwołać się do poszczególnych wierszy należy wykorzystać metodę `loc`:"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PKB 14646.0\n",
"Populacja 40381860.0\n",
"ODŻ 75.4\n",
"Name: Argentina, dtype: float64"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['Argentina']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Metoda `loc` również może przyjąć listę wierszy: "
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PKB | \n",
" Populacja | \n",
" ODŻ | \n",
"
\n",
" \n",
" Country | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Albania | \n",
" 8644.0 | \n",
" 2968026.0 | \n",
" 76.8 | \n",
"
\n",
" \n",
" Angola | \n",
" 7103.0 | \n",
" 19842251.0 | \n",
" 56.7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PKB Populacja ODŻ\n",
"Country \n",
"Albania 8644.0 2968026.0 76.8\n",
"Angola 7103.0 19842251.0 56.7"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[['Albania', 'Angola']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Możemy również podać drugi parametr: nazwy kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PKB | \n",
" Populacja | \n",
"
\n",
" \n",
" Country | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Albania | \n",
" 8644.0 | \n",
" 2968026.0 | \n",
"
\n",
" \n",
" Angola | \n",
" 7103.0 | \n",
" 19842251.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PKB Populacja\n",
"Country \n",
"Albania 8644.0 2968026.0\n",
"Angola 7103.0 19842251.0"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.loc[['Albania', 'Angola'], ['PKB', 'Populacja']]\n",
"\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Albo wykorzystać tzw. _slicing_, cyzli operator `:`:"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PKB | \n",
" Populacja | \n",
" ODŻ | \n",
"
\n",
" \n",
" Country | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Albania | \n",
" 8644.0 | \n",
" 2968026.0 | \n",
" 76.8 | \n",
"
\n",
" \n",
" Algeria | \n",
" 12314.0 | \n",
" 34811059.0 | \n",
" 75.5 | \n",
"
\n",
" \n",
" Angola | \n",
" 7103.0 | \n",
" 19842251.0 | \n",
" 56.7 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PKB Populacja ODŻ\n",
"Country \n",
"Albania 8644.0 2968026.0 76.8\n",
"Algeria 12314.0 34811059.0 75.5\n",
"Angola 7103.0 19842251.0 56.7"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['Albania': 'Angola', 'PKB': 'ODŻ']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Żeby odwołać się do pojedyńczej wartości możemy użyć metody `at`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.at['Angola', 'PKB']"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Dostęp do indeksu:"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda',\n",
" 'Argentina', 'Armenia', 'Australia'],\n",
" dtype='object', name='Country')"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.index"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Podstawowe metody `pd.Series` i `pd.DataFrame`"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" members | \n",
" occasionals | \n",
"
\n",
" \n",
" \n",
" \n",
" May | \n",
" 682758 | \n",
" 147898 | \n",
"
\n",
" \n",
" June | \n",
" 737011 | \n",
" 171494 | \n",
"
\n",
" \n",
" July | \n",
" 779511 | \n",
" 194316 | \n",
"
\n",
" \n",
" August | \n",
" 673790 | \n",
" 206809 | \n",
"
\n",
" \n",
" September | \n",
" 673790 | \n",
" 140492 | \n",
"
\n",
" \n",
" October | \n",
" 444177 | \n",
" 53596 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" members occasionals\n",
"May 682758 147898\n",
"June 737011 171494\n",
"July 779511 194316\n",
"August 673790 206809\n",
"September 673790 140492\n",
"October 444177 53596"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"members = pd.Series({'May': 682758, 'June': 737011, 'July': 779511, 'August': 673790,\n",
"'September': 673790, 'October': 444177})\n",
"\n",
"occasionals = pd.Series({'May': 147898, 'June': 171494, 'July': 194316, 'August': 206809,\n",
"'September': 140492, 'October': 53596})\n",
"\n",
"df = pd.DataFrame({'members': members, 'occasionals': occasionals})\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `head` pozwala tworzy nową ramkę danych z pierwszymi 5 przykładami:"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" members | \n",
" occasionals | \n",
"
\n",
" \n",
" \n",
" \n",
" May | \n",
" 682758 | \n",
" 147898 | \n",
"
\n",
" \n",
" June | \n",
" 737011 | \n",
" 171494 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" members occasionals\n",
"May 682758 147898\n",
"June 737011 171494"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `tail` robi to samo, ale z 5 ostatnymi przykładami:"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" members | \n",
" occasionals | \n",
"
\n",
" \n",
" \n",
" \n",
" June | \n",
" 737011 | \n",
" 171494 | \n",
"
\n",
" \n",
" July | \n",
" 779511 | \n",
" 194316 | \n",
"
\n",
" \n",
" August | \n",
" 673790 | \n",
" 206809 | \n",
"
\n",
" \n",
" September | \n",
" 673790 | \n",
" 140492 | \n",
"
\n",
" \n",
" October | \n",
" 444177 | \n",
" 53596 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" members occasionals\n",
"June 737011 171494\n",
"July 779511 194316\n",
"August 673790 206809\n",
"September 673790 140492\n",
"October 444177 53596"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `sample` pozwala na stworzenie nowej ramki danych z wylosowanymi `n` przykładami:"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" members | \n",
" occasionals | \n",
"
\n",
" \n",
" \n",
" \n",
" September | \n",
" 673790 | \n",
" 140492 | \n",
"
\n",
" \n",
" May | \n",
" 682758 | \n",
" 147898 | \n",
"
\n",
" \n",
" June | \n",
" 737011 | \n",
" 171494 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" members occasionals\n",
"September 673790 140492\n",
"May 682758 147898\n",
"June 737011 171494"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `describe` zwraca podstawowe statystyki m.in.: liczebność, średnią, wartości skrajne: "
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" members | \n",
" occasionals | \n",
"
\n",
" \n",
" \n",
" \n",
" May | \n",
" 682758 | \n",
" 147898 | \n",
"
\n",
" \n",
" June | \n",
" 737011 | \n",
" 171494 | \n",
"
\n",
" \n",
" July | \n",
" 779511 | \n",
" 194316 | \n",
"
\n",
" \n",
" August | \n",
" 673790 | \n",
" 206809 | \n",
"
\n",
" \n",
" September | \n",
" 673790 | \n",
" 140492 | \n",
"
\n",
" \n",
" October | \n",
" 444177 | \n",
" 53596 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" members occasionals\n",
"May 682758 147898\n",
"June 737011 171494\n",
"July 779511 194316\n",
"August 673790 206809\n",
"September 673790 140492\n",
"October 444177 53596"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" members | \n",
" occasionals | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 6.000000 | \n",
" 6.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 665172.833333 | \n",
" 152434.166667 | \n",
"
\n",
" \n",
" std | \n",
" 116216.045456 | \n",
" 54783.506738 | \n",
"
\n",
" \n",
" min | \n",
" 444177.000000 | \n",
" 53596.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 673790.000000 | \n",
" 142343.500000 | \n",
"
\n",
" \n",
" 50% | \n",
" 678274.000000 | \n",
" 159696.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 723447.750000 | \n",
" 188610.500000 | \n",
"
\n",
" \n",
" max | \n",
" 779511.000000 | \n",
" 206809.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" members occasionals\n",
"count 6.000000 6.000000\n",
"mean 665172.833333 152434.166667\n",
"std 116216.045456 54783.506738\n",
"min 444177.000000 53596.000000\n",
"25% 673790.000000 142343.500000\n",
"50% 678274.000000 159696.000000\n",
"75% 723447.750000 188610.500000\n",
"max 779511.000000 206809.000000"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Metoda `info` zwraca informacje techniczne o kolumnach: np. typ danych:"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Index: 6 entries, May to October\n",
"Data columns (total 2 columns):\n",
" # Column Non-Null Count Dtype\n",
"--- ------ -------------- -----\n",
" 0 members 6 non-null int64\n",
" 1 occasionals 6 non-null int64\n",
"dtypes: int64(2)\n",
"memory usage: 144.0+ bytes\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Podstawową informacją o ramce danych to liczba przykładów w ramce danych. Możemy wykorzystać to tego funkcję `len`:"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"6"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Natomiast atrybut `shape` zwraca nam krotkę z liczbą przykładów i liczbą kolumn:"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(6, 2)"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Operacja arytmetyczne\n",
"\n",
" * `max`, `idxmax`\n",
" * `min`, `idxmin`\n",
" * `mean`\n",
" * `count`"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" members | \n",
" occasionals | \n",
"
\n",
" \n",
" \n",
" \n",
" May | \n",
" 682758 | \n",
" 147898 | \n",
"
\n",
" \n",
" June | \n",
" 737011 | \n",
" 171494 | \n",
"
\n",
" \n",
" July | \n",
" 779511 | \n",
" 194316 | \n",
"
\n",
" \n",
" August | \n",
" 673790 | \n",
" 206809 | \n",
"
\n",
" \n",
" September | \n",
" 673790 | \n",
" 140492 | \n",
"
\n",
" \n",
" October | \n",
" 444177 | \n",
" 53596 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" members occasionals\n",
"May 682758 147898\n",
"June 737011 171494\n",
"July 779511 194316\n",
"August 673790 206809\n",
"September 673790 140492\n",
"October 444177 53596"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Zbiór wartości i zliczanie wartości:"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1 3 2]\n",
"3 4\n",
"1 3\n",
"2 3\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"dane = pd.Series([1, 3, 2, 3, 1, 1, 2, 3, 2, 3])\n",
"\n",
"print(dane.unique())\n",
"\n",
"dane = pd.Series([1, 3, 2, 3, 1, 1, 2, 3, 2, 3])\n",
"\n",
"print(dane.value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 3\n",
"2 2\n",
"3 3\n",
"4 1\n",
"5 1\n",
"6 2\n",
"7 3\n",
"8 2\n",
"9 3\n",
"dtype: int64"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dane"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3 4\n",
"1 3\n",
"2 3\n",
"Name: count, dtype: int64\n"
]
}
],
"source": [
"print(dane.value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Sprawdzanie czy brakuje danych:"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
"5 False\n",
" ... \n",
"887 False\n",
"888 False\n",
"889 True\n",
"890 False\n",
"891 False\n",
"Name: Age, Length: 891, dtype: bool"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n",
"df.Age.isnull()\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Dodawanie i modyfikowanie danych"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=5)\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"conts = pd.Series({\n",
" 'Afghanistan': 'Asia', 'Albania': 'Europe', 'Algeria':' Africa', 'Angola': 'Africa', 'Antigua and Barbuda': 'Americas'})\n",
"\n",
"df['continent'] = conts\n",
"\n",
"df['tmp'] = 1\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.loc['Argentina'] = {\n",
" 'female_BMI': 27.46523,\n",
" 'male_BMI': 27.5017,\n",
" 'gdp': 14646.0,\n",
" 'population': 40381860.0,\n",
" 'under5mortality': 15.4,\n",
" 'life_expectancy': 75.4,\n",
" 'fertility': 2.24\n",
"}\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.drop('gdp', axis='columns')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Filtrowanie danych"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Biblioteka pandas posiada 2 sposoby na filtrowanie danych zawartych w ramce danych:\n",
" * operator `[]` -- najbardziej rozpowszechniony;\n",
" * metoda `query()`.\n",
"Oba sposoby mają różną składnię.\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" PassengerId | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund\\t Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings\\t Mrs. John Bradley (Florence Briggs T... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen\\t Miss. Laina | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.9250 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle\\t Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen\\t Mr. William Henry | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.0500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"3 1 3 \n",
"4 1 1 \n",
"5 0 3 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund\\t Mr. Owen Harris male 22.0 \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"3 Heikkinen\\t Miss. Laina female 26.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"5 Allen\\t Mr. William Henry male 35.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"3 0 0 STON/O2. 3101282 7.9250 NaN S \n",
"4 1 0 113803 53.1000 C123 S \n",
"5 0 0 373450 8.0500 NaN S "
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 0\n",
"2 1\n",
"3 1\n",
"4 1\n",
"5 0\n",
" ..\n",
"887 0\n",
"888 1\n",
"889 0\n",
"890 1\n",
"891 0\n",
"Name: Survived, Length: 891, dtype: int64"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Survived']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df['Survived']"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId\n",
"1 False\n",
"2 True\n",
"3 True\n",
"4 True\n",
"5 False\n",
" ... \n",
"887 False\n",
"888 True\n",
"889 False\n",
"890 True\n",
"891 False\n",
"Name: Survived, Length: 891, dtype: bool"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Survived'] == 1"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df_survived = df[df['Pclass'] == 1]"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"891"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"216"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df_survived)"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" PassengerId | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings\\t Mrs. John Bradley (Florence Briggs T... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle\\t Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 7 | \n",
" 0 | \n",
" 1 | \n",
" McCarthy\\t Mr. Timothy J | \n",
" male | \n",
" 54.0 | \n",
" 0 | \n",
" 0 | \n",
" 17463 | \n",
" 51.8625 | \n",
" E46 | \n",
" S | \n",
"
\n",
" \n",
" 12 | \n",
" 1 | \n",
" 1 | \n",
" Bonnell\\t Miss. Elizabeth | \n",
" female | \n",
" 58.0 | \n",
" 0 | \n",
" 0 | \n",
" 113783 | \n",
" 26.5500 | \n",
" C103 | \n",
" S | \n",
"
\n",
" \n",
" 24 | \n",
" 1 | \n",
" 1 | \n",
" Sloper\\t Mr. William Thompson | \n",
" male | \n",
" 28.0 | \n",
" 0 | \n",
" 0 | \n",
" 113788 | \n",
" 35.5000 | \n",
" A6 | \n",
" S | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 872 | \n",
" 1 | \n",
" 1 | \n",
" Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) | \n",
" female | \n",
" 47.0 | \n",
" 1 | \n",
" 1 | \n",
" 11751 | \n",
" 52.5542 | \n",
" D35 | \n",
" S | \n",
"
\n",
" \n",
" 873 | \n",
" 0 | \n",
" 1 | \n",
" Carlsson\\t Mr. Frans Olof | \n",
" male | \n",
" 33.0 | \n",
" 0 | \n",
" 0 | \n",
" 695 | \n",
" 5.0000 | \n",
" B51 B53 B55 | \n",
" S | \n",
"
\n",
" \n",
" 880 | \n",
" 1 | \n",
" 1 | \n",
" Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) | \n",
" female | \n",
" 56.0 | \n",
" 0 | \n",
" 1 | \n",
" 11767 | \n",
" 83.1583 | \n",
" C50 | \n",
" C | \n",
"
\n",
" \n",
" 888 | \n",
" 1 | \n",
" 1 | \n",
" Graham\\t Miss. Margaret Edith | \n",
" female | \n",
" 19.0 | \n",
" 0 | \n",
" 0 | \n",
" 112053 | \n",
" 30.0000 | \n",
" B42 | \n",
" S | \n",
"
\n",
" \n",
" 890 | \n",
" 1 | \n",
" 1 | \n",
" Behr\\t Mr. Karl Howell | \n",
" male | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 111369 | \n",
" 30.0000 | \n",
" C148 | \n",
" C | \n",
"
\n",
" \n",
"
\n",
"
216 rows × 11 columns
\n",
"
"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"2 1 1 \n",
"4 1 1 \n",
"7 0 1 \n",
"12 1 1 \n",
"24 1 1 \n",
"... ... ... \n",
"872 1 1 \n",
"873 0 1 \n",
"880 1 1 \n",
"888 1 1 \n",
"890 1 1 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"7 McCarthy\\t Mr. Timothy J male 54.0 \n",
"12 Bonnell\\t Miss. Elizabeth female 58.0 \n",
"24 Sloper\\t Mr. William Thompson male 28.0 \n",
"... ... ... ... \n",
"872 Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) female 47.0 \n",
"873 Carlsson\\t Mr. Frans Olof male 33.0 \n",
"880 Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 \n",
"888 Graham\\t Miss. Margaret Edith female 19.0 \n",
"890 Behr\\t Mr. Karl Howell male 26.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"7 0 0 17463 51.8625 E46 S \n",
"12 0 0 113783 26.5500 C103 S \n",
"24 0 0 113788 35.5000 A6 S \n",
"... ... ... ... ... ... ... \n",
"872 1 1 11751 52.5542 D35 S \n",
"873 0 0 695 5.0000 B51 B53 B55 S \n",
"880 0 1 11767 83.1583 C50 C \n",
"888 0 0 112053 30.0000 B42 S \n",
"890 0 0 111369 30.0000 C148 C \n",
"\n",
"[216 rows x 11 columns]"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_survived"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Operatory\n",
"\n",
"* `&` - koniukcja (i)\n",
"* `|` - alternatywa (lub)\n",
"* `~` - negacja (nie)\n",
"* `()` - jeżeli mamy kilka warunków to warto je uporządkować w nawiasy"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" PassengerId | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings\\t Mrs. John Bradley (Florence Briggs T... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle\\t Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 12 | \n",
" 1 | \n",
" 1 | \n",
" Bonnell\\t Miss. Elizabeth | \n",
" female | \n",
" 58.0 | \n",
" 0 | \n",
" 0 | \n",
" 113783 | \n",
" 26.5500 | \n",
" C103 | \n",
" S | \n",
"
\n",
" \n",
" 32 | \n",
" 1 | \n",
" 1 | \n",
" Spencer\\t Mrs. William Augustus (Marie Eugenie) | \n",
" female | \n",
" NaN | \n",
" 1 | \n",
" 0 | \n",
" PC 17569 | \n",
" 146.5208 | \n",
" B78 | \n",
" C | \n",
"
\n",
" \n",
" 53 | \n",
" 1 | \n",
" 1 | \n",
" Harper\\t Mrs. Henry Sleeper (Myna Haxtun) | \n",
" female | \n",
" 49.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17572 | \n",
" 76.7292 | \n",
" D33 | \n",
" C | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 857 | \n",
" 1 | \n",
" 1 | \n",
" Wick\\t Mrs. George Dennick (Mary Hitchcock) | \n",
" female | \n",
" 45.0 | \n",
" 1 | \n",
" 1 | \n",
" 36928 | \n",
" 164.8667 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 863 | \n",
" 1 | \n",
" 1 | \n",
" Swift\\t Mrs. Frederick Joel (Margaret Welles B... | \n",
" female | \n",
" 48.0 | \n",
" 0 | \n",
" 0 | \n",
" 17466 | \n",
" 25.9292 | \n",
" D17 | \n",
" S | \n",
"
\n",
" \n",
" 872 | \n",
" 1 | \n",
" 1 | \n",
" Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) | \n",
" female | \n",
" 47.0 | \n",
" 1 | \n",
" 1 | \n",
" 11751 | \n",
" 52.5542 | \n",
" D35 | \n",
" S | \n",
"
\n",
" \n",
" 880 | \n",
" 1 | \n",
" 1 | \n",
" Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) | \n",
" female | \n",
" 56.0 | \n",
" 0 | \n",
" 1 | \n",
" 11767 | \n",
" 83.1583 | \n",
" C50 | \n",
" C | \n",
"
\n",
" \n",
" 888 | \n",
" 1 | \n",
" 1 | \n",
" Graham\\t Miss. Margaret Edith | \n",
" female | \n",
" 19.0 | \n",
" 0 | \n",
" 0 | \n",
" 112053 | \n",
" 30.0000 | \n",
" B42 | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
94 rows × 11 columns
\n",
"
"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"2 1 1 \n",
"4 1 1 \n",
"12 1 1 \n",
"32 1 1 \n",
"53 1 1 \n",
"... ... ... \n",
"857 1 1 \n",
"863 1 1 \n",
"872 1 1 \n",
"880 1 1 \n",
"888 1 1 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"12 Bonnell\\t Miss. Elizabeth female 58.0 \n",
"32 Spencer\\t Mrs. William Augustus (Marie Eugenie) female NaN \n",
"53 Harper\\t Mrs. Henry Sleeper (Myna Haxtun) female 49.0 \n",
"... ... ... ... \n",
"857 Wick\\t Mrs. George Dennick (Mary Hitchcock) female 45.0 \n",
"863 Swift\\t Mrs. Frederick Joel (Margaret Welles B... female 48.0 \n",
"872 Beckwith\\t Mrs. Richard Leonard (Sallie Monypeny) female 47.0 \n",
"880 Potter\\t Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 \n",
"888 Graham\\t Miss. Margaret Edith female 19.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"12 0 0 113783 26.5500 C103 S \n",
"32 1 0 PC 17569 146.5208 B78 C \n",
"53 1 0 PC 17572 76.7292 D33 C \n",
"... ... ... ... ... ... ... \n",
"857 1 1 36928 164.8667 NaN S \n",
"863 0 0 17466 25.9292 D17 S \n",
"872 1 1 11751 52.5542 D35 S \n",
"880 0 1 11767 83.1583 C50 C \n",
"888 0 0 112053 30.0000 B42 S \n",
"\n",
"[94 rows x 11 columns]"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pierwsza_klasa = df['Pclass'] == 1\n",
"kobiety = df['Sex'] == 'female'\n",
"\n",
"df[pierwsza_klasa & kobiety]\n"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" PassengerId | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund\\t Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings\\t Mrs. John Bradley (Florence Briggs T... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle\\t Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 8 | \n",
" 0 | \n",
" 3 | \n",
" Palsson\\t Master. Gosta Leonard | \n",
" male | \n",
" 2.0 | \n",
" 3 | \n",
" 1 | \n",
" 349909 | \n",
" 21.0750 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 10 | \n",
" 1 | \n",
" 2 | \n",
" Nasser\\t Mrs. Nicholas (Adele Achem) | \n",
" female | \n",
" 14.0 | \n",
" 1 | \n",
" 0 | \n",
" 237736 | \n",
" 30.0708 | \n",
" NaN | \n",
" C | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 861 | \n",
" 0 | \n",
" 3 | \n",
" Hansen\\t Mr. Claus Peter | \n",
" male | \n",
" 41.0 | \n",
" 2 | \n",
" 0 | \n",
" 350026 | \n",
" 14.1083 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 862 | \n",
" 0 | \n",
" 2 | \n",
" Giles\\t Mr. Frederick Edward | \n",
" male | \n",
" 21.0 | \n",
" 1 | \n",
" 0 | \n",
" 28134 | \n",
" 11.5000 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 864 | \n",
" 0 | \n",
" 3 | \n",
" Sage\\t Miss. Dorothy Edith \"Dolly\" | \n",
" female | \n",
" NaN | \n",
" 8 | \n",
" 2 | \n",
" CA. 2343 | \n",
" 69.5500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 867 | \n",
" 1 | \n",
" 2 | \n",
" Duran y More\\t Miss. Asuncion | \n",
" female | \n",
" 27.0 | \n",
" 1 | \n",
" 0 | \n",
" SC/PARIS 2149 | \n",
" 13.8583 | \n",
" NaN | \n",
" C | \n",
"
\n",
" \n",
" 875 | \n",
" 1 | \n",
" 2 | \n",
" Abelson\\t Mrs. Samuel (Hannah Wizosky) | \n",
" female | \n",
" 28.0 | \n",
" 1 | \n",
" 0 | \n",
" P/PP 3381 | \n",
" 24.0000 | \n",
" NaN | \n",
" C | \n",
"
\n",
" \n",
"
\n",
"
192 rows × 11 columns
\n",
"
"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"4 1 1 \n",
"8 0 3 \n",
"10 1 2 \n",
"... ... ... \n",
"861 0 3 \n",
"862 0 2 \n",
"864 0 3 \n",
"867 1 2 \n",
"875 1 2 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund\\t Mr. Owen Harris male 22.0 \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"8 Palsson\\t Master. Gosta Leonard male 2.0 \n",
"10 Nasser\\t Mrs. Nicholas (Adele Achem) female 14.0 \n",
"... ... ... ... \n",
"861 Hansen\\t Mr. Claus Peter male 41.0 \n",
"862 Giles\\t Mr. Frederick Edward male 21.0 \n",
"864 Sage\\t Miss. Dorothy Edith \"Dolly\" female NaN \n",
"867 Duran y More\\t Miss. Asuncion female 27.0 \n",
"875 Abelson\\t Mrs. Samuel (Hannah Wizosky) female 28.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"8 3 1 349909 21.0750 NaN S \n",
"10 1 0 237736 30.0708 NaN C \n",
"... ... ... ... ... ... ... \n",
"861 2 0 350026 14.1083 NaN S \n",
"862 1 0 28134 11.5000 NaN S \n",
"864 8 2 CA. 2343 69.5500 NaN S \n",
"867 1 0 SC/PARIS 2149 13.8583 NaN C \n",
"875 1 0 P/PP 3381 24.0000 NaN C \n",
"\n",
"[192 rows x 11 columns]"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df['SibSp'] > df['Parch']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### `pd.DataFrame.query`\n",
"\n",
"Innym sposobem na filtrowanie danych jest metoda `query`, która jako argument przyjmuje wyrażenie:"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" PassengerId | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings\\t Mrs. John Bradley (Florence Briggs T... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle\\t Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 7 | \n",
" 0 | \n",
" 1 | \n",
" McCarthy\\t Mr. Timothy J | \n",
" male | \n",
" 54.0 | \n",
" 0 | \n",
" 0 | \n",
" 17463 | \n",
" 51.8625 | \n",
" E46 | \n",
" S | \n",
"
\n",
" \n",
" 12 | \n",
" 1 | \n",
" 1 | \n",
" Bonnell\\t Miss. Elizabeth | \n",
" female | \n",
" 58.0 | \n",
" 0 | \n",
" 0 | \n",
" 113783 | \n",
" 26.5500 | \n",
" C103 | \n",
" S | \n",
"
\n",
" \n",
" 24 | \n",
" 1 | \n",
" 1 | \n",
" Sloper\\t Mr. William Thompson | \n",
" male | \n",
" 28.0 | \n",
" 0 | \n",
" 0 | \n",
" 113788 | \n",
" 35.5000 | \n",
" A6 | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"2 1 1 \n",
"4 1 1 \n",
"7 0 1 \n",
"12 1 1 \n",
"24 1 1 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"7 McCarthy\\t Mr. Timothy J male 54.0 \n",
"12 Bonnell\\t Miss. Elizabeth female 58.0 \n",
"24 Sloper\\t Mr. William Thompson male 28.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"7 0 0 17463 51.8625 E46 S \n",
"12 0 0 113783 26.5500 C103 S \n",
"24 0 0 113788 35.5000 A6 S "
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.query('Pclass == 1').head()"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" PassengerId | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings\\t Mrs. John Bradley (Florence Briggs T... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle\\t Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 12 | \n",
" 1 | \n",
" 1 | \n",
" Bonnell\\t Miss. Elizabeth | \n",
" female | \n",
" 58.0 | \n",
" 0 | \n",
" 0 | \n",
" 113783 | \n",
" 26.5500 | \n",
" C103 | \n",
" S | \n",
"
\n",
" \n",
" 32 | \n",
" 1 | \n",
" 1 | \n",
" Spencer\\t Mrs. William Augustus (Marie Eugenie) | \n",
" female | \n",
" NaN | \n",
" 1 | \n",
" 0 | \n",
" PC 17569 | \n",
" 146.5208 | \n",
" B78 | \n",
" C | \n",
"
\n",
" \n",
" 53 | \n",
" 1 | \n",
" 1 | \n",
" Harper\\t Mrs. Henry Sleeper (Myna Haxtun) | \n",
" female | \n",
" 49.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17572 | \n",
" 76.7292 | \n",
" D33 | \n",
" C | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"2 1 1 \n",
"4 1 1 \n",
"12 1 1 \n",
"32 1 1 \n",
"53 1 1 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"12 Bonnell\\t Miss. Elizabeth female 58.0 \n",
"32 Spencer\\t Mrs. William Augustus (Marie Eugenie) female NaN \n",
"53 Harper\\t Mrs. Henry Sleeper (Myna Haxtun) female 49.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"12 0 0 113783 26.5500 C103 S \n",
"32 1 0 PC 17569 146.5208 B78 C \n",
"53 1 0 PC 17572 76.7292 D33 C "
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.query('(Pclass == 1) and (Sex == \"female\")').head()"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" PassengerId | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund\\t Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings\\t Mrs. John Bradley (Florence Briggs T... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle\\t Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" 8 | \n",
" 0 | \n",
" 3 | \n",
" Palsson\\t Master. Gosta Leonard | \n",
" male | \n",
" 2.0 | \n",
" 3 | \n",
" 1 | \n",
" 349909 | \n",
" 21.0750 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 10 | \n",
" 1 | \n",
" 2 | \n",
" Nasser\\t Mrs. Nicholas (Adele Achem) | \n",
" female | \n",
" 14.0 | \n",
" 1 | \n",
" 0 | \n",
" 237736 | \n",
" 30.0708 | \n",
" NaN | \n",
" C | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 861 | \n",
" 0 | \n",
" 3 | \n",
" Hansen\\t Mr. Claus Peter | \n",
" male | \n",
" 41.0 | \n",
" 2 | \n",
" 0 | \n",
" 350026 | \n",
" 14.1083 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 862 | \n",
" 0 | \n",
" 2 | \n",
" Giles\\t Mr. Frederick Edward | \n",
" male | \n",
" 21.0 | \n",
" 1 | \n",
" 0 | \n",
" 28134 | \n",
" 11.5000 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 864 | \n",
" 0 | \n",
" 3 | \n",
" Sage\\t Miss. Dorothy Edith \"Dolly\" | \n",
" female | \n",
" NaN | \n",
" 8 | \n",
" 2 | \n",
" CA. 2343 | \n",
" 69.5500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" 867 | \n",
" 1 | \n",
" 2 | \n",
" Duran y More\\t Miss. Asuncion | \n",
" female | \n",
" 27.0 | \n",
" 1 | \n",
" 0 | \n",
" SC/PARIS 2149 | \n",
" 13.8583 | \n",
" NaN | \n",
" C | \n",
"
\n",
" \n",
" 875 | \n",
" 1 | \n",
" 2 | \n",
" Abelson\\t Mrs. Samuel (Hannah Wizosky) | \n",
" female | \n",
" 28.0 | \n",
" 1 | \n",
" 0 | \n",
" P/PP 3381 | \n",
" 24.0000 | \n",
" NaN | \n",
" C | \n",
"
\n",
" \n",
"
\n",
"
192 rows × 11 columns
\n",
"
"
],
"text/plain": [
" Survived Pclass \\\n",
"PassengerId \n",
"1 0 3 \n",
"2 1 1 \n",
"4 1 1 \n",
"8 0 3 \n",
"10 1 2 \n",
"... ... ... \n",
"861 0 3 \n",
"862 0 2 \n",
"864 0 3 \n",
"867 1 2 \n",
"875 1 2 \n",
"\n",
" Name Sex Age \\\n",
"PassengerId \n",
"1 Braund\\t Mr. Owen Harris male 22.0 \n",
"2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n",
"4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n",
"8 Palsson\\t Master. Gosta Leonard male 2.0 \n",
"10 Nasser\\t Mrs. Nicholas (Adele Achem) female 14.0 \n",
"... ... ... ... \n",
"861 Hansen\\t Mr. Claus Peter male 41.0 \n",
"862 Giles\\t Mr. Frederick Edward male 21.0 \n",
"864 Sage\\t Miss. Dorothy Edith \"Dolly\" female NaN \n",
"867 Duran y More\\t Miss. Asuncion female 27.0 \n",
"875 Abelson\\t Mrs. Samuel (Hannah Wizosky) female 28.0 \n",
"\n",
" SibSp Parch Ticket Fare Cabin Embarked \n",
"PassengerId \n",
"1 1 0 A/5 21171 7.2500 NaN S \n",
"2 1 0 PC 17599 71.2833 C85 C \n",
"4 1 0 113803 53.1000 C123 S \n",
"8 3 1 349909 21.0750 NaN S \n",
"10 1 0 237736 30.0708 NaN C \n",
"... ... ... ... ... ... ... \n",
"861 2 0 350026 14.1083 NaN S \n",
"862 1 0 28134 11.5000 NaN S \n",
"864 8 2 CA. 2343 69.5500 NaN S \n",
"867 1 0 SC/PARIS 2149 13.8583 NaN C \n",
"875 1 0 P/PP 3381 24.0000 NaN C \n",
"\n",
"[192 rows x 11 columns]"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.query('SibSp > Parch')"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(113, 11)"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"young = 18\n",
"df.query('Age < @young').shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Operacje na wierszach i kolumnach"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df = pd.read_csv('./gapminder.csv', index_col='Country', nrows=5)\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Iterowanie po ramce danych oznacza oznacza przejście po nazwach kolumn:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"for column_name in df:\n",
" print(column_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for col_name, series in df.items():\n",
" print(col_name, series)\n",
" break"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"for idx, row in df.iterrows():\n",
" print(idx, '\\n', row)\n",
" break"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"def bmi_level(bmi):\n",
" if bmi <= 18.5:\n",
" level = 'underweight'\n",
" elif bmi < 25:\n",
" level = 'normal'\n",
" elif bmi < 30:\n",
" level = 'overweight'\n",
" else:\n",
" level = 'obese'\n",
" return level\n",
"\n",
"s = df['male_BMI'].map(bmi_level)\n",
" \n",
"s"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"def bmi_level(row_data):\n",
" bmi = row_data['male_BMI']\n",
" if bmi <= 18.5:\n",
" return 'underweight'\n",
" elif bmi < 25:\n",
" return 'normal'\n",
" elif bmi < 30:\n",
" return 'overweight'\n",
" return 'obese'\n",
"\n",
"df.apply(bmi_level, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.transpose()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grupowanie (`groupby`)\n",
"\n",
"Często zdarza się, gdy potrzebujemy podzielić dane ze względu na wartości w zadanej kolumnie, a następnie obliczenie zebranie danych w każdej z grup. Do tego służy metody `groupby`."
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv('./nba.csv')\n",
"\n",
"#df.sample(5)"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Team | \n",
" Number | \n",
" Position | \n",
" Age | \n",
" Height | \n",
" Weight | \n",
" College | \n",
" Salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Avery Bradley | \n",
" Boston Celtics | \n",
" 0.0 | \n",
" PG | \n",
" 25.0 | \n",
" 6-2 | \n",
" 180.0 | \n",
" Texas | \n",
" 7730337.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Jae Crowder | \n",
" Boston Celtics | \n",
" 99.0 | \n",
" SF | \n",
" 25.0 | \n",
" 6-6 | \n",
" 235.0 | \n",
" Marquette | \n",
" 6796117.0 | \n",
"
\n",
" \n",
" 2 | \n",
" John Holland | \n",
" Boston Celtics | \n",
" 30.0 | \n",
" SG | \n",
" 27.0 | \n",
" 6-5 | \n",
" 205.0 | \n",
" Boston University | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" R.J. Hunter | \n",
" Boston Celtics | \n",
" 28.0 | \n",
" SG | \n",
" 22.0 | \n",
" 6-5 | \n",
" 185.0 | \n",
" Georgia State | \n",
" 1148640.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Jonas Jerebko | \n",
" Boston Celtics | \n",
" 8.0 | \n",
" PF | \n",
" 29.0 | \n",
" 6-10 | \n",
" 231.0 | \n",
" NaN | \n",
" 5000000.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 453 | \n",
" Shelvin Mack | \n",
" Utah Jazz | \n",
" 8.0 | \n",
" PG | \n",
" 26.0 | \n",
" 6-3 | \n",
" 203.0 | \n",
" Butler | \n",
" 2433333.0 | \n",
"
\n",
" \n",
" 454 | \n",
" Raul Neto | \n",
" Utah Jazz | \n",
" 25.0 | \n",
" PG | \n",
" 24.0 | \n",
" 6-1 | \n",
" 179.0 | \n",
" NaN | \n",
" 900000.0 | \n",
"
\n",
" \n",
" 455 | \n",
" Tibor Pleiss | \n",
" Utah Jazz | \n",
" 21.0 | \n",
" C | \n",
" 26.0 | \n",
" 7-3 | \n",
" 256.0 | \n",
" NaN | \n",
" 2900000.0 | \n",
"
\n",
" \n",
" 456 | \n",
" Jeff Withey | \n",
" Utah Jazz | \n",
" 24.0 | \n",
" C | \n",
" 26.0 | \n",
" 7-0 | \n",
" 231.0 | \n",
" Kansas | \n",
" 947276.0 | \n",
"
\n",
" \n",
" 457 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
458 rows × 9 columns
\n",
"
"
],
"text/plain": [
" Name Team Number Position Age Height Weight \\\n",
"0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 \n",
"1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 \n",
"2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 \n",
"3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 \n",
"4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 \n",
".. ... ... ... ... ... ... ... \n",
"453 Shelvin Mack Utah Jazz 8.0 PG 26.0 6-3 203.0 \n",
"454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0 \n",
"455 Tibor Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0 \n",
"456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0 231.0 \n",
"457 NaN NaN NaN NaN NaN NaN NaN \n",
"\n",
" College Salary \n",
"0 Texas 7730337.0 \n",
"1 Marquette 6796117.0 \n",
"2 Boston University NaN \n",
"3 Georgia State 1148640.0 \n",
"4 NaN 5000000.0 \n",
".. ... ... \n",
"453 Butler 2433333.0 \n",
"454 NaN 900000.0 \n",
"455 NaN 2900000.0 \n",
"456 Kansas 947276.0 \n",
"457 NaN NaN \n",
"\n",
"[458 rows x 9 columns]"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"_Przykład_: chcemy obliczyć średnią wypłatę dla każdej z drużyn."
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Team | \n",
" Salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Boston Celtics | \n",
" 7730337.0 | \n",
"
\n",
" \n",
" 1 | \n",
" Boston Celtics | \n",
" 6796117.0 | \n",
"
\n",
" \n",
" 2 | \n",
" Boston Celtics | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" Boston Celtics | \n",
" 1148640.0 | \n",
"
\n",
" \n",
" 4 | \n",
" Boston Celtics | \n",
" 5000000.0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 453 | \n",
" Utah Jazz | \n",
" 2433333.0 | \n",
"
\n",
" \n",
" 454 | \n",
" Utah Jazz | \n",
" 900000.0 | \n",
"
\n",
" \n",
" 455 | \n",
" Utah Jazz | \n",
" 2900000.0 | \n",
"
\n",
" \n",
" 456 | \n",
" Utah Jazz | \n",
" 947276.0 | \n",
"
\n",
" \n",
" 457 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
458 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Team Salary\n",
"0 Boston Celtics 7730337.0\n",
"1 Boston Celtics 6796117.0\n",
"2 Boston Celtics NaN\n",
"3 Boston Celtics 1148640.0\n",
"4 Boston Celtics 5000000.0\n",
".. ... ...\n",
"453 Utah Jazz 2433333.0\n",
"454 Utah Jazz 900000.0\n",
"455 Utah Jazz 2900000.0\n",
"456 Utah Jazz 947276.0\n",
"457 NaN NaN\n",
"\n",
"[458 rows x 2 columns]"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['Team', 'Salary']]"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Salary | \n",
"
\n",
" \n",
" Team | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Atlanta Hawks | \n",
" 2854940.0 | \n",
"
\n",
" \n",
" Boston Celtics | \n",
" 3021242.5 | \n",
"
\n",
" \n",
" Brooklyn Nets | \n",
" 1335480.0 | \n",
"
\n",
" \n",
" Charlotte Hornets | \n",
" 4204200.0 | \n",
"
\n",
" \n",
" Chicago Bulls | \n",
" 2380440.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Salary\n",
"Team \n",
"Atlanta Hawks 2854940.0\n",
"Boston Celtics 3021242.5\n",
"Brooklyn Nets 1335480.0\n",
"Charlotte Hornets 4204200.0\n",
"Chicago Bulls 2380440.0"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['Team', 'Salary']].groupby('Team').median().h"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Możemy też podać listę nazw kolumn. Wtedy wartości zostaną obliczone dla każdej z wytworzonych grup:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.groupby(['Team', 'Position'])['Salary'].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
" * `sum()`\n",
" * `min()`\n",
" * `max()`\n",
" * `mean()`\n",
" * `size()`\n",
" * `describe()`\n",
" * `first()`\n",
" * `last()`\n",
" * `count()`\n",
" * `std()`\n",
" * `var()`\n",
" * `sem()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df[['Position', 'Salary']].groupby('Position').agg(['mean', 'std', 'count'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"def group_range(x):\n",
" return x.max() - x.min()\n",
"\n",
"df[['Position', 'Salary']].groupby('Position').apply(group_range)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"gb = df.groupby(['Position'])\n",
"\n",
"print('Liczba grup:', gb.ngroups)\n",
"print(gb.groups.keys())\n",
"\n",
"print(gb.get_group('C').head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"df.Height.str.split('-').str[0].astype('Int64') * 2.56"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Pivot\n",
"Metoda `pivot` pozwala na stworzenie nowej ramki danych, gdzie indeks i nazwy kolumn są wartościami początkowej ranki danych. \n",
"\n",
"_Przykład_: zobaczmy na poniższą ramkę danych, która zawiera informacje o jakości tłumaczenia dla pary językowej hausa-angielski. Kolumna `system` zawiera nazwę systemu, kolumna `metric` - nazwę metryki, zaś kolumna `score`- wartość metryki. Chcemy przedstawić te dane w następujący sposób: jako klucz chcemy mieć nazwę systemu, zaś jako kolumny - metryki. Możemy wykorzystać do tego metodę `pivot`, gdzie musimy podać 3 argumenty:\n",
" * `index`: nazwę kolumny, na podstawie której zostanie stworzony indeks;\n",
" * `columns`: nazwa kolumny, które zawiera nazwy kolumn dla nowej ramki danych;\n",
" * `values`: nazwa kolumny, która zawiera interesujące nas dane."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df = pd.read_csv('https://raw.githubusercontent.com/wmt-conference/wmt21-news-systems/main/scores/automatic-scores.tsv', sep='\\t')\n",
"df = df[df.pair == 'ha-en']\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.pivot(index='system', columns='metric', values='score')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Dane tekstowe"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"`pandas` posiada udogodnienia do pracy z wartościami tekstowymi:\n",
" * dostęp następuje przez atrybut `str`;\n",
" * funkcje:\n",
" * formatujące: `lower()`, `upper()`;\n",
" * wyrażenia regularne: `contains()`, `match()`;\n",
" * inne: `split()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df = pd.read_csv('./titanic_train.tsv', sep='\\t', index_col='PassengerId')\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.Name.str.upper()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"print(df.Name.head())\n",
"df.Name.str.contains('Miss|Mrs').head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.Name.str.split('\\t', expand=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.Name.str.split('\\t')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.Name.str.split('\\t').str[1]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"df.Name.str.split('\\t').str[1].str.strip().str.split(' ').str[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Zadanie\n",
"Zestaw `nba.csv` zawiera informaję o wysokości zawodników. Oblicz wzrost każdego z zawodników w systemie metrycznym przyjmując, że stop to `30.48` cm., a cal to `2.54` cm."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Slideshow",
"interpreter": {
"hash": "d4d1e4263499bec80672ea0156c357c1ee493ec2b1c70f0acce89fc37c4a6abe"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}