{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Podstawy Analizy danych w Pythonie: pandas\n", "\n", "## 9 lutego 2019" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Ostatnia cześć kursu Pythona będzie dotyczyć biblioteki **pandas**, która służy do analizy danych. Zacznijmy zatem od importu. Przeważnie bibliotekę skraca się do *pd*:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import sys\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Pandas posiada dwie podstawowe struktury danych: \n", " * szereg (Series),\n", " * ramka danych (DataFrame). \n", " \n", "Zaczniemy od szeregów. Szereg danych, mówiąc prościej jest to lista danych tego samego typu." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Żeby zobaczyć szeregi w akcji, stwórzmy listę losowych liczb:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 4 5 10 9 13 7 7 4 19 13 13 10 2 3 17 18 3 14 4 19 5 6 9 5\n", " 13 14]\n" ] } ], "source": [ "losowe = np.random.randint(1, 20, 26)\n", "print(losowe)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "A następnie stwórzmy szereg, korzystając z powyższych liczb:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 4\n", "1 5\n", "2 10\n", "3 9\n", "4 13\n", "5 7\n", "6 7\n", "7 4\n", "8 19\n", "9 13\n", "10 13\n", "11 10\n", "12 2\n", "13 3\n", "14 17\n", "15 18\n", "16 3\n", "17 14\n", "18 4\n", "19 19\n", "20 5\n", "21 6\n", "22 9\n", "23 5\n", "24 13\n", "25 14\n", "dtype: int64\n" ] } ], "source": [ "dane = pd.Series(losowe)\n", "print(dane)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Czym różni się szereg od listy? Szereg danych posiada indeks, czyli klucz, dzięki ktoremu możemy zindetyfikować dane. Domyślnie, indeks jest ciągiem liczb zaczynających się od zera. Nie musi tak być, możemy podczas tworzenia przekazać również indeks:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a 1\n", "b 2\n", "c 3\n", "d 4\n", "e 5\n", "dtype: int64\n" ] } ], "source": [ "dane2 = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])\n", "print(dane2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Jak można domyśleć się, indeks służy nam do uzyskania dostępu do danego elementu:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n" ] } ], "source": [ "print(dane2['b'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Więcej o dostępnie do danych będzie w dalszej części kursu." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Żeby uzyskać rozmiar danych możemy wykorzystać znaną już funkcję **len** lub wykorzystać polę **shape**." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "26\n", "(26,)\n" ] } ], "source": [ "print(len(dane))\n", "print(dane.shape)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Przeważnie zbiory danych, na których pracujemy są duże. Stąd, próba ich wyświetlenia może okazać się karkołomna\n", "lub nawet niemożliwa. Czasem chcemy tylko zobaczyć pogląd. Do tego służą dwie metody: **head** i **tail**, które\n", " zwrócą nam kilka pierwszych lub ostatnich wierszy z szeregu:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 4\n", "1 5\n", "2 10\n", "3 9\n", "4 13\n", "dtype: int64\n" ] } ], "source": [ "print(dane.head())" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "21 6\n", "22 9\n", "23 5\n", "24 13\n", "25 14\n", "dtype: int64\n" ] } ], "source": [ "print(dane.tail())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Szeregi są dostosowane do analizy danych. Np. udostępniają prosty sposób do uzyskania podstawowych statystyk:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Średnia: 9.461538461538462\n", "Mediana: 9.0\n" ] } ], "source": [ "print(\"Średnia:\", dane.mean())\n", "print(\"Mediana:\", dane.median())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jak i inne przydatne funkcje:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Zbiór wartości: [14 6 16 7 17 13 15 2 12 3 4 1 9 19 5]\n", "Zliczanie 15 4\n", "13 3\n", "4 3\n", "12 2\n", "7 2\n", "2 2\n", "19 1\n", "17 1\n", "16 1\n", "14 1\n", "9 1\n", "6 1\n", "5 1\n", "3 1\n", "1 1\n", "dtype: int64\n" ] } ], "source": [ "print(\"Zbiór wartości:\", dane.unique())\n", "print(\"Zliczanie\", dane.value_counts())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Metoda ```value_counts``` zwraca nam szereg danych, który możemy wykorzystać do dalszych badań. Na przyklad, żeby wyświetlić 5 najczęściej występujących wartości, możemy napisać:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15 4\n", "13 3\n", "4 3\n", "12 2\n", "7 2\n", "dtype: int64\n" ] } ], "source": [ "print(dane.value_counts().head())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Żeby uzyskać wszystkie podstawowe statystyki, możmey wywołać metodę ```describe```:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count 25.000000\n", "mean 9.720000\n", "std 5.556678\n", "min 1.000000\n", "25% 4.000000\n", "50% 12.000000\n", "75% 15.000000\n", "max 19.000000\n", "dtype: float64\n" ] } ], "source": [ "print(dane.describe())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "A żeby wyświetlić je w postaci wykresu:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAEACAYAAAB4ayemAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEFdJREFUeJzt3V+IbedZx/Hfc3LS0jQmc1B7lMZmtKKBgIyigRI5rCq0\noaIVLyRViaNQvEhpICDW3hy8kV5VCuKNjTWGlqghaaqoSSW+lFQ0KU2maZNWA51oJOcYScyQhDY5\n5vFi78mZTGb22n/etZ/3Xe/3A8PsNbNnr9+8Z81v1jz7zzF3FwCgDieiAwAA5kdpA0BFKG0AqAil\nDQAVobQBoCKUNgBU5OQ8VzKzXUkvSHpN0qvuft2QoQAAR5urtDUp687dnx8yDABgtnnHI7bAdQEA\nA5m3iF3SfWb2sJl9eMhAAIDjzTseud7dnzGz75f0RTN7wt0fHDIYAODN5iptd39m+v5ZM7tH0nWS\n3lDaZsaLmADAgtzdFrl+73jEzC4zs8unl98u6X2Svn7MznnL8Hb27NnwDGN6i1rP6U9FwNvQ+z17\n7H6j/61re1vGPGfapyXdMz2TPinps+5+/1J7w1x2d3ejI4wK65nbbnSApvWWtrt/W9LWGrIAAHrw\nML4CbW9vR0cYFdYzt+3oAE2zZecqb7ohM891W8AYmJkuzpjXuuew/dIBizEzee47IrF+KaXoCKPC\neuaWogM0jdIGgIowHgEGwngEfRiPAMDIUdoFYgabF+uZW4oO0DRKGwAqwkwbGAgzbfRhpg0AI0dp\nF4gZbF6sZ24pOkDTKG0AqAgzbWAgzLTRh5k2AIwcpV0gZrB5sZ65pegATaO0AaAizLSBgTDTRh9m\n2gAwcpR2gZjB5sV65paiAzSN0gaAijDTBgbCTBt9mGkDwMhR2gViBpsX65lbig7QNEobACrCTBsY\nCDNt9GGmDQAjR2kXiBlsXqxnbik6QNMobQCoCDNtYCDMtNGHmTYAjBylXSBmsHmxnrml6ABNo7QB\noCLMtIGBMNNGH2baADBylHaBmMHmxXrmlqIDNI3SBoCKzD3TNrMTkr4i6Wl3/6UjPs9MGziAmTb6\nDD3TvkXS44tFAgDkNFdpm9lVkj4g6dPDxoHEDDY31jO3FB2gafOeaf+RpN9VzN9cAICpk31XMLNf\nkHTe3R81s06TgdmRtre3tbm5KUna2NjQ1taWuq6TdPFsp7TtG2/c1vnzTx33LQ3q1KnTeu65c2/I\n03Wduq4rZn3GsB25nhftb3cj2O5mfH66VdC/f0nb+5d3d3e1rN47Is3sDyX9hqQLkt4m6Xsk3e3u\nNx26XpV3RMbdWSRxx824cUck+gxyR6S7f9zd3+XuPyLpRkkPHC5s5MUMNi/WM7cUHaBpPE4bACrS\n/GuPMB7BUBiPoA+vPQIAI0dpF4gZbF6sZ24pOkDTKG0AqAgzbWbaGAgzbfRhpg0AI0dpF4gZbF6s\nZ24pOkDTKG0AqAgzbWbaGAgzbfRhpg0AI0dpF4gZbF6sZ24pOkDTKG0AqAgzbWbaGAgzbfRhpg0A\nI0dpF4gZbF6sZ24pOkDTKG0AqAgzbWbaGAgzbfRhpg0AI0dpF4gZbF6sZ24pOkDTKG0AqAgzbWba\nGAgzbfRhpg0AI0dpF4gZbF6sZ24pOkDTKG0AqAgzbWbaGAgzbfRhpg0AI0dpF4gZbF6sZ24pOkDT\nKG0AqAgzbWbaGAgzbfRhpg0AI0dpF4gZbF6sZ24pOkDTKG0AqAgzbWbaGAgzbfRhpg0AI0dpF4gZ\nbF6sZ24pOkDTKG0AqEjvTNvM3irpS5LeIumkpLvc/Q+OuB4z7cX3zgxwxJhpo88yM+2TfVdw9++a\n2Xvd/WUzu0TSl83s7939oaWTAgCWMtd4xN1fnl58qyZFz6/TATGDzYv1zC1FB2jaXKVtZifM7BFJ\n5yR90d0fHjYWAOAoCz1O28yukPR5SR9x98cPfW7pmfZLL72kW275mF544cWlvn4Vd93152KmjSEw\n00afQWbaB7n7npn9k6QbJD1++PPb29va3NyUJG1sbGhra0td10m6+CfqUdtPPvmk7rjjTr3yym9L\numZ6a9+cvh9y+7sH0qfp+27N29OtGevDdr3bF+1vdyPfnm4Vsv6lbe9f3t3d1bLmefTI90l61d1f\nMLO3SbpP0ifc/e8OXW/pM+2dnR2dOXOT9vZ2lvr65e1JulKlnWmnlF7/x8bqotZzvGfaSRfL+o37\n5Ux7MUOdaf+gpNvN7IQmM/C/PFzYAID1KOK1RzjTxhiN90z7+P1yPC+G1x4BgJGjtAvE44rzYj1z\nS9EBmkZpA0BFKO0C8ciRvFjP3LroAE2jtAGgIpR2gZjB5sV65paiAzSN0gaAilDaBWIGmxfrmVsX\nHaBplDYAVITSLhAz2LxYz9xSdICmUdoAUBFKu0DMYPNiPXProgM0jdIGgIpQ2gViBpsX65lbig7Q\nNEobACpCaReIGWxerGduXXSAplHaAFARSrtAzGDzYj1zS9EBmkZpA0BFKO0CMYPNi/XMrYsO0DRK\nGwAqQmkXiBlsXqxnbik6QNMobQCoCKVdIGawebGeuXXRAZpGaQNARSjtAjGDzYv1zC1FB2gapQ0A\nFaG0C8QMNi/WM7cuOkDTKG0AqAilXSBmsHmxnrml6ABNo7QBoCKUdoGYwebFeubWRQdoGqUNABWh\ntAvEDDYv1jO3FB2gaZQ2AFSE0i4QM9i8WM/cuugATestbTO7ysweMLNvmNljZvbRdQQDALzZPGfa\nFyTd6u7XSnqPpJvN7JphY7WNGWxerGduKTpA03pL293Pufuj08svSnpC0juHDgYAeLOFZtpmtilp\nS9K/DhEGE8xg82I9c+uiAzRt7tI2s8sl3SXplukZNwBgzU7OcyUzO6lJYd/h7vced73t7W1tbm5K\nkjY2NrS1tfX6Wc7+XPG47QsXXtRkVtZNby1N3w+5/dKB9OvY3+HtS2VmWrcTJy7Ta6+9vPb9njp1\nWs89d05S//GQc/vgTHsd+zu4fdH+djeC7f3Lhz8fczxLk2Pr7rvvXPu/7zLHQ0pJu7u7C31/B5m7\n91/J7C8k/Y+73zrjOj7PbR1lZ2dHZ87cpL29naW+fnl7kq6UtFzu1dkx+04a9k/Q4/Y7NNOyx8gq\nUkohI5JJgcWs87D7TTr6+Iz6fif7jji2VmVmcveFftPN85C/6yX9uqSfM7NHzOyrZnbDsiExjy46\nwKgw086tiw7QtN7xiLt/WdIla8gCAOjBMyKLlKIDjAqP084tRQdoGqUNABWhtIvURQcYFWbauXXR\nAZpGaQNARSjtIqXoAKPCTDu3FB2gaZQ2AFSE0i5SFx1gVJhp59ZFB2gapQ0AFaG0i5SiA4wKM+3c\nUnSAplHaAFARSrtIXXSAUWGmnVsXHaBplDYAVITSLlKKDjAqzLRzS9EBmkZpA0BFKO0iddEBRoWZ\ndm5ddICmUdoAUBFKu0gpOsCoMNPOLUUHaBqlDQAVobSL1EUHGBVm2rl10QGaRmkDQEUo7SKl6ACj\nwkw7txQdoGmUNgBUhNIuUhcdYFSYaefWRQdoGqUNABWhtIuUogOMCjPt3FJ0gKZR2gBQEUq7SF10\ngFFhpp1bFx2gaZQ2AFSE0i5Sig4wKsy0c0vRAZpGaQNARSjtInXRAUaFmXZuXXSAplHaAFARSrtI\nKTrAqDDTzi1FB2gapQ0AFaG0i9RFBxgVZtq5ddEBmkZpA0BFekvbzG4zs/Nm9rV1BILEzDAvZtq5\npegATZvnTPszkt4/dBAAQL/e0nb3ByU9v4YseF0XHWBUmGnn1kUHaBozbQCoyMmcN7a9va3NzU1J\n0sbGhra2tl4/y9mfKx63feHCi5rMyrrpraXp+yG3XzqQfh37O2r7qP0f/Ny68wy5fanMTOt26tRp\n3X33nZM0cx6PubYv2t/uRrC9f/moz+vQ9rryxRxbp09frXPndhc6HlJK2t3dXXqf5u79VzK7WtLf\nuPtPzLiOz3NbR9nZ2dGZMzdpb29nqa9f3p6kKyUtl3t1dsy+k4b9E/S4/Q4tbr/LHpsr7dXGus5J\nRx+fUd9v5L5XO7bMTO6+0G+beccjNn3DWnTRAYAZuugATZvnIX+fk/TPkn7MzP7DzH5r+FgAgKP0\nzrTd/dfWEQQHJXE2g3IlcXzG4dEjAFARSrtIXXQAYIYuOkDTKG0AqAilXaQUHQCYIUUHaBqlDQAV\nobSL1EUHAGboogM0jdIGgIpQ2kVK0QGAGVJ0gKZR2gBQEUq7SF10AGCGLjpA0yhtAKgIpV2kFB0A\nmCFFB2gapQ0AFaG0i9RFBwBm6KIDNI3SBoCKUNpFStEBgBlSdICmUdoAUBFKu0hddABghi46QNMo\nbQCoCKVdpBQdAJghRQdoGqUNABWhtIvURQcAZuiiAzSN0gaAilDaRUrRAYAZUnSAplHaAFARSrtI\nXXQAYIYuOkDTKG0AqAilXaQUHQCYIUUHaBqlDQAVobSL1EUHAGboogM0jdIGgIpQ2kVK0QGAGVJ0\ngKZR2gBQEUq7SF10AGCGLjpA0yhtAKjIXKVtZjeY2TfN7N/M7PeGDoUUHQCYIUUHaFpvaZvZCUl/\nLOn9kq6V9CEzu2boYG17NDoAMAPHZ6R5zrSvk/Tv7v6Uu78q6U5JHxw2Vuv+NzoAMAPHZ6R5Svud\nkv7zwPbT048BANbsZHQASbr00kv1ne98W1dc8Ytr3vOr2ttb8y7nshsdAJhhNzpA0+Yp7f+S9K4D\n21dNP/YmZrZSmFde+duVvn55q+UeZt+3B+13aDH7XfXYXGHPI93vccdniT9LA+91zceWufvsK5hd\nIulbkn5e0jOSHpL0IXd/Yvh4AICDes+03f3/zOwjku7XZAZ+G4UNADF6z7QBAOVY+RmRPPEmLzPb\nNbMdM3vEzB6KzlMbM7vNzM6b2dcOfOyUmd1vZt8ys/vM7MrIjLU4Zi3PmtnTZvbV6dsNkRlrYmZX\nmdkDZvYNM3vMzD46/fhCx+dKpc0TbwbxmqTO3X/S3a+LDlOhz2hyPB70MUn/6O4/LukBSb+/9lR1\nOmotJemT7v5T07d/WHeoil2QdKu7XyvpPZJunvblQsfnqmfaPPEmPxOvCbM0d39Q0vOHPvxBXXy4\nw+2SfnmtoSp1zFpKsQ8RqZa7n3P3R6eXX5T0hCaPxlvo+Fy1HHjiTX4u6T4ze9jMPhwdZiTe4e7n\npckPjqR3BOep3c1m9qiZfZpR03LMbFPSlqR/kXR6keOTM7ryXO/uPy3pA5r8cPxsdKAR4t735f2J\npHe7+5akc5I+GZynOmZ2uaS7JN0yPeM+fDzOPD5XLe25n3iD+bj7M9P3z0q6R5MRFFZz3sxOS5KZ\n/YCk/w7OUy13f9YvPuTsTyX9TGSe2pjZSU0K+w53v3f64YWOz1VL+2FJP2pmV5vZWyTdKOkLK95m\ns8zssulvYZnZ2yW9T9LXY1NVyfTGuesXJG1PL/+mpHsPfwGO9Ya1nJbKvl8Rx+ei/kzS4+7+qQMf\nW+j4XPlx2tOH/HxKF59484mVbrBhZvbDmpxduyZPfPos67kYM/ucJv+1yvdKOi/prKTPS/prST8k\n6SlJv+ruvFRdj2PW8r2azGJf0+RFSH5nfx6L2czseklfkvSYJj/jLunjmjzL/K805/HJk2sAoCLc\nEQkAFaG0AaAilDYAVITSBoCKUNoAUBFKGwAqQmkDQEUobQCoyP8DqnDI/zjyvZIAAAAASUVORK5C\nYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dane.hist()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "(Dane zostały wygenerowane w sposób losowy, stąd ich analiza jak na razie jest pozbawiona sensu.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Indeksowanie, czyli dostęp do danych" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Stwórzmy szereg danych, którego indeks będzie składać się z wielkich liter alfabetu:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A 11\n", "B 13\n", "C 6\n", "D 9\n", "E 18\n", "dtype: int64\n" ] } ], "source": [ "import string\n", "litery = list(string.ascii_uppercase)\n", "dane3 = pd.Series(losowe, index=litery)\n", "print(dane3.head())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Najprostszym sposobem dostępu do danych jest przez indeks:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18\n" ] } ], "source": [ "print(dane3['E'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Szeregi udostępniają wiele więcej. Jeżeli chcemy zobaczyć przykłady o kluczach *P*, *Y*, *T*, to możemy podać listę indeksów jako argument:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P 12\n", "Y 9\n", "T 7\n", "dtype: int64\n" ] } ], "source": [ "print(dane3[['P', 'Y', 'T']])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Możemy również podać zakres danych:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "B 13\n", "C 6\n", "D 9\n", "E 18\n", "dtype: int64\n" ] } ], "source": [ "print(dane3['B':'E'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Jeżeli zmienimy indeks szeregu, to cay czas mamy możliwość pracy na indeskach liczbowych:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C 6\n", "D 9\n", "E 18\n", "dtype: int64\n" ] } ], "source": [ "print(dane3[2:5])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Mapowanie" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Szeregi pozwalają zmieniać dane, które przechowują. Pojedyńcze wartości mozemy zmieniać przy pomocy odwołania się do konkretnego elementu:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "777\n" ] } ], "source": [ "dane3[2] = 777\n", "print(dane3[2])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Jeżeli chcemy zmienić cały szereg przy pomocy funkcji, możemy wykorzystać metodę ```map```:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A 1331\n", "B 2197\n", "C 469097433\n", "D 729\n", "E 5832\n", "F 1000\n", "G 343\n", "H 512\n", "I 125\n", "J 8\n", "K 3375\n", "L 4096\n", "M 216\n", "N 5832\n", "O 343\n", "P 1728\n", "Q 4913\n", "R 1728\n", "S 2197\n", "T 343\n", "U 4913\n", "V 3375\n", "W 2744\n", "X 1331\n", "Y 729\n", "Z 4096\n", "dtype: int64\n" ] } ], "source": [ "def cube(x):\n", " return x ** 3\n", "print(dane3.map(cube))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "*Uwaga:* w Pythonie istnieją funkcje lambda, które można tu wykorzystać." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Ramki danych" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ramka danych jest odpowiednikiem tabeli znanej z R lub sqla. Patrząc z innego punktu widzenia, jest lista szeregóœ danych, które są połącząne z sobą wspólnym indeksem. Stwórzmy ramkę danych składających się z małych i wielkich liter:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('a', 'A'), ('b', 'B'), ('c', 'C'), ('d', 'D'), ('e', 'E'), ('f', 'F'), ('g', 'G'), ('h', 'H'), ('i', 'I'), ('j', 'J'), ('k', 'K'), ('l', 'L'), ('m', 'M'), ('n', 'N'), ('o', 'O'), ('p', 'P'), ('q', 'Q'), ('r', 'R'), ('s', 'S'), ('t', 'T'), ('u', 'U'), ('v', 'V'), ('w', 'W'), ('x', 'X'), ('y', 'Y'), ('z', 'Z')]\n", " 0 1\n", "0 a A\n", "1 b B\n", "2 c C\n", "3 d D\n", "4 e E\n", "5 f F\n", "6 g G\n", "7 h H\n", "8 i I\n", "9 j J\n", "10 k K\n", "11 l L\n", "12 m M\n", "13 n N\n", "14 o O\n", "15 p P\n", "16 q Q\n", "17 r R\n", "18 s S\n", "19 t T\n", "20 u U\n", "21 v V\n", "22 w W\n", "23 x X\n", "24 y Y\n", "25 z Z\n" ] } ], "source": [ "wielkie = list(string.ascii_uppercase)\n", "male = list(string.ascii_lowercase)\n", "surowe = list(zip(male, wielkie))\n", "print(surowe)\n", "\n", "dane = pd.DataFrame(surowe)\n", "print(dane)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jak widzimy, ramkę danych tworzymy podając listę przykładów. W powyższej ramce mamy dwie kolumny nazwane *0* i *1*. Zmieńmy te nazwy na bardziej czytelne:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " małe wielkie\n", "0 a A\n", "1 b B\n", "2 c C\n", "3 d D\n", "4 e E\n" ] } ], "source": [ "dane.columns = [\"małe\", \"wielkie\"]\n", "print(dane.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obsługa ramki danych nie różni się za bardzo od obsługi szeregu, np. działaja metody head i tail, jak i inne:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "małe z\n", "wielkie Z\n", "dtype: object\n", " małe wielkie\n", "count 26 26\n", "unique 26 26\n", "top j M\n", "freq 1 1\n" ] } ], "source": [ "print(dane.max())\n", "print(dane.describe())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Dodajmy trzecią kolumnę składającą się z losowych liczb:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "dane['losowe'] = np.random.randint(1, 20, 26)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "df = pd.read_csv(\"./titanic_train.tsv\", sep='\\t', index_col='PassengerId')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',\n", " 'Fare', 'Cabin', 'Embarked'],\n", " dtype='object')\n" ] } ], "source": [ "print(df.columns)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 891 entries, 1 to 891\n", "Data columns (total 11 columns):\n", "Survived 891 non-null int64\n", "Pclass 891 non-null int64\n", "Name 891 non-null object\n", "Sex 891 non-null object\n", "Age 714 non-null float64\n", "SibSp 891 non-null int64\n", "Parch 891 non-null int64\n", "Ticket 891 non-null object\n", "Fare 891 non-null float64\n", "Cabin 204 non-null object\n", "Embarked 889 non-null object\n", "dtypes: float64(2), int64(4), object(5)\n", "memory usage: 83.5+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund\\t Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
313Heikkinen\\t Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen\\t Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund\\t Mr. Owen Harris male 22.0 \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "3 Heikkinen\\t Miss. Laina female 26.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen\\t Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclass
PassengerId
103
211
313
411
503
\n", "
" ], "text/plain": [ " Survived Pclass\n", "PassengerId \n", "1 0 3\n", "2 1 1\n", "3 1 3\n", "4 1 1\n", "5 0 3" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Survived', 'Pclass']].head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(891, 11)\n" ] } ], "source": [ "print(df.shape)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "Survived 1\n", "Pclass 1\n", "Name Graham\\t Miss. Margaret Edith\n", "Sex female\n", "Age 19\n", "SibSp 0\n", "Parch 0\n", "Ticket 112053\n", "Fare 30\n", "Cabin B42\n", "Embarked S\n", "Name: 888, dtype: object" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[888]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
88811Graham\\t Miss. Margaret Edithfemale19.00011205330.00B42S
88903Johnston\\t Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.45NaNS
89011Behr\\t Mr. Karl Howellmale26.00011136930.00C148C
89103Dooley\\t Mr. Patrickmale32.0003703767.75NaNQ
\n", "
" ], "text/plain": [ " Survived Pclass Name \\\n", "PassengerId \n", "888 1 1 Graham\\t Miss. Margaret Edith \n", "889 0 3 Johnston\\t Miss. Catherine Helen \"Carrie\" \n", "890 1 1 Behr\\t Mr. Karl Howell \n", "891 0 3 Dooley\\t Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "888 female 19.0 0 0 112053 30.00 B42 S \n", "889 female NaN 1 2 W./C. 6607 23.45 NaN S \n", "890 male 26.0 0 0 111369 30.00 C148 C \n", "891 male 32.0 0 0 370376 7.75 NaN Q " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[888:892]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
211Cumings\\t Mrs. John Bradley (Florence Briggs T...female38.010PC 1759971.2833C85C
313Heikkinen\\t Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle\\t Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
913Johnson\\t Mrs. Oscar W (Elisabeth Vilhelmina B...female27.00234774211.1333NaNS
1012Nasser\\t Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "9 1 3 \n", "10 1 2 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "2 Cumings\\t Mrs. John Bradley (Florence Briggs T... female 38.0 \n", "3 Heikkinen\\t Miss. Laina female 26.0 \n", "4 Futrelle\\t Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "9 Johnson\\t Mrs. Oscar W (Elisabeth Vilhelmina B... female 27.0 \n", "10 Nasser\\t Mrs. Nicholas (Adele Achem) female 14.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "9 0 2 347742 11.1333 NaN S \n", "10 1 0 237736 30.0708 NaN C " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.Survived == 1].head()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "PassengerId\n", "1 2\n", "2 2\n", "3 1\n", "4 2\n", "5 1\n", "Name: FSize, dtype: int64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"FSize\"] = df.SibSp + df.Parch + 1\n", "df.FSize.head()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Survived
Sex
female0.742038
male0.188908
\n", "
" ], "text/plain": [ " Survived\n", "Sex \n", "female 0.742038\n", "male 0.188908" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Sex', 'Survived']].groupby('Sex').mean()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Survived
FSize
10.303538
\n", "
" ], "text/plain": [ " Survived\n", "FSize \n", "1 0.303538" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['FSize', 'Survived']].loc[df['FSize'] == 1].groupby('FSize').mean()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "0.30353817504655495" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['FSize', 'Survived']].groupby('FSize').mean().loc[1]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Survived 0\n", "Pclass 3\n", "Name Braund\\t Mr. Owen Harris\n", "Sex male\n", "Age 22\n", "SibSp 1\n", "Parch 0\n", "Ticket A/5 21171\n", "Fare 7.25\n", "Cabin NaN\n", "Embarked S\n", "FSize 2\n", "Name: 1, dtype: object\n", "Survived 1\n", "Pclass 1\n", "Name Cumings\\t Mrs. John Bradley (Florence Briggs T...\n", "Sex female\n", "Age 38\n", "SibSp 1\n", "Parch 0\n", "Ticket PC 17599\n", "Fare 71.2833\n", "Cabin C85\n", "Embarked C\n", "FSize 2\n", "Name: 2, dtype: object\n", "Survived 1\n", "Pclass 3\n", "Name Heikkinen\\t Miss. Laina\n", "Sex female\n", "Age 26\n", "SibSp 0\n", "Parch 0\n", "Ticket STON/O2. 3101282\n", "Fare 7.925\n", "Cabin NaN\n", "Embarked S\n", "FSize 1\n", "Name: 3, dtype: object\n", "Survived 1\n", "Pclass 1\n", "Name Futrelle\\t Mrs. Jacques Heath (Lily May Peel)\n", "Sex female\n", "Age 35\n", "SibSp 1\n", "Parch 0\n", "Ticket 113803\n", "Fare 53.1\n", "Cabin C123\n", "Embarked S\n", "FSize 2\n", "Name: 4, dtype: object\n" ] } ], "source": [ "for idx, row in df.iterrows():\n", " if idx < 5:\n", " print(row)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 1 }