From 8f8a8f14f9c7154d4d99eec651d72bfb79ca3fcb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tomasz=20Zi=C4=99tkiewicz?= Date: Mon, 14 Mar 2022 09:09:50 +0100 Subject: [PATCH] Aktualizacja zajec 2.t --- IUM_00.Organizacyjne.ipynb | 9 +- IUM_02.Dane.ipynb | 259 ++++++++++++++++++++++++------------- 2 files changed, 177 insertions(+), 91 deletions(-) diff --git a/IUM_00.Organizacyjne.ipynb b/IUM_00.Organizacyjne.ipynb index 88d834d..f79f68c 100644 --- a/IUM_00.Organizacyjne.ipynb +++ b/IUM_00.Organizacyjne.ipynb @@ -43,7 +43,6 @@ "## Przedmiot\n", "- Kod przedmiotu: 06-DIUMUI0\n", "- Nazwa: Inżynieria Uczenia Maszynowego\n", - "- WMI UAM 2021\n", "- Sylabus: Sylabus-AITech-InzynieriaUczeniaMaszynowego.pdf" ] }, @@ -61,9 +60,11 @@ "- stanowisko:\tdoktorant\n", "- [Zakład Sztucznej Inteligencji](https://ai.wmi.amu.edu.pl/pl/)\n", "- email: tomasz.zietkiewicz@amu.edu.pl\n", - "- www: http://tz47965.home.amu.edu.pl/\n", + "\n", "- https://git.wmi.amu.edu.pl/tzietkiewicz/aitech-ium\n", - "- konsultacje: przez MS Teams, po wcześniejszym umówieniu mailowym lub przez chat" + "- konsultacje: \n", + " - przez MS Teams po wcześniejszym umówieniu\n", + " - pokój B2-36, wtorki 12:00 - 13:00, po wcześniejszym umówieniu" ] }, { @@ -104,7 +105,7 @@ "10. Kontrola eksperymentów - DVC\n", "11. Github Actions i CML\n", "12. Zarządzanie Jenkins\n", - "13. Integracja\n", + "13. Raportowanie\n", "14. Przegląd technologii\n", "15. Przegląd technologii, cz. 2" ] diff --git a/IUM_02.Dane.ipynb b/IUM_02.Dane.ipynb index 22cf3c5..9c5a5ad 100644 --- a/IUM_02.Dane.ipynb +++ b/IUM_02.Dane.ipynb @@ -77,9 +77,9 @@ "# Źródła danych\n", "- Tworzenie danych:\n", " - Generowanie syntetyczne\n", + " - np. generowanie korpusów mowy za pomocą TTS (syntezy mowy)\n", " - Crowdsourcing\n", - " - Data scrapping\n", - " - Ekstrakcja\n" + " - Data scrapping" ] }, { @@ -109,10 +109,10 @@ "source": [ "## Repozytoria/wyszukiwarki otwartych zbiorów danych\n", "- Papers with code: https://paperswithcode.com/datasets\n", - "- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/\n", + "- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/ (University of California)\n", "- Google dataset search: https://datasetsearch.research.google.com/\n", "- Zbiory google:https://research.google/tools/datasets/\n", - "- https://registry.opendata.aws/\n", + "- Otwarte zbiory na Amazon AWS: https://registry.opendata.aws/\n", " " ] }, @@ -129,8 +129,8 @@ " - https://www.openslr.org/ - Libri Speech, TED Lium\n", " - Mozilla Open Voice: https://commonvoice.mozilla.org/\n", "- NLP:\n", - " - Clarin PL: https://lindat.cz/repository/xmlui/\n", " - Clarin: https://clarin-pl.eu/index.php/zasoby/\n", + " - NKJP: http://nkjp.pl/\n", " " ] }, @@ -143,9 +143,22 @@ }, "source": [ "## Crowdsourcing\n", - "- Amazon Mechanical Turk: https://www.mturk.com/\n", - "- Yandex Toloka\n", "- reCAPTCHA\n", + "\n", + "\n", + "\n", + "Źródło: https://pl.wikipedia.org/wiki/ReCAPTCHA#/media/Plik:ReCAPTCHA_idea.jpg" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "- Amazon Mechanical Turk: https://www.mturk.com/\n", "\n", "\n", "Źródło: https://en.wikipedia.org/wiki/Mechanical_Turk#/media/File:Tuerkischer_schachspieler_windisch4.jpg" @@ -193,6 +206,117 @@ "Licencja: [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pobranie danych" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "scrolled": true, + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting kaggle\n", + " Using cached kaggle-1.5.12.tar.gz (58 kB)\n", + "Requirement already satisfied: six>=1.10 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.15.0)\n", + "Requirement already satisfied: certifi in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2021.5.30)\n", + "Requirement already satisfied: python-dateutil in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.8.1)\n", + "Requirement already satisfied: requests in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.25.1)\n", + "Requirement already satisfied: tqdm in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (4.59.0)\n", + "Requirement already satisfied: python-slugify in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (5.0.2)\n", + "Requirement already satisfied: urllib3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.26.4)\n", + "Requirement already satisfied: text-unidecode>=1.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-slugify->kaggle) (1.3)\n", + "Requirement already satisfied: idna<3,>=2.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (2.10)\n", + "Requirement already satisfied: chardet<5,>=3.0.2 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (4.0.0)\n", + "Building wheels for collected packages: kaggle\n", + " Building wheel for kaggle (setup.py) ... \u001b[?25ldone\n", + "\u001b[?25h Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73053 sha256=1e6240d540651324d97a9772ad1ced30da7d7b5dc5956dc974eeeddf7c48844b\n", + " Stored in directory: /home/tomek/.cache/pip/wheels/ac/b2/c3/fa4706d469b5879105991d1c8be9a3c2ef329ba9fe2ce5085e\n", + "Successfully built kaggle\n", + "Installing collected packages: kaggle\n", + "Successfully installed kaggle-1.5.12\n", + "Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n", + "Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n", + "Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n", + "Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n", + "Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n" + ] + } + ], + "source": [ + "#Zainstalujmy potrzebne biblioteki \n", + "!pip install --user kaggle #API Kaggle, do pobrania zbioru\n", + "!pip install --user pandas" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " - Pobierzemy zbiór Iris z Kaggle: https://www.kaggle.com/uciml/iris\n", + " - Licencja to \"Public Domain\", więc możemy z niego korzystać bez ograniczeń." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading iris.zip to /home/tomek/AITech/repo/aitech-ium\n", + " 0%| | 0.00/3.60k [00:00=1.10 in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (1.15.0)\n", - "Requirement already satisfied: urllib3 in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (1.25.11)\n", - "Requirement already satisfied: python-slugify in /home/tomek/.local/lib/python3.8/site-packages (from kaggle) (4.0.1)\n", - "Requirement already satisfied: certifi in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2020.6.20)\n", - "Requirement already satisfied: tqdm in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (4.50.2)\n", - "Requirement already satisfied: requests in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2.24.0)\n", - "Requirement already satisfied: text-unidecode>=1.3 in /home/tomek/.local/lib/python3.8/site-packages (from python-slugify->kaggle) (1.3)\n", - "Requirement already satisfied: chardet<4,>=3.0.2 in /home/tomek/anaconda3/lib/python3.8/site-packages (from requests->kaggle) (3.0.4)\n", - "Requirement already satisfied: idna<3,>=2.5 in /home/tomek/anaconda3/lib/python3.8/site-packages (from requests->kaggle) (2.10)\n", - "Requirement already satisfied: pandas in /home/tomek/anaconda3/lib/python3.8/site-packages (1.1.3)\n", - "Requirement already satisfied: python-dateutil>=2.7.3 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (2.8.1)\n", - "Requirement already satisfied: numpy>=1.15.4 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (1.19.2)\n", - "Requirement already satisfied: pytz>=2017.2 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (2020.1)\n", - "Requirement already satisfied: six>=1.5 in /home/tomek/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n" + "Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n", + "Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n", + "Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n", + "Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n", + "Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n", + "Collecting seaborn\n", + " Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)\n", + "\u001b[K |████████████████████████████████| 292 kB 1.1 MB/s eta 0:00:01\n", + "\u001b[?25hCollecting matplotlib>=2.2\n", + " Downloading matplotlib-3.5.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)\n", + "\u001b[K |████████████████████████████████| 11.2 MB 10.8 MB/s eta 0:00:01\n", + "\u001b[?25hRequirement already satisfied: pandas>=0.23 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.2.4)\n", + "Requirement already satisfied: numpy>=1.15 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.20.2)\n", + "Requirement already satisfied: scipy>=1.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.6.3)\n", + "Requirement already satisfied: packaging>=20.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (20.9)\n", + "Requirement already satisfied: python-dateutil>=2.7 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)\n", + "Collecting cycler>=0.10\n", + " Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)\n", + "Requirement already satisfied: pyparsing>=2.2.1 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)\n", + "Collecting fonttools>=4.22.0\n", + " Downloading fonttools-4.30.0-py3-none-any.whl (898 kB)\n", + "\u001b[K |████████████████████████████████| 898 kB 4.9 MB/s eta 0:00:01\n", + "\u001b[?25hRequirement already satisfied: pillow>=6.2.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)\n", + "Collecting kiwisolver>=1.0.1\n", + " Downloading kiwisolver-1.3.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)\n", + "\u001b[K |████████████████████████████████| 1.6 MB 7.7 MB/s eta 0:00:01\n", + "\u001b[?25hRequirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas>=0.23->seaborn) (2021.1)\n", + "Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.15.0)\n", + "Installing collected packages: kiwisolver, fonttools, cycler, matplotlib, seaborn\n", + "Successfully installed cycler-0.11.0 fonttools-4.30.0 kiwisolver-1.3.2 matplotlib-3.5.1 seaborn-0.11.2\n" ] } ], "source": [ - "#Zainstalujmy potrzebne biblioteki \n", - "!pip install --user kaggle #API Kaggle, do pobrania zbioru\n", - "!pip install --user pandas" + "!pip install --user pandas\n", + "!pip install --user seaborn" ] }, { "cell_type": "code", - "execution_count": 61, - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/tomek/.kaggle/kaggle.json'\n", - "iris.zip: Skipping, found more recently modified local copy (use --force to force download)\n" - ] - } - ], - "source": [ - "# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.\n", - "# Instrukcje: https://www.kaggle.com/docs/api\n", - "!kaggle datasets download -d uciml/iris" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Archive: iris.zip\r\n", - " inflating: Iris.csv \r\n", - " inflating: database.sqlite \r\n" - ] - } - ], - "source": [ - "!unzip -o iris.zip" - ] - }, - { - "cell_type": "code", - "execution_count": 15, + "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" @@ -337,7 +422,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 5, "metadata": { "slideshow": { "slide_type": "slide" @@ -508,7 +593,7 @@ "[150 rows x 6 columns]" ] }, - "execution_count": 18, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -1429,7 +1514,7 @@ " - Dane dźwiękowe: normalizacja natężenia, rozdzielczości, częstotliwości próbkowania, ilości kanałów\n", "- Poszerzanie (augumentacja) danych\n", " - Generowanie nowych przykładów przez wprowadzanie szumu/przekształceń na originalnych danych\n", - " - np. dodanie echa do nagrania dźwiękowego\n", + " - np. dodanie echa do nagrania dźwiękowego, dodanie szumów do obrazka\n", " - zmiana wartości cech o względnie małe, losowe wartości \n", "- Over/under-sampling\n", " - Algorymty uczące i metryki mogą być wrażliwe na niezbalansowane klasy w zbiorze\n", @@ -1448,9 +1533,9 @@ "# Zadanie [5pkt]\n", "- Wybierz jeden z ogólnodostępnych zbiorów danych. Będziesz na nim pracował do końca roku (oczywiście, zbiór można zmienić w trakcie, ale będzie się to wiązało z powtarzeniem pewnych działań, co prawdwa niezbyt kosztownych, ale jednak).\n", "- Zbiór powinien być:\n", - " - nie za duży (max 10-20 MB)\n", + " - nie za duży (max 50 MB)\n", " - nie za mały (np. IRIS jest za mały ;))\n", - " - unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/:x:/s/2021SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EYhZK_aXp41BsIVS4K-L1V4B_vM2FjO5nJZMWv2QKXJolA?e=DKIS2O\n", + " - unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n", " - najlepiej, żeby był to zbiór zawierający dane w formie tekstowej, mogący posłużyć do zadania klasyfikacji lub rergesji - na takim zbiorze będzie łatwiej pracować niż np. na zbiorze obrazów albo dźwięków. Dzięki temu będziesz się mogła/mógł skupić na istocie tych zajęć.\n", "\n", "- Napisz skrypt, który:\n", @@ -1466,7 +1551,7 @@ "- Skrypt możesz napisać w swoim ulubionym języku. Jedyne ograniczenie: ma działać pod Linuxem\n", "- Wygodnie będzie stworzyć zeszyt Jupyter w tym celu (choć nie jest to konieczne)\n", "- Stwórz na wydziałowym serwerze git (http://git.wmi.amu.edu.pl/) repozytorium \"ium_nrindeksu\" i umieść w nim stworzony skrypt\n", - "- Link do repozytorium wklej do arkusza ze zbiorami (https://uam.sharepoint.com/:x:/s/2021SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EYhZK_aXp41BsIVS4K-L1V4B_vM2FjO5nJZMWv2QKXJolA?e=DKIS2O)\n" + "- Link do repozytorium wklej do arkusza ze zbiorami: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n" ] }, {