Aktualizacja zajec 2.t

2022-03-14 09:09:50 +01:00 · 2022-03-14 09:09:50 +01:00 · 8f8a8f14f9
commit 8f8a8f14f9
parent 0e2b626d0a
2 changed files with 177 additions and 91 deletions
--- a/IUM_00.Organizacyjne.ipynb
+++ b/IUM_00.Organizacyjne.ipynb
@ -43,7 +43,6 @@
    "## Przedmiot\n",
    "- Kod przedmiotu: 06-DIUMUI0\n",
    "- Nazwa: Inżynieria Uczenia Maszynowego\n",
-    "- WMI UAM 2021\n",
    "- Sylabus: Sylabus-AITech-InzynieriaUczeniaMaszynowego.pdf"
   ]
  },
@ -61,9 +60,11 @@
    "- stanowisko:\tdoktorant\n",
    "- [Zakład Sztucznej Inteligencji](https://ai.wmi.amu.edu.pl/pl/)\n",
    "- email: tomasz.zietkiewicz@amu.edu.pl\n",
-    "- www: http://tz47965.home.amu.edu.pl/\n",
+    "<!-- - www: http://tz47965.home.amu.edu.pl/ -->\n",
    "- https://git.wmi.amu.edu.pl/tzietkiewicz/aitech-ium\n",
-    "- konsultacje: przez MS Teams, po wcześniejszym umówieniu mailowym lub przez chat"
+    "- konsultacje: \n",
+    "    - przez MS Teams po wcześniejszym umówieniu\n",
+    "    - pokój B2-36, wtorki 12:00 - 13:00, po wcześniejszym umówieniu"
   ]
  },
  {
@ -104,7 +105,7 @@
    "10. Kontrola eksperymentów - DVC\n",
    "11. Github Actions i CML\n",
    "12. Zarządzanie Jenkins\n",
-    "13. Integracja\n",
+    "13. Raportowanie\n",
    "14. Przegląd technologii\n",
    "15. Przegląd technologii, cz. 2"
   ]
--- a/IUM_02.Dane.ipynb
+++ b/IUM_02.Dane.ipynb
@ -77,9 +77,9 @@
    "# Źródła danych\n",
    "- Tworzenie danych:\n",
    " - Generowanie syntetyczne\n",
+    "   - np. generowanie korpusów mowy za pomocą TTS (syntezy mowy)\n",
    " - Crowdsourcing\n",
-    " - Data scrapping\n",
-    " - Ekstrakcja\n"
+    " - Data scrapping"
   ]
  },
  {
@ -109,10 +109,10 @@
   "source": [
    "## Repozytoria/wyszukiwarki otwartych zbiorów danych\n",
    "- Papers with code: https://paperswithcode.com/datasets\n",
-    "- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/\n",
+    "- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/ (University of California)\n",
    "- Google dataset search: https://datasetsearch.research.google.com/\n",
    "- Zbiory google:https://research.google/tools/datasets/\n",
-    "- https://registry.opendata.aws/\n",
+    "- Otwarte zbiory na Amazon AWS: https://registry.opendata.aws/\n",
    "    "
   ]
  },
@ -129,8 +129,8 @@
    " - https://www.openslr.org/ - Libri Speech, TED Lium\n",
    " - Mozilla Open Voice: https://commonvoice.mozilla.org/\n",
    "- NLP:\n",
-    " - Clarin PL: https://lindat.cz/repository/xmlui/\n",
    " - Clarin: https://clarin-pl.eu/index.php/zasoby/\n",
+    " - NKJP: http://nkjp.pl/\n",
    " "
   ]
  },
@ -143,9 +143,22 @@
   },
   "source": [
    "## Crowdsourcing\n",
-    "- Amazon Mechanical Turk: https://www.mturk.com/\n",
-    "- Yandex Toloka\n",
    "- reCAPTCHA\n",
+    "<img src=\"img/ReCAPTCHA_idea.jpg\">\n",
+    "<img src=\"img/cat_captcha.png\">\n",
+    "\n",
+    "<sub>Źródło: https://pl.wikipedia.org/wiki/ReCAPTCHA#/media/Plik:ReCAPTCHA_idea.jpg</sub>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "slideshow": {
+     "slide_type": "slide"
+    }
+   },
+   "source": [
+    "- Amazon Mechanical Turk: https://www.mturk.com/\n",
    "<img src=\"img/Tuerkischer_schachspieler_windisch4.jpg\">\n",
    "\n",
    "<sub>Źródło: https://en.wikipedia.org/wiki/Mechanical_Turk#/media/File:Tuerkischer_schachspieler_windisch4.jpg</sub>"
@ -193,6 +206,117 @@
    "Licencja: [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)</sub>"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Pobranie danych"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "scrolled": true,
+    "slideshow": {
+     "slide_type": "slide"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Collecting kaggle\n",
+      "  Using cached kaggle-1.5.12.tar.gz (58 kB)\n",
+      "Requirement already satisfied: six>=1.10 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.15.0)\n",
+      "Requirement already satisfied: certifi in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2021.5.30)\n",
+      "Requirement already satisfied: python-dateutil in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.8.1)\n",
+      "Requirement already satisfied: requests in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.25.1)\n",
+      "Requirement already satisfied: tqdm in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (4.59.0)\n",
+      "Requirement already satisfied: python-slugify in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (5.0.2)\n",
+      "Requirement already satisfied: urllib3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.26.4)\n",
+      "Requirement already satisfied: text-unidecode>=1.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-slugify->kaggle) (1.3)\n",
+      "Requirement already satisfied: idna<3,>=2.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (2.10)\n",
+      "Requirement already satisfied: chardet<5,>=3.0.2 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (4.0.0)\n",
+      "Building wheels for collected packages: kaggle\n",
+      "  Building wheel for kaggle (setup.py) ... \u001b[?25ldone\n",
+      "\u001b[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73053 sha256=1e6240d540651324d97a9772ad1ced30da7d7b5dc5956dc974eeeddf7c48844b\n",
+      "  Stored in directory: /home/tomek/.cache/pip/wheels/ac/b2/c3/fa4706d469b5879105991d1c8be9a3c2ef329ba9fe2ce5085e\n",
+      "Successfully built kaggle\n",
+      "Installing collected packages: kaggle\n",
+      "Successfully installed kaggle-1.5.12\n",
+      "Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n",
+      "Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n",
+      "Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n",
+      "Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n",
+      "Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n"
+     ]
+    }
+   ],
+   "source": [
+    "#Zainstalujmy potrzebne biblioteki \n",
+    "!pip install --user kaggle #API Kaggle, do pobrania zbioru\n",
+    "!pip install --user pandas"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    " - Pobierzemy zbiór Iris z Kaggle: https://www.kaggle.com/uciml/iris\n",
+    " - Licencja to \"Public Domain\", więc możemy z niego korzystać bez ograniczeń."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "slideshow": {
+     "slide_type": "slide"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloading iris.zip to /home/tomek/AITech/repo/aitech-ium\n",
+      "  0%|                                               | 0.00/3.60k [00:00<?, ?B/s]\n",
+      "100%|██████████████████████████████████████| 3.60k/3.60k [00:00<00:00, 1.63MB/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.\n",
+    "# Instrukcje: https://www.kaggle.com/docs/api\n",
+    "!kaggle datasets download -d uciml/iris"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "scrolled": true,
+    "slideshow": {
+     "slide_type": "slide"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Archive:  iris.zip\r\n",
+      "  inflating: Iris.csv                \r\n",
+      "  inflating: database.sqlite         \r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!unzip -o iris.zip"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {
@ -226,93 +350,54 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {
-    "scrolled": true,
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
+   "execution_count": 7,
+   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Requirement already satisfied: kaggle in /home/tomek/.local/lib/python3.8/site-packages (1.5.12)\n",
-      "Requirement already satisfied: python-dateutil in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2.8.1)\n",
-      "Requirement already satisfied: six>=1.10 in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (1.15.0)\n",
-      "Requirement already satisfied: urllib3 in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (1.25.11)\n",
-      "Requirement already satisfied: python-slugify in /home/tomek/.local/lib/python3.8/site-packages (from kaggle) (4.0.1)\n",
-      "Requirement already satisfied: certifi in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2020.6.20)\n",
-      "Requirement already satisfied: tqdm in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (4.50.2)\n",
-      "Requirement already satisfied: requests in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2.24.0)\n",
-      "Requirement already satisfied: text-unidecode>=1.3 in /home/tomek/.local/lib/python3.8/site-packages (from python-slugify->kaggle) (1.3)\n",
-      "Requirement already satisfied: chardet<4,>=3.0.2 in /home/tomek/anaconda3/lib/python3.8/site-packages (from requests->kaggle) (3.0.4)\n",
-      "Requirement already satisfied: idna<3,>=2.5 in /home/tomek/anaconda3/lib/python3.8/site-packages (from requests->kaggle) (2.10)\n",
-      "Requirement already satisfied: pandas in /home/tomek/anaconda3/lib/python3.8/site-packages (1.1.3)\n",
-      "Requirement already satisfied: python-dateutil>=2.7.3 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (2.8.1)\n",
-      "Requirement already satisfied: numpy>=1.15.4 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (1.19.2)\n",
-      "Requirement already satisfied: pytz>=2017.2 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (2020.1)\n",
-      "Requirement already satisfied: six>=1.5 in /home/tomek/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n"
+      "Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n",
+      "Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n",
+      "Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n",
+      "Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n",
+      "Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n",
+      "Collecting seaborn\n",
+      "  Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)\n",
+      "\u001b[K     |████████████████████████████████| 292 kB 1.1 MB/s eta 0:00:01\n",
+      "\u001b[?25hCollecting matplotlib>=2.2\n",
+      "  Downloading matplotlib-3.5.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)\n",
+      "\u001b[K     |████████████████████████████████| 11.2 MB 10.8 MB/s eta 0:00:01\n",
+      "\u001b[?25hRequirement already satisfied: pandas>=0.23 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.2.4)\n",
+      "Requirement already satisfied: numpy>=1.15 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.20.2)\n",
+      "Requirement already satisfied: scipy>=1.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.6.3)\n",
+      "Requirement already satisfied: packaging>=20.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (20.9)\n",
+      "Requirement already satisfied: python-dateutil>=2.7 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)\n",
+      "Collecting cycler>=0.10\n",
+      "  Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)\n",
+      "Requirement already satisfied: pyparsing>=2.2.1 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)\n",
+      "Collecting fonttools>=4.22.0\n",
+      "  Downloading fonttools-4.30.0-py3-none-any.whl (898 kB)\n",
+      "\u001b[K     |████████████████████████████████| 898 kB 4.9 MB/s eta 0:00:01\n",
+      "\u001b[?25hRequirement already satisfied: pillow>=6.2.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)\n",
+      "Collecting kiwisolver>=1.0.1\n",
+      "  Downloading kiwisolver-1.3.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)\n",
+      "\u001b[K     |████████████████████████████████| 1.6 MB 7.7 MB/s eta 0:00:01\n",
+      "\u001b[?25hRequirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas>=0.23->seaborn) (2021.1)\n",
+      "Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.15.0)\n",
+      "Installing collected packages: kiwisolver, fonttools, cycler, matplotlib, seaborn\n",
+      "Successfully installed cycler-0.11.0 fonttools-4.30.0 kiwisolver-1.3.2 matplotlib-3.5.1 seaborn-0.11.2\n"
     ]
    }
   ],
   "source": [
-    "#Zainstalujmy potrzebne biblioteki \n",
-    "!pip install --user kaggle #API Kaggle, do pobrania zbioru\n",
-    "!pip install --user pandas"
+    "!pip install --user pandas\n",
+    "!pip install --user seaborn"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 61,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/tomek/.kaggle/kaggle.json'\n",
-      "iris.zip: Skipping, found more recently modified local copy (use --force to force download)\n"
-     ]
-    }
-   ],
-   "source": [
-    "# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.\n",
-    "# Instrukcje: https://www.kaggle.com/docs/api\n",
-    "!kaggle datasets download -d uciml/iris"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Archive:  iris.zip\r\n",
-      "  inflating: Iris.csv                \r\n",
-      "  inflating: database.sqlite         \r\n"
-     ]
-    }
-   ],
-   "source": [
-    "!unzip -o iris.zip"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
@ -337,7 +422,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
@ -508,7 +593,7 @@
       "[150 rows x 6 columns]"
      ]
     },
-     "execution_count": 18,
+     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -1429,7 +1514,7 @@
    "  - Dane dźwiękowe: normalizacja natężenia, rozdzielczości, częstotliwości próbkowania, ilości kanałów\n",
    "- Poszerzanie (augumentacja) danych\n",
    "  - Generowanie nowych przykładów przez wprowadzanie szumu/przekształceń na originalnych danych\n",
-    "  - np. dodanie echa do nagrania dźwiękowego\n",
+    "  - np. dodanie echa do nagrania dźwiękowego, dodanie szumów do obrazka\n",
    "  - zmiana wartości cech o względnie małe, losowe wartości \n",
    "- Over/under-sampling\n",
    "  - Algorymty uczące i metryki mogą być wrażliwe na niezbalansowane klasy w zbiorze\n",
@ -1448,9 +1533,9 @@
    "# Zadanie [5pkt]\n",
    "- Wybierz jeden z ogólnodostępnych zbiorów danych. Będziesz na nim pracował do końca roku (oczywiście, zbiór można zmienić w trakcie, ale będzie się to wiązało z powtarzeniem pewnych działań, co prawdwa niezbyt kosztownych, ale jednak).\n",
    "- Zbiór powinien być:\n",
-    " - nie za duży (max 10-20 MB)\n",
+    " - nie za duży (max 50 MB)\n",
    " - nie za mały (np. IRIS jest za mały ;))\n",
-    " - unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/:x:/s/2021SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EYhZK_aXp41BsIVS4K-L1V4B_vM2FjO5nJZMWv2QKXJolA?e=DKIS2O\n",
+    " - unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n",
    " - najlepiej, żeby był to zbiór zawierający dane w formie tekstowej, mogący posłużyć do zadania klasyfikacji lub rergesji - na takim zbiorze będzie łatwiej pracować niż np. na zbiorze obrazów albo dźwięków. Dzięki temu będziesz się mogła/mógł skupić na istocie tych zajęć.\n",
    "\n",
    "- Napisz skrypt, który:\n",
@ -1466,7 +1551,7 @@
    "- Skrypt możesz napisać w swoim ulubionym języku. Jedyne ograniczenie: ma działać pod Linuxem\n",
    "- Wygodnie będzie stworzyć zeszyt Jupyter w tym celu (choć nie jest to konieczne)\n",
    "- Stwórz na wydziałowym serwerze git (http://git.wmi.amu.edu.pl/) repozytorium \"ium_nrindeksu\" i umieść w nim stworzony skrypt\n",
-    "- Link do repozytorium wklej do arkusza ze zbiorami (https://uam.sharepoint.com/:x:/s/2021SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EYhZK_aXp41BsIVS4K-L1V4B_vM2FjO5nJZMWv2QKXJolA?e=DKIS2O)\n"
+    "- Link do repozytorium wklej do arkusza ze zbiorami: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n"
   ]
  },
  {