diff --git a/IUM_00.Organizacyjne.ipynb b/IUM_00.Organizacyjne.ipynb
index 88d834d..f79f68c 100644
--- a/IUM_00.Organizacyjne.ipynb
+++ b/IUM_00.Organizacyjne.ipynb
@@ -43,7 +43,6 @@
"## Przedmiot\n",
"- Kod przedmiotu: 06-DIUMUI0\n",
"- Nazwa: Inżynieria Uczenia Maszynowego\n",
- "- WMI UAM 2021\n",
"- Sylabus: Sylabus-AITech-InzynieriaUczeniaMaszynowego.pdf"
]
},
@@ -61,9 +60,11 @@
"- stanowisko:\tdoktorant\n",
"- [Zakład Sztucznej Inteligencji](https://ai.wmi.amu.edu.pl/pl/)\n",
"- email: tomasz.zietkiewicz@amu.edu.pl\n",
- "- www: http://tz47965.home.amu.edu.pl/\n",
+ "\n",
"- https://git.wmi.amu.edu.pl/tzietkiewicz/aitech-ium\n",
- "- konsultacje: przez MS Teams, po wcześniejszym umówieniu mailowym lub przez chat"
+ "- konsultacje: \n",
+ " - przez MS Teams po wcześniejszym umówieniu\n",
+ " - pokój B2-36, wtorki 12:00 - 13:00, po wcześniejszym umówieniu"
]
},
{
@@ -104,7 +105,7 @@
"10. Kontrola eksperymentów - DVC\n",
"11. Github Actions i CML\n",
"12. Zarządzanie Jenkins\n",
- "13. Integracja\n",
+ "13. Raportowanie\n",
"14. Przegląd technologii\n",
"15. Przegląd technologii, cz. 2"
]
diff --git a/IUM_02.Dane.ipynb b/IUM_02.Dane.ipynb
index 22cf3c5..9c5a5ad 100644
--- a/IUM_02.Dane.ipynb
+++ b/IUM_02.Dane.ipynb
@@ -77,9 +77,9 @@
"# Źródła danych\n",
"- Tworzenie danych:\n",
" - Generowanie syntetyczne\n",
+ " - np. generowanie korpusów mowy za pomocą TTS (syntezy mowy)\n",
" - Crowdsourcing\n",
- " - Data scrapping\n",
- " - Ekstrakcja\n"
+ " - Data scrapping"
]
},
{
@@ -109,10 +109,10 @@
"source": [
"## Repozytoria/wyszukiwarki otwartych zbiorów danych\n",
"- Papers with code: https://paperswithcode.com/datasets\n",
- "- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/\n",
+ "- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/ (University of California)\n",
"- Google dataset search: https://datasetsearch.research.google.com/\n",
"- Zbiory google:https://research.google/tools/datasets/\n",
- "- https://registry.opendata.aws/\n",
+ "- Otwarte zbiory na Amazon AWS: https://registry.opendata.aws/\n",
" "
]
},
@@ -129,8 +129,8 @@
" - https://www.openslr.org/ - Libri Speech, TED Lium\n",
" - Mozilla Open Voice: https://commonvoice.mozilla.org/\n",
"- NLP:\n",
- " - Clarin PL: https://lindat.cz/repository/xmlui/\n",
" - Clarin: https://clarin-pl.eu/index.php/zasoby/\n",
+ " - NKJP: http://nkjp.pl/\n",
" "
]
},
@@ -143,9 +143,22 @@
},
"source": [
"## Crowdsourcing\n",
- "- Amazon Mechanical Turk: https://www.mturk.com/\n",
- "- Yandex Toloka\n",
"- reCAPTCHA\n",
+ "\n",
+ "\n",
+ "\n",
+ "Źródło: https://pl.wikipedia.org/wiki/ReCAPTCHA#/media/Plik:ReCAPTCHA_idea.jpg"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "- Amazon Mechanical Turk: https://www.mturk.com/\n",
"\n",
"\n",
"Źródło: https://en.wikipedia.org/wiki/Mechanical_Turk#/media/File:Tuerkischer_schachspieler_windisch4.jpg"
@@ -193,6 +206,117 @@
"Licencja: [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Pobranie danych"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "scrolled": true,
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Collecting kaggle\n",
+ " Using cached kaggle-1.5.12.tar.gz (58 kB)\n",
+ "Requirement already satisfied: six>=1.10 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.15.0)\n",
+ "Requirement already satisfied: certifi in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2021.5.30)\n",
+ "Requirement already satisfied: python-dateutil in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.8.1)\n",
+ "Requirement already satisfied: requests in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.25.1)\n",
+ "Requirement already satisfied: tqdm in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (4.59.0)\n",
+ "Requirement already satisfied: python-slugify in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (5.0.2)\n",
+ "Requirement already satisfied: urllib3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.26.4)\n",
+ "Requirement already satisfied: text-unidecode>=1.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-slugify->kaggle) (1.3)\n",
+ "Requirement already satisfied: idna<3,>=2.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (2.10)\n",
+ "Requirement already satisfied: chardet<5,>=3.0.2 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (4.0.0)\n",
+ "Building wheels for collected packages: kaggle\n",
+ " Building wheel for kaggle (setup.py) ... \u001b[?25ldone\n",
+ "\u001b[?25h Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73053 sha256=1e6240d540651324d97a9772ad1ced30da7d7b5dc5956dc974eeeddf7c48844b\n",
+ " Stored in directory: /home/tomek/.cache/pip/wheels/ac/b2/c3/fa4706d469b5879105991d1c8be9a3c2ef329ba9fe2ce5085e\n",
+ "Successfully built kaggle\n",
+ "Installing collected packages: kaggle\n",
+ "Successfully installed kaggle-1.5.12\n",
+ "Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n",
+ "Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n",
+ "Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n",
+ "Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n",
+ "Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n"
+ ]
+ }
+ ],
+ "source": [
+ "#Zainstalujmy potrzebne biblioteki \n",
+ "!pip install --user kaggle #API Kaggle, do pobrania zbioru\n",
+ "!pip install --user pandas"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ " - Pobierzemy zbiór Iris z Kaggle: https://www.kaggle.com/uciml/iris\n",
+ " - Licencja to \"Public Domain\", więc możemy z niego korzystać bez ograniczeń."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Downloading iris.zip to /home/tomek/AITech/repo/aitech-ium\n",
+ " 0%| | 0.00/3.60k [00:00, ?B/s]\n",
+ "100%|██████████████████████████████████████| 3.60k/3.60k [00:00<00:00, 1.63MB/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.\n",
+ "# Instrukcje: https://www.kaggle.com/docs/api\n",
+ "!kaggle datasets download -d uciml/iris"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "scrolled": true,
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Archive: iris.zip\r\n",
+ " inflating: Iris.csv \r\n",
+ " inflating: database.sqlite \r\n"
+ ]
+ }
+ ],
+ "source": [
+ "!unzip -o iris.zip"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {
@@ -226,93 +350,54 @@
},
{
"cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "scrolled": true,
- "slideshow": {
- "slide_type": "slide"
- }
- },
+ "execution_count": 7,
+ "metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
- "Requirement already satisfied: kaggle in /home/tomek/.local/lib/python3.8/site-packages (1.5.12)\n",
- "Requirement already satisfied: python-dateutil in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2.8.1)\n",
- "Requirement already satisfied: six>=1.10 in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (1.15.0)\n",
- "Requirement already satisfied: urllib3 in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (1.25.11)\n",
- "Requirement already satisfied: python-slugify in /home/tomek/.local/lib/python3.8/site-packages (from kaggle) (4.0.1)\n",
- "Requirement already satisfied: certifi in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2020.6.20)\n",
- "Requirement already satisfied: tqdm in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (4.50.2)\n",
- "Requirement already satisfied: requests in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2.24.0)\n",
- "Requirement already satisfied: text-unidecode>=1.3 in /home/tomek/.local/lib/python3.8/site-packages (from python-slugify->kaggle) (1.3)\n",
- "Requirement already satisfied: chardet<4,>=3.0.2 in /home/tomek/anaconda3/lib/python3.8/site-packages (from requests->kaggle) (3.0.4)\n",
- "Requirement already satisfied: idna<3,>=2.5 in /home/tomek/anaconda3/lib/python3.8/site-packages (from requests->kaggle) (2.10)\n",
- "Requirement already satisfied: pandas in /home/tomek/anaconda3/lib/python3.8/site-packages (1.1.3)\n",
- "Requirement already satisfied: python-dateutil>=2.7.3 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (2.8.1)\n",
- "Requirement already satisfied: numpy>=1.15.4 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (1.19.2)\n",
- "Requirement already satisfied: pytz>=2017.2 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (2020.1)\n",
- "Requirement already satisfied: six>=1.5 in /home/tomek/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n"
+ "Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n",
+ "Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n",
+ "Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n",
+ "Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n",
+ "Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n",
+ "Collecting seaborn\n",
+ " Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)\n",
+ "\u001b[K |████████████████████████████████| 292 kB 1.1 MB/s eta 0:00:01\n",
+ "\u001b[?25hCollecting matplotlib>=2.2\n",
+ " Downloading matplotlib-3.5.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)\n",
+ "\u001b[K |████████████████████████████████| 11.2 MB 10.8 MB/s eta 0:00:01\n",
+ "\u001b[?25hRequirement already satisfied: pandas>=0.23 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.2.4)\n",
+ "Requirement already satisfied: numpy>=1.15 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.20.2)\n",
+ "Requirement already satisfied: scipy>=1.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.6.3)\n",
+ "Requirement already satisfied: packaging>=20.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (20.9)\n",
+ "Requirement already satisfied: python-dateutil>=2.7 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)\n",
+ "Collecting cycler>=0.10\n",
+ " Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)\n",
+ "Requirement already satisfied: pyparsing>=2.2.1 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)\n",
+ "Collecting fonttools>=4.22.0\n",
+ " Downloading fonttools-4.30.0-py3-none-any.whl (898 kB)\n",
+ "\u001b[K |████████████████████████████████| 898 kB 4.9 MB/s eta 0:00:01\n",
+ "\u001b[?25hRequirement already satisfied: pillow>=6.2.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)\n",
+ "Collecting kiwisolver>=1.0.1\n",
+ " Downloading kiwisolver-1.3.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)\n",
+ "\u001b[K |████████████████████████████████| 1.6 MB 7.7 MB/s eta 0:00:01\n",
+ "\u001b[?25hRequirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas>=0.23->seaborn) (2021.1)\n",
+ "Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.15.0)\n",
+ "Installing collected packages: kiwisolver, fonttools, cycler, matplotlib, seaborn\n",
+ "Successfully installed cycler-0.11.0 fonttools-4.30.0 kiwisolver-1.3.2 matplotlib-3.5.1 seaborn-0.11.2\n"
]
}
],
"source": [
- "#Zainstalujmy potrzebne biblioteki \n",
- "!pip install --user kaggle #API Kaggle, do pobrania zbioru\n",
- "!pip install --user pandas"
+ "!pip install --user pandas\n",
+ "!pip install --user seaborn"
]
},
{
"cell_type": "code",
- "execution_count": 61,
- "metadata": {
- "slideshow": {
- "slide_type": "slide"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/tomek/.kaggle/kaggle.json'\n",
- "iris.zip: Skipping, found more recently modified local copy (use --force to force download)\n"
- ]
- }
- ],
- "source": [
- "# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.\n",
- "# Instrukcje: https://www.kaggle.com/docs/api\n",
- "!kaggle datasets download -d uciml/iris"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "slideshow": {
- "slide_type": "slide"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Archive: iris.zip\r\n",
- " inflating: Iris.csv \r\n",
- " inflating: database.sqlite \r\n"
- ]
- }
- ],
- "source": [
- "!unzip -o iris.zip"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
+ "execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "slide"
@@ -337,7 +422,7 @@
},
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "slide"
@@ -508,7 +593,7 @@
"[150 rows x 6 columns]"
]
},
- "execution_count": 18,
+ "execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
@@ -1429,7 +1514,7 @@
" - Dane dźwiękowe: normalizacja natężenia, rozdzielczości, częstotliwości próbkowania, ilości kanałów\n",
"- Poszerzanie (augumentacja) danych\n",
" - Generowanie nowych przykładów przez wprowadzanie szumu/przekształceń na originalnych danych\n",
- " - np. dodanie echa do nagrania dźwiękowego\n",
+ " - np. dodanie echa do nagrania dźwiękowego, dodanie szumów do obrazka\n",
" - zmiana wartości cech o względnie małe, losowe wartości \n",
"- Over/under-sampling\n",
" - Algorymty uczące i metryki mogą być wrażliwe na niezbalansowane klasy w zbiorze\n",
@@ -1448,9 +1533,9 @@
"# Zadanie [5pkt]\n",
"- Wybierz jeden z ogólnodostępnych zbiorów danych. Będziesz na nim pracował do końca roku (oczywiście, zbiór można zmienić w trakcie, ale będzie się to wiązało z powtarzeniem pewnych działań, co prawdwa niezbyt kosztownych, ale jednak).\n",
"- Zbiór powinien być:\n",
- " - nie za duży (max 10-20 MB)\n",
+ " - nie za duży (max 50 MB)\n",
" - nie za mały (np. IRIS jest za mały ;))\n",
- " - unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/:x:/s/2021SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EYhZK_aXp41BsIVS4K-L1V4B_vM2FjO5nJZMWv2QKXJolA?e=DKIS2O\n",
+ " - unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n",
" - najlepiej, żeby był to zbiór zawierający dane w formie tekstowej, mogący posłużyć do zadania klasyfikacji lub rergesji - na takim zbiorze będzie łatwiej pracować niż np. na zbiorze obrazów albo dźwięków. Dzięki temu będziesz się mogła/mógł skupić na istocie tych zajęć.\n",
"\n",
"- Napisz skrypt, który:\n",
@@ -1466,7 +1551,7 @@
"- Skrypt możesz napisać w swoim ulubionym języku. Jedyne ograniczenie: ma działać pod Linuxem\n",
"- Wygodnie będzie stworzyć zeszyt Jupyter w tym celu (choć nie jest to konieczne)\n",
"- Stwórz na wydziałowym serwerze git (http://git.wmi.amu.edu.pl/) repozytorium \"ium_nrindeksu\" i umieść w nim stworzony skrypt\n",
- "- Link do repozytorium wklej do arkusza ze zbiorami (https://uam.sharepoint.com/:x:/s/2021SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EYhZK_aXp41BsIVS4K-L1V4B_vM2FjO5nJZMWv2QKXJolA?e=DKIS2O)\n"
+ "- Link do repozytorium wklej do arkusza ze zbiorami: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n"
]
},
{