pms
/
ium
forked from AITech/aitech-ium
4
5
Fork 1

Aktualizacja zajec 2.t

This commit is contained in:
Tomasz Ziętkiewicz 2022-03-14 09:09:50 +01:00
parent 0e2b626d0a
commit 8f8a8f14f9
2 changed files with 177 additions and 91 deletions

View File

@ -43,7 +43,6 @@
"## Przedmiot\n",
"- Kod przedmiotu: 06-DIUMUI0\n",
"- Nazwa: Inżynieria Uczenia Maszynowego\n",
"- WMI UAM 2021\n",
"- Sylabus: Sylabus-AITech-InzynieriaUczeniaMaszynowego.pdf"
]
},
@ -61,9 +60,11 @@
"- stanowisko:\tdoktorant\n",
"- [Zakład Sztucznej Inteligencji](https://ai.wmi.amu.edu.pl/pl/)\n",
"- email: tomasz.zietkiewicz@amu.edu.pl\n",
"- www: http://tz47965.home.amu.edu.pl/\n",
"<!-- - www: http://tz47965.home.amu.edu.pl/ -->\n",
"- https://git.wmi.amu.edu.pl/tzietkiewicz/aitech-ium\n",
"- konsultacje: przez MS Teams, po wcześniejszym umówieniu mailowym lub przez chat"
"- konsultacje: \n",
" - przez MS Teams po wcześniejszym umówieniu\n",
" - pokój B2-36, wtorki 12:00 - 13:00, po wcześniejszym umówieniu"
]
},
{
@ -104,7 +105,7 @@
"10. Kontrola eksperymentów - DVC\n",
"11. Github Actions i CML\n",
"12. Zarządzanie Jenkins\n",
"13. Integracja\n",
"13. Raportowanie\n",
"14. Przegląd technologii\n",
"15. Przegląd technologii, cz. 2"
]

View File

@ -77,9 +77,9 @@
"# Źródła danych\n",
"- Tworzenie danych:\n",
" - Generowanie syntetyczne\n",
" - np. generowanie korpusów mowy za pomocą TTS (syntezy mowy)\n",
" - Crowdsourcing\n",
" - Data scrapping\n",
" - Ekstrakcja\n"
" - Data scrapping"
]
},
{
@ -109,10 +109,10 @@
"source": [
"## Repozytoria/wyszukiwarki otwartych zbiorów danych\n",
"- Papers with code: https://paperswithcode.com/datasets\n",
"- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/\n",
"- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/ (University of California)\n",
"- Google dataset search: https://datasetsearch.research.google.com/\n",
"- Zbiory google:https://research.google/tools/datasets/\n",
"- https://registry.opendata.aws/\n",
"- Otwarte zbiory na Amazon AWS: https://registry.opendata.aws/\n",
" "
]
},
@ -129,8 +129,8 @@
" - https://www.openslr.org/ - Libri Speech, TED Lium\n",
" - Mozilla Open Voice: https://commonvoice.mozilla.org/\n",
"- NLP:\n",
" - Clarin PL: https://lindat.cz/repository/xmlui/\n",
" - Clarin: https://clarin-pl.eu/index.php/zasoby/\n",
" - NKJP: http://nkjp.pl/\n",
" "
]
},
@ -143,9 +143,22 @@
},
"source": [
"## Crowdsourcing\n",
"- Amazon Mechanical Turk: https://www.mturk.com/\n",
"- Yandex Toloka\n",
"- reCAPTCHA\n",
"<img src=\"img/ReCAPTCHA_idea.jpg\">\n",
"<img src=\"img/cat_captcha.png\">\n",
"\n",
"<sub>Źródło: https://pl.wikipedia.org/wiki/ReCAPTCHA#/media/Plik:ReCAPTCHA_idea.jpg</sub>"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Amazon Mechanical Turk: https://www.mturk.com/\n",
"<img src=\"img/Tuerkischer_schachspieler_windisch4.jpg\">\n",
"\n",
"<sub>Źródło: https://en.wikipedia.org/wiki/Mechanical_Turk#/media/File:Tuerkischer_schachspieler_windisch4.jpg</sub>"
@ -193,6 +206,117 @@
"Licencja: [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0)</sub>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pobranie danych"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting kaggle\n",
" Using cached kaggle-1.5.12.tar.gz (58 kB)\n",
"Requirement already satisfied: six>=1.10 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.15.0)\n",
"Requirement already satisfied: certifi in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2021.5.30)\n",
"Requirement already satisfied: python-dateutil in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.8.1)\n",
"Requirement already satisfied: requests in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (2.25.1)\n",
"Requirement already satisfied: tqdm in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (4.59.0)\n",
"Requirement already satisfied: python-slugify in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (5.0.2)\n",
"Requirement already satisfied: urllib3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from kaggle) (1.26.4)\n",
"Requirement already satisfied: text-unidecode>=1.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-slugify->kaggle) (1.3)\n",
"Requirement already satisfied: idna<3,>=2.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (2.10)\n",
"Requirement already satisfied: chardet<5,>=3.0.2 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from requests->kaggle) (4.0.0)\n",
"Building wheels for collected packages: kaggle\n",
" Building wheel for kaggle (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73053 sha256=1e6240d540651324d97a9772ad1ced30da7d7b5dc5956dc974eeeddf7c48844b\n",
" Stored in directory: /home/tomek/.cache/pip/wheels/ac/b2/c3/fa4706d469b5879105991d1c8be9a3c2ef329ba9fe2ce5085e\n",
"Successfully built kaggle\n",
"Installing collected packages: kaggle\n",
"Successfully installed kaggle-1.5.12\n",
"Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n",
"Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n",
"Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n",
"Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n",
"Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n"
]
}
],
"source": [
"#Zainstalujmy potrzebne biblioteki \n",
"!pip install --user kaggle #API Kaggle, do pobrania zbioru\n",
"!pip install --user pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" - Pobierzemy zbiór Iris z Kaggle: https://www.kaggle.com/uciml/iris\n",
" - Licencja to \"Public Domain\", więc możemy z niego korzystać bez ograniczeń."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading iris.zip to /home/tomek/AITech/repo/aitech-ium\n",
" 0%| | 0.00/3.60k [00:00<?, ?B/s]\n",
"100%|██████████████████████████████████████| 3.60k/3.60k [00:00<00:00, 1.63MB/s]\n"
]
}
],
"source": [
"# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.\n",
"# Instrukcje: https://www.kaggle.com/docs/api\n",
"!kaggle datasets download -d uciml/iris"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: iris.zip\r\n",
" inflating: Iris.csv \r\n",
" inflating: database.sqlite \r\n"
]
}
],
"source": [
"!unzip -o iris.zip"
]
},
{
"cell_type": "markdown",
"metadata": {
@ -226,93 +350,54 @@
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: kaggle in /home/tomek/.local/lib/python3.8/site-packages (1.5.12)\n",
"Requirement already satisfied: python-dateutil in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2.8.1)\n",
"Requirement already satisfied: six>=1.10 in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (1.15.0)\n",
"Requirement already satisfied: urllib3 in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (1.25.11)\n",
"Requirement already satisfied: python-slugify in /home/tomek/.local/lib/python3.8/site-packages (from kaggle) (4.0.1)\n",
"Requirement already satisfied: certifi in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2020.6.20)\n",
"Requirement already satisfied: tqdm in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (4.50.2)\n",
"Requirement already satisfied: requests in /home/tomek/anaconda3/lib/python3.8/site-packages (from kaggle) (2.24.0)\n",
"Requirement already satisfied: text-unidecode>=1.3 in /home/tomek/.local/lib/python3.8/site-packages (from python-slugify->kaggle) (1.3)\n",
"Requirement already satisfied: chardet<4,>=3.0.2 in /home/tomek/anaconda3/lib/python3.8/site-packages (from requests->kaggle) (3.0.4)\n",
"Requirement already satisfied: idna<3,>=2.5 in /home/tomek/anaconda3/lib/python3.8/site-packages (from requests->kaggle) (2.10)\n",
"Requirement already satisfied: pandas in /home/tomek/anaconda3/lib/python3.8/site-packages (1.1.3)\n",
"Requirement already satisfied: python-dateutil>=2.7.3 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (2.8.1)\n",
"Requirement already satisfied: numpy>=1.15.4 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (1.19.2)\n",
"Requirement already satisfied: pytz>=2017.2 in /home/tomek/anaconda3/lib/python3.8/site-packages (from pandas) (2020.1)\n",
"Requirement already satisfied: six>=1.5 in /home/tomek/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n"
"Requirement already satisfied: pandas in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (1.2.4)\n",
"Requirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2021.1)\n",
"Requirement already satisfied: numpy>=1.16.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (1.20.2)\n",
"Requirement already satisfied: python-dateutil>=2.7.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas) (2.8.1)\n",
"Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)\n",
"Collecting seaborn\n",
" Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)\n",
"\u001b[K |████████████████████████████████| 292 kB 1.1 MB/s eta 0:00:01\n",
"\u001b[?25hCollecting matplotlib>=2.2\n",
" Downloading matplotlib-3.5.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)\n",
"\u001b[K |████████████████████████████████| 11.2 MB 10.8 MB/s eta 0:00:01\n",
"\u001b[?25hRequirement already satisfied: pandas>=0.23 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.2.4)\n",
"Requirement already satisfied: numpy>=1.15 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.20.2)\n",
"Requirement already satisfied: scipy>=1.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from seaborn) (1.6.3)\n",
"Requirement already satisfied: packaging>=20.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (20.9)\n",
"Requirement already satisfied: python-dateutil>=2.7 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)\n",
"Collecting cycler>=0.10\n",
" Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)\n",
"Requirement already satisfied: pyparsing>=2.2.1 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)\n",
"Collecting fonttools>=4.22.0\n",
" Downloading fonttools-4.30.0-py3-none-any.whl (898 kB)\n",
"\u001b[K |████████████████████████████████| 898 kB 4.9 MB/s eta 0:00:01\n",
"\u001b[?25hRequirement already satisfied: pillow>=6.2.0 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)\n",
"Collecting kiwisolver>=1.0.1\n",
" Downloading kiwisolver-1.3.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)\n",
"\u001b[K |████████████████████████████████| 1.6 MB 7.7 MB/s eta 0:00:01\n",
"\u001b[?25hRequirement already satisfied: pytz>=2017.3 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from pandas>=0.23->seaborn) (2021.1)\n",
"Requirement already satisfied: six>=1.5 in /media/tomek/Linux_data/home/tomek/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.15.0)\n",
"Installing collected packages: kiwisolver, fonttools, cycler, matplotlib, seaborn\n",
"Successfully installed cycler-0.11.0 fonttools-4.30.0 kiwisolver-1.3.2 matplotlib-3.5.1 seaborn-0.11.2\n"
]
}
],
"source": [
"#Zainstalujmy potrzebne biblioteki \n",
"!pip install --user kaggle #API Kaggle, do pobrania zbioru\n",
"!pip install --user pandas"
"!pip install --user pandas\n",
"!pip install --user seaborn"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/tomek/.kaggle/kaggle.json'\n",
"iris.zip: Skipping, found more recently modified local copy (use --force to force download)\n"
]
}
],
"source": [
"# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.\n",
"# Instrukcje: https://www.kaggle.com/docs/api\n",
"!kaggle datasets download -d uciml/iris"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Archive: iris.zip\r\n",
" inflating: Iris.csv \r\n",
" inflating: database.sqlite \r\n"
]
}
],
"source": [
"!unzip -o iris.zip"
]
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "slide"
@ -337,7 +422,7 @@
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "slide"
@ -508,7 +593,7 @@
"[150 rows x 6 columns]"
]
},
"execution_count": 18,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
@ -1429,7 +1514,7 @@
" - Dane dźwiękowe: normalizacja natężenia, rozdzielczości, częstotliwości próbkowania, ilości kanałów\n",
"- Poszerzanie (augumentacja) danych\n",
" - Generowanie nowych przykładów przez wprowadzanie szumu/przekształceń na originalnych danych\n",
" - np. dodanie echa do nagrania dźwiękowego\n",
" - np. dodanie echa do nagrania dźwiękowego, dodanie szumów do obrazka\n",
" - zmiana wartości cech o względnie małe, losowe wartości \n",
"- Over/under-sampling\n",
" - Algorymty uczące i metryki mogą być wrażliwe na niezbalansowane klasy w zbiorze\n",
@ -1448,9 +1533,9 @@
"# Zadanie [5pkt]\n",
"- Wybierz jeden z ogólnodostępnych zbiorów danych. Będziesz na nim pracował do końca roku (oczywiście, zbiór można zmienić w trakcie, ale będzie się to wiązało z powtarzeniem pewnych działań, co prawdwa niezbyt kosztownych, ale jednak).\n",
"- Zbiór powinien być:\n",
" - nie za duży (max 10-20 MB)\n",
" - nie za duży (max 50 MB)\n",
" - nie za mały (np. IRIS jest za mały ;))\n",
" - unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/:x:/s/2021SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EYhZK_aXp41BsIVS4K-L1V4B_vM2FjO5nJZMWv2QKXJolA?e=DKIS2O\n",
" - unikalny (każda osoba w grupie pracuje na innym zbiorze). W celu synchronizacji, wybrany przez siebie zbiór proszę zapisać tutaj: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n",
" - najlepiej, żeby był to zbiór zawierający dane w formie tekstowej, mogący posłużyć do zadania klasyfikacji lub rergesji - na takim zbiorze będzie łatwiej pracować niż np. na zbiorze obrazów albo dźwięków. Dzięki temu będziesz się mogła/mógł skupić na istocie tych zajęć.\n",
"\n",
"- Napisz skrypt, który:\n",
@ -1466,7 +1551,7 @@
"- Skrypt możesz napisać w swoim ulubionym języku. Jedyne ograniczenie: ma działać pod Linuxem\n",
"- Wygodnie będzie stworzyć zeszyt Jupyter w tym celu (choć nie jest to konieczne)\n",
"- Stwórz na wydziałowym serwerze git (http://git.wmi.amu.edu.pl/) repozytorium \"ium_nrindeksu\" i umieść w nim stworzony skrypt\n",
"- Link do repozytorium wklej do arkusza ze zbiorami (https://uam.sharepoint.com/:x:/s/2021SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EYhZK_aXp41BsIVS4K-L1V4B_vM2FjO5nJZMWv2QKXJolA?e=DKIS2O)\n"
"- Link do repozytorium wklej do arkusza ze zbiorami: https://uam.sharepoint.com/:x:/s/2022SL06-DIUMUI0LABInynieriauczeniamaszynowego-Grupa11/EaID4LM20YhOhxxL-VHxPogBCTq4uXpLHQAzrCeDzPv1dw?e=HfXKqB\n"
]
},
{