ium/IUM_10.DVC.ipynb

1652 lines
64 KiB
Plaintext
Raw Normal View History

2021-05-31 11:55:27 +02:00
{
"cells": [
2021-09-28 10:56:21 +02:00
{
"cell_type": "markdown",
"id": "7fe475ae",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Inżynieria uczenia maszynowego </h1>\n",
"<h2> 10. <i>DVC</i> [laboratoria]</h2> \n",
"<h3> Tomasz Ziętkiewicz (2021)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
]
},
2021-05-31 11:55:27 +02:00
{
"cell_type": "markdown",
"id": "0c6f27a5",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"<img src=\"img/expcontrol/dvc-logo.png\">"
]
},
{
"cell_type": "markdown",
"id": "560eec71",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## DVC - Data Version Control\n",
"- [dvc.org](https://dvc.org/)\n",
"- \"Version Control System for Machine Learning Projects\" (System kontroli wersji dla projektów uczenia maszynowego)\n",
"- Open Source\n",
"- Umożliwia:\n",
" - wersjonowanie danych i modeli. \"Git dla danych i modeli\"\n",
" - budowanie potoków (\"pipeline\") definiujących jak budować/trenować/ewaluować modele. \"Makefile dla uczenia maszynowego\"\n",
" - śledzeniem, porównywanie metryk i parametrów\n",
"- ściśle zintegowany z gitem\n",
"- działa niezależnie od używanego języka/bibliotek i systemu operacyjnego\n",
"- 5-minutowe wprowadzenie: https://www.youtube.com/watch?v=UbL7VUpv1Bs&t=197s"
]
},
{
"cell_type": "markdown",
"id": "9bfb356e",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Instalacja i inicjalizacja\n",
" - https://dvc.org/doc/install\n",
" - ```pip(x) install dvc``` albo:\n",
" - ```conda install dvc```"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "054c7a11",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting package metadata (current_repodata.json): done\n",
"Solving environment: failed with initial frozen solve. Retrying with flexible solve.\n",
"Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.\n",
"Collecting package metadata (repodata.json): done\n",
"Solving environment: done\n",
"\n",
"## Package Plan ##\n",
"\n",
" environment location: /home/tomek/miniconda3\n",
"\n",
" added / updated specs:\n",
" - dvc\n",
"\n",
"\n",
"The following packages will be downloaded:\n",
"\n",
" package | build\n",
" ---------------------------|-----------------\n",
" atpublic-1.0 | py_0 7 KB conda-forge\n",
" bzip2-1.0.8 | h7f98852_4 484 KB conda-forge\n",
" cached-property-1.5.2 | hd8ed1ab_1 4 KB conda-forge\n",
" cached_property-1.5.2 | pyha770c72_1 11 KB conda-forge\n",
" colorama-0.4.4 | pyh9f0ad1d_0 18 KB conda-forge\n",
" commonmark-0.9.1 | py_0 46 KB conda-forge\n",
" configobj-5.0.6 | py_0 31 KB conda-forge\n",
" dictdiffer-0.8.1 | pyhd8ed1ab_0 16 KB conda-forge\n",
" diskcache-5.2.1 | pyh44b312d_0 36 KB conda-forge\n",
" distro-1.5.0 | pyh9f0ad1d_0 20 KB conda-forge\n",
" dpath-2.0.1 | py39hf3d152e_0 23 KB conda-forge\n",
" dulwich-0.20.23 | py39h3811e60_0 721 KB conda-forge\n",
" dvc-2.1.0 | py39hf3d152e_0 551 KB conda-forge\n",
" flatten-dict-0.3.0 | pyh9f0ad1d_0 11 KB conda-forge\n",
" flufl.lock-3.2 | py_0 19 KB conda-forge\n",
" fsspec-0.9.0 | pyhd8ed1ab_2 75 KB conda-forge\n",
" ftfy-5.5.1 | py_0 47 KB conda-forge\n",
" funcy-1.16 | pyhd8ed1ab_0 30 KB conda-forge\n",
" future-0.18.2 | py39hf3d152e_3 718 KB conda-forge\n",
" grandalf-0.6 | py_0 42 KB conda-forge\n",
" jsonpath-ng-1.5.2 | pyh9f0ad1d_0 26 KB conda-forge\n",
" libgit2-1.1.0 | h0b03e73_0 693 KB conda-forge\n",
" libssh2-1.9.0 | ha56f1ee_6 226 KB conda-forge\n",
" mailchecker-4.0.7 | pyhd8ed1ab_0 206 KB conda-forge\n",
" nanotime-0.5.2 | py_0 6 KB conda-forge\n",
" networkx-2.5 | py_0 1.2 MB conda-forge\n",
" pathlib2-2.3.5 | py39hf3d152e_3 35 KB conda-forge\n",
" pathspec-0.8.1 | pyhd3deb0d_0 29 KB conda-forge\n",
" pcre2-10.35 | h032f7d1_2 693 KB conda-forge\n",
" phonenumbers-8.10.14 | py_0 1.5 MB conda-forge\n",
" ply-3.11 | py_1 44 KB conda-forge\n",
" pyasn1-0.4.8 | py_0 53 KB conda-forge\n",
" pydot-1.2.4 | py_0 20 KB conda-forge\n",
" pygit2-1.5.0 | py39h3811e60_0 213 KB conda-forge\n",
" pygtrie-2.3.2 | pyh8c360ce_0 24 KB conda-forge\n",
" python-benedict-0.24.0 | pyhd8ed1ab_0 30 KB conda-forge\n",
" python-fsutil-0.5.0 | pyhd8ed1ab_0 13 KB conda-forge\n",
" python-slugify-5.0.2 | pyhd8ed1ab_0 12 KB conda-forge\n",
" rich-10.2.2 | py39hf3d152e_0 337 KB conda-forge\n",
" ruamel.yaml-0.17.4 | py39h3811e60_0 160 KB conda-forge\n",
" ruamel.yaml.clib-0.2.2 | py39h3811e60_2 173 KB conda-forge\n",
" shortuuid-1.0.1 | py39hf3d152e_4 15 KB conda-forge\n",
" shtab-1.3.6 | pyhd8ed1ab_0 15 KB conda-forge\n",
" text-unidecode-1.3 | py_0 68 KB conda-forge\n",
" toml-0.10.2 | pyhd8ed1ab_0 18 KB conda-forge\n",
" unidecode-1.2.0 | pyhd8ed1ab_0 155 KB conda-forge\n",
" voluptuous-0.12.1 | pyhd3deb0d_0 28 KB conda-forge\n",
" zc.lockfile-2.0 | py_0 11 KB conda-forge\n",
" ------------------------------------------------------------\n",
" Total: 8.8 MB\n",
"\n",
"The following NEW packages will be INSTALLED:\n",
"\n",
" _openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-1_gnu\n",
" appdirs conda-forge/noarch::appdirs-1.4.4-pyh9f0ad1d_0\n",
" atpublic conda-forge/noarch::atpublic-1.0-py_0\n",
" bzip2 conda-forge/linux-64::bzip2-1.0.8-h7f98852_4\n",
" cached-property conda-forge/noarch::cached-property-1.5.2-hd8ed1ab_1\n",
" cached_property conda-forge/noarch::cached_property-1.5.2-pyha770c72_1\n",
" colorama conda-forge/noarch::colorama-0.4.4-pyh9f0ad1d_0\n",
" commonmark conda-forge/noarch::commonmark-0.9.1-py_0\n",
" configobj conda-forge/noarch::configobj-5.0.6-py_0\n",
" dictdiffer conda-forge/noarch::dictdiffer-0.8.1-pyhd8ed1ab_0\n",
" diskcache conda-forge/noarch::diskcache-5.2.1-pyh44b312d_0\n",
" distro conda-forge/noarch::distro-1.5.0-pyh9f0ad1d_0\n",
" dpath conda-forge/linux-64::dpath-2.0.1-py39hf3d152e_0\n",
" dulwich conda-forge/linux-64::dulwich-0.20.23-py39h3811e60_0\n",
" dvc conda-forge/linux-64::dvc-2.1.0-py39hf3d152e_0\n",
" flatten-dict conda-forge/noarch::flatten-dict-0.3.0-pyh9f0ad1d_0\n",
" flufl.lock conda-forge/noarch::flufl.lock-3.2-py_0\n",
" fsspec conda-forge/noarch::fsspec-0.9.0-pyhd8ed1ab_2\n",
" ftfy conda-forge/noarch::ftfy-5.5.1-py_0\n",
" funcy conda-forge/noarch::funcy-1.16-pyhd8ed1ab_0\n",
" future conda-forge/linux-64::future-0.18.2-py39hf3d152e_3\n",
" gitdb conda-forge/noarch::gitdb-4.0.7-pyhd8ed1ab_0\n",
" gitpython conda-forge/noarch::gitpython-3.1.17-pyhd8ed1ab_0\n",
" grandalf conda-forge/noarch::grandalf-0.6-py_0\n",
" jsonpath-ng conda-forge/noarch::jsonpath-ng-1.5.2-pyh9f0ad1d_0\n",
" libgit2 conda-forge/linux-64::libgit2-1.1.0-h0b03e73_0\n",
" libgomp conda-forge/linux-64::libgomp-9.3.0-h2828fa1_19\n",
" libssh2 conda-forge/linux-64::libssh2-1.9.0-ha56f1ee_6\n",
" mailchecker conda-forge/noarch::mailchecker-4.0.7-pyhd8ed1ab_0\n",
" nanotime conda-forge/noarch::nanotime-0.5.2-py_0\n",
" networkx conda-forge/noarch::networkx-2.5-py_0\n",
" pathlib2 conda-forge/linux-64::pathlib2-2.3.5-py39hf3d152e_3\n",
" pathspec conda-forge/noarch::pathspec-0.8.1-pyhd3deb0d_0\n",
" pcre2 conda-forge/linux-64::pcre2-10.35-h032f7d1_2\n",
" phonenumbers conda-forge/noarch::phonenumbers-8.10.14-py_0\n",
" pip conda-forge/noarch::pip-21.1.2-pyhd8ed1ab_0\n",
" ply conda-forge/noarch::ply-3.11-py_1\n",
" pyasn1 conda-forge/noarch::pyasn1-0.4.8-py_0\n",
" pydot conda-forge/noarch::pydot-1.2.4-py_0\n",
" pygit2 conda-forge/linux-64::pygit2-1.5.0-py39h3811e60_0\n",
" pygtrie conda-forge/noarch::pygtrie-2.3.2-pyh8c360ce_0\n",
" python-benedict conda-forge/noarch::python-benedict-0.24.0-pyhd8ed1ab_0\n",
" python-fsutil conda-forge/noarch::python-fsutil-0.5.0-pyhd8ed1ab_0\n",
" python-slugify conda-forge/noarch::python-slugify-5.0.2-pyhd8ed1ab_0\n",
" rich conda-forge/linux-64::rich-10.2.2-py39hf3d152e_0\n",
" ruamel.yaml conda-forge/linux-64::ruamel.yaml-0.17.4-py39h3811e60_0\n",
" ruamel.yaml.clib conda-forge/linux-64::ruamel.yaml.clib-0.2.2-py39h3811e60_2\n",
" shortuuid conda-forge/linux-64::shortuuid-1.0.1-py39hf3d152e_4\n",
" shtab conda-forge/noarch::shtab-1.3.6-pyhd8ed1ab_0\n",
" smmap conda-forge/noarch::smmap-3.0.5-pyh44b312d_0\n",
" tabulate conda-forge/noarch::tabulate-0.8.9-pyhd8ed1ab_0\n",
" text-unidecode conda-forge/noarch::text-unidecode-1.3-py_0\n",
" toml conda-forge/noarch::toml-0.10.2-pyhd8ed1ab_0\n",
" typing_extensions conda-forge/noarch::typing_extensions-3.7.4.3-py_0\n",
" unidecode conda-forge/noarch::unidecode-1.2.0-pyhd8ed1ab_0\n",
" voluptuous conda-forge/noarch::voluptuous-0.12.1-pyhd3deb0d_0\n",
" wheel conda-forge/noarch::wheel-0.36.2-pyhd3deb0d_0\n",
" zc.lockfile conda-forge/noarch::zc.lockfile-2.0-py_0\n",
"\n",
"The following packages will be UPDATED:\n",
"\n",
" certifi pkgs/main::certifi-2020.12.5-py39h06a~ --> conda-forge::certifi-2020.12.5-py39hf3d152e_1\n",
" libgcc-ng pkgs/main::libgcc-ng-9.1.0-hdf63c60_0 --> conda-forge::libgcc-ng-9.3.0-h2828fa1_19\n",
"\n",
"The following packages will be SUPERSEDED by a higher-priority channel:\n",
"\n",
" _libgcc_mutex pkgs/main::_libgcc_mutex-0.1-main --> conda-forge::_libgcc_mutex-0.1-conda_forge\n",
" ca-certificates pkgs/main::ca-certificates-2021.4.13-~ --> conda-forge::ca-certificates-2020.12.5-ha878542_0\n",
" conda pkgs/main::conda-4.10.1-py39h06a4308_1 --> conda-forge::conda-4.10.1-py39hf3d152e_0\n",
" openssl pkgs/main::openssl-1.1.1k-h27cfd23_0 --> conda-forge::openssl-1.1.1k-h7f98852_0\n",
"\n",
"\n",
"\n",
"Downloading and Extracting Packages\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"diskcache-5.2.1 | 36 KB | ##################################### | 100% \n",
"pathspec-0.8.1 | 29 KB | ##################################### | 100% \n",
"cached-property-1.5. | 4 KB | ##################################### | 100% \n",
"networkx-2.5 | 1.2 MB | ##################################### | 100% \n",
"commonmark-0.9.1 | 46 KB | ##################################### | 100% \n",
"configobj-5.0.6 | 31 KB | ##################################### | 100% \n",
"python-fsutil-0.5.0 | 13 KB | ##################################### | 100% \n",
"fsspec-0.9.0 | 75 KB | ##################################### | 100% \n",
"dulwich-0.20.23 | 721 KB | ##################################### | 100% \n",
"funcy-1.16 | 30 KB | ##################################### | 100% \n",
"bzip2-1.0.8 | 484 KB | ##################################### | 100% \n",
"ply-3.11 | 44 KB | ##################################### | 100% \n",
"libgit2-1.1.0 | 693 KB | ##################################### | 100% \n",
"ftfy-5.5.1 | 47 KB | ##################################### | 100% \n",
"nanotime-0.5.2 | 6 KB | ##################################### | 100% \n",
"pyasn1-0.4.8 | 53 KB | ##################################### | 100% \n",
"unidecode-1.2.0 | 155 KB | ##################################### | 100% \n",
"dvc-2.1.0 | 551 KB | ##################################### | 100% \n",
"pydot-1.2.4 | 20 KB | ##################################### | 100% \n",
"zc.lockfile-2.0 | 11 KB | ##################################### | 100% \n",
"dpath-2.0.1 | 23 KB | ##################################### | 100% \n",
"pcre2-10.35 | 693 KB | ##################################### | 100% \n",
"ruamel.yaml-0.17.4 | 160 KB | ##################################### | 100% \n",
"flatten-dict-0.3.0 | 11 KB | ##################################### | 100% \n",
"python-slugify-5.0.2 | 12 KB | ##################################### | 100% \n",
"shortuuid-1.0.1 | 15 KB | ##################################### | 100% \n",
"text-unidecode-1.3 | 68 KB | ##################################### | 100% \n",
"cached_property-1.5. | 11 KB | ##################################### | 100% \n",
"colorama-0.4.4 | 18 KB | ##################################### | 100% \n",
"flufl.lock-3.2 | 19 KB | ##################################### | 100% \n",
"libssh2-1.9.0 | 226 KB | ##################################### | 100% \n",
"python-benedict-0.24 | 30 KB | ##################################### | 100% \n",
"distro-1.5.0 | 20 KB | ##################################### | 100% \n",
"grandalf-0.6 | 42 KB | ##################################### | 100% \n",
"future-0.18.2 | 718 KB | ##################################### | 100% \n",
"ruamel.yaml.clib-0.2 | 173 KB | ##################################### | 100% \n",
"rich-10.2.2 | 337 KB | ##################################### | 100% \n",
"shtab-1.3.6 | 15 KB | ##################################### | 100% \n",
"pygtrie-2.3.2 | 24 KB | ##################################### | 100% \n",
"mailchecker-4.0.7 | 206 KB | ##################################### | 100% \n",
"voluptuous-0.12.1 | 28 KB | ##################################### | 100% \n",
"atpublic-1.0 | 7 KB | ##################################### | 100% \n",
"phonenumbers-8.10.14 | 1.5 MB | ##################################### | 100% \n",
"pathlib2-2.3.5 | 35 KB | ##################################### | 100% \n",
"pygit2-1.5.0 | 213 KB | ##################################### | 100% \n",
"dictdiffer-0.8.1 | 16 KB | ##################################### | 100% \n",
"toml-0.10.2 | 18 KB | ##################################### | 100% \n",
"jsonpath-ng-1.5.2 | 26 KB | ##################################### | 100% \n",
"Preparing transaction: done\n",
"Verifying transaction: done\n",
"Executing transaction: done\n",
"\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"conda install dvc"
]
},
{
"cell_type": "markdown",
"id": "20975d62",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Stwórzmy katalog, w którym będziemy przechowywać nasz projekt:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "aae59ec2",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"!mkdir -p IUM_10/sample-ml-project"
]
},
{
"cell_type": "code",
2021-06-04 15:52:21 +02:00
"execution_count": 2,
2021-05-31 11:55:27 +02:00
"id": "1e522a93",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/tomek/AITech/repo/aitech-ium-private/IUM_10/sample-ml-project\n"
]
}
],
"source": [
"#Jupyter notebook magic https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-cd\n",
"%cd \"IUM_10/sample-ml-project\""
]
},
{
"cell_type": "markdown",
"id": "199c0d92",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Inicjalizujemy puste repozytorium Git (możemy też pominąć ten krok i działać w istniejącym już repozytorium)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "c13c525b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Initialized empty Git repository in /home/tomek/AITech/repo/aitech-ium-private/IUM_10/sample-ml-project/.git/\r\n"
]
}
],
"source": [
"!git init"
]
},
{
"cell_type": "markdown",
"id": "c7155369",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Teraz inicjalizujemy repozytorium DVC:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "44f28226",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Initialized DVC repository.\n",
"\n",
"You can now commit the changes to git.\n",
"\n",
"\u001b[31m+---------------------------------------------------------------------+\n",
"\u001b[0m\u001b[31m|\u001b[0m \u001b[31m|\u001b[0m\n",
"\u001b[31m|\u001b[0m DVC has enabled anonymous aggregate usage analytics. \u001b[31m|\u001b[0m\n",
"\u001b[31m|\u001b[0m Read the analytics documentation (and how to opt-out) here: \u001b[31m|\u001b[0m\n",
"\u001b[31m|\u001b[0m <\u001b[36mhttps://dvc.org/doc/user-guide/analytics\u001b[39m> \u001b[31m|\u001b[0m\n",
"\u001b[31m|\u001b[0m \u001b[31m|\u001b[0m\n",
"\u001b[31m+---------------------------------------------------------------------+\n",
"\u001b[0m\n",
"\u001b[33mWhat's next?\u001b[39m\n",
"\u001b[33m------------\u001b[39m\n",
"- Check out the documentation: <\u001b[36mhttps://dvc.org/doc\u001b[39m>\n",
"- Get help and share ideas: <\u001b[36mhttps://dvc.org/chat\u001b[39m>\n",
"- Star us on GitHub: <\u001b[36mhttps://github.com/iterative/dvc\u001b[39m>\n",
"\u001b[0m"
]
}
],
"source": [
"!dvc init"
]
},
{
"cell_type": "markdown",
"id": "00bc72ed",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Zobaczmy jakie pliki dodał (również do repozytorium git) DVC.\n",
"Ich opis znajdziemy tutaj: https://dvc.org/doc/user-guide/project-structure/internal-files"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "d1aefe16",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"On branch master\r\n",
"\r\n",
"No commits yet\r\n",
"\r\n",
"Changes to be committed:\r\n",
" (use \"git rm --cached <file>...\" to unstage)\r\n",
"\t\u001b[32mnew file: .dvc/.gitignore\u001b[m\r\n",
"\t\u001b[32mnew file: .dvc/config\u001b[m\r\n",
"\t\u001b[32mnew file: .dvc/plots/confusion.json\u001b[m\r\n",
"\t\u001b[32mnew file: .dvc/plots/confusion_normalized.json\u001b[m\r\n",
"\t\u001b[32mnew file: .dvc/plots/default.json\u001b[m\r\n",
"\t\u001b[32mnew file: .dvc/plots/linear.json\u001b[m\r\n",
"\t\u001b[32mnew file: .dvc/plots/scatter.json\u001b[m\r\n",
"\t\u001b[32mnew file: .dvc/plots/smooth.json\u001b[m\r\n",
"\t\u001b[32mnew file: .dvcignore\u001b[m\r\n",
"\r\n"
]
}
],
"source": [
"!git status"
]
},
{
"cell_type": "markdown",
"id": "72e0a272",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Możemy teraz zacommitować zmiany w git:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "59780e99",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"On branch master\r\n",
"nothing to commit, working tree clean\r\n"
]
}
],
"source": [
"!git commit -m \"Initial commit\""
]
},
{
"cell_type": "markdown",
"id": "dd8e529b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Śledzenie plików za pomocą DVC\n",
" - dużymi plikami, takimi jak plikami z danymi wejściowymi czy plikami modeli, trudno zarządza się za pomocą gita, ze względu na problemy z:\n",
" - wydajnością\n",
" - przestrzenią w repozytorium\n",
" - Git posiada rozszerzenie [lfs(Large File Storage)](https://git-lfs.github.com/), które stanowi pewne rozwiązanie tego problemu. Same pliki przechowywane są na specjalnym zdalnym serwerze, w repozytorium przechowywane są jedynie odnośniki do tych plików i pewne metadane\n",
" - DVC proponuje podobne podejście, ale:\n",
" - pliki mogą być przechowywane na niemal dowolnym serwerze, również lokalnie\n",
" - brak limitu wielkości plików (w Git-LFS najczęściej limit 2GB)\n",
" - DVC zapewnia dodatkowe narzędzie umożliwiające śledzenie plików i ich powiązań z wynikami eksperymentów\n",
" - więcej, patrz [tutaj](https://dvc.org/doc/user-guide/related-technologies)"
]
},
{
"cell_type": "markdown",
"id": "a8861abe",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Przygotujmy przykładowe dane, pobierając je z Kaggle:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "f05ece1b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading iris.zip to /home/tomek/AITech/repo/aitech-ium-private/IUM_10/sample-ml-project\n",
" 0%| | 0.00/3.60k [00:00<?, ?B/s]\n",
"100%|██████████████████████████████████████| 3.60k/3.60k [00:00<00:00, 2.63MB/s]\n",
"Archive: iris.zip\n",
" inflating: Iris.csv \n",
" inflating: database.sqlite \n"
]
}
],
"source": [
"!kaggle datasets download -d uciml/iris\n",
"!unzip -o iris.zip\n",
"!rm database.sqlite iris.zip\n",
"!mkdir -p data\n",
"!mv Iris.csv data/"
]
},
{
"cell_type": "markdown",
"id": "adb9a522",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Teraz dodamy plik(i) z danymi do DVC:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "74d182c7",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Adding... \n",
"!\u001b[A\n",
" 0%| |.E8dZEGBYoRayYsJLdesNS4.tmp 0.00/5.11k [00:00<?, ?it/s]\u001b[A\n",
"100% Add|██████████████████████████████████████████████|1/1 [00:04, 4.71s/file]\u001b[A\n",
"\n",
"To track the changes with git, run:\n",
"\n",
"\tgit add data/Iris.csv.dvc data/.gitignore\n",
"\u001b[0m"
]
}
],
"source": [
"!dvc add data/Iris.csv"
]
},
{
"cell_type": "markdown",
"id": "72c6b5d0",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
" - DVC utworzył plik `data/Iris.csv.dvc` i dadał oryginalny plik do `.gitignore`\n",
" - W repozytorium będzie obecny tylko plik `*.dvc`, zawierający odnośnik do prawdziwego pliku"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "74d54652",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"On branch master\r\n",
"Untracked files:\r\n",
" (use \"git add <file>...\" to include in what will be committed)\r\n",
"\t\u001b[31mdata/.gitignore\u001b[m\r\n",
"\t\u001b[31mdata/Iris.csv.dvc\u001b[m\r\n",
"\t\u001b[31miris.zip\u001b[m\r\n",
"\r\n",
"nothing added to commit but untracked files present (use \"git add\" to track)\r\n"
]
}
],
"source": [
"!git status -u"
]
},
{
"cell_type": "markdown",
"id": "8589fecf",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Dodajmy pliki `data/Iris.csv.dvc data/.gitignore` do repozytorium git, zgodnie z sugestią DVC:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "460c4a17",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"!git add data/Iris.csv.dvc data/.gitignore"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "80644077",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[master cc0821a] Dodano dane IRIS (DVC)\r\n",
" 2 files changed, 5 insertions(+)\r\n",
" create mode 100644 data/.gitignore\r\n",
" create mode 100644 data/Iris.csv.dvc\r\n"
]
}
],
"source": [
"!git commit -m \"Dodano dane IRIS (DVC)\""
]
},
{
"cell_type": "markdown",
"id": "03899863",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Plik `*.dvc` zawiera m.in. hash pliku. Więcej o plikach `*.dvc`: [link](https://dvc.org/doc/user-guide/project-structure/dvc-files)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8cb2ba7c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# %load data/Iris.csv.dvc\n",
"outs:\n",
"- md5: 717820ef0af287ff346c5cabfb4c612c\n",
" size: 5107\n",
" path: Iris.csv\n"
]
},
{
"cell_type": "markdown",
"id": "0b421d45",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Oryginalny plik `Iris.csv` został przeniesiony do katalogu ./dvc/cache/{wartość hash pliku) i podlinkowany z powrotem do oryginalnej lokalizacji. Sposób tworzenia linków może być [różny w zależności od systemu plików](https://dvc.org/doc/user-guide/large-dataset-optimization)."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "1d471f3a",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total 8\r\n",
"-r--r--r-- 1 tomek tomek 5107 wrz 19 2019 7820ef0af287ff346c5cabfb4c612c\r\n"
]
}
],
"source": [
"!ls -l .dvc/cache/71"
]
},
{
"cell_type": "markdown",
"id": "901e8e90",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## dvc remote\n",
" - żeby wysłać właściwe pliki śledzone przez DVC do zdalnej lokalizacji (z której będą mogłby być pobrane np. przez system CI albo innych użytkowników) musimy mieć skonfigurowaną taką lokazliację\n",
" - służy do tego polecenie [`dvc remote add`](https://dvc.org/doc/command-reference/remote/add)\n",
" - użyjemy lokalnego \"remote\". Tutaj będzie to po prostu utworzony wcześniej katalog `/dvcstore`. Taki katalog istnieje też na naszym Jenkinsie, oczywiście należy go podmontować w Dockerze\n",
" - w realnych zastosowaniach podalibyśmy tutaj ścieżkę do jakiegoś zasobu dostępnego przez inernet jak np. serwer SFTP, ścieżka do AWS S3 itp."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "731f6ea4",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Setting 'my_local_remote' as a default remote.\r\n",
"\u001b[0m"
]
}
],
"source": [
"!dvc remote add -d my_local_remote /dvcstore"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "9c3deeaf",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"On branch master\r\n",
"Changes not staged for commit:\r\n",
" (use \"git add <file>...\" to update what will be committed)\r\n",
" (use \"git restore <file>...\" to discard changes in working directory)\r\n",
"\t\u001b[31mmodified: .dvc/config\u001b[m\r\n",
"\r\n",
"no changes added to commit (use \"git add\" and/or \"git commit -a\")\r\n"
]
}
],
"source": [
"!git status"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "899eac7d",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[master 3ff62b6] Added DVC remote\r\n",
" 1 file changed, 4 insertions(+)\r\n"
]
}
],
"source": [
"!git add .dvc/config\n",
"!git commit -m \"Added DVC remote\""
]
},
{
"cell_type": "markdown",
"id": "8c556c96",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## dvc push\n",
"Kiedy mamy już skonfigurowany \"remote\" możemy wypchnąć do niego pliki korzystając z polecenia `dvc push`:"
]
},
{
"cell_type": "code",
2021-06-04 15:52:21 +02:00
"execution_count": null,
"id": "c7f24f75",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!dvc push"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "8a355575",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[34;42m/dvcstore\u001b[00m\r\n",
"└── \u001b[01;34m71\u001b[00m\r\n",
" └── 7820ef0af287ff346c5cabfb4c612c\r\n",
"\r\n",
"1 directory, 1 file\r\n"
]
}
],
"source": [
"!tree /dvcstore"
]
},
{
"cell_type": "markdown",
"id": "af59ecb3",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## dvc pull\n",
"Żeby pobrać dane z DVC (np. w innej lokalizacji, przez innego użytkownika), musimy:\n",
" - sklonować repozytorium git (żeby m.in. pobrać pliki `*.dvc`\n",
" - wykonać `dvc pull`"
]
},
{
"cell_type": "markdown",
"id": "9fa914a7",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Dodawanie nowych plików i modyfikacja istniejących wygląda tak podobnie jak przy zwykłych plikach śledzonych przez git, tylko zamiast `git` używamy polecenia `dvc` a dodatkowo pamiętamy o zarządzaniu plikami `*.dvc` za pomocą gita:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "dde39796",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"!head -n -1 data/Iris.csv | sponge data/Iris.csv"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "7f14ec60",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"On branch master\r\n",
"nothing to commit, working tree clean\r\n"
]
}
],
"source": [
"!git status"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "8a841039",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"data/Iris.csv.dvc: core\u001b[39m>\n",
"\tchanged outs:\n",
"\t\tmodified: data/Iris.csv\n",
"\u001b[0m"
]
}
],
"source": [
"!dvc status"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "bf6c1067",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Adding... \n",
"!\u001b[A\n",
" 0%| |.TatTHknArFHCT9iDCtxHzh.tmp 0.00/5.07k [00:00<?, ?it/s]\u001b[A\n",
"100% Add|██████████████████████████████████████████████|1/1 [00:00, 2.68file/s]\u001b[A\n",
"\n",
"To track the changes with git, run:\n",
"\n",
"\tgit add data/Iris.csv.dvc\n",
"\u001b[0m"
]
}
],
"source": [
"!dvc add data/Iris.csv"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "4a4865c9",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[master e38c244] Removed last line from Iris dataset\r\n",
" 1 file changed, 2 insertions(+), 2 deletions(-)\r\n"
]
}
],
"source": [
"!git add data/Iris.csv.dvc\n",
"!git commit -m \"Removed last line from Iris dataset\"\n"
]
},
{
"cell_type": "markdown",
"id": "d710977c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### dvc checkout\n",
" - Polecenia `dvc checkout` używamy razem z `git checkout`, żeby zmienić branch, na którym pracujemy.\n",
" - DVC podmieni wersje plików śledzonych przez siebie na pochodzące z innego brancha (o ile pliki te się różnią i różnią się pliki `*.dvc` w odpowiednich branchach\n",
" - zmiana brancha przez git powoduje (ewentualną) zmianę plików `*.dvc` a `dvc checkout` kopiuje/linkuje pliki z katalogu `.dvc/cache` o wartościach hash odpowiadających tym z plików `*.dvc`"
]
},
{
"cell_type": "markdown",
"id": "5897e8eb",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Wymiana danych między projektami\n",
" - za pomocą poleceń `dvc import` i `dvc update` możemy dodać i później aktualizować pliki śledzone przez DVC w innym repozytorium"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "9b018146",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data/data.xml'\n",
" 0% Downloading| |0/1 [00:00<?, ?file/s]\n",
"!\u001b[A\n",
" 0%| |get-started/data.xml 0.00/37.9M [00:00<?, ?it/s]\u001b[A\n",
" 0%| |get-started/data.xml 64.0k/36.1M [00:00<02:12, 286kB/s]\u001b[A\n",
" 0%| |get-started/data.xml 128k/36.1M [00:00<01:33, 403kB/s]\u001b[A\n",
" 1%| |get-started/data.xml 256k/36.1M [00:00<00:57, 658kB/s]\u001b[A\n",
" 1%| |get-started/data.xml 384k/36.1M [00:00<00:45, 818kB/s]\u001b[A\n",
" 1%|▏ |get-started/data.xml 512k/36.1M [00:00<00:53, 693kB/s]\u001b[A\n",
" 2%|▏ |get-started/data.xml 640k/36.1M [00:01<00:57, 644kB/s]\u001b[A\n",
" 2%|▏ |get-started/data.xml 768k/36.1M [00:01<00:59, 619kB/s]\u001b[A\n",
" 2%|▏ |get-started/data.xml 896k/36.1M [00:01<00:51, 718kB/s]\u001b[A\n",
" 3%|▎ |get-started/data.xml 1.00M/36.1M [00:01<00:55, 666kB/s]\u001b[A\n",
" 3%|▎ |get-started/data.xml 1.12M/36.1M [00:01<00:57, 633kB/s]\u001b[A\n",
" 3%|▎ |get-started/data.xml 1.25M/36.1M [00:02<00:57, 638kB/s]\u001b[A\n",
" 4%|▍ |get-started/data.xml 1.38M/36.1M [00:02<00:52, 698kB/s]\u001b[A\n",
" 4%|▍ |get-started/data.xml 1.50M/36.1M [00:02<00:55, 656kB/s]\u001b[A\n",
" 4%|▍ |get-started/data.xml 1.62M/36.1M [00:02<00:57, 628kB/s]\u001b[A\n",
" 5%|▍ |get-started/data.xml 1.69M/36.1M [00:02<00:58, 618kB/s]\u001b[A\n",
" 5%|▌ |get-started/data.xml 1.81M/36.1M [00:02<00:53, 675kB/s]\u001b[A\n",
" 5%|▌ |get-started/data.xml 1.94M/36.1M [00:03<00:53, 672kB/s]\u001b[A\n",
" 6%|▌ |get-started/data.xml 2.06M/36.1M [00:03<00:55, 642kB/s]\u001b[A\n",
" 6%|▌ |get-started/data.xml 2.12M/36.1M [00:03<00:56, 628kB/s]\u001b[A\n",
" 6%|▌ |get-started/data.xml 2.19M/36.1M [00:03<00:57, 616kB/s]\u001b[A\n",
" 6%|▌ |get-started/data.xml 2.25M/36.1M [00:03<00:58, 606kB/s]\u001b[A\n",
" 7%|▋ |get-started/data.xml 2.38M/36.1M [00:03<00:48, 732kB/s]\u001b[A\n",
" 7%|▋ |get-started/data.xml 2.50M/36.1M [00:04<00:52, 666kB/s]\u001b[A\n",
" 7%|▋ |get-started/data.xml 2.62M/36.1M [00:04<00:55, 636kB/s]\u001b[A\n",
" 8%|▊ |get-started/data.xml 2.75M/36.1M [00:04<00:56, 614kB/s]\u001b[A\n",
" 8%|▊ |get-started/data.xml 2.88M/36.1M [00:04<00:49, 711kB/s]\u001b[A\n",
" 8%|▊ |get-started/data.xml 3.00M/36.1M [00:04<00:52, 663kB/s]\u001b[A\n",
" 9%|▊ |get-started/data.xml 3.12M/36.1M [00:05<00:54, 637kB/s]\u001b[A\n",
" 9%|▉ |get-started/data.xml 3.25M/36.1M [00:05<00:55, 623kB/s]\u001b[A\n",
" 9%|▉ |get-started/data.xml 3.38M/36.1M [00:05<00:48, 710kB/s]\u001b[A\n",
" 10%|▉ |get-started/data.xml 3.50M/36.1M [00:05<00:51, 664kB/s]\u001b[A\n",
" 10%|█ |get-started/data.xml 3.62M/36.1M [00:05<00:45, 751kB/s]\u001b[A\n",
" 10%|█ |get-started/data.xml 3.75M/36.1M [00:05<00:49, 691kB/s]\u001b[A\n",
" 11%|█ |get-started/data.xml 3.88M/36.1M [00:06<00:43, 777kB/s]\u001b[A\n",
" 11%|█ |get-started/data.xml 4.00M/36.1M [00:06<00:47, 705kB/s]\u001b[A\n",
" 11%|█▏ |get-started/data.xml 4.12M/36.1M [00:06<00:42, 790kB/s]\u001b[A\n",
" 12%|█▏ |get-started/data.xml 4.25M/36.1M [00:06<00:46, 716kB/s]\u001b[A\n",
" 12%|█▏ |get-started/data.xml 4.38M/36.1M [00:06<00:44, 749kB/s]\u001b[A\n",
" 12%|█▏ |get-started/data.xml 4.50M/36.1M [00:07<00:45, 734kB/s]\u001b[A\n",
" 13%|█▎ |get-started/data.xml 4.62M/36.1M [00:07<00:40, 810kB/s]\u001b[A\n",
" 13%|█▎ |get-started/data.xml 4.75M/36.1M [00:07<00:42, 773kB/s]\u001b[A\n",
" 13%|█▎ |get-started/data.xml 4.88M/36.1M [00:07<00:41, 795kB/s]\u001b[A\n",
" 14%|█▍ |get-started/data.xml 5.00M/36.1M [00:07<00:37, 870kB/s]\u001b[A\n",
" 14%|█▍ |get-started/data.xml 5.12M/36.1M [00:07<00:34, 932kB/s]\u001b[A\n",
" 15%|█▍ |get-started/data.xml 5.25M/36.1M [00:07<00:35, 916kB/s]\u001b[A\n",
" 15%|█▍ |get-started/data.xml 5.38M/36.1M [00:08<00:35, 898kB/s]\u001b[A\n",
" 15%|█▌ |get-started/data.xml 5.50M/36.1M [00:08<00:33, 962kB/s]\u001b[A\n",
" 16%|█▌ |get-started/data.xml 5.62M/36.1M [00:08<00:33, 949kB/s]\u001b[A\n",
" 16%|█▌ |get-started/data.xml 5.75M/36.1M [00:08<00:31, 1.00MB/s]\u001b[A\n",
" 16%|█▋ |get-started/data.xml 5.88M/36.1M [00:08<00:30, 1.04MB/s]\u001b[A\n",
" 17%|█▋ |get-started/data.xml 6.06M/36.1M [00:08<00:26, 1.19MB/s]\u001b[A\n",
" 17%|█▋ |get-started/data.xml 6.19M/36.1M [00:08<00:26, 1.19MB/s]\u001b[A\n",
" 17%|█▋ |get-started/data.xml 6.31M/36.1M [00:08<00:26, 1.19MB/s]\u001b[A\n",
" 18%|█▊ |get-started/data.xml 6.50M/36.1M [00:08<00:23, 1.31MB/s]\u001b[A\n",
" 18%|█▊ |get-started/data.xml 6.62M/36.1M [00:09<00:23, 1.30MB/s]\u001b[A\n",
" 19%|█▉ |get-started/data.xml 6.81M/36.1M [00:09<00:21, 1.41MB/s]\u001b[A\n",
" 19%|█▉ |get-started/data.xml 7.00M/36.1M [00:09<00:20, 1.48MB/s]\u001b[A\n",
" 20%|█▉ |get-started/data.xml 7.19M/36.1M [00:09<00:19, 1.54MB/s]\u001b[A\n",
" 20%|██ |get-started/data.xml 7.38M/36.1M [00:09<00:18, 1.60MB/s]\u001b[A\n",
" 21%|██ |get-started/data.xml 7.56M/36.1M [00:09<00:18, 1.62MB/s]\u001b[A\n",
" 21%|██▏ |get-started/data.xml 7.75M/36.1M [00:09<00:17, 1.68MB/s]\u001b[A\n",
" 22%|██▏ |get-started/data.xml 7.94M/36.1M [00:09<00:17, 1.70MB/s]\u001b[A\n",
" 22%|██▏ |get-started/data.xml 8.12M/36.1M [00:10<00:17, 1.72MB/s]\u001b[A\n",
" 23%|██▎ |get-started/data.xml 8.38M/36.1M [00:10<00:15, 1.88MB/s]\u001b[A\n",
" 24%|██▎ |get-started/data.xml 8.56M/36.1M [00:10<00:15, 1.84MB/s]\u001b[A\n",
" 24%|██▍ |get-started/data.xml 8.81M/36.1M [00:10<00:14, 1.96MB/s]\u001b[A\n",
" 25%|██▌ |get-started/data.xml 9.06M/36.1M [00:10<00:13, 2.06MB/s]\u001b[A\n",
" 26%|██▌ |get-started/data.xml 9.31M/36.1M [00:10<00:13, 2.14MB/s]\u001b[A\n",
" 27%|██▋ |get-started/data.xml 9.62M/36.1M [00:10<00:11, 2.32MB/s]\u001b[A\n",
" 27%|██▋ |get-started/data.xml 9.88M/36.1M [00:10<00:11, 2.33MB/s]\u001b[A\n",
" 28%|██▊ |get-started/data.xml 10.2M/36.1M [00:10<00:11, 2.46MB/s]\u001b[A\n",
" 29%|██▉ |get-started/data.xml 10.4M/36.1M [00:11<00:10, 2.45MB/s]\u001b[A\n",
" 30%|██▉ |get-started/data.xml 10.8M/36.1M [00:11<00:10, 2.57MB/s]\u001b[A\n",
" 31%|███ |get-started/data.xml 11.1M/36.1M [00:11<00:09, 2.67MB/s]\u001b[A\n",
" 32%|███▏ |get-started/data.xml 11.4M/36.1M [00:11<00:09, 2.84MB/s]\u001b[A\n",
" 33%|███▎ |get-started/data.xml 11.8M/36.1M [00:11<00:08, 2.85MB/s]\u001b[A\n",
" 34%|███▎ |get-started/data.xml 12.1M/36.1M [00:11<00:08, 3.01MB/s]\u001b[A\n",
" 35%|███▍ |get-started/data.xml 12.5M/36.1M [00:11<00:07, 3.12MB/s]\u001b[A\n",
" 36%|███▌ |get-started/data.xml 12.9M/36.1M [00:11<00:07, 3.22MB/s]\u001b[A\n",
" 37%|███▋ |get-started/data.xml 13.2M/36.1M [00:11<00:07, 3.31MB/s]\u001b[A\n",
" 38%|███▊ |get-started/data.xml 13.7M/36.1M [00:12<00:06, 3.49MB/s]\u001b[A\n",
" 39%|███▉ |get-started/data.xml 14.1M/36.1M [00:12<00:06, 3.62MB/s]\u001b[A\n",
" 40%|████ |get-started/data.xml 14.6M/36.1M [00:12<00:06, 3.74MB/s]\u001b[A\n",
" 42%|████▏ |get-started/data.xml 15.0M/36.1M [00:12<00:05, 3.82MB/s]\u001b[A\n",
" 43%|████▎ |get-started/data.xml 15.4M/36.1M [00:12<00:05, 3.97MB/s]\u001b[A\n",
" 44%|████▍ |get-started/data.xml 15.9M/36.1M [00:12<00:05, 4.08MB/s]\u001b[A\n",
" 45%|████▌ |get-started/data.xml 16.4M/36.1M [00:12<00:04, 4.23MB/s]\u001b[A\n",
" 47%|████▋ |get-started/data.xml 17.0M/36.1M [00:12<00:04, 4.44MB/s]\u001b[A\n",
" 48%|████▊ |get-started/data.xml 17.5M/36.1M [00:12<00:04, 4.52MB/s]\u001b[A\n",
" 50%|████▉ |get-started/data.xml 18.1M/36.1M [00:13<00:04, 4.69MB/s]\u001b[A\n",
" 52%|█████▏ |get-started/data.xml 18.6M/36.1M [00:13<00:03, 4.84MB/s]\u001b[A\n",
" 53%|█████▎ |get-started/data.xml 19.2M/36.1M [00:13<00:03, 5.05MB/s]\u001b[A\n",
" 55%|█████▍ |get-started/data.xml 19.8M/36.1M [00:13<00:03, 5.16MB/s]\u001b[A\n",
2021-06-04 15:52:21 +02:00
" 57%|█████▋ |get-started/data.xml 20.4M/36.1M [00:13<00:03, 5.35MB/s]\u001b[A\n",
2021-05-31 11:55:27 +02:00
" 58%|█████▊ |get-started/data.xml 21.1M/36.1M [00:13<00:02, 5.49MB/s]\u001b[A\n",
" 60%|██████ |get-started/data.xml 21.8M/36.1M [00:13<00:02, 5.66MB/s]\u001b[A\n",
" 62%|██████▏ |get-started/data.xml 22.4M/36.1M [00:13<00:02, 5.83MB/s]\u001b[A\n",
" 64%|██████▍ |get-started/data.xml 23.2M/36.1M [00:14<00:02, 6.05MB/s]\u001b[A\n",
" 66%|██████▌ |get-started/data.xml 23.9M/36.1M [00:14<00:02, 6.20MB/s]\u001b[A\n",
" 68%|██████▊ |get-started/data.xml 24.6M/36.1M [00:14<00:01, 6.40MB/s]\u001b[A\n",
" 70%|███████ |get-started/data.xml 25.4M/36.1M [00:14<00:01, 6.51MB/s]\u001b[A\n",
" 72%|███████▏ |get-started/data.xml 26.0M/36.1M [00:14<00:01, 5.75MB/s]\u001b[A\n",
" 74%|███████▎ |get-started/data.xml 26.6M/36.1M [00:14<00:02, 4.26MB/s]\u001b[A\n",
" 75%|███████▍ |get-started/data.xml 27.1M/36.1M [00:14<00:02, 3.53MB/s]\u001b[A\n",
" 76%|███████▌ |get-started/data.xml 27.5M/36.1M [00:15<00:02, 3.26MB/s]\u001b[A\n",
" 77%|███████▋ |get-started/data.xml 27.9M/36.1M [00:15<00:02, 3.00MB/s]\u001b[A\n",
" 78%|███████▊ |get-started/data.xml 28.2M/36.1M [00:15<00:02, 2.95MB/s]\u001b[A\n",
" 79%|███████▉ |get-started/data.xml 28.5M/36.1M [00:15<00:02, 2.91MB/s]\u001b[A\n",
" 80%|███████▉ |get-started/data.xml 28.8M/36.1M [00:15<00:02, 2.88MB/s]\u001b[A\n",
" 81%|████████ |get-started/data.xml 29.1M/36.1M [00:15<00:02, 2.86MB/s]\u001b[A\n",
" 81%|████████▏ |get-started/data.xml 29.4M/36.1M [00:15<00:02, 2.84MB/s]\u001b[A\n",
" 82%|████████▏ |get-started/data.xml 29.8M/36.1M [00:16<00:02, 2.83MB/s]\u001b[A\n",
" 83%|████████▎ |get-started/data.xml 30.1M/36.1M [00:16<00:02, 2.83MB/s]\u001b[A\n",
" 84%|████████▍ |get-started/data.xml 30.4M/36.1M [00:16<00:02, 2.83MB/s]\u001b[A\n",
" 85%|████████▍ |get-started/data.xml 30.7M/36.1M [00:16<00:02, 2.83MB/s]\u001b[A\n",
" 86%|████████▌ |get-started/data.xml 31.0M/36.1M [00:16<00:01, 2.83MB/s]\u001b[A\n",
" 87%|████████▋ |get-started/data.xml 31.3M/36.1M [00:16<00:01, 2.83MB/s]\u001b[A\n",
" 88%|████████▊ |get-started/data.xml 31.6M/36.1M [00:16<00:01, 2.83MB/s]\u001b[A\n",
" 88%|████████▊ |get-started/data.xml 31.9M/36.1M [00:16<00:01, 2.84MB/s]\u001b[A\n",
" 89%|████████▉ |get-started/data.xml 32.2M/36.1M [00:16<00:01, 2.85MB/s]\u001b[A\n",
" 90%|█████████ |get-started/data.xml 32.6M/36.1M [00:17<00:01, 2.85MB/s]\u001b[A\n",
" 91%|█████████ |get-started/data.xml 32.9M/36.1M [00:17<00:01, 2.86MB/s]\u001b[A\n",
" 92%|█████████▏|get-started/data.xml 33.2M/36.1M [00:17<00:01, 2.86MB/s]\u001b[A\n",
" 93%|█████████▎|get-started/data.xml 33.5M/36.1M [00:17<00:00, 2.87MB/s]\u001b[A\n",
" 94%|█████████▎|get-started/data.xml 33.8M/36.1M [00:17<00:00, 2.87MB/s]\u001b[A\n",
" 94%|█████████▍|get-started/data.xml 34.1M/36.1M [00:17<00:00, 2.87MB/s]\u001b[A\n",
" 95%|█████████▌|get-started/data.xml 34.4M/36.1M [00:17<00:00, 2.87MB/s]\u001b[A\n",
" 96%|█████████▌|get-started/data.xml 34.8M/36.1M [00:17<00:00, 2.87MB/s]\u001b[A\n",
" 97%|█████████▋|get-started/data.xml 35.1M/36.1M [00:17<00:00, 2.87MB/s]\u001b[A\n",
" 98%|█████████▊|get-started/data.xml 35.4M/36.1M [00:18<00:00, 2.87MB/s]\u001b[A\n",
" 99%|█████████▉|get-started/data.xml 35.7M/36.1M [00:18<00:00, 2.88MB/s]\u001b[A\n",
"100%|█████████▉|get-started/data.xml 36.0M/36.1M [00:18<00:00, 2.87MB/s]\u001b[A\n",
" \u001b[A\n",
"To track the changes with git, run:\n",
"\n",
"\tgit add data/.gitignore data/data.xml.dvc\n",
"\u001b[0m"
]
}
],
"source": [
"!dvc import https://github.com/iterative/dataset-registry \\\n",
" get-started/data.xml -o data/data.xml"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "be2c1a37",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data and pipelines are up to date. \n",
"\u001b[0m"
]
}
],
"source": [
"!dvc status"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "3306c5b7",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total 37020\r\n",
"-rw-rw-r-- 1 tomek tomek 37891850 maj 31 11:10 data.xml\r\n",
"-rw-rw-r-- 1 tomek tomek 284 maj 31 11:10 data.xml.dvc\r\n",
"-rw-rw-r-- 1 tomek tomek 5072 maj 31 11:01 Iris.csv\r\n",
"-rw-rw-r-- 1 tomek tomek 76 maj 31 11:01 Iris.csv.dvc\r\n"
]
}
],
"source": [
"ls -l data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b73c56ea",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# %load data/data.xml.dvc\n",
"md5: a7cd139231cc35ed63541ce3829b96db\n",
"frozen: true\n",
"deps:\n",
"- path: get-started/data.xml\n",
" repo:\n",
" url: https://github.com/iterative/dataset-registry\n",
" rev_lock: ba014f40e29670421a67cb1c47543f402348aa13\n",
"outs:\n",
"- md5: a304afb96060aad90176268345e10355\n",
" size: 37891850\n",
" path: data.xml\n"
]
},
{
"cell_type": "markdown",
"id": "db1063ac",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## DVC pipelines\n",
" - wprowadzenie: https://youtu.be/71IGzyH95UY\n",
" - Getting started: https://dvc.org/doc/start/data-pipelines\n",
" - dvc pipelines pozwala nam zbudować (za pomocą polecenie `dvc run`) lub zdefiniować (edytując plik `dvc.yaml`) graf zależności między krokami wykonywanymi w naszym projekcie (takimi jak \"przygotowanie danych\", \"trenowanie\", \"ewaluacja\")\n",
" - tak zdefiniowany pipeline można potem uruchomić za pomocą polecenia `dvc reproduce`"
]
},
{
"cell_type": "markdown",
"id": "e2939867",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Zadania [15pkt]\n",
"1. Zainicjalizuj repozytorium DVC wewnątrz Twojego repozytorium z projektem [1pkt]\n",
"2. Dodaj plik(i) z danymi w Twoim projekcie do DVC [1pkt]\n",
"3. Skonfiguruj remote (dane do konfiguracji będą podane niebawem) [1pkt]\n",
2021-06-04 16:11:54 +02:00
"4. Stwórz/zdefiniuj i dodaj do repozytorium plik `dvc.yaml` opisujący kroki wykonywane w Twoim projekcie. Wydziel przynajmniej 2 kroki (np. przygotowanie danych/trenowanie) powiązane ze sobą za pomocą zależności (skorzystaj z \n",
"materiałów \"Getting started\", link powyżej) [6pkt]\n",
2021-05-31 11:55:27 +02:00
"5. Stwórz projekt na Jenkinsie (`s1233456-dvc`), w którym sklonujesz repozytorium, ściągniesz pliki dvc (za pomocą `dvc pull`) i uruchomisz pipeline (za pomocą `dvc reproduce`) [6pkt]"
]
2021-06-04 15:52:21 +02:00
},
{
"cell_type": "markdown",
"id": "2f5a8590",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"source": [
2021-06-04 16:11:54 +02:00
"## SSH remote\n",
"Jednym z remote obsługiwanych przez DVC jest SFTP/SSH.\n",
"W celu jego wykorzystania na serwerze tzietkiewicz.vm.wmi.amu.edu.pl utworzony został użytkownik `ium-sftp` i skonfigurowany serwer SFTP.\n",
"Został też dla niego wygenerowany klucz ssh, który został dodany jako \"Jenkins credential\" (patrz opis konfiguracji na Jenkins poniżej)"
2021-06-04 15:52:21 +02:00
]
},
{
"cell_type": "markdown",
"id": "82a61107",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"source": [
2021-06-04 16:11:54 +02:00
"### Lokalnie\n",
2021-06-04 15:52:21 +02:00
"Będziemy potrzebować zależności :\n",
" \n",
" `conda install dvc-ssh` \n",
"\n",
"albo\n",
"\n",
"`pip install dvc[ssh] paramiko`"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "c48c5b8e",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting package metadata (current_repodata.json): done\n",
"Solving environment: done\n",
"\n",
"## Package Plan ##\n",
"\n",
" environment location: /home/tomek/miniconda3\n",
"\n",
" added / updated specs:\n",
" - dvc-ssh\n",
"\n",
"\n",
"The following packages will be downloaded:\n",
"\n",
" package | build\n",
" ---------------------------|-----------------\n",
" bcrypt-3.2.0 | py39h3811e60_1 44 KB conda-forge\n",
" ca-certificates-2021.5.30 | ha878542_0 136 KB conda-forge\n",
" certifi-2021.5.30 | py39hf3d152e_0 141 KB conda-forge\n",
" dvc-2.3.0 | py39hf3d152e_0 542 KB conda-forge\n",
" dvc-ssh-2.3.0 | py39hf3d152e_0 9 KB conda-forge\n",
" fsspec-2021.5.0 | pyhd8ed1ab_0 77 KB conda-forge\n",
" invoke-1.5.0 | pyhd3deb0d_0 137 KB conda-forge\n",
" paramiko-2.7.2 | pyh9f0ad1d_0 135 KB conda-forge\n",
" pynacl-1.4.0 | py39h3811e60_2 1.3 MB conda-forge\n",
" ------------------------------------------------------------\n",
" Total: 2.5 MB\n",
"\n",
"The following NEW packages will be INSTALLED:\n",
"\n",
" bcrypt conda-forge/linux-64::bcrypt-3.2.0-py39h3811e60_1\n",
" dvc-ssh conda-forge/linux-64::dvc-ssh-2.3.0-py39hf3d152e_0\n",
" invoke conda-forge/noarch::invoke-1.5.0-pyhd3deb0d_0\n",
" paramiko conda-forge/noarch::paramiko-2.7.2-pyh9f0ad1d_0\n",
" pynacl conda-forge/linux-64::pynacl-1.4.0-py39h3811e60_2\n",
"\n",
"The following packages will be UPDATED:\n",
"\n",
" ca-certificates 2020.12.5-ha878542_0 --> 2021.5.30-ha878542_0\n",
" certifi 2020.12.5-py39hf3d152e_1 --> 2021.5.30-py39hf3d152e_0\n",
" dvc 2.1.0-py39hf3d152e_0 --> 2.3.0-py39hf3d152e_0\n",
" fsspec 0.9.0-pyhd8ed1ab_2 --> 2021.5.0-pyhd8ed1ab_0\n",
"\n",
"\n",
"\n",
"Downloading and Extracting Packages\n",
"certifi-2021.5.30 | 141 KB | ##################################### | 100% \n",
"fsspec-2021.5.0 | 77 KB | ##################################### | 100% \n",
"dvc-2.3.0 | 542 KB | ##################################### | 100% \n",
"invoke-1.5.0 | 137 KB | ##################################### | 100% \n",
"paramiko-2.7.2 | 135 KB | ##################################### | 100% \n",
"bcrypt-3.2.0 | 44 KB | ##################################### | 100% \n",
"pynacl-1.4.0 | 1.3 MB | ##################################### | 100% \n",
"dvc-ssh-2.3.0 | 9 KB | ##################################### | 100% \n",
"ca-certificates-2021 | 136 KB | ##################################### | 100% \n",
"Preparing transaction: done\n",
"Verifying transaction: done\n",
"Executing transaction: done\n",
"\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"conda install -c conda-forge dvc-ssh"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "e9a04876",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Setting 'ium_ssh_remote' as a default remote.\n",
"\u001b[0m"
]
}
],
"source": [
"!dvc remote add -f -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "e3f27bbb",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"my_local_remote\t/dvcstore\n",
"ium_ssh_remote\tssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp\n",
"\u001b[0m"
]
}
],
"source": [
"!dvc remote list"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "5b2fa175",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[0m"
]
}
],
"source": [
"!dvc remote modify --local ium_ssh_remote password [hasło takie jak do serwera MLflow (patrz MSTeams)]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "ea6e16fa",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0% Uploading| |0/1 [00:00<?, ?file/s]\n",
"!\u001b[A\n",
" 0%| |data/Iris.csv 0.00/4.95k [00:00<?, ?B/s]\u001b[A\n",
"1 file pushed \u001b[A\n",
"\u001b[0m"
]
}
],
"source": [
"!dvc push"
]
},
{
"cell_type": "markdown",
"id": "1468c44c",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"source": [
"### Jenkins\n",
"\n",
"W Jenkins można użyć mechanizmu \"Credentials\", żeby w bezpieczny sposób przekazać hasło albo klucz prywatny.\n",
"\n",
"Takie dane dla użytkownika ium-sftp zostały stworzone na Jenkinsie:\n",
"\n",
" - typu ssh key: https://tzietkiewicz.vm.wmi.amu.edu.pl:8080/credentials/store/system/domain/_/credential/48ac7004-216e-4260-abba-1fe5db753e18/\n",
" - typu \"secret text\" - zawierający hasło użytkownika ium-shftp: https://tzietkiewicz.vm.wmi.amu.edu.pl:8080/credentials/store/system/domain/_/credential/ium-sftp-password/\n",
"\n",
"Opis używania \"Credentials\" w Jenkinsfile: https://www.jenkins.io/doc/book/pipeline/jenkinsfile/#for-other-credential-types\n",
"\n",
"Klucza ssh można użyć tak: \n",
"\n",
"```Jenkinsfile\n",
"withCredentials(\n",
" [sshUserPrivateKey(credentialsId: '48ac7004-216e-4260-abba-1fe5db753e18', keyFileVariable: 'IUM_SFTP_KEY', passphraseVariable: '', usernameVariable: '')]) {\n",
2021-06-14 11:44:01 +02:00
" sh 'dvc remote add -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp'\n",
" sh 'dvc remote modify --local ium_ssh_remote keyfile $IUM_SFTP_KEY'\n",
" sh 'dvc pull'}\n",
2021-06-04 15:52:21 +02:00
"```\n",
"\n",
"Secret text tak:\n",
"\n",
"```Jenkinsfile\n",
" withCredentials([string(credentialsId: 'ium-sftp-password', variable: 'IUM_SFTP_PASS')]) {\n",
2021-06-14 11:44:01 +02:00
" sh 'dvc remote add -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp'\n",
" sh 'dvc remote modify --local ium_ssh_remote password $IUM_SFTP_KEY'\n",
" sh 'dvc pull'\n",
2021-06-04 15:52:21 +02:00
" }\n",
"```\n",
"\n",
"Przykład kongiguracji: \n",
" - https://tzietkiewicz.vm.wmi.amu.edu.pl:8080/job/docker-test-mount/ \n",
" - https://git.wmi.amu.edu.pl/tzietkiewicz/ium-helloworld"
2021-06-04 15:52:21 +02:00
]
2021-05-31 11:55:27 +02:00
}
],
"metadata": {
2021-09-28 10:56:21 +02:00
"author": "Tomasz Ziętkiewicz",
2021-05-31 11:55:27 +02:00
"celltoolbar": "Slideshow",
2021-09-28 10:56:21 +02:00
"email": "tomasz.zietkiewicz@amu.edu.pl",
2021-05-31 11:55:27 +02:00
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
2021-09-28 10:56:21 +02:00
"lang": "pl",
2021-05-31 11:55:27 +02:00
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.1"
2021-09-28 10:56:21 +02:00
},
"slideshow": {
"slide_type": "slide"
},
"subtitle": "10.DVC[laboratoria]",
"title": "Inżynieria uczenia maszynowego",
"year": "2021"
2021-05-31 11:55:27 +02:00
},
"nbformat": 4,
"nbformat_minor": 5
}