{ "cells": [ { "cell_type": "markdown", "id": "7fe475ae", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Inżynieria uczenia maszynowego

\n", "

10. DVC [laboratoria]

\n", "

Tomasz Ziętkiewicz (2021)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "id": "0c6f27a5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "" ] }, { "cell_type": "markdown", "id": "560eec71", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## DVC - Data Version Control\n", "- [dvc.org](https://dvc.org/)\n", "- \"Version Control System for Machine Learning Projects\" (System kontroli wersji dla projektów uczenia maszynowego)\n", "- Open Source\n", "- Umożliwia:\n", " - wersjonowanie danych i modeli. \"Git dla danych i modeli\"\n", " - budowanie potoków (\"pipeline\") definiujących jak budować/trenować/ewaluować modele. \"Makefile dla uczenia maszynowego\"\n", " - śledzeniem, porównywanie metryk i parametrów\n", "- ściśle zintegowany z gitem\n", "- działa niezależnie od używanego języka/bibliotek i systemu operacyjnego\n", "- 5-minutowe wprowadzenie: https://www.youtube.com/watch?v=UbL7VUpv1Bs&t=197s" ] }, { "cell_type": "markdown", "id": "9bfb356e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Instalacja i inicjalizacja\n", " - https://dvc.org/doc/install\n", " - ```pip(x) install dvc``` albo:\n", " - ```conda install dvc```" ] }, { "cell_type": "code", "execution_count": 10, "id": "054c7a11", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting package metadata (current_repodata.json): done\n", "Solving environment: failed with initial frozen solve. Retrying with flexible solve.\n", "Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.\n", "Collecting package metadata (repodata.json): done\n", "Solving environment: done\n", "\n", "## Package Plan ##\n", "\n", " environment location: /home/tomek/miniconda3\n", "\n", " added / updated specs:\n", " - dvc\n", "\n", "\n", "The following packages will be downloaded:\n", "\n", " package | build\n", " ---------------------------|-----------------\n", " atpublic-1.0 | py_0 7 KB conda-forge\n", " bzip2-1.0.8 | h7f98852_4 484 KB conda-forge\n", " cached-property-1.5.2 | hd8ed1ab_1 4 KB conda-forge\n", " cached_property-1.5.2 | pyha770c72_1 11 KB conda-forge\n", " colorama-0.4.4 | pyh9f0ad1d_0 18 KB conda-forge\n", " commonmark-0.9.1 | py_0 46 KB conda-forge\n", " configobj-5.0.6 | py_0 31 KB conda-forge\n", " dictdiffer-0.8.1 | pyhd8ed1ab_0 16 KB conda-forge\n", " diskcache-5.2.1 | pyh44b312d_0 36 KB conda-forge\n", " distro-1.5.0 | pyh9f0ad1d_0 20 KB conda-forge\n", " dpath-2.0.1 | py39hf3d152e_0 23 KB conda-forge\n", " dulwich-0.20.23 | py39h3811e60_0 721 KB conda-forge\n", " dvc-2.1.0 | py39hf3d152e_0 551 KB conda-forge\n", " flatten-dict-0.3.0 | pyh9f0ad1d_0 11 KB conda-forge\n", " flufl.lock-3.2 | py_0 19 KB conda-forge\n", " fsspec-0.9.0 | pyhd8ed1ab_2 75 KB conda-forge\n", " ftfy-5.5.1 | py_0 47 KB conda-forge\n", " funcy-1.16 | pyhd8ed1ab_0 30 KB conda-forge\n", " future-0.18.2 | py39hf3d152e_3 718 KB conda-forge\n", " grandalf-0.6 | py_0 42 KB conda-forge\n", " jsonpath-ng-1.5.2 | pyh9f0ad1d_0 26 KB conda-forge\n", " libgit2-1.1.0 | h0b03e73_0 693 KB conda-forge\n", " libssh2-1.9.0 | ha56f1ee_6 226 KB conda-forge\n", " mailchecker-4.0.7 | pyhd8ed1ab_0 206 KB conda-forge\n", " nanotime-0.5.2 | py_0 6 KB conda-forge\n", " networkx-2.5 | py_0 1.2 MB conda-forge\n", " pathlib2-2.3.5 | py39hf3d152e_3 35 KB conda-forge\n", " pathspec-0.8.1 | pyhd3deb0d_0 29 KB conda-forge\n", " pcre2-10.35 | h032f7d1_2 693 KB conda-forge\n", " phonenumbers-8.10.14 | py_0 1.5 MB conda-forge\n", " ply-3.11 | py_1 44 KB conda-forge\n", " pyasn1-0.4.8 | py_0 53 KB conda-forge\n", " pydot-1.2.4 | py_0 20 KB conda-forge\n", " pygit2-1.5.0 | py39h3811e60_0 213 KB conda-forge\n", " pygtrie-2.3.2 | pyh8c360ce_0 24 KB conda-forge\n", " python-benedict-0.24.0 | pyhd8ed1ab_0 30 KB conda-forge\n", " python-fsutil-0.5.0 | pyhd8ed1ab_0 13 KB conda-forge\n", " python-slugify-5.0.2 | pyhd8ed1ab_0 12 KB conda-forge\n", " rich-10.2.2 | py39hf3d152e_0 337 KB conda-forge\n", " ruamel.yaml-0.17.4 | py39h3811e60_0 160 KB conda-forge\n", " ruamel.yaml.clib-0.2.2 | py39h3811e60_2 173 KB conda-forge\n", " shortuuid-1.0.1 | py39hf3d152e_4 15 KB conda-forge\n", " shtab-1.3.6 | pyhd8ed1ab_0 15 KB conda-forge\n", " text-unidecode-1.3 | py_0 68 KB conda-forge\n", " toml-0.10.2 | pyhd8ed1ab_0 18 KB conda-forge\n", " unidecode-1.2.0 | pyhd8ed1ab_0 155 KB conda-forge\n", " voluptuous-0.12.1 | pyhd3deb0d_0 28 KB conda-forge\n", " zc.lockfile-2.0 | py_0 11 KB conda-forge\n", " ------------------------------------------------------------\n", " Total: 8.8 MB\n", "\n", "The following NEW packages will be INSTALLED:\n", "\n", " _openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-1_gnu\n", " appdirs conda-forge/noarch::appdirs-1.4.4-pyh9f0ad1d_0\n", " atpublic conda-forge/noarch::atpublic-1.0-py_0\n", " bzip2 conda-forge/linux-64::bzip2-1.0.8-h7f98852_4\n", " cached-property conda-forge/noarch::cached-property-1.5.2-hd8ed1ab_1\n", " cached_property conda-forge/noarch::cached_property-1.5.2-pyha770c72_1\n", " colorama conda-forge/noarch::colorama-0.4.4-pyh9f0ad1d_0\n", " commonmark conda-forge/noarch::commonmark-0.9.1-py_0\n", " configobj conda-forge/noarch::configobj-5.0.6-py_0\n", " dictdiffer conda-forge/noarch::dictdiffer-0.8.1-pyhd8ed1ab_0\n", " diskcache conda-forge/noarch::diskcache-5.2.1-pyh44b312d_0\n", " distro conda-forge/noarch::distro-1.5.0-pyh9f0ad1d_0\n", " dpath conda-forge/linux-64::dpath-2.0.1-py39hf3d152e_0\n", " dulwich conda-forge/linux-64::dulwich-0.20.23-py39h3811e60_0\n", " dvc conda-forge/linux-64::dvc-2.1.0-py39hf3d152e_0\n", " flatten-dict conda-forge/noarch::flatten-dict-0.3.0-pyh9f0ad1d_0\n", " flufl.lock conda-forge/noarch::flufl.lock-3.2-py_0\n", " fsspec conda-forge/noarch::fsspec-0.9.0-pyhd8ed1ab_2\n", " ftfy conda-forge/noarch::ftfy-5.5.1-py_0\n", " funcy conda-forge/noarch::funcy-1.16-pyhd8ed1ab_0\n", " future conda-forge/linux-64::future-0.18.2-py39hf3d152e_3\n", " gitdb conda-forge/noarch::gitdb-4.0.7-pyhd8ed1ab_0\n", " gitpython conda-forge/noarch::gitpython-3.1.17-pyhd8ed1ab_0\n", " grandalf conda-forge/noarch::grandalf-0.6-py_0\n", " jsonpath-ng conda-forge/noarch::jsonpath-ng-1.5.2-pyh9f0ad1d_0\n", " libgit2 conda-forge/linux-64::libgit2-1.1.0-h0b03e73_0\n", " libgomp conda-forge/linux-64::libgomp-9.3.0-h2828fa1_19\n", " libssh2 conda-forge/linux-64::libssh2-1.9.0-ha56f1ee_6\n", " mailchecker conda-forge/noarch::mailchecker-4.0.7-pyhd8ed1ab_0\n", " nanotime conda-forge/noarch::nanotime-0.5.2-py_0\n", " networkx conda-forge/noarch::networkx-2.5-py_0\n", " pathlib2 conda-forge/linux-64::pathlib2-2.3.5-py39hf3d152e_3\n", " pathspec conda-forge/noarch::pathspec-0.8.1-pyhd3deb0d_0\n", " pcre2 conda-forge/linux-64::pcre2-10.35-h032f7d1_2\n", " phonenumbers conda-forge/noarch::phonenumbers-8.10.14-py_0\n", " pip conda-forge/noarch::pip-21.1.2-pyhd8ed1ab_0\n", " ply conda-forge/noarch::ply-3.11-py_1\n", " pyasn1 conda-forge/noarch::pyasn1-0.4.8-py_0\n", " pydot conda-forge/noarch::pydot-1.2.4-py_0\n", " pygit2 conda-forge/linux-64::pygit2-1.5.0-py39h3811e60_0\n", " pygtrie conda-forge/noarch::pygtrie-2.3.2-pyh8c360ce_0\n", " python-benedict conda-forge/noarch::python-benedict-0.24.0-pyhd8ed1ab_0\n", " python-fsutil conda-forge/noarch::python-fsutil-0.5.0-pyhd8ed1ab_0\n", " python-slugify conda-forge/noarch::python-slugify-5.0.2-pyhd8ed1ab_0\n", " rich conda-forge/linux-64::rich-10.2.2-py39hf3d152e_0\n", " ruamel.yaml conda-forge/linux-64::ruamel.yaml-0.17.4-py39h3811e60_0\n", " ruamel.yaml.clib conda-forge/linux-64::ruamel.yaml.clib-0.2.2-py39h3811e60_2\n", " shortuuid conda-forge/linux-64::shortuuid-1.0.1-py39hf3d152e_4\n", " shtab conda-forge/noarch::shtab-1.3.6-pyhd8ed1ab_0\n", " smmap conda-forge/noarch::smmap-3.0.5-pyh44b312d_0\n", " tabulate conda-forge/noarch::tabulate-0.8.9-pyhd8ed1ab_0\n", " text-unidecode conda-forge/noarch::text-unidecode-1.3-py_0\n", " toml conda-forge/noarch::toml-0.10.2-pyhd8ed1ab_0\n", " typing_extensions conda-forge/noarch::typing_extensions-3.7.4.3-py_0\n", " unidecode conda-forge/noarch::unidecode-1.2.0-pyhd8ed1ab_0\n", " voluptuous conda-forge/noarch::voluptuous-0.12.1-pyhd3deb0d_0\n", " wheel conda-forge/noarch::wheel-0.36.2-pyhd3deb0d_0\n", " zc.lockfile conda-forge/noarch::zc.lockfile-2.0-py_0\n", "\n", "The following packages will be UPDATED:\n", "\n", " certifi pkgs/main::certifi-2020.12.5-py39h06a~ --> conda-forge::certifi-2020.12.5-py39hf3d152e_1\n", " libgcc-ng pkgs/main::libgcc-ng-9.1.0-hdf63c60_0 --> conda-forge::libgcc-ng-9.3.0-h2828fa1_19\n", "\n", "The following packages will be SUPERSEDED by a higher-priority channel:\n", "\n", " _libgcc_mutex pkgs/main::_libgcc_mutex-0.1-main --> conda-forge::_libgcc_mutex-0.1-conda_forge\n", " ca-certificates pkgs/main::ca-certificates-2021.4.13-~ --> conda-forge::ca-certificates-2020.12.5-ha878542_0\n", " conda pkgs/main::conda-4.10.1-py39h06a4308_1 --> conda-forge::conda-4.10.1-py39hf3d152e_0\n", " openssl pkgs/main::openssl-1.1.1k-h27cfd23_0 --> conda-forge::openssl-1.1.1k-h7f98852_0\n", "\n", "\n", "\n", "Downloading and Extracting Packages\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "diskcache-5.2.1 | 36 KB | ##################################### | 100% \n", "pathspec-0.8.1 | 29 KB | ##################################### | 100% \n", "cached-property-1.5. | 4 KB | ##################################### | 100% \n", "networkx-2.5 | 1.2 MB | ##################################### | 100% \n", "commonmark-0.9.1 | 46 KB | ##################################### | 100% \n", "configobj-5.0.6 | 31 KB | ##################################### | 100% \n", "python-fsutil-0.5.0 | 13 KB | ##################################### | 100% \n", "fsspec-0.9.0 | 75 KB | ##################################### | 100% \n", "dulwich-0.20.23 | 721 KB | ##################################### | 100% \n", "funcy-1.16 | 30 KB | ##################################### | 100% \n", "bzip2-1.0.8 | 484 KB | ##################################### | 100% \n", "ply-3.11 | 44 KB | ##################################### | 100% \n", "libgit2-1.1.0 | 693 KB | ##################################### | 100% \n", "ftfy-5.5.1 | 47 KB | ##################################### | 100% \n", "nanotime-0.5.2 | 6 KB | ##################################### | 100% \n", "pyasn1-0.4.8 | 53 KB | ##################################### | 100% \n", "unidecode-1.2.0 | 155 KB | ##################################### | 100% \n", "dvc-2.1.0 | 551 KB | ##################################### | 100% \n", "pydot-1.2.4 | 20 KB | ##################################### | 100% \n", "zc.lockfile-2.0 | 11 KB | ##################################### | 100% \n", "dpath-2.0.1 | 23 KB | ##################################### | 100% \n", "pcre2-10.35 | 693 KB | ##################################### | 100% \n", "ruamel.yaml-0.17.4 | 160 KB | ##################################### | 100% \n", "flatten-dict-0.3.0 | 11 KB | ##################################### | 100% \n", "python-slugify-5.0.2 | 12 KB | ##################################### | 100% \n", "shortuuid-1.0.1 | 15 KB | ##################################### | 100% \n", "text-unidecode-1.3 | 68 KB | ##################################### | 100% \n", "cached_property-1.5. | 11 KB | ##################################### | 100% \n", "colorama-0.4.4 | 18 KB | ##################################### | 100% \n", "flufl.lock-3.2 | 19 KB | ##################################### | 100% \n", "libssh2-1.9.0 | 226 KB | ##################################### | 100% \n", "python-benedict-0.24 | 30 KB | ##################################### | 100% \n", "distro-1.5.0 | 20 KB | ##################################### | 100% \n", "grandalf-0.6 | 42 KB | ##################################### | 100% \n", "future-0.18.2 | 718 KB | ##################################### | 100% \n", "ruamel.yaml.clib-0.2 | 173 KB | ##################################### | 100% \n", "rich-10.2.2 | 337 KB | ##################################### | 100% \n", "shtab-1.3.6 | 15 KB | ##################################### | 100% \n", "pygtrie-2.3.2 | 24 KB | ##################################### | 100% \n", "mailchecker-4.0.7 | 206 KB | ##################################### | 100% \n", "voluptuous-0.12.1 | 28 KB | ##################################### | 100% \n", "atpublic-1.0 | 7 KB | ##################################### | 100% \n", "phonenumbers-8.10.14 | 1.5 MB | ##################################### | 100% \n", "pathlib2-2.3.5 | 35 KB | ##################################### | 100% \n", "pygit2-1.5.0 | 213 KB | ##################################### | 100% \n", "dictdiffer-0.8.1 | 16 KB | ##################################### | 100% \n", "toml-0.10.2 | 18 KB | ##################################### | 100% \n", "jsonpath-ng-1.5.2 | 26 KB | ##################################### | 100% \n", "Preparing transaction: done\n", "Verifying transaction: done\n", "Executing transaction: done\n", "\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "conda install dvc" ] }, { "cell_type": "markdown", "id": "20975d62", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Stwórzmy katalog, w którym będziemy przechowywać nasz projekt:" ] }, { "cell_type": "code", "execution_count": 12, "id": "aae59ec2", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "!mkdir -p IUM_10/sample-ml-project" ] }, { "cell_type": "code", "execution_count": 2, "id": "1e522a93", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/tomek/AITech/repo/aitech-ium-private/IUM_10/sample-ml-project\n" ] } ], "source": [ "#Jupyter notebook magic https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-cd\n", "%cd \"IUM_10/sample-ml-project\"" ] }, { "cell_type": "markdown", "id": "199c0d92", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Inicjalizujemy puste repozytorium Git (możemy też pominąć ten krok i działać w istniejącym już repozytorium)" ] }, { "cell_type": "code", "execution_count": 17, "id": "c13c525b", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initialized empty Git repository in /home/tomek/AITech/repo/aitech-ium-private/IUM_10/sample-ml-project/.git/\r\n" ] } ], "source": [ "!git init" ] }, { "cell_type": "markdown", "id": "c7155369", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Teraz inicjalizujemy repozytorium DVC:" ] }, { "cell_type": "code", "execution_count": 18, "id": "44f28226", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initialized DVC repository.\n", "\n", "You can now commit the changes to git.\n", "\n", "\u001b[31m+---------------------------------------------------------------------+\n", "\u001b[0m\u001b[31m|\u001b[0m \u001b[31m|\u001b[0m\n", "\u001b[31m|\u001b[0m DVC has enabled anonymous aggregate usage analytics. \u001b[31m|\u001b[0m\n", "\u001b[31m|\u001b[0m Read the analytics documentation (and how to opt-out) here: \u001b[31m|\u001b[0m\n", "\u001b[31m|\u001b[0m <\u001b[36mhttps://dvc.org/doc/user-guide/analytics\u001b[39m> \u001b[31m|\u001b[0m\n", "\u001b[31m|\u001b[0m \u001b[31m|\u001b[0m\n", "\u001b[31m+---------------------------------------------------------------------+\n", "\u001b[0m\n", "\u001b[33mWhat's next?\u001b[39m\n", "\u001b[33m------------\u001b[39m\n", "- Check out the documentation: <\u001b[36mhttps://dvc.org/doc\u001b[39m>\n", "- Get help and share ideas: <\u001b[36mhttps://dvc.org/chat\u001b[39m>\n", "- Star us on GitHub: <\u001b[36mhttps://github.com/iterative/dvc\u001b[39m>\n", "\u001b[0m" ] } ], "source": [ "!dvc init" ] }, { "cell_type": "markdown", "id": "00bc72ed", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Zobaczmy jakie pliki dodał (również do repozytorium git) DVC.\n", "Ich opis znajdziemy tutaj: https://dvc.org/doc/user-guide/project-structure/internal-files" ] }, { "cell_type": "code", "execution_count": 19, "id": "d1aefe16", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On branch master\r\n", "\r\n", "No commits yet\r\n", "\r\n", "Changes to be committed:\r\n", " (use \"git rm --cached ...\" to unstage)\r\n", "\t\u001b[32mnew file: .dvc/.gitignore\u001b[m\r\n", "\t\u001b[32mnew file: .dvc/config\u001b[m\r\n", "\t\u001b[32mnew file: .dvc/plots/confusion.json\u001b[m\r\n", "\t\u001b[32mnew file: .dvc/plots/confusion_normalized.json\u001b[m\r\n", "\t\u001b[32mnew file: .dvc/plots/default.json\u001b[m\r\n", "\t\u001b[32mnew file: .dvc/plots/linear.json\u001b[m\r\n", "\t\u001b[32mnew file: .dvc/plots/scatter.json\u001b[m\r\n", "\t\u001b[32mnew file: .dvc/plots/smooth.json\u001b[m\r\n", "\t\u001b[32mnew file: .dvcignore\u001b[m\r\n", "\r\n" ] } ], "source": [ "!git status" ] }, { "cell_type": "markdown", "id": "72e0a272", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Możemy teraz zacommitować zmiany w git:" ] }, { "cell_type": "code", "execution_count": 5, "id": "59780e99", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On branch master\r\n", "nothing to commit, working tree clean\r\n" ] } ], "source": [ "!git commit -m \"Initial commit\"" ] }, { "cell_type": "markdown", "id": "dd8e529b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Śledzenie plików za pomocą DVC\n", " - dużymi plikami, takimi jak plikami z danymi wejściowymi czy plikami modeli, trudno zarządza się za pomocą gita, ze względu na problemy z:\n", " - wydajnością\n", " - przestrzenią w repozytorium\n", " - Git posiada rozszerzenie [lfs(Large File Storage)](https://git-lfs.github.com/), które stanowi pewne rozwiązanie tego problemu. Same pliki przechowywane są na specjalnym zdalnym serwerze, w repozytorium przechowywane są jedynie odnośniki do tych plików i pewne metadane\n", " - DVC proponuje podobne podejście, ale:\n", " - pliki mogą być przechowywane na niemal dowolnym serwerze, również lokalnie\n", " - brak limitu wielkości plików (w Git-LFS najczęściej limit 2GB)\n", " - DVC zapewnia dodatkowe narzędzie umożliwiające śledzenie plików i ich powiązań z wynikami eksperymentów\n", " - więcej, patrz [tutaj](https://dvc.org/doc/user-guide/related-technologies)" ] }, { "cell_type": "markdown", "id": "a8861abe", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Przygotujmy przykładowe dane, pobierając je z Kaggle:" ] }, { "cell_type": "code", "execution_count": 19, "id": "f05ece1b", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading iris.zip to /home/tomek/AITech/repo/aitech-ium-private/IUM_10/sample-ml-project\n", " 0%| | 0.00/3.60k [00:00...\" to include in what will be committed)\r\n", "\t\u001b[31mdata/.gitignore\u001b[m\r\n", "\t\u001b[31mdata/Iris.csv.dvc\u001b[m\r\n", "\t\u001b[31miris.zip\u001b[m\r\n", "\r\n", "nothing added to commit but untracked files present (use \"git add\" to track)\r\n" ] } ], "source": [ "!git status -u" ] }, { "cell_type": "markdown", "id": "8589fecf", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Dodajmy pliki `data/Iris.csv.dvc data/.gitignore` do repozytorium git, zgodnie z sugestią DVC:" ] }, { "cell_type": "code", "execution_count": 21, "id": "460c4a17", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "!git add data/Iris.csv.dvc data/.gitignore" ] }, { "cell_type": "code", "execution_count": 22, "id": "80644077", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[master cc0821a] Dodano dane IRIS (DVC)\r\n", " 2 files changed, 5 insertions(+)\r\n", " create mode 100644 data/.gitignore\r\n", " create mode 100644 data/Iris.csv.dvc\r\n" ] } ], "source": [ "!git commit -m \"Dodano dane IRIS (DVC)\"" ] }, { "cell_type": "markdown", "id": "03899863", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Plik `*.dvc` zawiera m.in. hash pliku. Więcej o plikach `*.dvc`: [link](https://dvc.org/doc/user-guide/project-structure/dvc-files)" ] }, { "cell_type": "code", "execution_count": null, "id": "8cb2ba7c", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# %load data/Iris.csv.dvc\n", "outs:\n", "- md5: 717820ef0af287ff346c5cabfb4c612c\n", " size: 5107\n", " path: Iris.csv\n" ] }, { "cell_type": "markdown", "id": "0b421d45", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Oryginalny plik `Iris.csv` został przeniesiony do katalogu ./dvc/cache/{wartość hash pliku) i podlinkowany z powrotem do oryginalnej lokalizacji. Sposób tworzenia linków może być [różny w zależności od systemu plików](https://dvc.org/doc/user-guide/large-dataset-optimization)." ] }, { "cell_type": "code", "execution_count": 27, "id": "1d471f3a", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 8\r\n", "-r--r--r-- 1 tomek tomek 5107 wrz 19 2019 7820ef0af287ff346c5cabfb4c612c\r\n" ] } ], "source": [ "!ls -l .dvc/cache/71" ] }, { "cell_type": "markdown", "id": "901e8e90", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## dvc remote\n", " - żeby wysłać właściwe pliki śledzone przez DVC do zdalnej lokalizacji (z której będą mogłby być pobrane np. przez system CI albo innych użytkowników) musimy mieć skonfigurowaną taką lokazliację\n", " - służy do tego polecenie [`dvc remote add`](https://dvc.org/doc/command-reference/remote/add)\n", " - użyjemy lokalnego \"remote\". Tutaj będzie to po prostu utworzony wcześniej katalog `/dvcstore`. Taki katalog istnieje też na naszym Jenkinsie, oczywiście należy go podmontować w Dockerze\n", " - w realnych zastosowaniach podalibyśmy tutaj ścieżkę do jakiegoś zasobu dostępnego przez inernet jak np. serwer SFTP, ścieżka do AWS S3 itp." ] }, { "cell_type": "code", "execution_count": 28, "id": "731f6ea4", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting 'my_local_remote' as a default remote.\r\n", "\u001b[0m" ] } ], "source": [ "!dvc remote add -d my_local_remote /dvcstore" ] }, { "cell_type": "code", "execution_count": 39, "id": "9c3deeaf", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On branch master\r\n", "Changes not staged for commit:\r\n", " (use \"git add ...\" to update what will be committed)\r\n", " (use \"git restore ...\" to discard changes in working directory)\r\n", "\t\u001b[31mmodified: .dvc/config\u001b[m\r\n", "\r\n", "no changes added to commit (use \"git add\" and/or \"git commit -a\")\r\n" ] } ], "source": [ "!git status" ] }, { "cell_type": "code", "execution_count": 41, "id": "899eac7d", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[master 3ff62b6] Added DVC remote\r\n", " 1 file changed, 4 insertions(+)\r\n" ] } ], "source": [ "!git add .dvc/config\n", "!git commit -m \"Added DVC remote\"" ] }, { "cell_type": "markdown", "id": "8c556c96", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## dvc push\n", "Kiedy mamy już skonfigurowany \"remote\" możemy wypchnąć do niego pliki korzystając z polecenia `dvc push`:" ] }, { "cell_type": "code", "execution_count": null, "id": "c7f24f75", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "!dvc push" ] }, { "cell_type": "code", "execution_count": 33, "id": "8a355575", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34;42m/dvcstore\u001b[00m\r\n", "└── \u001b[01;34m71\u001b[00m\r\n", " └── 7820ef0af287ff346c5cabfb4c612c\r\n", "\r\n", "1 directory, 1 file\r\n" ] } ], "source": [ "!tree /dvcstore" ] }, { "cell_type": "markdown", "id": "af59ecb3", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## dvc pull\n", "Żeby pobrać dane z DVC (np. w innej lokalizacji, przez innego użytkownika), musimy:\n", " - sklonować repozytorium git (żeby m.in. pobrać pliki `*.dvc`\n", " - wykonać `dvc pull`" ] }, { "cell_type": "markdown", "id": "9fa914a7", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Dodawanie nowych plików i modyfikacja istniejących wygląda tak podobnie jak przy zwykłych plikach śledzonych przez git, tylko zamiast `git` używamy polecenia `dvc` a dodatkowo pamiętamy o zarządzaniu plikami `*.dvc` za pomocą gita:" ] }, { "cell_type": "code", "execution_count": 37, "id": "dde39796", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "!head -n -1 data/Iris.csv | sponge data/Iris.csv" ] }, { "cell_type": "code", "execution_count": 42, "id": "7f14ec60", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On branch master\r\n", "nothing to commit, working tree clean\r\n" ] } ], "source": [ "!git status" ] }, { "cell_type": "code", "execution_count": 43, "id": "8a841039", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data/Iris.csv.dvc: core\u001b[39m>\n", "\tchanged outs:\n", "\t\tmodified: data/Iris.csv\n", "\u001b[0m" ] } ], "source": [ "!dvc status" ] }, { "cell_type": "code", "execution_count": 44, "id": "bf6c1067", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Adding... \n", "!\u001b[A\n", " 0%| |.TatTHknArFHCT9iDCtxHzh.tmp 0.00/5.07k [00:00 'data/data.xml'\n", " 0% Downloading| |0/1 [00:00 2021.5.30-ha878542_0\n", " certifi 2020.12.5-py39hf3d152e_1 --> 2021.5.30-py39hf3d152e_0\n", " dvc 2.1.0-py39hf3d152e_0 --> 2.3.0-py39hf3d152e_0\n", " fsspec 0.9.0-pyhd8ed1ab_2 --> 2021.5.0-pyhd8ed1ab_0\n", "\n", "\n", "\n", "Downloading and Extracting Packages\n", "certifi-2021.5.30 | 141 KB | ##################################### | 100% \n", "fsspec-2021.5.0 | 77 KB | ##################################### | 100% \n", "dvc-2.3.0 | 542 KB | ##################################### | 100% \n", "invoke-1.5.0 | 137 KB | ##################################### | 100% \n", "paramiko-2.7.2 | 135 KB | ##################################### | 100% \n", "bcrypt-3.2.0 | 44 KB | ##################################### | 100% \n", "pynacl-1.4.0 | 1.3 MB | ##################################### | 100% \n", "dvc-ssh-2.3.0 | 9 KB | ##################################### | 100% \n", "ca-certificates-2021 | 136 KB | ##################################### | 100% \n", "Preparing transaction: done\n", "Verifying transaction: done\n", "Executing transaction: done\n", "\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "conda install -c conda-forge dvc-ssh" ] }, { "cell_type": "code", "execution_count": 27, "id": "e9a04876", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting 'ium_ssh_remote' as a default remote.\n", "\u001b[0m" ] } ], "source": [ "!dvc remote add -f -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp" ] }, { "cell_type": "code", "execution_count": 28, "id": "e3f27bbb", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "my_local_remote\t/dvcstore\n", "ium_ssh_remote\tssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp\n", "\u001b[0m" ] } ], "source": [ "!dvc remote list" ] }, { "cell_type": "code", "execution_count": 32, "id": "5b2fa175", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0m" ] } ], "source": [ "!dvc remote modify --local ium_ssh_remote password [hasło takie jak do serwera MLflow (patrz MSTeams)]" ] }, { "cell_type": "code", "execution_count": 30, "id": "ea6e16fa", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0% Uploading| |0/1 [00:00