{ "cells": [ { "cell_type": "markdown", "id": "7fe475ae", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Inżynieria uczenia maszynowego

\n", "

10. DVC [laboratoria]

\n", "

Tomasz Ziętkiewicz (2023)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "id": "0c6f27a5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "" ] }, { "cell_type": "markdown", "id": "560eec71", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## DVC - Data Version Control\n", "- [dvc.org](https://dvc.org/)\n", "- \"Version Control System for Machine Learning Projects\" (System kontroli wersji dla projektów uczenia maszynowego)\n", "- Open Source\n", "- Umożliwia:\n", " - wersjonowanie danych i modeli. \"Git dla danych i modeli\"\n", " - budowanie potoków (\"pipeline\") definiujących jak budować/trenować/ewaluować modele. \"Makefile dla uczenia maszynowego\"\n", " - śledzenie, porównywanie metryk i parametrów\n", "- ściśle zintegowany z gitem\n", "- działa niezależnie od używanego języka/bibliotek i systemu operacyjnego\n", "- 5-minutowe wprowadzenie: https://www.youtube.com/watch?v=UbL7VUpv1Bs&t=197s" ] }, { "cell_type": "markdown", "id": "3d4ce1cb", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Śledzenie plików za pomocą DVC\n", " - dużymi plikami, takimi jak plikami z danymi wejściowymi czy plikami modeli, trudno zarządza się za pomocą gita, ze względu na problemy z:\n", " - wydajnością\n", " - przestrzenią w repozytorium\n", " - ograniczenia ze strony serwisu (np. [limit 100 MB na plik w Github](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github))\n", " - Git posiada rozszerzenie [lfs(Large File Storage)](https://git-lfs.github.com/), które stanowi pewne rozwiązanie tego problemu. \n", " - Same pliki przechowywane są na specjalnym zdalnym serwerze, w repozytorium przechowywane są jedynie odnośniki do tych plików i pewne metadane\n", " - Github ma zintegrowany LFS z [limitem 1GB dla kont bezpłatnych](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage)" ] }, { "cell_type": "markdown", "id": "dd8e529b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " - DVC proponuje podobne podejście co LFS, ale:\n", " - pliki mogą być przechowywane na niemal dowolnym serwerze, również lokalnie\n", " - brak limitu wielkości plików (w Git-LFS na Github [limit 2GB](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage))\n", " - DVC zapewnia dodatkowe narzędzie umożliwiające śledzenie plików i ich powiązań z wynikami eksperymentów\n", " - więcej, patrz [tutaj](https://dvc.org/doc/user-guide/related-technologies)" ] }, { "cell_type": "markdown", "id": "9bfb356e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Instalacja i inicjalizacja\n", " - https://dvc.org/doc/install\n", " - ```pip(x) install dvc``` albo:\n", " - ```conda install dvc```" ] }, { "cell_type": "code", "execution_count": 6, "id": "054c7a11", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting dvc\n", " Downloading dvc-2.55.0-py3-none-any.whl (419 kB)\n", "\u001b[K |████████████████████████████████| 419 kB 794 kB/s eta 0:00:01\n", "\u001b[?25hCollecting funcy>=1.14\n", " Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)\n", "Collecting voluptuous>=0.11.7\n", " Using cached voluptuous-0.13.1-py3-none-any.whl (29 kB)\n", "Collecting dvc-http>=2.29.0\n", " Downloading dvc_http-2.30.2-py3-none-any.whl (12 kB)\n", "Requirement already satisfied: colorama>=0.3.9 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (0.4.6)\n", "Collecting pathspec>=0.10.3\n", " Downloading pathspec-0.11.1-py3-none-any.whl (29 kB)\n", "Collecting pygtrie>=2.3.2\n", " Downloading pygtrie-2.5.0-py3-none-any.whl (25 kB)\n", "Requirement already satisfied: ruamel.yaml>=0.17.11 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (0.17.21)\n", "Requirement already satisfied: tabulate>=0.8.7 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (0.9.0)\n", "Collecting zc.lockfile>=1.2.1\n", " Downloading zc.lockfile-3.0.post1-py3-none-any.whl (9.8 kB)\n", "Collecting dpath<3,>=2.1.0\n", " Downloading dpath-2.1.5-py3-none-any.whl (17 kB)\n", "Collecting shtab<2,>=1.3.4\n", " Downloading shtab-1.6.1-py3-none-any.whl (13 kB)\n", "Requirement already satisfied: tqdm<5,>=4.63.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (4.64.0)\n", "Collecting pydot>=1.2.4\n", " Using cached pydot-1.4.2-py2.py3-none-any.whl (21 kB)\n", "Collecting scmrepo<2,>=1.0.0\n", " Downloading scmrepo-1.0.2-py3-none-any.whl (54 kB)\n", "\u001b[K |████████████████████████████████| 54 kB 1.8 MB/s eta 0:00:01\n", "\u001b[?25hCollecting flatten-dict<1,>=0.4.1\n", " Using cached flatten_dict-0.4.2-py2.py3-none-any.whl (9.7 kB)\n", "Collecting psutil>=5.8\n", " Downloading psutil-5.9.5-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (282 kB)\n", "\u001b[K |████████████████████████████████| 282 kB 21.9 MB/s eta 0:00:01\n", "\u001b[?25hCollecting dvc-data<0.48,>=0.47.1\n", " Downloading dvc_data-0.47.2-py3-none-any.whl (59 kB)\n", "\u001b[K |████████████████████████████████| 59 kB 4.1 MB/s eta 0:00:01\n", "\u001b[?25hCollecting dvc-render<0.4.0,>=0.3.1\n", " Downloading dvc_render-0.3.1-py3-none-any.whl (18 kB)\n", "Collecting dvc-studio-client<1,>=0.6.1\n", " Downloading dvc_studio_client-0.8.0-py3-none-any.whl (10 kB)\n", "Collecting flufl.lock>=5\n", " Downloading flufl.lock-7.1.1-py3-none-any.whl (11 kB)\n", "Requirement already satisfied: platformdirs<4,>=3.1.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (3.1.1)\n", "Collecting networkx>=2.5\n", " Downloading networkx-3.1-py3-none-any.whl (2.1 MB)\n", "\u001b[K |████████████████████████████████| 2.1 MB 14.1 MB/s eta 0:00:01\n", "\u001b[?25hCollecting grandalf<1,>=0.7\n", " Downloading grandalf-0.8-py3-none-any.whl (41 kB)\n", "\u001b[K |████████████████████████████████| 41 kB 304 kB/s eta 0:00:01\n", "\u001b[?25hCollecting hydra-core>=1.1\n", " Downloading hydra_core-1.3.2-py3-none-any.whl (154 kB)\n", "\u001b[K |████████████████████████████████| 154 kB 14.3 MB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied: pyparsing>=2.4.7 in /home/tomek/.local/lib/python3.9/site-packages (from dvc) (3.0.9)\n", "Collecting tomlkit>=0.11.1\n", " Downloading tomlkit-0.11.7-py3-none-any.whl (35 kB)\n", "Requirement already satisfied: requests>=2.22 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (2.27.1)\n", "Requirement already satisfied: packaging>=19 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (23.0)\n", "Collecting distro>=1.3\n", " Downloading distro-1.8.0-py3-none-any.whl (20 kB)\n", "Collecting shortuuid>=0.5\n", " Downloading shortuuid-1.0.11-py3-none-any.whl (10 kB)\n", "Collecting rich>=12\n", " Downloading rich-13.3.4-py3-none-any.whl (238 kB)\n", "\u001b[K |████████████████████████████████| 238 kB 11.6 MB/s eta 0:00:01\n", "\u001b[?25hCollecting dvc-task<1,>=0.2.0\n", " Downloading dvc_task-0.2.0-py3-none-any.whl (23 kB)\n", "Collecting configobj>=5.0.6\n", " Downloading configobj-5.0.8-py2.py3-none-any.whl (36 kB)\n", "Collecting iterative-telemetry>=0.0.7\n", " Downloading iterative_telemetry-0.0.8-py3-none-any.whl (10 kB)\n", "Requirement already satisfied: six in /home/tomek/miniconda3/lib/python3.9/site-packages (from configobj>=5.0.6->dvc) (1.16.0)\n", "Collecting dvc-objects<1,>=0.21.1\n", " Downloading dvc_objects-0.21.2-py3-none-any.whl (37 kB)\n", "Requirement already satisfied: attrs>=21.3.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc-data<0.48,>=0.47.1->dvc) (22.2.0)\n", "Collecting dictdiffer>=0.8.1\n", " Using cached dictdiffer-0.9.0-py2.py3-none-any.whl (16 kB)\n", "Collecting nanotime>=0.5.2\n", " Using cached nanotime-0.5.2.tar.gz (3.2 kB)\n", "Collecting diskcache>=5.2.1\n", " Downloading diskcache-5.6.1-py3-none-any.whl (45 kB)\n", "\u001b[K |████████████████████████████████| 45 kB 905 kB/s eta 0:00:01\n", "\u001b[?25hCollecting sqltrie<1,>=0.3.1\n", " Downloading sqltrie-0.3.1-py3-none-any.whl (16 kB)\n", "Requirement already satisfied: fsspec[http] in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc-http>=2.29.0->dvc) (2023.3.0)\n", "Collecting aiohttp-retry>=2.5.0\n", " Downloading aiohttp_retry-2.8.3-py3-none-any.whl (9.8 kB)\n", "Requirement already satisfied: aiohttp in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (3.8.4)\n", "Requirement already satisfied: typing-extensions>=3.7.4 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc-objects<1,>=0.21.1->dvc-data<0.48,>=0.47.1->dvc) (4.5.0)\n", "Collecting dulwich\n", " Downloading dulwich-0.21.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (505 kB)\n", "\u001b[K |████████████████████████████████| 505 kB 16.6 MB/s eta 0:00:01\n", "\u001b[?25hCollecting celery<6,>=5.2.0\n", " Downloading celery-5.2.7-py3-none-any.whl (405 kB)\n", "\u001b[K |████████████████████████████████| 405 kB 19.2 MB/s eta 0:00:01\n", "\u001b[?25hCollecting kombu<6,>=5.2.0\n", " Downloading kombu-5.2.4-py3-none-any.whl (189 kB)\n", "\u001b[K |████████████████████████████████| 189 kB 14.8 MB/s eta 0:00:01\n", "\u001b[?25hCollecting click-didyoumean>=0.0.3\n", " Downloading click_didyoumean-0.3.0-py3-none-any.whl (2.7 kB)\n", "Collecting billiard<4.0,>=3.6.4.0\n", " Downloading billiard-3.6.4.0-py3-none-any.whl (89 kB)\n", "\u001b[K |████████████████████████████████| 89 kB 3.8 MB/s eta 0:00:01\n", "\u001b[?25hCollecting vine<6.0,>=5.0.0\n", " Downloading vine-5.0.0-py2.py3-none-any.whl (9.4 kB)\n", "Collecting click-repl>=0.2.0\n", " Downloading click_repl-0.2.0-py3-none-any.whl (5.2 kB)\n", "Requirement already satisfied: click<9.0,>=8.0.3 in /home/tomek/miniconda3/lib/python3.9/site-packages (from celery<6,>=5.2.0->dvc-task<1,>=0.2.0->dvc) (8.1.3)\n", "Collecting click-plugins>=1.1.1\n", " Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)\n", "Requirement already satisfied: pytz>=2021.3 in /home/tomek/miniconda3/lib/python3.9/site-packages (from celery<6,>=5.2.0->dvc-task<1,>=0.2.0->dvc) (2022.7.1)\n", "Requirement already satisfied: prompt-toolkit in /home/tomek/miniconda3/lib/python3.9/site-packages (from click-repl>=0.2.0->celery<6,>=5.2.0->dvc-task<1,>=0.2.0->dvc) (3.0.38)\n", "Collecting atpublic>=2.3\n", " Downloading atpublic-3.1.1-py3-none-any.whl (4.8 kB)\n", "Collecting antlr4-python3-runtime==4.9.*\n", " Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)\n", "\u001b[K |████████████████████████████████| 117 kB 17.4 MB/s eta 0:00:01\n", "\u001b[?25hCollecting omegaconf<2.4,>=2.2\n", " Downloading omegaconf-2.3.0-py3-none-any.whl (79 kB)\n", "\u001b[K |████████████████████████████████| 79 kB 3.6 MB/s eta 0:00:01\n", "\u001b[?25hCollecting appdirs\n", " Using cached appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)\n", "Requirement already satisfied: filelock in /home/tomek/miniconda3/lib/python3.9/site-packages (from iterative-telemetry>=0.0.7->dvc) (3.9.1)\n", "Collecting amqp<6.0.0,>=5.0.9\n", " Downloading amqp-5.1.1-py3-none-any.whl (50 kB)\n", "\u001b[K |████████████████████████████████| 50 kB 2.7 MB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied: PyYAML>=5.1.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from omegaconf<2.4,>=2.2->hydra-core>=1.1->dvc) (6.0)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from requests>=2.22->dvc) (1.26.9)\n", "Requirement already satisfied: charset-normalizer~=2.0.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from requests>=2.22->dvc) (2.0.4)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/tomek/miniconda3/lib/python3.9/site-packages (from requests>=2.22->dvc) (2022.12.7)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/tomek/miniconda3/lib/python3.9/site-packages (from requests>=2.22->dvc) (3.3)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting markdown-it-py<3.0.0,>=2.2.0\n", " Downloading markdown_it_py-2.2.0-py3-none-any.whl (84 kB)\n", "\u001b[K |████████████████████████████████| 84 kB 1.9 MB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from rich>=12->dvc) (2.14.0)\n", "Collecting mdurl~=0.1\n", " Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)\n", "Requirement already satisfied: ruamel.yaml.clib>=0.2.6 in /home/tomek/miniconda3/lib/python3.9/site-packages (from ruamel.yaml>=0.17.11->dvc) (0.2.6)\n", "Collecting pygit2>=1.10.0\n", " Downloading pygit2-1.12.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB)\n", "\u001b[K |████████████████████████████████| 4.9 MB 13.6 MB/s eta 0:00:01\n", "\u001b[?25hCollecting asyncssh<3,>=2.13.1\n", " Downloading asyncssh-2.13.1-py3-none-any.whl (348 kB)\n", "\u001b[K |████████████████████████████████| 348 kB 38.4 MB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied: gitpython>3 in /home/tomek/miniconda3/lib/python3.9/site-packages (from scmrepo<2,>=1.0.0->dvc) (3.1.31)\n", "Requirement already satisfied: cryptography>=3.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from asyncssh<3,>=2.13.1->scmrepo<2,>=1.0.0->dvc) (37.0.1)\n", "Requirement already satisfied: cffi>=1.12 in /home/tomek/miniconda3/lib/python3.9/site-packages (from cryptography>=3.1->asyncssh<3,>=2.13.1->scmrepo<2,>=1.0.0->dvc) (1.15.0)\n", "Requirement already satisfied: pycparser in /home/tomek/miniconda3/lib/python3.9/site-packages (from cffi>=1.12->cryptography>=3.1->asyncssh<3,>=2.13.1->scmrepo<2,>=1.0.0->dvc) (2.21)\n", "Requirement already satisfied: gitdb<5,>=4.0.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from gitpython>3->scmrepo<2,>=1.0.0->dvc) (4.0.10)\n", "Requirement already satisfied: smmap<6,>=3.0.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from gitdb<5,>=4.0.1->gitpython>3->scmrepo<2,>=1.0.0->dvc) (5.0.0)\n", "Collecting orjson\n", " Downloading orjson-3.8.10-cp39-cp39-manylinux_2_28_x86_64.whl (140 kB)\n", "\u001b[K |████████████████████████████████| 140 kB 39.5 MB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied: setuptools in /home/tomek/miniconda3/lib/python3.9/site-packages (from zc.lockfile>=1.2.1->dvc) (61.2.0)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (4.0.2)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (6.0.4)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (1.8.2)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (1.3.1)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (1.3.3)\n", "Requirement already satisfied: wcwidth in /home/tomek/miniconda3/lib/python3.9/site-packages (from prompt-toolkit->click-repl>=0.2.0->celery<6,>=5.2.0->dvc-task<1,>=0.2.0->dvc) (0.2.6)\n", "Building wheels for collected packages: antlr4-python3-runtime, nanotime\n", " Building wheel for antlr4-python3-runtime (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.3-py3-none-any.whl size=144575 sha256=94691fc7a4109d606872ddee3ae9575c3c9f9f945643a27b5514fce3176c552a\n", " Stored in directory: /home/tomek/.cache/pip/wheels/23/cf/80/f3efa822e6ab23277902ee9165fe772eeb1dfb8014f359020a\n", " Building wheel for nanotime (setup.py) ... \u001b[?25ldone\n", "\u001b[?25h Created wheel for nanotime: filename=nanotime-0.5.2-py3-none-any.whl size=2441 sha256=42933d16d8f6362832282dea6b0b44f2bdd41b0eb0d68de121660a8a0db1f96c\n", " Stored in directory: /home/tomek/.cache/pip/wheels/ee/1f/7c/610bdb7d5541b98d9743c5953e32681ef35dd54fadddd347e8\n", "Successfully built antlr4-python3-runtime nanotime\n", "Installing collected packages: vine, amqp, shortuuid, pygtrie, orjson, mdurl, kombu, funcy, click-repl, click-plugins, click-didyoumean, billiard, antlr4-python3-runtime, voluptuous, sqltrie, pygit2, psutil, pathspec, omegaconf, nanotime, markdown-it-py, dvc-objects, dulwich, distro, diskcache, dictdiffer, celery, atpublic, asyncssh, appdirs, aiohttp-retry, zc.lockfile, tomlkit, shtab, scmrepo, rich, pydot, networkx, iterative-telemetry, hydra-core, grandalf, flufl.lock, flatten-dict, dvc-task, dvc-studio-client, dvc-render, dvc-http, dvc-data, dpath, configobj, dvc\n", "Successfully installed aiohttp-retry-2.8.3 amqp-5.1.1 antlr4-python3-runtime-4.9.3 appdirs-1.4.4 asyncssh-2.13.1 atpublic-3.1.1 billiard-3.6.4.0 celery-5.2.7 click-didyoumean-0.3.0 click-plugins-1.1.1 click-repl-0.2.0 configobj-5.0.8 dictdiffer-0.9.0 diskcache-5.6.1 distro-1.8.0 dpath-2.1.5 dulwich-0.21.3 dvc-2.55.0 dvc-data-0.47.2 dvc-http-2.30.2 dvc-objects-0.21.2 dvc-render-0.3.1 dvc-studio-client-0.8.0 dvc-task-0.2.0 flatten-dict-0.4.2 flufl.lock-7.1.1 funcy-2.0 grandalf-0.8 hydra-core-1.3.2 iterative-telemetry-0.0.8 kombu-5.2.4 markdown-it-py-2.2.0 mdurl-0.1.2 nanotime-0.5.2 networkx-3.1 omegaconf-2.3.0 orjson-3.8.10 pathspec-0.11.1 psutil-5.9.5 pydot-1.4.2 pygit2-1.12.0 pygtrie-2.5.0 rich-13.3.4 scmrepo-1.0.2 shortuuid-1.0.11 shtab-1.6.1 sqltrie-0.3.1 tomlkit-0.11.7 vine-5.0.0 voluptuous-0.13.1 zc.lockfile-3.0.post1\n" ] } ], "source": [ "!pip3 install dvc" ] }, { "cell_type": "markdown", "id": "20975d62", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Stwórzmy katalog, w którym będziemy przechowywać nasz projekt:" ] }, { "cell_type": "code", "execution_count": 7, "id": "4d94e912", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "!rm -r -f IUM_10/sample-ml-project-2023\n", "!mkdir -p IUM_10/sample-ml-project-2023" ] }, { "cell_type": "code", "execution_count": 8, "id": "aae59ec2", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/tomek/repos/aitech-ium/IUM_10/sample-ml-project-2023/IUM_10/sample-ml-project-2023\n" ] } ], "source": [ "#Jupyter notebook magic https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-cd\n", "%cd \"IUM_10/sample-ml-project-2023\"" ] }, { "cell_type": "markdown", "id": "199c0d92", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Inicjalizujemy puste repozytorium Git (możemy też pominąć ten krok i działać w istniejącym już repozytorium)" ] }, { "cell_type": "code", "execution_count": 9, "id": "c13c525b", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initialized empty Git repository in /home/tomek/repos/aitech-ium/IUM_10/sample-ml-project-2023/IUM_10/sample-ml-project-2023/.git/\r\n" ] } ], "source": [ "!git init" ] }, { "cell_type": "markdown", "id": "c7155369", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Teraz inicjalizujemy repozytorium DVC:" ] }, { "cell_type": "code", "execution_count": 10, "id": "44f28226", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initialized DVC repository.\n", "\n", "You can now commit the changes to git.\n", "\n", "\u001b[31m+---------------------------------------------------------------------+\n", "\u001b[0m\u001b[31m|\u001b[0m \u001b[31m|\u001b[0m\n", "\u001b[31m|\u001b[0m DVC has enabled anonymous aggregate usage analytics. \u001b[31m|\u001b[0m\n", "\u001b[31m|\u001b[0m Read the analytics documentation (and how to opt-out) here: \u001b[31m|\u001b[0m\n", "\u001b[31m|\u001b[0m <\u001b[36mhttps://dvc.org/doc/user-guide/analytics\u001b[39m> \u001b[31m|\u001b[0m\n", "\u001b[31m|\u001b[0m \u001b[31m|\u001b[0m\n", "\u001b[31m+---------------------------------------------------------------------+\n", "\u001b[0m\n", "\u001b[33mWhat's next?\u001b[39m\n", "\u001b[33m------------\u001b[39m\n", "- Check out the documentation: <\u001b[36mhttps://dvc.org/doc\u001b[39m>\n", "- Get help and share ideas: <\u001b[36mhttps://dvc.org/chat\u001b[39m>\n", "- Star us on GitHub: <\u001b[36mhttps://github.com/iterative/dvc\u001b[39m>\n", "\u001b[0m" ] } ], "source": [ "!dvc init" ] }, { "cell_type": "markdown", "id": "00bc72ed", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Zobaczmy jakie pliki dodał (również do repozytorium git) DVC.\n", "Ich opis znajdziemy tutaj: https://dvc.org/doc/user-guide/project-structure/internal-files" ] }, { "cell_type": "code", "execution_count": 11, "id": "d1aefe16", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On branch main\r\n", "\r\n", "No commits yet\r\n", "\r\n", "Changes to be committed:\r\n", " (use \"git rm --cached ...\" to unstage)\r\n", "\t\u001b[32mnew file: .dvc/.gitignore\u001b[m\r\n", "\t\u001b[32mnew file: .dvc/config\u001b[m\r\n", "\t\u001b[32mnew file: .dvcignore\u001b[m\r\n", "\r\n" ] } ], "source": [ "!git status" ] }, { "cell_type": "markdown", "id": "b16a62e6", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- `.dvc/config` - główny plik konfiguracyjny dvc\n", "- `.dvc/config.local` - nadpisuje wartości z `config`, do lokalnych zmian nie commitowanych do repo\n", "- `.dvc/.gitignore` - pliki dvc, które nie mają znaleźć się w repo\n", "- `.dvcignore` - dvc pomija pliki zdefiniowane w tym pliku (np. aby poprawić wydajność)" ] }, { "cell_type": "markdown", "id": "72e0a272", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Możemy teraz zacommitować zmiany w git:" ] }, { "cell_type": "code", "execution_count": 12, "id": "59780e99", "metadata": { "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[main (root-commit) 6b03a40] Initial commit\r\n", " 3 files changed, 6 insertions(+)\r\n", " create mode 100644 .dvc/.gitignore\r\n", " create mode 100644 .dvc/config\r\n", " create mode 100644 .dvcignore\r\n" ] } ], "source": [ "!git commit -m \"Initial commit\"" ] }, { "cell_type": "markdown", "id": "a8861abe", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Przygotujmy przykładowe dane, pobierając je z Kaggle:" ] }, { "cell_type": "code", "execution_count": 13, "id": "f05ece1b", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading iris.zip to /home/tomek/repos/aitech-ium/IUM_10/sample-ml-project-2023/IUM_10/sample-ml-project-2023\n", " 0%| | 0.00/3.60k [00:00...\" to include in what will be committed)\r\n", "\t\u001b[31mdata/.gitignore\u001b[m\r\n", "\t\u001b[31mdata/Iris.csv.dvc\u001b[m\r\n", "\r\n", "nothing added to commit but untracked files present (use \"git add\" to track)\r\n" ] } ], "source": [ "!git status -u" ] }, { "cell_type": "markdown", "id": "8589fecf", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Dodajmy pliki `data/Iris.csv.dvc data/.gitignore` do repozytorium git, zgodnie z sugestią DVC:" ] }, { "cell_type": "code", "execution_count": 17, "id": "460c4a17", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "!git add data/Iris.csv.dvc data/.gitignore" ] }, { "cell_type": "code", "execution_count": 18, "id": "80644077", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[main 812cb53] Dodano dane IRIS (DVC)\r\n", " 2 files changed, 5 insertions(+)\r\n", " create mode 100644 data/.gitignore\r\n", " create mode 100644 data/Iris.csv.dvc\r\n" ] } ], "source": [ "!git commit -m \"Dodano dane IRIS (DVC)\"" ] }, { "cell_type": "markdown", "id": "03899863", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Plik `*.dvc` zawiera m.in. hash pliku. Więcej o plikach `*.dvc`: [link](https://dvc.org/doc/user-guide/project-structure/dvc-files)" ] }, { "cell_type": "code", "execution_count": null, "id": "8cb2ba7c", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# %load data/Iris.csv.dvc\n", "outs:\n", "- md5: 717820ef0af287ff346c5cabfb4c612c\n", " size: 5107\n", " path: Iris.csv\n" ] }, { "cell_type": "markdown", "id": "0b421d45", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Oryginalny plik `Iris.csv` został przeniesiony do katalogu ./dvc/cache/{wartość hash pliku) i podlinkowany z powrotem do oryginalnej lokalizacji. Sposób tworzenia linków może być [różny w zależności od systemu plików](https://dvc.org/doc/user-guide/large-dataset-optimization)." ] }, { "cell_type": "code", "execution_count": 25, "id": "1d471f3a", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 8\r\n", "-r--r--r-- 1 tomek tomek 5107 Sep 19 2019 7820ef0af287ff346c5cabfb4c612c\r\n" ] } ], "source": [ "!ls -l .dvc/cache/71" ] }, { "cell_type": "code", "execution_count": 33, "id": "32531aa8", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species\r\n", "1,5.1,3.5,1.4,0.2,Iris-setosa\r\n", "2,4.9,3.0,1.4,0.2,Iris-setosa\r\n" ] } ], "source": [ "!head -n 3 .dvc/cache/71/7820ef0af287ff346c5cabfb4c612c" ] }, { "cell_type": "code", "execution_count": 35, "id": "2396c762", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Enumerating objects: 11, done.\n", "Counting objects: 100% (11/11), done.\n", "Delta compression using up to 4 threads\n", "Compressing objects: 100% (8/8), done.\n", "Writing objects: 100% (11/11), 889 bytes | 889.00 KiB/s, done.\n", "Total 11 (delta 1), reused 0 (delta 0), pack-reused 0\n", "remote: \n", "remote: Create a new pull request for 'main':\u001b[K\n", "remote: https://git.wmi.amu.edu.pl/tzietkiewicz/sample-ml-project/compare/master...main\u001b[K\n", "remote: \n", "remote: . Processing 1 references\u001b[K\n", "remote: Processed 1 references in total\u001b[K\n", "To git.wmi.amu.edu.pl:tzietkiewicz/sample-ml-project.git\n", " * [new branch] main -> main\n", "Branch 'main' set up to track remote branch 'main' from 'origin'.\n" ] } ], "source": [ "!git remote add origin git@git.wmi.amu.edu.pl:tzietkiewicz/sample-ml-project.git\n", "!git push --set-upstream origin main" ] }, { "cell_type": "markdown", "id": "901e8e90", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## dvc remote\n", " - żeby wysłać właściwe pliki śledzone przez DVC do zdalnej lokalizacji (z której będą mogłby być pobrane np. przez system CI albo innych użytkowników) musimy mieć skonfigurowaną taką lokazliację\n", " - służy do tego polecenie [`dvc remote add`](https://dvc.org/doc/command-reference/remote/add)\n", " - użyjemy lokalnego \"remote\". Tutaj będzie to po prostu utworzony wcześniej katalog `/dvcstore`. Taki katalog istnieje też na naszym Jenkinsie, oczywiście należy go podmontować w Dockerze\n", " - w realnych zastosowaniach podalibyśmy tutaj ścieżkę do jakiegoś zasobu dostępnego przez inernet jak np. serwer SFTP, ścieżka do AWS S3 itp." ] }, { "cell_type": "markdown", "id": "53429521", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Obsługiwane typy zdalnych lokalizacji (remotes): https://dvc.org/doc/command-reference/remote/add#supported-storage-types\n", " - Amazon S3\n", " - S3-compatible storage\n", " - Microsoft Azure Blob Storage\n", " - Google Drive\n", " - Google Cloud Storage\n", " - Aliyun OSS\n", " - SSH\n", " - HDFS\n", " - WebHDFS\n", " - HTTP\n", " - WebDAV\n", " - local remote" ] }, { "cell_type": "markdown", "id": "507e3a09", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Dodawanie remote typu local" ] }, { "cell_type": "code", "execution_count": 71, "id": "a16f2bfa", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting 'my_local_remote' as a default remote.\n", "\u001b[0m" ] } ], "source": [ "!dvc remote add -d my_local_remote /dvcstore" ] }, { "cell_type": "code", "execution_count": 25, "id": "9c3deeaf", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On branch master\r\n", "nothing to commit, working tree clean\r\n" ] } ], "source": [ "!git status" ] }, { "cell_type": "code", "execution_count": 34, "id": "899eac7d", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "On branch main\r\n", "nothing to commit, working tree clean\r\n" ] } ], "source": [ "!git add .dvc/config\n", "!git commit -m \"Added DVC remote\"" ] }, { "cell_type": "markdown", "id": "8c556c96", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## dvc push\n", "Kiedy mamy już skonfigurowany \"remote\" możemy wypchnąć do niego pliki korzystając z polecenia `dvc push`:" ] }, { "cell_type": "code", "execution_count": 28, "id": "c7f24f75", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0% Transferring| |0/1 [00:00\n", "\tchanged outs:\n", "\t\tmodified: data/Iris.csv\n", "\u001b[0m" ] } ], "source": [ "!dvc status" ] }, { "cell_type": "code", "execution_count": 15, "id": "bf6c1067", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K\u001b[32m⠹\u001b[0m Checking graph \u001b[32m⠋\u001b[0m Checking graph\n", "Adding... \n", "!\u001b[A\n", " 0% Checking cache in '/home/tomek/repos/aitech-ium/IUM_10/sample-ml-project/.d\u001b[A\n", " \u001b[A\n", "!\u001b[A\n", " 0%| |Transferring 0/1 [00:00 'data/data.xml'\n", " 0% Downloading| |0/1 [00:00 2021.5.30-ha878542_0\n", " certifi 2020.12.5-py39hf3d152e_1 --> 2021.5.30-py39hf3d152e_0\n", " dvc 2.1.0-py39hf3d152e_0 --> 2.3.0-py39hf3d152e_0\n", " fsspec 0.9.0-pyhd8ed1ab_2 --> 2021.5.0-pyhd8ed1ab_0\n", "\n", "\n", "\n", "Downloading and Extracting Packages\n", "certifi-2021.5.30 | 141 KB | ##################################### | 100% \n", "fsspec-2021.5.0 | 77 KB | ##################################### | 100% \n", "dvc-2.3.0 | 542 KB | ##################################### | 100% \n", "invoke-1.5.0 | 137 KB | ##################################### | 100% \n", "paramiko-2.7.2 | 135 KB | ##################################### | 100% \n", "bcrypt-3.2.0 | 44 KB | ##################################### | 100% \n", "pynacl-1.4.0 | 1.3 MB | ##################################### | 100% \n", "dvc-ssh-2.3.0 | 9 KB | ##################################### | 100% \n", "ca-certificates-2021 | 136 KB | ##################################### | 100% \n", "Preparing transaction: done\n", "Verifying transaction: done\n", "Executing transaction: done\n", "\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "conda install -c conda-forge dvc-ssh" ] }, { "cell_type": "markdown", "id": "04c41da0", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Dodajemy remote:" ] }, { "cell_type": "code", "execution_count": 17, "id": "e9a04876", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting 'ium_ssh_remote' as a default remote.\n", "\u001b[0m" ] } ], "source": [ "!dvc remote add -f -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl" ] }, { "cell_type": "code", "execution_count": 18, "id": "e3f27bbb", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ium_ssh_remote\tssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl\n", "\u001b[0m" ] } ], "source": [ "!dvc remote list" ] }, { "cell_type": "markdown", "id": "c92edd7b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Zapisujemy hasło:" ] }, { "cell_type": "code", "execution_count": 19, "id": "5b2fa175", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0m" ] } ], "source": [ "!dvc remote modify --local ium_ssh_remote password IUM@2021" ] }, { "cell_type": "markdown", "id": "8b83049b", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Pushujemy do skonfigurowanego remote:" ] }, { "cell_type": "code", "execution_count": 20, "id": "ea6e16fa", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0% Transferring| |0/1 [00:00