aitech-ium/IUM_10.DVC.ipynb
Tomasz 425c9defb6
All checks were successful
sacred-mongo/pipeline/head This commit looks good
Updated DVC slides
2023-04-26 13:44:55 +02:00

72 KiB
Raw Blame History

Logo 1

Inżynieria uczenia maszynowego

10. DVC [laboratoria]

Tomasz Ziętkiewicz (2023)

Logo 2

DVC - Data Version Control

  • dvc.org
  • "Version Control System for Machine Learning Projects" (System kontroli wersji dla projektów uczenia maszynowego)
  • Open Source
  • Umożliwia:
    • wersjonowanie danych i modeli. "Git dla danych i modeli"
    • budowanie potoków ("pipeline") definiujących jak budować/trenować/ewaluować modele. "Makefile dla uczenia maszynowego"
    • śledzenie, porównywanie metryk i parametrów
  • ściśle zintegowany z gitem
  • działa niezależnie od używanego języka/bibliotek i systemu operacyjnego
  • 5-minutowe wprowadzenie: https://www.youtube.com/watch?v=UbL7VUpv1Bs&t=197s

Śledzenie plików za pomocą DVC

  • dużymi plikami, takimi jak plikami z danymi wejściowymi czy plikami modeli, trudno zarządza się za pomocą gita, ze względu na problemy z:
  • Git posiada rozszerzenie lfs(Large File Storage), które stanowi pewne rozwiązanie tego problemu.
    • Same pliki przechowywane są na specjalnym zdalnym serwerze, w repozytorium przechowywane są jedynie odnośniki do tych plików i pewne metadane
    • Github ma zintegrowany LFS z limitem 1GB dla kont bezpłatnych
  • DVC proponuje podobne podejście co LFS, ale:
    • pliki mogą być przechowywane na niemal dowolnym serwerze, również lokalnie
    • brak limitu wielkości plików (w Git-LFS na Github limit 2GB)
    • DVC zapewnia dodatkowe narzędzie umożliwiające śledzenie plików i ich powiązań z wynikami eksperymentów
    • więcej, patrz tutaj

Instalacja i inicjalizacja

!pip3 install dvc
Collecting dvc
  Downloading dvc-2.55.0-py3-none-any.whl (419 kB)
     |████████████████████████████████| 419 kB 794 kB/s eta 0:00:01
[?25hCollecting funcy>=1.14
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Collecting voluptuous>=0.11.7
  Using cached voluptuous-0.13.1-py3-none-any.whl (29 kB)
Collecting dvc-http>=2.29.0
  Downloading dvc_http-2.30.2-py3-none-any.whl (12 kB)
Requirement already satisfied: colorama>=0.3.9 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (0.4.6)
Collecting pathspec>=0.10.3
  Downloading pathspec-0.11.1-py3-none-any.whl (29 kB)
Collecting pygtrie>=2.3.2
  Downloading pygtrie-2.5.0-py3-none-any.whl (25 kB)
Requirement already satisfied: ruamel.yaml>=0.17.11 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (0.17.21)
Requirement already satisfied: tabulate>=0.8.7 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (0.9.0)
Collecting zc.lockfile>=1.2.1
  Downloading zc.lockfile-3.0.post1-py3-none-any.whl (9.8 kB)
Collecting dpath<3,>=2.1.0
  Downloading dpath-2.1.5-py3-none-any.whl (17 kB)
Collecting shtab<2,>=1.3.4
  Downloading shtab-1.6.1-py3-none-any.whl (13 kB)
Requirement already satisfied: tqdm<5,>=4.63.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (4.64.0)
Collecting pydot>=1.2.4
  Using cached pydot-1.4.2-py2.py3-none-any.whl (21 kB)
Collecting scmrepo<2,>=1.0.0
  Downloading scmrepo-1.0.2-py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 1.8 MB/s  eta 0:00:01
[?25hCollecting flatten-dict<1,>=0.4.1
  Using cached flatten_dict-0.4.2-py2.py3-none-any.whl (9.7 kB)
Collecting psutil>=5.8
  Downloading psutil-5.9.5-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (282 kB)
     |████████████████████████████████| 282 kB 21.9 MB/s eta 0:00:01
[?25hCollecting dvc-data<0.48,>=0.47.1
  Downloading dvc_data-0.47.2-py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 4.1 MB/s  eta 0:00:01
[?25hCollecting dvc-render<0.4.0,>=0.3.1
  Downloading dvc_render-0.3.1-py3-none-any.whl (18 kB)
Collecting dvc-studio-client<1,>=0.6.1
  Downloading dvc_studio_client-0.8.0-py3-none-any.whl (10 kB)
Collecting flufl.lock>=5
  Downloading flufl.lock-7.1.1-py3-none-any.whl (11 kB)
Requirement already satisfied: platformdirs<4,>=3.1.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (3.1.1)
Collecting networkx>=2.5
  Downloading networkx-3.1-py3-none-any.whl (2.1 MB)
     |████████████████████████████████| 2.1 MB 14.1 MB/s eta 0:00:01
[?25hCollecting grandalf<1,>=0.7
  Downloading grandalf-0.8-py3-none-any.whl (41 kB)
     |████████████████████████████████| 41 kB 304 kB/s  eta 0:00:01
[?25hCollecting hydra-core>=1.1
  Downloading hydra_core-1.3.2-py3-none-any.whl (154 kB)
     |████████████████████████████████| 154 kB 14.3 MB/s eta 0:00:01
[?25hRequirement already satisfied: pyparsing>=2.4.7 in /home/tomek/.local/lib/python3.9/site-packages (from dvc) (3.0.9)
Collecting tomlkit>=0.11.1
  Downloading tomlkit-0.11.7-py3-none-any.whl (35 kB)
Requirement already satisfied: requests>=2.22 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (2.27.1)
Requirement already satisfied: packaging>=19 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc) (23.0)
Collecting distro>=1.3
  Downloading distro-1.8.0-py3-none-any.whl (20 kB)
Collecting shortuuid>=0.5
  Downloading shortuuid-1.0.11-py3-none-any.whl (10 kB)
Collecting rich>=12
  Downloading rich-13.3.4-py3-none-any.whl (238 kB)
     |████████████████████████████████| 238 kB 11.6 MB/s eta 0:00:01
[?25hCollecting dvc-task<1,>=0.2.0
  Downloading dvc_task-0.2.0-py3-none-any.whl (23 kB)
Collecting configobj>=5.0.6
  Downloading configobj-5.0.8-py2.py3-none-any.whl (36 kB)
Collecting iterative-telemetry>=0.0.7
  Downloading iterative_telemetry-0.0.8-py3-none-any.whl (10 kB)
Requirement already satisfied: six in /home/tomek/miniconda3/lib/python3.9/site-packages (from configobj>=5.0.6->dvc) (1.16.0)
Collecting dvc-objects<1,>=0.21.1
  Downloading dvc_objects-0.21.2-py3-none-any.whl (37 kB)
Requirement already satisfied: attrs>=21.3.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc-data<0.48,>=0.47.1->dvc) (22.2.0)
Collecting dictdiffer>=0.8.1
  Using cached dictdiffer-0.9.0-py2.py3-none-any.whl (16 kB)
Collecting nanotime>=0.5.2
  Using cached nanotime-0.5.2.tar.gz (3.2 kB)
Collecting diskcache>=5.2.1
  Downloading diskcache-5.6.1-py3-none-any.whl (45 kB)
     |████████████████████████████████| 45 kB 905 kB/s eta 0:00:01
[?25hCollecting sqltrie<1,>=0.3.1
  Downloading sqltrie-0.3.1-py3-none-any.whl (16 kB)
Requirement already satisfied: fsspec[http] in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc-http>=2.29.0->dvc) (2023.3.0)
Collecting aiohttp-retry>=2.5.0
  Downloading aiohttp_retry-2.8.3-py3-none-any.whl (9.8 kB)
Requirement already satisfied: aiohttp in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (3.8.4)
Requirement already satisfied: typing-extensions>=3.7.4 in /home/tomek/miniconda3/lib/python3.9/site-packages (from dvc-objects<1,>=0.21.1->dvc-data<0.48,>=0.47.1->dvc) (4.5.0)
Collecting dulwich
  Downloading dulwich-0.21.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (505 kB)
     |████████████████████████████████| 505 kB 16.6 MB/s eta 0:00:01
[?25hCollecting celery<6,>=5.2.0
  Downloading celery-5.2.7-py3-none-any.whl (405 kB)
     |████████████████████████████████| 405 kB 19.2 MB/s eta 0:00:01
[?25hCollecting kombu<6,>=5.2.0
  Downloading kombu-5.2.4-py3-none-any.whl (189 kB)
     |████████████████████████████████| 189 kB 14.8 MB/s eta 0:00:01
[?25hCollecting click-didyoumean>=0.0.3
  Downloading click_didyoumean-0.3.0-py3-none-any.whl (2.7 kB)
Collecting billiard<4.0,>=3.6.4.0
  Downloading billiard-3.6.4.0-py3-none-any.whl (89 kB)
     |████████████████████████████████| 89 kB 3.8 MB/s  eta 0:00:01
[?25hCollecting vine<6.0,>=5.0.0
  Downloading vine-5.0.0-py2.py3-none-any.whl (9.4 kB)
Collecting click-repl>=0.2.0
  Downloading click_repl-0.2.0-py3-none-any.whl (5.2 kB)
Requirement already satisfied: click<9.0,>=8.0.3 in /home/tomek/miniconda3/lib/python3.9/site-packages (from celery<6,>=5.2.0->dvc-task<1,>=0.2.0->dvc) (8.1.3)
Collecting click-plugins>=1.1.1
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Requirement already satisfied: pytz>=2021.3 in /home/tomek/miniconda3/lib/python3.9/site-packages (from celery<6,>=5.2.0->dvc-task<1,>=0.2.0->dvc) (2022.7.1)
Requirement already satisfied: prompt-toolkit in /home/tomek/miniconda3/lib/python3.9/site-packages (from click-repl>=0.2.0->celery<6,>=5.2.0->dvc-task<1,>=0.2.0->dvc) (3.0.38)
Collecting atpublic>=2.3
  Downloading atpublic-3.1.1-py3-none-any.whl (4.8 kB)
Collecting antlr4-python3-runtime==4.9.*
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
     |████████████████████████████████| 117 kB 17.4 MB/s eta 0:00:01
[?25hCollecting omegaconf<2.4,>=2.2
  Downloading omegaconf-2.3.0-py3-none-any.whl (79 kB)
     |████████████████████████████████| 79 kB 3.6 MB/s  eta 0:00:01
[?25hCollecting appdirs
  Using cached appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Requirement already satisfied: filelock in /home/tomek/miniconda3/lib/python3.9/site-packages (from iterative-telemetry>=0.0.7->dvc) (3.9.1)
Collecting amqp<6.0.0,>=5.0.9
  Downloading amqp-5.1.1-py3-none-any.whl (50 kB)
     |████████████████████████████████| 50 kB 2.7 MB/s  eta 0:00:01
[?25hRequirement already satisfied: PyYAML>=5.1.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from omegaconf<2.4,>=2.2->hydra-core>=1.1->dvc) (6.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from requests>=2.22->dvc) (1.26.9)
Requirement already satisfied: charset-normalizer~=2.0.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from requests>=2.22->dvc) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /home/tomek/miniconda3/lib/python3.9/site-packages (from requests>=2.22->dvc) (2022.12.7)
Requirement already satisfied: idna<4,>=2.5 in /home/tomek/miniconda3/lib/python3.9/site-packages (from requests>=2.22->dvc) (3.3)
Collecting markdown-it-py<3.0.0,>=2.2.0
  Downloading markdown_it_py-2.2.0-py3-none-any.whl (84 kB)
     |████████████████████████████████| 84 kB 1.9 MB/s  eta 0:00:01
[?25hRequirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from rich>=12->dvc) (2.14.0)
Collecting mdurl~=0.1
  Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Requirement already satisfied: ruamel.yaml.clib>=0.2.6 in /home/tomek/miniconda3/lib/python3.9/site-packages (from ruamel.yaml>=0.17.11->dvc) (0.2.6)
Collecting pygit2>=1.10.0
  Downloading pygit2-1.12.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB)
     |████████████████████████████████| 4.9 MB 13.6 MB/s eta 0:00:01
[?25hCollecting asyncssh<3,>=2.13.1
  Downloading asyncssh-2.13.1-py3-none-any.whl (348 kB)
     |████████████████████████████████| 348 kB 38.4 MB/s eta 0:00:01
[?25hRequirement already satisfied: gitpython>3 in /home/tomek/miniconda3/lib/python3.9/site-packages (from scmrepo<2,>=1.0.0->dvc) (3.1.31)
Requirement already satisfied: cryptography>=3.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from asyncssh<3,>=2.13.1->scmrepo<2,>=1.0.0->dvc) (37.0.1)
Requirement already satisfied: cffi>=1.12 in /home/tomek/miniconda3/lib/python3.9/site-packages (from cryptography>=3.1->asyncssh<3,>=2.13.1->scmrepo<2,>=1.0.0->dvc) (1.15.0)
Requirement already satisfied: pycparser in /home/tomek/miniconda3/lib/python3.9/site-packages (from cffi>=1.12->cryptography>=3.1->asyncssh<3,>=2.13.1->scmrepo<2,>=1.0.0->dvc) (2.21)
Requirement already satisfied: gitdb<5,>=4.0.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from gitpython>3->scmrepo<2,>=1.0.0->dvc) (4.0.10)
Requirement already satisfied: smmap<6,>=3.0.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from gitdb<5,>=4.0.1->gitpython>3->scmrepo<2,>=1.0.0->dvc) (5.0.0)
Collecting orjson
  Downloading orjson-3.8.10-cp39-cp39-manylinux_2_28_x86_64.whl (140 kB)
     |████████████████████████████████| 140 kB 39.5 MB/s eta 0:00:01
[?25hRequirement already satisfied: setuptools in /home/tomek/miniconda3/lib/python3.9/site-packages (from zc.lockfile>=1.2.1->dvc) (61.2.0)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (4.0.2)
Requirement already satisfied: multidict<7.0,>=4.5 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (1.8.2)
Requirement already satisfied: aiosignal>=1.1.2 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (1.3.1)
Requirement already satisfied: frozenlist>=1.1.1 in /home/tomek/miniconda3/lib/python3.9/site-packages (from aiohttp->aiohttp-retry>=2.5.0->dvc-http>=2.29.0->dvc) (1.3.3)
Requirement already satisfied: wcwidth in /home/tomek/miniconda3/lib/python3.9/site-packages (from prompt-toolkit->click-repl>=0.2.0->celery<6,>=5.2.0->dvc-task<1,>=0.2.0->dvc) (0.2.6)
Building wheels for collected packages: antlr4-python3-runtime, nanotime
  Building wheel for antlr4-python3-runtime (setup.py) ... [?25ldone
[?25h  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.3-py3-none-any.whl size=144575 sha256=94691fc7a4109d606872ddee3ae9575c3c9f9f945643a27b5514fce3176c552a
  Stored in directory: /home/tomek/.cache/pip/wheels/23/cf/80/f3efa822e6ab23277902ee9165fe772eeb1dfb8014f359020a
  Building wheel for nanotime (setup.py) ... [?25ldone
[?25h  Created wheel for nanotime: filename=nanotime-0.5.2-py3-none-any.whl size=2441 sha256=42933d16d8f6362832282dea6b0b44f2bdd41b0eb0d68de121660a8a0db1f96c
  Stored in directory: /home/tomek/.cache/pip/wheels/ee/1f/7c/610bdb7d5541b98d9743c5953e32681ef35dd54fadddd347e8
Successfully built antlr4-python3-runtime nanotime
Installing collected packages: vine, amqp, shortuuid, pygtrie, orjson, mdurl, kombu, funcy, click-repl, click-plugins, click-didyoumean, billiard, antlr4-python3-runtime, voluptuous, sqltrie, pygit2, psutil, pathspec, omegaconf, nanotime, markdown-it-py, dvc-objects, dulwich, distro, diskcache, dictdiffer, celery, atpublic, asyncssh, appdirs, aiohttp-retry, zc.lockfile, tomlkit, shtab, scmrepo, rich, pydot, networkx, iterative-telemetry, hydra-core, grandalf, flufl.lock, flatten-dict, dvc-task, dvc-studio-client, dvc-render, dvc-http, dvc-data, dpath, configobj, dvc
Successfully installed aiohttp-retry-2.8.3 amqp-5.1.1 antlr4-python3-runtime-4.9.3 appdirs-1.4.4 asyncssh-2.13.1 atpublic-3.1.1 billiard-3.6.4.0 celery-5.2.7 click-didyoumean-0.3.0 click-plugins-1.1.1 click-repl-0.2.0 configobj-5.0.8 dictdiffer-0.9.0 diskcache-5.6.1 distro-1.8.0 dpath-2.1.5 dulwich-0.21.3 dvc-2.55.0 dvc-data-0.47.2 dvc-http-2.30.2 dvc-objects-0.21.2 dvc-render-0.3.1 dvc-studio-client-0.8.0 dvc-task-0.2.0 flatten-dict-0.4.2 flufl.lock-7.1.1 funcy-2.0 grandalf-0.8 hydra-core-1.3.2 iterative-telemetry-0.0.8 kombu-5.2.4 markdown-it-py-2.2.0 mdurl-0.1.2 nanotime-0.5.2 networkx-3.1 omegaconf-2.3.0 orjson-3.8.10 pathspec-0.11.1 psutil-5.9.5 pydot-1.4.2 pygit2-1.12.0 pygtrie-2.5.0 rich-13.3.4 scmrepo-1.0.2 shortuuid-1.0.11 shtab-1.6.1 sqltrie-0.3.1 tomlkit-0.11.7 vine-5.0.0 voluptuous-0.13.1 zc.lockfile-3.0.post1

Stwórzmy katalog, w którym będziemy przechowywać nasz projekt:

!rm -r -f IUM_10/sample-ml-project-2023
!mkdir -p IUM_10/sample-ml-project-2023
#Jupyter notebook magic https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-cd
%cd "IUM_10/sample-ml-project-2023"
/home/tomek/repos/aitech-ium/IUM_10/sample-ml-project-2023/IUM_10/sample-ml-project-2023

Inicjalizujemy puste repozytorium Git (możemy też pominąć ten krok i działać w istniejącym już repozytorium)

!git init
Initialized empty Git repository in /home/tomek/repos/aitech-ium/IUM_10/sample-ml-project-2023/IUM_10/sample-ml-project-2023/.git/

Teraz inicjalizujemy repozytorium DVC:

!dvc init
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


Zobaczmy jakie pliki dodał (również do repozytorium git) DVC. Ich opis znajdziemy tutaj: https://dvc.org/doc/user-guide/project-structure/internal-files

!git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   .dvc/.gitignore
	new file:   .dvc/config
	new file:   .dvcignore

  • .dvc/config - główny plik konfiguracyjny dvc
  • .dvc/config.local - nadpisuje wartości z config, do lokalnych zmian nie commitowanych do repo
  • .dvc/.gitignore - pliki dvc, które nie mają znaleźć się w repo
  • .dvcignore - dvc pomija pliki zdefiniowane w tym pliku (np. aby poprawić wydajność)

Możemy teraz zacommitować zmiany w git:

!git commit -m "Initial commit"
[main (root-commit) 6b03a40] Initial commit
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore

Przygotujmy przykładowe dane, pobierając je z Kaggle:

!kaggle datasets download -d uciml/iris
!unzip -o iris.zip
!rm database.sqlite iris.zip
!mkdir -p data
!mv Iris.csv data/
Downloading iris.zip to /home/tomek/repos/aitech-ium/IUM_10/sample-ml-project-2023/IUM_10/sample-ml-project-2023
  0%|                                               | 0.00/3.60k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 3.60k/3.60k [00:00<00:00, 3.38MB/s]
Archive:  iris.zip
  inflating: Iris.csv                
  inflating: database.sqlite         

Teraz dodamy plik(i) z danymi do DVC:

!dvc add data/Iris.csv
⠋ Checking graph                                                   ⠋ Checking graph
Adding...                                                                       
!
  0% Checking cache in '/home/tomek/repos/aitech-ium/IUM_10/sample-ml-project-20
                                                                                
!
  0%|          |Transferring                          0/? [00:00<?,     ?file/s]
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s]
100% Adding...|████████████████████████████████████████|1/1 [00:00, 13.53file/s]

To track the changes with git, run:

	git add data/Iris.csv.dvc data/.gitignore

To enable auto staging, run:

	dvc config core.autostage true

  • DVC utworzył plik data/Iris.csv.dvc i dadał oryginalny plik do .gitignore
  • W repozytorium będzie obecny tylko plik *.dvc, zawierający odnośnik do prawdziwego pliku
!git status -u
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	data/.gitignore
	data/Iris.csv.dvc

nothing added to commit but untracked files present (use "git add" to track)

Dodajmy pliki data/Iris.csv.dvc data/.gitignore do repozytorium git, zgodnie z sugestią DVC:

!git add data/Iris.csv.dvc data/.gitignore
!git commit -m "Dodano dane IRIS (DVC)"
[main 812cb53] Dodano dane IRIS (DVC)
 2 files changed, 5 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/Iris.csv.dvc

Plik *.dvc zawiera m.in. hash pliku. Więcej o plikach *.dvc: link

# %load data/Iris.csv.dvc
outs:
- md5: 717820ef0af287ff346c5cabfb4c612c
  size: 5107
  path: Iris.csv

Oryginalny plik Iris.csv został przeniesiony do katalogu ./dvc/cache/{wartość hash pliku) i podlinkowany z powrotem do oryginalnej lokalizacji. Sposób tworzenia linków może być różny w zależności od systemu plików.

!ls -l .dvc/cache/71
total 8
-r--r--r-- 1 tomek tomek 5107 Sep 19  2019 7820ef0af287ff346c5cabfb4c612c
!head -n 3 .dvc/cache/71/7820ef0af287ff346c5cabfb4c612c
Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
!git remote add origin git@git.wmi.amu.edu.pl:tzietkiewicz/sample-ml-project.git
!git push --set-upstream origin main
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 4 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (11/11), 889 bytes | 889.00 KiB/s, done.
Total 11 (delta 1), reused 0 (delta 0), pack-reused 0
remote: 
remote: Create a new pull request for 'main':
remote:   https://git.wmi.amu.edu.pl/tzietkiewicz/sample-ml-project/compare/master...main
remote: 
remote: . Processing 1 references
remote: Processed 1 references in total
To git.wmi.amu.edu.pl:tzietkiewicz/sample-ml-project.git
 * [new branch]      main -> main
Branch 'main' set up to track remote branch 'main' from 'origin'.

dvc remote

  • żeby wysłać właściwe pliki śledzone przez DVC do zdalnej lokalizacji (z której będą mogłby być pobrane np. przez system CI albo innych użytkowników) musimy mieć skonfigurowaną taką lokazliację
  • służy do tego polecenie dvc remote add
  • użyjemy lokalnego "remote". Tutaj będzie to po prostu utworzony wcześniej katalog /dvcstore. Taki katalog istnieje też na naszym Jenkinsie, oczywiście należy go podmontować w Dockerze
  • w realnych zastosowaniach podalibyśmy tutaj ścieżkę do jakiegoś zasobu dostępnego przez inernet jak np. serwer SFTP, ścieżka do AWS S3 itp.

Obsługiwane typy zdalnych lokalizacji (remotes): https://dvc.org/doc/command-reference/remote/add#supported-storage-types

  • Amazon S3
  • S3-compatible storage
  • Microsoft Azure Blob Storage
  • Google Drive
  • Google Cloud Storage
  • Aliyun OSS
  • SSH
  • HDFS
  • WebHDFS
  • HTTP
  • WebDAV
  • local remote

Dodawanie remote typu local

!dvc remote add -d my_local_remote /dvcstore
Setting 'my_local_remote' as a default remote.

!git status
On branch master
nothing to commit, working tree clean
!git add .dvc/config
!git commit -m "Added DVC remote"
On branch main
nothing to commit, working tree clean

dvc push

Kiedy mamy już skonfigurowany "remote" możemy wypchnąć do niego pliki korzystając z polecenia dvc push:

!dvc push
  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
!
  0%|          |7820ef0af287ff346c5cabfb4c612c     0.00/? [00:00<?,        ?B/s]
  0%|          |7820ef0af287ff346c5cabfb4c612c 0.00/4.99k [00:00<?,        ?B/s]
1 file pushed                                                                   

!tree /dvcstore
/dvcstore
└── 71
    └── 7820ef0af287ff346c5cabfb4c612c

1 directory, 1 file

dvc pull

Żeby pobrać dane z DVC (np. w innej lokalizacji, przez innego użytkownika), musimy:

  • sklonować repozytorium git (żeby m.in. pobrać pliki *.dvc
  • wykonać dvc pull

Dodawanie nowych plików i modyfikacja istniejących wygląda podobnie jak przy zwykłych plikach śledzonych przez git, tylko zamiast git używamy polecenia dvc a dodatkowo pamiętamy o zarządzaniu plikami *.dvc za pomocą gita:

!head -n -1 data/Iris.csv | sponge data/Iris.csv
!git status
On branch master
nothing to commit, working tree clean
!dvc status
data/Iris.csv.dvc:                                                    core>
	changed outs:
		modified:           data/Iris.csv

!dvc add data/Iris.csv
⠹ Checking graph                                                   ⠋ Checking graph
Adding...                                                                       
!
  0% Checking cache in '/home/tomek/repos/aitech-ium/IUM_10/sample-ml-project/.d
                                                                                
!
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s]
                                                                                
!
  0%|          |.AquCc93WCb2aAJ98voTeFG.tmp        0.00/? [00:00<?,        ?B/s]
  0%|          |.AquCc93WCb2aAJ98voTeFG.tmp     0.00/4.00 [00:00<?,        ?B/s]
                                                                                
!
  0%|          |4705c4d470a4d9dd152808e5e9f56f     0.00/? [00:00<?,        ?B/s]
  0%|          |4705c4d470a4d9dd152808e5e9f56f 0.00/4.92k [00:00<?,        ?B/s]
100% Adding...|████████████████████████████████████████|1/1 [00:00, 11.09file/s]

To track the changes with git, run:

    git add data/Iris.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true

!git add data/Iris.csv.dvc
!git commit -m "Removed last line from Iris dataset"
[master 5379e3b] Removed last line from Iris dataset
 1 file changed, 2 insertions(+), 2 deletions(-)

dvc checkout

  • Polecenia dvc checkout używamy razem z git checkout, żeby zmienić branch, na którym pracujemy.
  • DVC podmieni wersje plików śledzonych przez siebie na pochodzące z innego brancha (o ile pliki te się różnią i różnią się pliki *.dvc w odpowiednich branchach
  • zmiana brancha przez git powoduje (ewentualną) zmianę plików *.dvc a dvc checkout kopiuje/linkuje pliki z katalogu .dvc/cache o wartościach hash odpowiadających tym z plików *.dvc

Wymiana danych między projektami

  • za pomocą poleceń dvc import i dvc update możemy dodać i później aktualizować pliki śledzone przez DVC w innym repozytorium
!dvc import https://github.com/iterative/dataset-registry \
             get-started/data.xml -o data/data.xml
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data/data.xml'
  0% Downloading|                                    |0/1 [00:00<?,     ?file/s]
!
  0%|          |get-started/data.xml           0.00/37.9M [00:00<?,       ?it/s]
  0%|          |get-started/data.xml      64.0k/36.1M [00:00<02:12,     286kB/s]
  0%|          |get-started/data.xml       128k/36.1M [00:00<01:33,     403kB/s]
  1%|          |get-started/data.xml       256k/36.1M [00:00<00:57,     658kB/s]
  1%|          |get-started/data.xml       384k/36.1M [00:00<00:45,     818kB/s]
  1%|▏         |get-started/data.xml       512k/36.1M [00:00<00:53,     693kB/s]
  2%|▏         |get-started/data.xml       640k/36.1M [00:01<00:57,     644kB/s]
  2%|▏         |get-started/data.xml       768k/36.1M [00:01<00:59,     619kB/s]
  2%|▏         |get-started/data.xml       896k/36.1M [00:01<00:51,     718kB/s]
  3%|▎         |get-started/data.xml      1.00M/36.1M [00:01<00:55,     666kB/s]
  3%|▎         |get-started/data.xml      1.12M/36.1M [00:01<00:57,     633kB/s]
  3%|▎         |get-started/data.xml      1.25M/36.1M [00:02<00:57,     638kB/s]
  4%|▍         |get-started/data.xml      1.38M/36.1M [00:02<00:52,     698kB/s]
  4%|▍         |get-started/data.xml      1.50M/36.1M [00:02<00:55,     656kB/s]
  4%|▍         |get-started/data.xml      1.62M/36.1M [00:02<00:57,     628kB/s]
  5%|▍         |get-started/data.xml      1.69M/36.1M [00:02<00:58,     618kB/s]
  5%|▌         |get-started/data.xml      1.81M/36.1M [00:02<00:53,     675kB/s]
  5%|▌         |get-started/data.xml      1.94M/36.1M [00:03<00:53,     672kB/s]
  6%|▌         |get-started/data.xml      2.06M/36.1M [00:03<00:55,     642kB/s]
  6%|▌         |get-started/data.xml      2.12M/36.1M [00:03<00:56,     628kB/s]
  6%|▌         |get-started/data.xml      2.19M/36.1M [00:03<00:57,     616kB/s]
  6%|▌         |get-started/data.xml      2.25M/36.1M [00:03<00:58,     606kB/s]
  7%|▋         |get-started/data.xml      2.38M/36.1M [00:03<00:48,     732kB/s]
  7%|▋         |get-started/data.xml      2.50M/36.1M [00:04<00:52,     666kB/s]
  7%|▋         |get-started/data.xml      2.62M/36.1M [00:04<00:55,     636kB/s]
  8%|▊         |get-started/data.xml      2.75M/36.1M [00:04<00:56,     614kB/s]
  8%|▊         |get-started/data.xml      2.88M/36.1M [00:04<00:49,     711kB/s]
  8%|▊         |get-started/data.xml      3.00M/36.1M [00:04<00:52,     663kB/s]
  9%|▊         |get-started/data.xml      3.12M/36.1M [00:05<00:54,     637kB/s]
  9%|▉         |get-started/data.xml      3.25M/36.1M [00:05<00:55,     623kB/s]
  9%|▉         |get-started/data.xml      3.38M/36.1M [00:05<00:48,     710kB/s]
 10%|▉         |get-started/data.xml      3.50M/36.1M [00:05<00:51,     664kB/s]
 10%|█         |get-started/data.xml      3.62M/36.1M [00:05<00:45,     751kB/s]
 10%|█         |get-started/data.xml      3.75M/36.1M [00:05<00:49,     691kB/s]
 11%|█         |get-started/data.xml      3.88M/36.1M [00:06<00:43,     777kB/s]
 11%|█         |get-started/data.xml      4.00M/36.1M [00:06<00:47,     705kB/s]
 11%|█▏        |get-started/data.xml      4.12M/36.1M [00:06<00:42,     790kB/s]
 12%|█▏        |get-started/data.xml      4.25M/36.1M [00:06<00:46,     716kB/s]
 12%|█▏        |get-started/data.xml      4.38M/36.1M [00:06<00:44,     749kB/s]
 12%|█▏        |get-started/data.xml      4.50M/36.1M [00:07<00:45,     734kB/s]
 13%|█▎        |get-started/data.xml      4.62M/36.1M [00:07<00:40,     810kB/s]
 13%|█▎        |get-started/data.xml      4.75M/36.1M [00:07<00:42,     773kB/s]
 13%|█▎        |get-started/data.xml      4.88M/36.1M [00:07<00:41,     795kB/s]
 14%|█▍        |get-started/data.xml      5.00M/36.1M [00:07<00:37,     870kB/s]
 14%|█▍        |get-started/data.xml      5.12M/36.1M [00:07<00:34,     932kB/s]
 15%|█▍        |get-started/data.xml      5.25M/36.1M [00:07<00:35,     916kB/s]
 15%|█▍        |get-started/data.xml      5.38M/36.1M [00:08<00:35,     898kB/s]
 15%|█▌        |get-started/data.xml      5.50M/36.1M [00:08<00:33,     962kB/s]
 16%|█▌        |get-started/data.xml      5.62M/36.1M [00:08<00:33,     949kB/s]
 16%|█▌        |get-started/data.xml      5.75M/36.1M [00:08<00:31,    1.00MB/s]
 16%|█▋        |get-started/data.xml      5.88M/36.1M [00:08<00:30,    1.04MB/s]
 17%|█▋        |get-started/data.xml      6.06M/36.1M [00:08<00:26,    1.19MB/s]
 17%|█▋        |get-started/data.xml      6.19M/36.1M [00:08<00:26,    1.19MB/s]
 17%|█▋        |get-started/data.xml      6.31M/36.1M [00:08<00:26,    1.19MB/s]
 18%|█▊        |get-started/data.xml      6.50M/36.1M [00:08<00:23,    1.31MB/s]
 18%|█▊        |get-started/data.xml      6.62M/36.1M [00:09<00:23,    1.30MB/s]
 19%|█▉        |get-started/data.xml      6.81M/36.1M [00:09<00:21,    1.41MB/s]
 19%|█▉        |get-started/data.xml      7.00M/36.1M [00:09<00:20,    1.48MB/s]
 20%|█▉        |get-started/data.xml      7.19M/36.1M [00:09<00:19,    1.54MB/s]
 20%|██        |get-started/data.xml      7.38M/36.1M [00:09<00:18,    1.60MB/s]
 21%|██        |get-started/data.xml      7.56M/36.1M [00:09<00:18,    1.62MB/s]
 21%|██▏       |get-started/data.xml      7.75M/36.1M [00:09<00:17,    1.68MB/s]
 22%|██▏       |get-started/data.xml      7.94M/36.1M [00:09<00:17,    1.70MB/s]
 22%|██▏       |get-started/data.xml      8.12M/36.1M [00:10<00:17,    1.72MB/s]
 23%|██▎       |get-started/data.xml      8.38M/36.1M [00:10<00:15,    1.88MB/s]
 24%|██▎       |get-started/data.xml      8.56M/36.1M [00:10<00:15,    1.84MB/s]
 24%|██▍       |get-started/data.xml      8.81M/36.1M [00:10<00:14,    1.96MB/s]
 25%|██▌       |get-started/data.xml      9.06M/36.1M [00:10<00:13,    2.06MB/s]
 26%|██▌       |get-started/data.xml      9.31M/36.1M [00:10<00:13,    2.14MB/s]
 27%|██▋       |get-started/data.xml      9.62M/36.1M [00:10<00:11,    2.32MB/s]
 27%|██▋       |get-started/data.xml      9.88M/36.1M [00:10<00:11,    2.33MB/s]
 28%|██▊       |get-started/data.xml      10.2M/36.1M [00:10<00:11,    2.46MB/s]
 29%|██▉       |get-started/data.xml      10.4M/36.1M [00:11<00:10,    2.45MB/s]
 30%|██▉       |get-started/data.xml      10.8M/36.1M [00:11<00:10,    2.57MB/s]
 31%|███       |get-started/data.xml      11.1M/36.1M [00:11<00:09,    2.67MB/s]
 32%|███▏      |get-started/data.xml      11.4M/36.1M [00:11<00:09,    2.84MB/s]
 33%|███▎      |get-started/data.xml      11.8M/36.1M [00:11<00:08,    2.85MB/s]
 34%|███▎      |get-started/data.xml      12.1M/36.1M [00:11<00:08,    3.01MB/s]
 35%|███▍      |get-started/data.xml      12.5M/36.1M [00:11<00:07,    3.12MB/s]
 36%|███▌      |get-started/data.xml      12.9M/36.1M [00:11<00:07,    3.22MB/s]
 37%|███▋      |get-started/data.xml      13.2M/36.1M [00:11<00:07,    3.31MB/s]
 38%|███▊      |get-started/data.xml      13.7M/36.1M [00:12<00:06,    3.49MB/s]
 39%|███▉      |get-started/data.xml      14.1M/36.1M [00:12<00:06,    3.62MB/s]
 40%|████      |get-started/data.xml      14.6M/36.1M [00:12<00:06,    3.74MB/s]
 42%|████▏     |get-started/data.xml      15.0M/36.1M [00:12<00:05,    3.82MB/s]
 43%|████▎     |get-started/data.xml      15.4M/36.1M [00:12<00:05,    3.97MB/s]
 44%|████▍     |get-started/data.xml      15.9M/36.1M [00:12<00:05,    4.08MB/s]
 45%|████▌     |get-started/data.xml      16.4M/36.1M [00:12<00:04,    4.23MB/s]
 47%|████▋     |get-started/data.xml      17.0M/36.1M [00:12<00:04,    4.44MB/s]
 48%|████▊     |get-started/data.xml      17.5M/36.1M [00:12<00:04,    4.52MB/s]
 50%|████▉     |get-started/data.xml      18.1M/36.1M [00:13<00:04,    4.69MB/s]
 52%|█████▏    |get-started/data.xml      18.6M/36.1M [00:13<00:03,    4.84MB/s]
 53%|█████▎    |get-started/data.xml      19.2M/36.1M [00:13<00:03,    5.05MB/s]
 55%|█████▍    |get-started/data.xml      19.8M/36.1M [00:13<00:03,    5.16MB/s]
 57%|█████▋    |get-started/data.xml      20.4M/36.1M [00:13<00:03,    5.35MB/s]
 58%|█████▊    |get-started/data.xml      21.1M/36.1M [00:13<00:02,    5.49MB/s]
 60%|██████    |get-started/data.xml      21.8M/36.1M [00:13<00:02,    5.66MB/s]
 62%|██████▏   |get-started/data.xml      22.4M/36.1M [00:13<00:02,    5.83MB/s]
 64%|██████▍   |get-started/data.xml      23.2M/36.1M [00:14<00:02,    6.05MB/s]
 66%|██████▌   |get-started/data.xml      23.9M/36.1M [00:14<00:02,    6.20MB/s]
 68%|██████▊   |get-started/data.xml      24.6M/36.1M [00:14<00:01,    6.40MB/s]
 70%|███████   |get-started/data.xml      25.4M/36.1M [00:14<00:01,    6.51MB/s]
 72%|███████▏  |get-started/data.xml      26.0M/36.1M [00:14<00:01,    5.75MB/s]
 74%|███████▎  |get-started/data.xml      26.6M/36.1M [00:14<00:02,    4.26MB/s]
 75%|███████▍  |get-started/data.xml      27.1M/36.1M [00:14<00:02,    3.53MB/s]
 76%|███████▌  |get-started/data.xml      27.5M/36.1M [00:15<00:02,    3.26MB/s]
 77%|███████▋  |get-started/data.xml      27.9M/36.1M [00:15<00:02,    3.00MB/s]
 78%|███████▊  |get-started/data.xml      28.2M/36.1M [00:15<00:02,    2.95MB/s]
 79%|███████▉  |get-started/data.xml      28.5M/36.1M [00:15<00:02,    2.91MB/s]
 80%|███████▉  |get-started/data.xml      28.8M/36.1M [00:15<00:02,    2.88MB/s]
 81%|████████  |get-started/data.xml      29.1M/36.1M [00:15<00:02,    2.86MB/s]
 81%|████████▏ |get-started/data.xml      29.4M/36.1M [00:15<00:02,    2.84MB/s]
 82%|████████▏ |get-started/data.xml      29.8M/36.1M [00:16<00:02,    2.83MB/s]
 83%|████████▎ |get-started/data.xml      30.1M/36.1M [00:16<00:02,    2.83MB/s]
 84%|████████▍ |get-started/data.xml      30.4M/36.1M [00:16<00:02,    2.83MB/s]
 85%|████████▍ |get-started/data.xml      30.7M/36.1M [00:16<00:02,    2.83MB/s]
 86%|████████▌ |get-started/data.xml      31.0M/36.1M [00:16<00:01,    2.83MB/s]
 87%|████████▋ |get-started/data.xml      31.3M/36.1M [00:16<00:01,    2.83MB/s]
 88%|████████▊ |get-started/data.xml      31.6M/36.1M [00:16<00:01,    2.83MB/s]
 88%|████████▊ |get-started/data.xml      31.9M/36.1M [00:16<00:01,    2.84MB/s]
 89%|████████▉ |get-started/data.xml      32.2M/36.1M [00:16<00:01,    2.85MB/s]
 90%|█████████ |get-started/data.xml      32.6M/36.1M [00:17<00:01,    2.85MB/s]
 91%|█████████ |get-started/data.xml      32.9M/36.1M [00:17<00:01,    2.86MB/s]
 92%|█████████▏|get-started/data.xml      33.2M/36.1M [00:17<00:01,    2.86MB/s]
 93%|█████████▎|get-started/data.xml      33.5M/36.1M [00:17<00:00,    2.87MB/s]
 94%|█████████▎|get-started/data.xml      33.8M/36.1M [00:17<00:00,    2.87MB/s]
 94%|█████████▍|get-started/data.xml      34.1M/36.1M [00:17<00:00,    2.87MB/s]
 95%|█████████▌|get-started/data.xml      34.4M/36.1M [00:17<00:00,    2.87MB/s]
 96%|█████████▌|get-started/data.xml      34.8M/36.1M [00:17<00:00,    2.87MB/s]
 97%|█████████▋|get-started/data.xml      35.1M/36.1M [00:17<00:00,    2.87MB/s]
 98%|█████████▊|get-started/data.xml      35.4M/36.1M [00:18<00:00,    2.87MB/s]
 99%|█████████▉|get-started/data.xml      35.7M/36.1M [00:18<00:00,    2.88MB/s]
100%|█████████▉|get-started/data.xml      36.0M/36.1M [00:18<00:00,    2.87MB/s]
                                                                                
To track the changes with git, run:

	git add data/.gitignore data/data.xml.dvc

!dvc status
Data and pipelines are up to date.                                              

ls -l data
total 37020
-rw-rw-r-- 1 tomek tomek 37891850 maj 31 11:10 data.xml
-rw-rw-r-- 1 tomek tomek      284 maj 31 11:10 data.xml.dvc
-rw-rw-r-- 1 tomek tomek     5072 maj 31 11:01 Iris.csv
-rw-rw-r-- 1 tomek tomek       76 maj 31 11:01 Iris.csv.dvc
# %load data/data.xml.dvc
md5: a7cd139231cc35ed63541ce3829b96db
frozen: true
deps:
- path: get-started/data.xml
  repo:
    url: https://github.com/iterative/dataset-registry
    rev_lock: ba014f40e29670421a67cb1c47543f402348aa13
outs:
- md5: a304afb96060aad90176268345e10355
  size: 37891850
  path: data.xml

DVC pipelines

  • wprowadzenie: https://youtu.be/71IGzyH95UY
  • Getting started: https://dvc.org/doc/start/data-pipelines
  • dvc pipelines pozwala nam zbudować (za pomocą polecenie dvc run) lub zdefiniować (edytując plik dvc.yaml) graf zależności między krokami wykonywanymi w naszym projekcie (takimi jak "przygotowanie danych", "trenowanie", "ewaluacja")
  • tak zdefiniowany pipeline można potem uruchomić za pomocą polecenia dvc reproduce

Zadania [10+5 pkt]

  1. Zainicjalizuj repozytorium DVC wewnątrz Twojego repozytorium z projektem [1pkt]
  2. Dodaj plik(i) z danymi w Twoim projekcie do DVC [1pkt]
  3. Skonfiguruj remote (dane do konfiguracji podane poniżej) [3pkt]
  4. Stwórz/zdefiniuj i dodaj do repozytorium plik dvc.yaml opisujący kroki wykonywane w Twoim projekcie. Wydziel przynajmniej 2 kroki (np. przygotowanie danych/trenowanie) powiązane ze sobą za pomocą zależności (skorzystaj z materiałów "Getting started", link powyżej) [5pkt (opcjonalne)]
  5. Stwórz projekt na Jenkinsie (s1233456-dvc), w którym sklonujesz repozytorium, ściągniesz pliki dvc (za pomocą dvc pull) i uruchomisz pipeline (za pomocą dvc reproduce) [5pkt]

SSH remote

Jednym z remote obsługiwanych przez DVC jest SFTP/SSH. W celu jego wykorzystania na serwerze tzietkiewicz.vm.wmi.amu.edu.pl utworzony został użytkownik ium-sftp i skonfigurowany serwer SFTP. Został też dla niego wygenerowany klucz ssh, który został dodany jako "Jenkins credential" (patrz opis konfiguracji na Jenkins poniżej)

Lokalnie

Będziemy potrzebować zależności (szczegóły)

conda install dvc-ssh

albo

pip install dvc[ssh] paramiko

conda install -c conda-forge dvc-ssh
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/tomek/miniconda3

  added / updated specs:
    - dvc-ssh


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    bcrypt-3.2.0               |   py39h3811e60_1          44 KB  conda-forge
    ca-certificates-2021.5.30  |       ha878542_0         136 KB  conda-forge
    certifi-2021.5.30          |   py39hf3d152e_0         141 KB  conda-forge
    dvc-2.3.0                  |   py39hf3d152e_0         542 KB  conda-forge
    dvc-ssh-2.3.0              |   py39hf3d152e_0           9 KB  conda-forge
    fsspec-2021.5.0            |     pyhd8ed1ab_0          77 KB  conda-forge
    invoke-1.5.0               |     pyhd3deb0d_0         137 KB  conda-forge
    paramiko-2.7.2             |     pyh9f0ad1d_0         135 KB  conda-forge
    pynacl-1.4.0               |   py39h3811e60_2         1.3 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

  bcrypt             conda-forge/linux-64::bcrypt-3.2.0-py39h3811e60_1
  dvc-ssh            conda-forge/linux-64::dvc-ssh-2.3.0-py39hf3d152e_0
  invoke             conda-forge/noarch::invoke-1.5.0-pyhd3deb0d_0
  paramiko           conda-forge/noarch::paramiko-2.7.2-pyh9f0ad1d_0
  pynacl             conda-forge/linux-64::pynacl-1.4.0-py39h3811e60_2

The following packages will be UPDATED:

  ca-certificates                      2020.12.5-ha878542_0 --> 2021.5.30-ha878542_0
  certifi                          2020.12.5-py39hf3d152e_1 --> 2021.5.30-py39hf3d152e_0
  dvc                                  2.1.0-py39hf3d152e_0 --> 2.3.0-py39hf3d152e_0
  fsspec                                 0.9.0-pyhd8ed1ab_2 --> 2021.5.0-pyhd8ed1ab_0



Downloading and Extracting Packages
certifi-2021.5.30    | 141 KB    | ##################################### | 100% 
fsspec-2021.5.0      | 77 KB     | ##################################### | 100% 
dvc-2.3.0            | 542 KB    | ##################################### | 100% 
invoke-1.5.0         | 137 KB    | ##################################### | 100% 
paramiko-2.7.2       | 135 KB    | ##################################### | 100% 
bcrypt-3.2.0         | 44 KB     | ##################################### | 100% 
pynacl-1.4.0         | 1.3 MB    | ##################################### | 100% 
dvc-ssh-2.3.0        | 9 KB      | ##################################### | 100% 
ca-certificates-2021 | 136 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.

Dodajemy remote:

!dvc remote add -f -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl
Setting 'ium_ssh_remote' as a default remote.

!dvc remote list
ium_ssh_remote	ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl


Zapisujemy hasło:

!dvc remote modify --local ium_ssh_remote password IUM@2021


Pushujemy do skonfigurowanego remote:

!dvc push
  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
!
  0%|          |4705c4d470a4d9dd152808e5e9f56f     0.00/? [00:00<?,        ?B/s]
  0%|          |4705c4d470a4d9dd152808e5e9f56f 0.00/4.92k [00:00<?,        ?B/s]
1 file pushed                                                                   


Jenkins

W Jenkins można użyć mechanizmu "Credentials", żeby w bezpieczny sposób przekazać hasło albo klucz prywatny.

Takie dane dla użytkownika ium-sftp zostały stworzone na Jenkinsie:

Opis używania "Credentials" w Jenkinsfile: https://www.jenkins.io/doc/book/pipeline/jenkinsfile/#for-other-credential-types

Klucza ssh można użyć tak:

withCredentials(
    [sshUserPrivateKey(credentialsId: '48ac7004-216e-4260-abba-1fe5db753e18', keyFileVariable: 'IUM_SFTP_KEY', passphraseVariable: '', usernameVariable: '')]) {
                sh 'dvc remote add -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp'
                sh 'dvc remote modify --local ium_ssh_remote keyfile $IUM_SFTP_KEY'
                sh 'dvc pull'}

Secret text tak:

    withCredentials([string(credentialsId: 'ium-sftp-password', variable: 'IUM_SFTP_PASS')]) {
                sh 'dvc remote add -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp'
                sh 'dvc remote modify --local ium_ssh_remote password $IUM_SFTP_KEY'
                sh 'dvc pull'
    }

Przykład konfiguracji: