forked from pms/ium
Fork 0

95 KiB
Raw Blame History

Logo 1

Inżynieria uczenia maszynowego

10. DVC [laboratoria]

Tomasz Ziętkiewicz (2022)

Logo 2

DVC - Data Version Control

  • "Version Control System for Machine Learning Projects" (System kontroli wersji dla projektów uczenia maszynowego)
  • Open Source
  • Umożliwia:
    • wersjonowanie danych i modeli. "Git dla danych i modeli"
    • budowanie potoków ("pipeline") definiujących jak budować/trenować/ewaluować modele. "Makefile dla uczenia maszynowego"
    • śledzenie, porównywanie metryk i parametrów
  • ściśle zintegowany z gitem
  • działa niezależnie od używanego języka/bibliotek i systemu operacyjnego
  • 5-minutowe wprowadzenie:

Instalacja i inicjalizacja

!pip3 install dvc
Collecting dvc
  Cache entry deserialization failed, entry ignored
  Downloading (386kB)
    100% |████████████████████████████████| 389kB 1.4MB/s ta 0:00:01
[?25hCollecting setuptools>=34.0.0 (from dvc)
  Using cached
Collecting colorama>=0.3.9 (from dvc)
  Using cached
Collecting dictdiffer>=0.8.1 (from dvc)
Collecting ply>=3.9 (from dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
  Downloading (49kB)
    100% |████████████████████████████████| 51kB 1.8MB/s ta 0:00:01
[?25hCollecting nanotime>=0.5.2 (from dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
Collecting fsspec[http]>=2021.10.0 (from dvc)
  Downloading (133kB)
    100% |████████████████████████████████| 143kB 1.3MB/s ta 0:00:01
[?25hCollecting networkx>=2.5 (from dvc)
  Cache entry deserialization failed, entry ignored
  Downloading (1.6MB)
    100% |████████████████████████████████| 1.6MB 609kB/s ta 0:00:01
[?25hCollecting flufl.lock<4,>=3.2 (from dvc)
Collecting shtab<2,>=1.3.4 (from dvc)
Collecting shortuuid>=0.5.0 (from dvc)
  Cache entry deserialization failed, entry ignored
Collecting gitpython>3 (from dvc)
  Using cached
Collecting tqdm<5,>=4.45.0 (from dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
  Downloading (78kB)
    100% |████████████████████████████████| 81kB 2.0MB/s ta 0:00:01
[?25hCollecting zc.lockfile>=1.2.1 (from dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
Collecting rich>=10.9.0 (from dvc)
  Downloading (231kB)
    100% |████████████████████████████████| 235kB 1.5MB/s ta 0:00:01
[?25hCollecting aiohttp-retry>=2.4.5 (from dvc)
Collecting voluptuous>=0.11.7 (from dvc)
  Cache entry deserialization failed, entry ignored
Collecting psutil>=5.8.0 (from dvc)
  Downloading (479kB)
    100% |████████████████████████████████| 481kB 1.2MB/s ta 0:00:011
[?25hCollecting flatten-dict<1,>=0.4.1 (from dvc)
Collecting configobj>=5.0.6 (from dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
Collecting pygtrie>=2.3.2 (from dvc)
Collecting tabulate>=0.8.7 (from dvc)
Collecting dulwich>=0.20.23 (from dvc)
  Downloading (423kB)
    100% |████████████████████████████████| 430kB 1.1MB/s ta 0:00:01
[?25hCollecting appdirs>=1.4.3 (from dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
Collecting ruamel.yaml>=0.17.11 (from dvc)
  Cache entry deserialization failed, entry ignored
  Downloading (109kB)
    100% |████████████████████████████████| 112kB 2.1MB/s ta 0:00:01
[?25hCollecting python-benedict>=0.24.2 (from dvc)
  Downloading (41kB)
    100% |████████████████████████████████| 51kB 2.2MB/s ta 0:00:01
[?25hCollecting grandalf==0.6 (from dvc)
  Cache entry deserialization failed, entry ignored
Collecting pyparsing==2.4.7 (from dvc)
  Cache entry deserialization failed, entry ignored
  Downloading (67kB)
    100% |████████████████████████████████| 71kB 2.1MB/s ta 0:00:01
[?25hCollecting dpath<3,>=2.0.2 (from dvc)
Collecting pygit2<1.7,>=1.5.0; python_version < "3.7" (from dvc)
  Downloading (258kB)
    100% |████████████████████████████████| 266kB 3.5MB/s eta 0:00:01
[?25hCollecting dataclasses>=0.7; python_version < "3.7" (from dvc)
  Using cached
Collecting pyasn1>=0.4.1 (from dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
  Downloading (77kB)
    100% |████████████████████████████████| 81kB 2.2MB/s ta 0:00:01
[?25hCollecting pathspec<0.9.0,>=0.6.0 (from dvc)
  Cache entry deserialization failed, entry ignored
Collecting contextvars>=2.1; python_version < "3.7" (from dvc)
Collecting distro>=1.3.0 (from dvc)
  Cache entry deserialization failed, entry ignored
Collecting packaging>=19.0 (from dvc)
  Using cached
Collecting typing-extensions>=3.7.4 (from dvc)
  Using cached
Collecting importlib-metadata>=1.4; python_version < "3.8" (from dvc)
  Using cached
Collecting diskcache>=5.2.1 (from dvc)
  Downloading (44kB)
    100% |████████████████████████████████| 51kB 2.1MB/s ta 0:00:01
[?25hCollecting pydot>=1.2.4 (from dvc)
Collecting funcy>=1.14 (from dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
Collecting toml>=0.10.1 (from dvc)
Collecting requests>=2.22.0 (from dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
  Downloading (63kB)
    100% |████████████████████████████████| 71kB 2.2MB/s ta 0:00:01
[?25hCollecting aiohttp; extra == "http" (from fsspec[http]>=2021.10.0->dvc)
  Downloading (1.1MB)
    100% |████████████████████████████████| 1.1MB 729kB/s ta 0:00:01
[?25hCollecting decorator<5,>=4.3 (from networkx>=2.5->dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
Collecting atpublic (from flufl.lock<4,>=3.2->dvc)
Collecting gitdb<5,>=4.0.1 (from gitpython>3->dvc)
  Using cached
Collecting importlib-resources; python_version < "3.7" (from tqdm<5,>=4.45.0->dvc)
  Cache entry deserialization failed, entry ignored
Collecting commonmark<0.10.0,>=0.9.0 (from rich>=10.9.0->dvc)
  Downloading (51kB)
    100% |████████████████████████████████| 51kB 1.6MB/s ta 0:00:01
[?25hCollecting pygments<3.0.0,>=2.6.0 (from rich>=10.9.0->dvc)
  Using cached
Collecting six<2.0,>=1.12 (from flatten-dict<1,>=0.4.1->dvc)
  Using cached
Collecting urllib3>=1.24.1 (from dulwich>=0.20.23->dvc)
  Downloading (138kB)
    100% |████████████████████████████████| 143kB 1.7MB/s ta 0:00:01
[?25hCollecting certifi (from dulwich>=0.20.23->dvc)
  Cache entry deserialization failed, entry ignored
  Downloading (155kB)
    100% |████████████████████████████████| 163kB 1.8MB/s ta 0:00:01
[?25hCollecting ruamel.yaml.clib>=0.2.6; platform_python_implementation == "CPython" and python_version < "3.11" (from ruamel.yaml>=0.17.11->dvc)
  Downloading (552kB)
    100% |████████████████████████████████| 552kB 1.2MB/s ta 0:00:01
[?25hCollecting python-slugify<7.0.0,>=6.0.1 (from python-benedict>=0.24.2->dvc)
Collecting python-fsutil<1.0.0,>=0.6.0 (from python-benedict>=0.24.2->dvc)
Collecting mailchecker<5.0.0,>=4.1.0 (from python-benedict>=0.24.2->dvc)
  Downloading (232kB)
    100% |████████████████████████████████| 235kB 2.7MB/s ta 0:00:011
[?25hCollecting pyyaml<7.0,>=6.0 (from python-benedict>=0.24.2->dvc)
  Cache entry deserialization failed, entry ignored
  Downloading (603kB)
    100% |████████████████████████████████| 604kB 1.1MB/s ta 0:00:01
[?25hCollecting xmltodict<1.0.0,>=0.12.0 (from python-benedict>=0.24.2->dvc)
Collecting phonenumbers<9.0.0,>=8.12.0 (from python-benedict>=0.24.2->dvc)
  Downloading (2.6MB)
    100% |████████████████████████████████| 2.6MB 434kB/s ta 0:00:011
[?25hCollecting ftfy<7.0.0,>=6.0.0 (from python-benedict>=0.24.2->dvc)
  Downloading (64kB)
    100% |████████████████████████████████| 71kB 2.0MB/s ta 0:00:01
[?25hCollecting python-dateutil<3.0.0,>=2.8.0 (from python-benedict>=0.24.2->dvc)
  Cache entry deserialization failed, entry ignored
  Using cached
Collecting future (from grandalf==0.6->dvc)
  Cache entry deserialization failed, entry ignored
  Cache entry deserialization failed, entry ignored
  Downloading (829kB)
    100% |████████████████████████████████| 829kB 1.5MB/s eta 0:00:01
[?25hCollecting cffi>=1.4.0 (from pygit2<1.7,>=1.5.0; python_version < "3.7"->dvc)
  Using cached
Collecting cached-property (from pygit2<1.7,>=1.5.0; python_version < "3.7"->dvc)
Collecting immutables>=0.9 (from contextvars>=2.1; python_version < "3.7"->dvc)
  Downloading (115kB)
    100% |████████████████████████████████| 122kB 2.3MB/s ta 0:00:01
[?25hCollecting zipp>=0.5 (from importlib-metadata>=1.4; python_version < "3.8"->dvc)
  Using cached
Collecting idna<4,>=2.5; python_version >= "3" (from requests>=2.22.0->dvc)
  Downloading (61kB)
    100% |████████████████████████████████| 61kB 1.7MB/s ta 0:00:01
[?25hCollecting charset-normalizer~=2.0.0; python_version >= "3" (from requests>=2.22.0->dvc)
Collecting asynctest==0.13.0; python_version < "3.8" (from aiohttp; extra == "http"->fsspec[http]>=2021.10.0->dvc)
Collecting yarl<2.0,>=1.0 (from aiohttp; extra == "http"->fsspec[http]>=2021.10.0->dvc)
  Downloading (270kB)
    100% |████████████████████████████████| 276kB 1.9MB/s ta 0:00:01
[?25hCollecting idna-ssl>=1.0; python_version < "3.7" (from aiohttp; extra == "http"->fsspec[http]>=2021.10.0->dvc)
Collecting frozenlist>=1.1.1 (from aiohttp; extra == "http"->fsspec[http]>=2021.10.0->dvc)
  Downloading (191kB)
    100% |████████████████████████████████| 194kB 1.2MB/s ta 0:00:01
[?25hCollecting async-timeout<5.0,>=4.0.0a3 (from aiohttp; extra == "http"->fsspec[http]>=2021.10.0->dvc)
Collecting aiosignal>=1.1.2 (from aiohttp; extra == "http"->fsspec[http]>=2021.10.0->dvc)
Collecting attrs>=17.3.0 (from aiohttp; extra == "http"->fsspec[http]>=2021.10.0->dvc)
  Using cached
Collecting multidict<7.0,>=4.5 (from aiohttp; extra == "http"->fsspec[http]>=2021.10.0->dvc)
  Downloading (159kB)
    100% |████████████████████████████████| 163kB 2.0MB/s ta 0:00:01
[?25hCollecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython>3->dvc)
  Using cached
Collecting text-unidecode>=1.3 (from python-slugify<7.0.0,>=6.0.1->python-benedict>=0.24.2->dvc)
  Downloading (78kB)
    100% |████████████████████████████████| 81kB 2.2MB/s ta 0:00:01
[?25hCollecting wcwidth (from ftfy<7.0.0,>=6.0.0->python-benedict>=0.24.2->dvc)
  Using cached
Collecting pycparser (from cffi>=1.4.0->pygit2<1.7,>=1.5.0; python_version < "3.7"->dvc)
  Using cached
Building wheels for collected packages: nanotime, flufl.lock, psutil, configobj, pygtrie, dulwich, pygit2, contextvars, atpublic, mailchecker, ftfy, future, idna-ssl
  Running bdist_wheel for nanotime ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/41/99/17/7135f635215e1f61e906295afd11f4f791cfe4ab45f3bfdca2
  Running bdist_wheel for flufl.lock ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/4f/51/d7/f65a7b7f37da7594f7021b122fe677187667ad21f1171d2514
  Running bdist_wheel for psutil ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/6e/94/8f/ef906811f8dcf6824a9747df0381615be48d723073fb59a317
  Running bdist_wheel for configobj ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/f1/e4/16/4981ca97c2d65106b49861e0b35e2660695be7219a2d351ee0
  Running bdist_wheel for pygtrie ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/db/57/91/73782136379fe419036c5ec0e4070d8b3a35f2a36bd6a94ed8
  Running bdist_wheel for dulwich ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/63/28/8c/0bbff7d6e30f3fc523639b000b33aba9155152e9eb23689ba0
  Running bdist_wheel for pygit2 ... [?25lerror
  Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-hr8mtrcf/pygit2/';f=getattr(tokenize, 'open', open)(__file__);'\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/tmp4kyicel3pip-wheel- --python-tag cp36:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.6
  creating build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
  creating build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/attr.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/blame.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/buffer.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/callbacks.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/checkout.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/clone.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/common.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/config.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/describe.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/diff.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/errors.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/graph.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/index.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/indexer.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/merge.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/net.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/oid.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/pack.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/proxy.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/refspec.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/remote.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/repository.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/revert.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/stash.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/strarray.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/submodule.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/transport.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  copying pygit2/decl/types.h -> build/lib.linux-x86_64-3.6/pygit2/decl
  running build_ext
  generating cffi module 'build/temp.linux-x86_64-3.6/pygit2._libgit2.c'
  creating build/temp.linux-x86_64-3.6
  building 'pygit2._pygit2' extension
  creating build/temp.linux-x86_64-3.6/src
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/include -I/usr/include/python3.6m -c src/blob.c -o build/temp.linux-x86_64-3.6/src/blob.o
  In file included from src/blob.c:30:0:
  src/blob.h:33:10: fatal error: git2.h: No such file or directory
   #include <git2.h>
  compilation terminated.
  error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
  Failed building wheel for pygit2
[?25h  Running clean for pygit2
  Running bdist_wheel for contextvars ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/a5/7d/68/1ebae2668bda2228686e3c1cf16f2c2384cea6e9334ad5f6de
  Running bdist_wheel for atpublic ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/33/25/82/57d46b60a048f8e30b31f10497539498a3b826c78e2433c2d4
  Running bdist_wheel for mailchecker ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/fd/e1/e7/804e77a70eac7103bdba2f4b3e1eba36840b38554a4b8152c8
  Running bdist_wheel for ftfy ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/99/2c/e6/109c8a28fef7a443f67ba58df21fe1d0067ac3322e75e6b0b7
  Running bdist_wheel for future ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/8b/99/a0/81daf51dcd359a9377b110a8a886b3895921802d2fc1b2397e
  Running bdist_wheel for idna-ssl ... [?25ldone
[?25h  Stored in directory: /home/tomek/.cache/pip/wheels/d3/00/b3/32d613e19e08a739751dd6bf998cfed277728f8b2127ad4eb7
Successfully built nanotime flufl.lock psutil configobj pygtrie dulwich contextvars atpublic mailchecker ftfy future idna-ssl
Failed to build pygit2
Installing collected packages: setuptools, colorama, dictdiffer, ply, nanotime, asynctest, idna, multidict, typing-extensions, yarl, idna-ssl, frozenlist, async-timeout, aiosignal, charset-normalizer, attrs, aiohttp, certifi, urllib3, requests, fsspec, decorator, networkx, atpublic, flufl.lock, shtab, shortuuid, smmap, gitdb, gitpython, zipp, importlib-resources, tqdm, zc.lockfile, dataclasses, commonmark, pygments, rich, aiohttp-retry, voluptuous, psutil, six, importlib-metadata, flatten-dict, configobj, pygtrie, tabulate, dulwich, appdirs, ruamel.yaml.clib, ruamel.yaml, text-unidecode, python-slugify, python-fsutil, mailchecker, pyyaml, toml, xmltodict, phonenumbers, wcwidth, ftfy, python-dateutil, python-benedict, future, pyparsing, grandalf, dpath, pycparser, cffi, cached-property, pygit2, pyasn1, pathspec, immutables, contextvars, distro, packaging, diskcache, pydot, funcy, dvc
  Running install for pygit2 ... [?25lerror
    Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-hr8mtrcf/pygit2/';f=getattr(tokenize, 'open', open)(__file__);'\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-6tml5js1-record/install-record.txt --single-version-externally-managed --compile --user --prefix=:
    running install
    /home/tomek/.local/lib/python3.6/site-packages/setuptools/command/ SetuptoolsDeprecationWarning: install is deprecated. Use build and pip and other standards-based tools.
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.6
    creating build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    copying pygit2/ -> build/lib.linux-x86_64-3.6/pygit2
    creating build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/attr.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/blame.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/buffer.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/callbacks.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/checkout.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/clone.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/common.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/config.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/describe.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/diff.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/errors.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/graph.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/index.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/indexer.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/merge.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/net.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/oid.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/pack.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/proxy.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/refspec.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/remote.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/repository.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/revert.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/stash.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/strarray.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/submodule.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/transport.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    copying pygit2/decl/types.h -> build/lib.linux-x86_64-3.6/pygit2/decl
    running build_ext
    generating cffi module 'build/temp.linux-x86_64-3.6/pygit2._libgit2.c'
    creating build/temp.linux-x86_64-3.6
    building 'pygit2._pygit2' extension
    creating build/temp.linux-x86_64-3.6/src
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/include -I/usr/include/python3.6m -c src/blob.c -o build/temp.linux-x86_64-3.6/src/blob.o
    In file included from src/blob.c:30:0:
    src/blob.h:33:10: fatal error: git2.h: No such file or directory
     #include <git2.h>
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-hr8mtrcf/pygit2/';f=getattr(tokenize, 'open', open)(__file__);'\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-6tml5js1-record/install-record.txt --single-version-externally-managed --compile --user --prefix=" failed with error code 1 in /tmp/pip-build-hr8mtrcf/pygit2/

Stwórzmy katalog, w którym będziemy przechowywać nasz projekt:

!mkdir -p IUM_10/sample-ml-project
#Jupyter notebook magic
%cd "IUM_10/sample-ml-project"

Inicjalizujemy puste repozytorium Git (możemy też pominąć ten krok i działać w istniejącym już repozytorium)

!git init
Reinitialized existing Git repository in /home/tomek/repos/aitech-ium/IUM_10/sample-ml-project/.git/

Teraz inicjalizujemy repozytorium DVC:

!dvc init
Initialized DVC repository.

You can now commit the changes to git.

|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <>              |
|                                                                     |

What's next?
- Check out the documentation: <>
- Get help and share ideas: <>
- Star us on GitHub: <>

Zobaczmy jakie pliki dodał (również do repozytorium git) DVC. Ich opis znajdziemy tutaj:

!git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   .dvc/.gitignore
	new file:   .dvc/config
	new file:   .dvcignore

Możemy teraz zacommitować zmiany w git:

!git commit -m "Initial commit"
[master (root-commit) d00d0ac] Initial commit
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore

Śledzenie plików za pomocą DVC

  • dużymi plikami, takimi jak plikami z danymi wejściowymi czy plikami modeli, trudno zarządza się za pomocą gita, ze względu na problemy z:
  • Git posiada rozszerzenie lfs(Large File Storage), które stanowi pewne rozwiązanie tego problemu. Same pliki przechowywane są na specjalnym zdalnym serwerze, w repozytorium przechowywane są jedynie odnośniki do tych plików i pewne metadane
  • DVC proponuje podobne podejście, ale:
    • pliki mogą być przechowywane na niemal dowolnym serwerze, również lokalnie
    • brak limitu wielkości plików (w Git-LFS na Github limit 2GB)
    • DVC zapewnia dodatkowe narzędzie umożliwiające śledzenie plików i ich powiązań z wynikami eksperymentów
    • więcej, patrz tutaj

Przygotujmy przykładowe dane, pobierając je z Kaggle:

!kaggle datasets download -d uciml/iris
!unzip -o
!rm database.sqlite
!mkdir -p data
!mv Iris.csv data/
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/tomek/.kaggle/kaggle.json'
Downloading to /home/tomek/repos/aitech-ium/IUM_10/sample-ml-project
  0%|                                               | 0.00/3.60k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 3.60k/3.60k [00:00<00:00, 1.66MB/s]
  inflating: Iris.csv                
  inflating: database.sqlite         

Teraz dodamy plik(i) z danymi do DVC:

!dvc add data/Iris.csv
⠧ Checking graph                                                   ⠋ Checking graph
  0% Checking cache in '/home/tomek/repos/aitech-ium/IUM_10/sample-ml-project/.d
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s]
  0%|          |.oAL9GSGErYepJSZTnvkTL8.tmp        0.00/? [00:00<?,        ?B/s]
  0%|          |.oAL9GSGErYepJSZTnvkTL8.tmp     0.00/4.00 [00:00<?,        ?B/s]
  0%|          |7820ef0af287ff346c5cabfb4c612c     0.00/? [00:00<?,        ?B/s]
  0%|          |7820ef0af287ff346c5cabfb4c612c 0.00/4.99k [00:00<?,        ?B/s]
100% Adding...|████████████████████████████████████████|1/1 [00:00,  2.59file/s]

To track the changes with git, run:

    git add data/.gitignore data/Iris.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true

  • DVC utworzył plik data/Iris.csv.dvc i dadał oryginalny plik do .gitignore
  • W repozytorium będzie obecny tylko plik *.dvc, zawierający odnośnik do prawdziwego pliku
!git status -u
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

nothing added to commit but untracked files present (use "git add" to track)

Dodajmy pliki data/Iris.csv.dvc data/.gitignore do repozytorium git, zgodnie z sugestią DVC:

!git add data/Iris.csv.dvc data/.gitignore
!git commit -m "Dodano dane IRIS (DVC)"
[master 67214ea] Dodano dane IRIS (DVC)
 2 files changed, 5 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/Iris.csv.dvc

Plik *.dvc zawiera hash pliku. Więcej o plikach *.dvc: link

# %load data/Iris.csv.dvc
- md5: 717820ef0af287ff346c5cabfb4c612c
  size: 5107
  path: Iris.csv

Oryginalny plik Iris.csv został przeniesiony do katalogu ./dvc/cache/{wartość hash pliku) i podlinkowany z powrotem do oryginalnej lokalizacji. Sposób tworzenia linków może być różny w zależności od systemu plików.

!ls -l .dvc/cache/71
total 8
-r--r--r-- 1 tomek tomek 5107 Sep 19  2019 7820ef0af287ff346c5cabfb4c612c
!ls -l ./data
total 8
-rw-r--r-- 1 tomek tomek 5107 May 29 09:19 Iris.csv
-rw-r--r-- 1 tomek tomek   76 May 29 09:19 Iris.csv.dvc

dvc remote

  • żeby wysłać właściwe pliki śledzone przez DVC do zdalnej lokalizacji (z której będą mogłby być pobrane np. przez system CI albo innych użytkowników) musimy mieć skonfigurowaną taką lokazliację
  • służy do tego polecenie dvc remote add
  • użyjemy lokalnego "remote". Tutaj będzie to po prostu utworzony wcześniej katalog /dvcstore. Taki katalog istnieje też na naszym Jenkinsie, oczywiście należy go podmontować w Dockerze
  • w realnych zastosowaniach podalibyśmy tutaj ścieżkę do jakiegoś zasobu dostępnego przez inernet jak np. serwer SFTP, ścieżka do AWS S3 itp.

Obsługiwane typy zdalnych lokalizacji (remotes):

  • Amazon S3
  • S3-compatible storage
  • Microsoft Azure Blob Storage
  • Google Drive
  • Google Cloud Storage
  • Aliyun OSS
  • SSH
  • HDFS
  • WebHDFS
  • HTTP
  • WebDAV
  • local remote
!dvc remote add -d my_local_remote /dvcstore
Setting 'my_local_remote' as a default remote.

!git status
On branch master
nothing to commit, working tree clean
!git add .dvc/config
!git commit -m "Added DVC remote"
On branch master
nothing to commit, working tree clean

dvc push

Kiedy mamy już skonfigurowany "remote" możemy wypchnąć do niego pliki korzystając z polecenia dvc push:

!dvc push
  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
  0%|          |7820ef0af287ff346c5cabfb4c612c     0.00/? [00:00<?,        ?B/s]
  0%|          |7820ef0af287ff346c5cabfb4c612c 0.00/4.99k [00:00<?,        ?B/s]
1 file pushed                                                                   

!tree /dvcstore
└── 71
    └── 7820ef0af287ff346c5cabfb4c612c

1 directory, 1 file

dvc pull

Żeby pobrać dane z DVC (np. w innej lokalizacji, przez innego użytkownika), musimy:

  • sklonować repozytorium git (żeby pobrać pliki *.dvc
  • wykonać dvc pull

Dodawanie nowych plików i modyfikacja istniejących wygląda podobnie jak przy zwykłych plikach śledzonych przez git, tylko zamiast git używamy polecenia dvc a dodatkowo pamiętamy o zarządzaniu plikami *.dvc za pomocą gita:

!head -n -1 data/Iris.csv | sponge data/Iris.csv
!git status
On branch master
nothing to commit, working tree clean
!dvc status
data/Iris.csv.dvc:                                                    core>
	changed outs:
		modified:           data/Iris.csv

!dvc add data/Iris.csv
⠹ Checking graph                                                   ⠋ Checking graph
  0% Checking cache in '/home/tomek/repos/aitech-ium/IUM_10/sample-ml-project/.d
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s]
  0%|          |.GbNyfXVqWGYkQKjqaSP8tL.tmp        0.00/? [00:00<?,        ?B/s]
  0%|          |.GbNyfXVqWGYkQKjqaSP8tL.tmp     0.00/4.00 [00:00<?,        ?B/s]
  0%|          |cff2e578d76852294184c1dce9fdbf     0.00/? [00:00<?,        ?B/s]
  0%|          |cff2e578d76852294184c1dce9fdbf 0.00/4.95k [00:00<?,        ?B/s]
100% Adding...|████████████████████████████████████████|1/1 [00:00, 11.00file/s]

To track the changes with git, run:

    git add data/Iris.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true

!git add data/Iris.csv.dvc
!git commit -m "Removed last line from Iris dataset"
[master d6ff265] Removed last line from Iris dataset
 1 file changed, 2 insertions(+), 2 deletions(-)

dvc checkout

  • Polecenia dvc checkout używamy razem z git checkout, żeby zmienić branch, na którym pracujemy.
  • DVC podmieni wersje plików śledzonych przez siebie na pochodzące z innego brancha (o ile pliki te się różnią i różnią się pliki *.dvc w odpowiednich branchach
  • zmiana brancha przez git powoduje (ewentualną) zmianę plików *.dvc a dvc checkout kopiuje/linkuje pliki z katalogu .dvc/cache o wartościach hash odpowiadających tym z plików *.dvc

Wymiana danych między projektami

  • za pomocą poleceń dvc import i dvc update możemy dodać i później aktualizować pliki śledzone przez DVC w innym repozytorium
!dvc import \
             get-started/data.xml -o data/data.xml
Importing 'get-started/data.xml (' -> 'data/data.xml'
  0% Downloading|                                    |0/1 [00:00<?,     ?file/s]
  0%|          |get-started/data.xml           0.00/37.9M [00:00<?,       ?it/s]
  0%|          |get-started/data.xml      64.0k/36.1M [00:00<02:12,     286kB/s]
  0%|          |get-started/data.xml       128k/36.1M [00:00<01:33,     403kB/s]
  1%|          |get-started/data.xml       256k/36.1M [00:00<00:57,     658kB/s]
  1%|          |get-started/data.xml       384k/36.1M [00:00<00:45,     818kB/s]
  1%|▏         |get-started/data.xml       512k/36.1M [00:00<00:53,     693kB/s]
  2%|▏         |get-started/data.xml       640k/36.1M [00:01<00:57,     644kB/s]
  2%|▏         |get-started/data.xml       768k/36.1M [00:01<00:59,     619kB/s]
  2%|▏         |get-started/data.xml       896k/36.1M [00:01<00:51,     718kB/s]
  3%|▎         |get-started/data.xml      1.00M/36.1M [00:01<00:55,     666kB/s]
  3%|▎         |get-started/data.xml      1.12M/36.1M [00:01<00:57,     633kB/s]
  3%|▎         |get-started/data.xml      1.25M/36.1M [00:02<00:57,     638kB/s]
  4%|▍         |get-started/data.xml      1.38M/36.1M [00:02<00:52,     698kB/s]
  4%|▍         |get-started/data.xml      1.50M/36.1M [00:02<00:55,     656kB/s]
  4%|▍         |get-started/data.xml      1.62M/36.1M [00:02<00:57,     628kB/s]
  5%|▍         |get-started/data.xml      1.69M/36.1M [00:02<00:58,     618kB/s]
  5%|▌         |get-started/data.xml      1.81M/36.1M [00:02<00:53,     675kB/s]
  5%|▌         |get-started/data.xml      1.94M/36.1M [00:03<00:53,     672kB/s]
  6%|▌         |get-started/data.xml      2.06M/36.1M [00:03<00:55,     642kB/s]
  6%|▌         |get-started/data.xml      2.12M/36.1M [00:03<00:56,     628kB/s]
  6%|▌         |get-started/data.xml      2.19M/36.1M [00:03<00:57,     616kB/s]
  6%|▌         |get-started/data.xml      2.25M/36.1M [00:03<00:58,     606kB/s]
  7%|▋         |get-started/data.xml      2.38M/36.1M [00:03<00:48,     732kB/s]
  7%|▋         |get-started/data.xml      2.50M/36.1M [00:04<00:52,     666kB/s]
  7%|▋         |get-started/data.xml      2.62M/36.1M [00:04<00:55,     636kB/s]
  8%|▊         |get-started/data.xml      2.75M/36.1M [00:04<00:56,     614kB/s]
  8%|▊         |get-started/data.xml      2.88M/36.1M [00:04<00:49,     711kB/s]
  8%|▊         |get-started/data.xml      3.00M/36.1M [00:04<00:52,     663kB/s]
  9%|▊         |get-started/data.xml      3.12M/36.1M [00:05<00:54,     637kB/s]
  9%|▉         |get-started/data.xml      3.25M/36.1M [00:05<00:55,     623kB/s]
  9%|▉         |get-started/data.xml      3.38M/36.1M [00:05<00:48,     710kB/s]
 10%|▉         |get-started/data.xml      3.50M/36.1M [00:05<00:51,     664kB/s]
 10%|█         |get-started/data.xml      3.62M/36.1M [00:05<00:45,     751kB/s]
 10%|█         |get-started/data.xml      3.75M/36.1M [00:05<00:49,     691kB/s]
 11%|█         |get-started/data.xml      3.88M/36.1M [00:06<00:43,     777kB/s]
 11%|█         |get-started/data.xml      4.00M/36.1M [00:06<00:47,     705kB/s]
 11%|█▏        |get-started/data.xml      4.12M/36.1M [00:06<00:42,     790kB/s]
 12%|█▏        |get-started/data.xml      4.25M/36.1M [00:06<00:46,     716kB/s]
 12%|█▏        |get-started/data.xml      4.38M/36.1M [00:06<00:44,     749kB/s]
 12%|█▏        |get-started/data.xml      4.50M/36.1M [00:07<00:45,     734kB/s]
 13%|█▎        |get-started/data.xml      4.62M/36.1M [00:07<00:40,     810kB/s]
 13%|█▎        |get-started/data.xml      4.75M/36.1M [00:07<00:42,     773kB/s]
 13%|█▎        |get-started/data.xml      4.88M/36.1M [00:07<00:41,     795kB/s]
 14%|█▍        |get-started/data.xml      5.00M/36.1M [00:07<00:37,     870kB/s]
 14%|█▍        |get-started/data.xml      5.12M/36.1M [00:07<00:34,     932kB/s]
 15%|█▍        |get-started/data.xml      5.25M/36.1M [00:07<00:35,     916kB/s]
 15%|█▍        |get-started/data.xml      5.38M/36.1M [00:08<00:35,     898kB/s]
 15%|█▌        |get-started/data.xml      5.50M/36.1M [00:08<00:33,     962kB/s]
 16%|█▌        |get-started/data.xml      5.62M/36.1M [00:08<00:33,     949kB/s]
 16%|█▌        |get-started/data.xml      5.75M/36.1M [00:08<00:31,    1.00MB/s]
 16%|█▋        |get-started/data.xml      5.88M/36.1M [00:08<00:30,    1.04MB/s]
 17%|█▋        |get-started/data.xml      6.06M/36.1M [00:08<00:26,    1.19MB/s]
 17%|█▋        |get-started/data.xml      6.19M/36.1M [00:08<00:26,    1.19MB/s]
 17%|█▋        |get-started/data.xml      6.31M/36.1M [00:08<00:26,    1.19MB/s]
 18%|█▊        |get-started/data.xml      6.50M/36.1M [00:08<00:23,    1.31MB/s]
 18%|█▊        |get-started/data.xml      6.62M/36.1M [00:09<00:23,    1.30MB/s]
 19%|█▉        |get-started/data.xml      6.81M/36.1M [00:09<00:21,    1.41MB/s]
 19%|█▉        |get-started/data.xml      7.00M/36.1M [00:09<00:20,    1.48MB/s]
 20%|█▉        |get-started/data.xml      7.19M/36.1M [00:09<00:19,    1.54MB/s]
 20%|██        |get-started/data.xml      7.38M/36.1M [00:09<00:18,    1.60MB/s]
 21%|██        |get-started/data.xml      7.56M/36.1M [00:09<00:18,    1.62MB/s]
 21%|██▏       |get-started/data.xml      7.75M/36.1M [00:09<00:17,    1.68MB/s]
 22%|██▏       |get-started/data.xml      7.94M/36.1M [00:09<00:17,    1.70MB/s]
 22%|██▏       |get-started/data.xml      8.12M/36.1M [00:10<00:17,    1.72MB/s]
 23%|██▎       |get-started/data.xml      8.38M/36.1M [00:10<00:15,    1.88MB/s]
 24%|██▎       |get-started/data.xml      8.56M/36.1M [00:10<00:15,    1.84MB/s]
 24%|██▍       |get-started/data.xml      8.81M/36.1M [00:10<00:14,    1.96MB/s]
 25%|██▌       |get-started/data.xml      9.06M/36.1M [00:10<00:13,    2.06MB/s]
 26%|██▌       |get-started/data.xml      9.31M/36.1M [00:10<00:13,    2.14MB/s]
 27%|██▋       |get-started/data.xml      9.62M/36.1M [00:10<00:11,    2.32MB/s]
 27%|██▋       |get-started/data.xml      9.88M/36.1M [00:10<00:11,    2.33MB/s]
 28%|██▊       |get-started/data.xml      10.2M/36.1M [00:10<00:11,    2.46MB/s]
 29%|██▉       |get-started/data.xml      10.4M/36.1M [00:11<00:10,    2.45MB/s]
 30%|██▉       |get-started/data.xml      10.8M/36.1M [00:11<00:10,    2.57MB/s]
 31%|███       |get-started/data.xml      11.1M/36.1M [00:11<00:09,    2.67MB/s]
 32%|███▏      |get-started/data.xml      11.4M/36.1M [00:11<00:09,    2.84MB/s]
 33%|███▎      |get-started/data.xml      11.8M/36.1M [00:11<00:08,    2.85MB/s]
 34%|███▎      |get-started/data.xml      12.1M/36.1M [00:11<00:08,    3.01MB/s]
 35%|███▍      |get-started/data.xml      12.5M/36.1M [00:11<00:07,    3.12MB/s]
 36%|███▌      |get-started/data.xml      12.9M/36.1M [00:11<00:07,    3.22MB/s]
 37%|███▋      |get-started/data.xml      13.2M/36.1M [00:11<00:07,    3.31MB/s]
 38%|███▊      |get-started/data.xml      13.7M/36.1M [00:12<00:06,    3.49MB/s]
 39%|███▉      |get-started/data.xml      14.1M/36.1M [00:12<00:06,    3.62MB/s]
 40%|████      |get-started/data.xml      14.6M/36.1M [00:12<00:06,    3.74MB/s]
 42%|████▏     |get-started/data.xml      15.0M/36.1M [00:12<00:05,    3.82MB/s]
 43%|████▎     |get-started/data.xml      15.4M/36.1M [00:12<00:05,    3.97MB/s]
 44%|████▍     |get-started/data.xml      15.9M/36.1M [00:12<00:05,    4.08MB/s]
 45%|████▌     |get-started/data.xml      16.4M/36.1M [00:12<00:04,    4.23MB/s]
 47%|████▋     |get-started/data.xml      17.0M/36.1M [00:12<00:04,    4.44MB/s]
 48%|████▊     |get-started/data.xml      17.5M/36.1M [00:12<00:04,    4.52MB/s]
 50%|████▉     |get-started/data.xml      18.1M/36.1M [00:13<00:04,    4.69MB/s]
 52%|█████▏    |get-started/data.xml      18.6M/36.1M [00:13<00:03,    4.84MB/s]
 53%|█████▎    |get-started/data.xml      19.2M/36.1M [00:13<00:03,    5.05MB/s]
 55%|█████▍    |get-started/data.xml      19.8M/36.1M [00:13<00:03,    5.16MB/s]
 57%|█████▋    |get-started/data.xml      20.4M/36.1M [00:13<00:03,    5.35MB/s]
 58%|█████▊    |get-started/data.xml      21.1M/36.1M [00:13<00:02,    5.49MB/s]
 60%|██████    |get-started/data.xml      21.8M/36.1M [00:13<00:02,    5.66MB/s]
 62%|██████▏   |get-started/data.xml      22.4M/36.1M [00:13<00:02,    5.83MB/s]
 64%|██████▍   |get-started/data.xml      23.2M/36.1M [00:14<00:02,    6.05MB/s]
 66%|██████▌   |get-started/data.xml      23.9M/36.1M [00:14<00:02,    6.20MB/s]
 68%|██████▊   |get-started/data.xml      24.6M/36.1M [00:14<00:01,    6.40MB/s]
 70%|███████   |get-started/data.xml      25.4M/36.1M [00:14<00:01,    6.51MB/s]
 72%|███████▏  |get-started/data.xml      26.0M/36.1M [00:14<00:01,    5.75MB/s]
 74%|███████▎  |get-started/data.xml      26.6M/36.1M [00:14<00:02,    4.26MB/s]
 75%|███████▍  |get-started/data.xml      27.1M/36.1M [00:14<00:02,    3.53MB/s]
 76%|███████▌  |get-started/data.xml      27.5M/36.1M [00:15<00:02,    3.26MB/s]
 77%|███████▋  |get-started/data.xml      27.9M/36.1M [00:15<00:02,    3.00MB/s]
 78%|███████▊  |get-started/data.xml      28.2M/36.1M [00:15<00:02,    2.95MB/s]
 79%|███████▉  |get-started/data.xml      28.5M/36.1M [00:15<00:02,    2.91MB/s]
 80%|███████▉  |get-started/data.xml      28.8M/36.1M [00:15<00:02,    2.88MB/s]
 81%|████████  |get-started/data.xml      29.1M/36.1M [00:15<00:02,    2.86MB/s]
 81%|████████▏ |get-started/data.xml      29.4M/36.1M [00:15<00:02,    2.84MB/s]
 82%|████████▏ |get-started/data.xml      29.8M/36.1M [00:16<00:02,    2.83MB/s]
 83%|████████▎ |get-started/data.xml      30.1M/36.1M [00:16<00:02,    2.83MB/s]
 84%|████████▍ |get-started/data.xml      30.4M/36.1M [00:16<00:02,    2.83MB/s]
 85%|████████▍ |get-started/data.xml      30.7M/36.1M [00:16<00:02,    2.83MB/s]
 86%|████████▌ |get-started/data.xml      31.0M/36.1M [00:16<00:01,    2.83MB/s]
 87%|████████▋ |get-started/data.xml      31.3M/36.1M [00:16<00:01,    2.83MB/s]
 88%|████████▊ |get-started/data.xml      31.6M/36.1M [00:16<00:01,    2.83MB/s]
 88%|████████▊ |get-started/data.xml      31.9M/36.1M [00:16<00:01,    2.84MB/s]
 89%|████████▉ |get-started/data.xml      32.2M/36.1M [00:16<00:01,    2.85MB/s]
 90%|█████████ |get-started/data.xml      32.6M/36.1M [00:17<00:01,    2.85MB/s]
 91%|█████████ |get-started/data.xml      32.9M/36.1M [00:17<00:01,    2.86MB/s]
 92%|█████████▏|get-started/data.xml      33.2M/36.1M [00:17<00:01,    2.86MB/s]
 93%|█████████▎|get-started/data.xml      33.5M/36.1M [00:17<00:00,    2.87MB/s]
 94%|█████████▎|get-started/data.xml      33.8M/36.1M [00:17<00:00,    2.87MB/s]
 94%|█████████▍|get-started/data.xml      34.1M/36.1M [00:17<00:00,    2.87MB/s]
 95%|█████████▌|get-started/data.xml      34.4M/36.1M [00:17<00:00,    2.87MB/s]
 96%|█████████▌|get-started/data.xml      34.8M/36.1M [00:17<00:00,    2.87MB/s]
 97%|█████████▋|get-started/data.xml      35.1M/36.1M [00:17<00:00,    2.87MB/s]
 98%|█████████▊|get-started/data.xml      35.4M/36.1M [00:18<00:00,    2.87MB/s]
 99%|█████████▉|get-started/data.xml      35.7M/36.1M [00:18<00:00,    2.88MB/s]
100%|█████████▉|get-started/data.xml      36.0M/36.1M [00:18<00:00,    2.87MB/s]
To track the changes with git, run:

	git add data/.gitignore data/data.xml.dvc

!dvc status
Data and pipelines are up to date.                                              

ls -l data
total 37020
-rw-rw-r-- 1 tomek tomek 37891850 maj 31 11:10 data.xml
-rw-rw-r-- 1 tomek tomek      284 maj 31 11:10 data.xml.dvc
-rw-rw-r-- 1 tomek tomek     5072 maj 31 11:01 Iris.csv
-rw-rw-r-- 1 tomek tomek       76 maj 31 11:01 Iris.csv.dvc
# %load data/data.xml.dvc
md5: a7cd139231cc35ed63541ce3829b96db
frozen: true
- path: get-started/data.xml
    rev_lock: ba014f40e29670421a67cb1c47543f402348aa13
- md5: a304afb96060aad90176268345e10355
  size: 37891850
  path: data.xml

DVC pipelines

  • wprowadzenie:
  • Getting started:
  • dvc pipelines pozwala nam zbudować (za pomocą polecenie dvc run) lub zdefiniować (edytując plik dvc.yaml) graf zależności między krokami wykonywanymi w naszym projekcie (takimi jak "przygotowanie danych", "trenowanie", "ewaluacja")
  • tak zdefiniowany pipeline można potem uruchomić za pomocą polecenia dvc reproduce

Zadania [15pkt]

  1. Zainicjalizuj repozytorium DVC wewnątrz Twojego repozytorium z projektem [1pkt]
  2. Dodaj plik(i) z danymi w Twoim projekcie do DVC [1pkt]
  3. Skonfiguruj remote (dane do konfiguracji podane poniżej) [3pkt]
  4. Stwórz/zdefiniuj i dodaj do repozytorium plik dvc.yaml opisujący kroki wykonywane w Twoim projekcie. Wydziel przynajmniej 2 kroki (np. przygotowanie danych/trenowanie) powiązane ze sobą za pomocą zależności (skorzystaj z materiałów "Getting started", link powyżej) [6pkt]
  5. Stwórz projekt na Jenkinsie (s1233456-dvc), w którym sklonujesz repozytorium, ściągniesz pliki dvc (za pomocą dvc pull) i uruchomisz pipeline (za pomocą dvc reproduce) [4pkt]

SSH remote

Jednym z remote obsługiwanych przez DVC jest SFTP/SSH. W celu jego wykorzystania na serwerze utworzony został użytkownik ium-sftp i skonfigurowany serwer SFTP. Został też dla niego wygenerowany klucz ssh, który został dodany jako "Jenkins credential" (patrz opis konfiguracji na Jenkins poniżej)


Będziemy potrzebować zależności (szczegóły)

conda install dvc-ssh


pip install dvc[ssh] paramiko

conda install -c conda-forge dvc-ssh
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/tomek/miniconda3

  added / updated specs:
    - dvc-ssh

The following packages will be downloaded:

    package                    |            build
    bcrypt-3.2.0               |   py39h3811e60_1          44 KB  conda-forge
    ca-certificates-2021.5.30  |       ha878542_0         136 KB  conda-forge
    certifi-2021.5.30          |   py39hf3d152e_0         141 KB  conda-forge
    dvc-2.3.0                  |   py39hf3d152e_0         542 KB  conda-forge
    dvc-ssh-2.3.0              |   py39hf3d152e_0           9 KB  conda-forge
    fsspec-2021.5.0            |     pyhd8ed1ab_0          77 KB  conda-forge
    invoke-1.5.0               |     pyhd3deb0d_0         137 KB  conda-forge
    paramiko-2.7.2             |     pyh9f0ad1d_0         135 KB  conda-forge
    pynacl-1.4.0               |   py39h3811e60_2         1.3 MB  conda-forge
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

  bcrypt             conda-forge/linux-64::bcrypt-3.2.0-py39h3811e60_1
  dvc-ssh            conda-forge/linux-64::dvc-ssh-2.3.0-py39hf3d152e_0
  invoke             conda-forge/noarch::invoke-1.5.0-pyhd3deb0d_0
  paramiko           conda-forge/noarch::paramiko-2.7.2-pyh9f0ad1d_0
  pynacl             conda-forge/linux-64::pynacl-1.4.0-py39h3811e60_2

The following packages will be UPDATED:

  ca-certificates                      2020.12.5-ha878542_0 --> 2021.5.30-ha878542_0
  certifi                          2020.12.5-py39hf3d152e_1 --> 2021.5.30-py39hf3d152e_0
  dvc                                  2.1.0-py39hf3d152e_0 --> 2.3.0-py39hf3d152e_0
  fsspec                                 0.9.0-pyhd8ed1ab_2 --> 2021.5.0-pyhd8ed1ab_0

Downloading and Extracting Packages
certifi-2021.5.30    | 141 KB    | ##################################### | 100% 
fsspec-2021.5.0      | 77 KB     | ##################################### | 100% 
dvc-2.3.0            | 542 KB    | ##################################### | 100% 
invoke-1.5.0         | 137 KB    | ##################################### | 100% 
paramiko-2.7.2       | 135 KB    | ##################################### | 100% 
bcrypt-3.2.0         | 44 KB     | ##################################### | 100% 
pynacl-1.4.0         | 1.3 MB    | ##################################### | 100% 
dvc-ssh-2.3.0        | 9 KB      | ##################################### | 100% 
ca-certificates-2021 | 136 KB    | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
!dvc remote add -f -d ium_ssh_remote ssh://
Setting 'ium_ssh_remote' as a default remote.

!dvc remote list
my_local_remote	/dvcstore
ium_ssh_remote	ssh://

Zapisujemy hasło:

!dvc remote modify --local ium_ssh_remote password IUM@2021

!dvc push
Everything is up to date.                                                       


W Jenkins można użyć mechanizmu "Credentials", żeby w bezpieczny sposób przekazać hasło albo klucz prywatny.

Takie dane dla użytkownika ium-sftp zostały stworzone na Jenkinsie:

Opis używania "Credentials" w Jenkinsfile:

Klucza ssh można użyć tak:

    [sshUserPrivateKey(credentialsId: '48ac7004-216e-4260-abba-1fe5db753e18', keyFileVariable: 'IUM_SFTP_KEY', passphraseVariable: '', usernameVariable: '')]) {
                sh 'dvc remote add -d ium_ssh_remote ssh://'
                sh 'dvc remote modify --local ium_ssh_remote keyfile $IUM_SFTP_KEY'
                sh 'dvc pull'}

Secret text tak:

    withCredentials([string(credentialsId: 'ium-sftp-password', variable: 'IUM_SFTP_PASS')]) {
                sh 'dvc remote add -d ium_ssh_remote ssh://'
                sh 'dvc remote modify --local ium_ssh_remote password $IUM_SFTP_KEY'
                sh 'dvc pull'

Przykład kongiguracji: