ium/IUM_10.DVC.ipynb

1307 lines
47 KiB
Plaintext
Raw Normal View History

2021-05-31 11:55:27 +02:00
{
"cells": [
2021-09-28 10:56:21 +02:00
{
"cell_type": "markdown",
"id": "7fe475ae",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
2024-04-09 09:46:32 +02:00
"## Inżynieria uczenia maszynowego\n",
"### 22 maja 2024\n",
"# 10. DVC"
2021-09-28 10:56:21 +02:00
]
},
2021-05-31 11:55:27 +02:00
{
"cell_type": "markdown",
"id": "0c6f27a5",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"<img src=\"img/expcontrol/dvc-logo.png\">"
]
},
{
"cell_type": "markdown",
"id": "560eec71",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## DVC - Data Version Control\n",
"- [dvc.org](https://dvc.org/)\n",
"- \"Version Control System for Machine Learning Projects\" (System kontroli wersji dla projektów uczenia maszynowego)\n",
"- Open Source\n",
"- Umożliwia:\n",
" - wersjonowanie danych i modeli. \"Git dla danych i modeli\"\n",
" - budowanie potoków (\"pipeline\") definiujących jak budować/trenować/ewaluować modele. \"Makefile dla uczenia maszynowego\"\n",
2022-05-23 09:57:46 +02:00
" - śledzenie, porównywanie metryk i parametrów\n",
2021-05-31 11:55:27 +02:00
"- ściśle zintegowany z gitem\n",
"- działa niezależnie od używanego języka/bibliotek i systemu operacyjnego\n",
2024-05-21 14:24:34 +02:00
"- 5-minutowe wprowadzenie: https://www.youtube.com/watch?v=UbL7VUpv1Bs"
2021-05-31 11:55:27 +02:00
]
},
2023-04-26 13:44:55 +02:00
{
"cell_type": "markdown",
"id": "3d4ce1cb",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Śledzenie plików za pomocą DVC\n",
" - dużymi plikami, takimi jak plikami z danymi wejściowymi czy plikami modeli, trudno zarządza się za pomocą gita, ze względu na problemy z:\n",
" - wydajnością\n",
" - przestrzenią w repozytorium\n",
" - ograniczenia ze strony serwisu (np. [limit 100 MB na plik w Github](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github))\n",
" - Git posiada rozszerzenie [lfs(Large File Storage)](https://git-lfs.github.com/), które stanowi pewne rozwiązanie tego problemu. \n",
" - Same pliki przechowywane są na specjalnym zdalnym serwerze, w repozytorium przechowywane są jedynie odnośniki do tych plików i pewne metadane\n",
" - Github ma zintegrowany LFS z [limitem 1GB dla kont bezpłatnych](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage)"
]
},
{
"cell_type": "markdown",
"id": "dd8e529b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
" - DVC proponuje podobne podejście co LFS, ale:\n",
" - pliki mogą być przechowywane na niemal dowolnym serwerze, również lokalnie\n",
" - brak limitu wielkości plików (w Git-LFS na Github [limit 2GB](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage))\n",
" - DVC zapewnia dodatkowe narzędzie umożliwiające śledzenie plików i ich powiązań z wynikami eksperymentów\n",
" - więcej, patrz [tutaj](https://dvc.org/doc/user-guide/related-technologies)"
]
},
2021-05-31 11:55:27 +02:00
{
"cell_type": "markdown",
"id": "9bfb356e",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Instalacja i inicjalizacja\n",
" - https://dvc.org/doc/install\n",
2024-05-21 14:24:34 +02:00
" - ```pip install dvc```\n",
" - ```pipx install dvc```\n",
2021-05-31 11:55:27 +02:00
" - ```conda install dvc```"
]
},
{
"cell_type": "code",
2024-05-21 14:24:34 +02:00
"execution_count": 1,
2021-05-31 11:55:27 +02:00
"id": "054c7a11",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2024-05-21 14:24:34 +02:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting dvc\n",
" Downloading dvc-3.50.2-py3-none-any.whl (451 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m451.6/451.6 KB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: psutil>=5.8 in ./venv/lib/python3.10/site-packages (from dvc) (5.9.8)\n",
"Collecting rich>=12\n",
" Downloading rich-13.7.1-py3-none-any.whl (240 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m240.7/240.7 KB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting pathspec>=0.10.3\n",
" Using cached pathspec-0.12.1-py3-none-any.whl (31 kB)\n",
"Collecting configobj>=5.0.6\n",
" Downloading configobj-5.0.8-py2.py3-none-any.whl (36 kB)\n",
"Collecting pydot>=1.2.4\n",
" Downloading pydot-2.0.0-py3-none-any.whl (22 kB)\n",
"Collecting platformdirs<4,>=3.1.1\n",
" Downloading platformdirs-3.11.0-py3-none-any.whl (17 kB)\n",
"Collecting funcy>=1.14\n",
" Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)\n",
"Collecting attrs>=22.2.0\n",
" Using cached attrs-23.2.0-py3-none-any.whl (60 kB)\n",
"Collecting shtab<2,>=1.3.4\n",
" Downloading shtab-1.7.1-py3-none-any.whl (14 kB)\n",
"Collecting flatten-dict<1,>=0.4.1\n",
" Downloading flatten_dict-0.4.2-py2.py3-none-any.whl (9.7 kB)\n",
"Collecting kombu\n",
" Downloading kombu-5.3.7-py3-none-any.whl (200 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m200.2/200.2 KB\u001b[0m \u001b[31m7.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting hydra-core>=1.1\n",
" Downloading hydra_core-1.3.2-py3-none-any.whl (154 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m154.5/154.5 KB\u001b[0m \u001b[31m9.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting omegaconf\n",
" Downloading omegaconf-2.3.0-py3-none-any.whl (79 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m79.5/79.5 KB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting voluptuous>=0.11.7\n",
" Downloading voluptuous-0.14.2-py3-none-any.whl (31 kB)\n",
"Collecting dvc-objects\n",
" Downloading dvc_objects-5.1.0-py3-none-any.whl (33 kB)\n",
"Requirement already satisfied: colorama>=0.3.9 in ./venv/lib/python3.10/site-packages (from dvc) (0.4.6)\n",
"Collecting dulwich\n",
" Downloading dulwich-0.22.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (979 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m979.1/979.1 KB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hCollecting celery\n",
" Downloading celery-5.4.0-py3-none-any.whl (425 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m426.0/426.0 KB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hCollecting pygtrie>=2.3.2\n",
" Downloading pygtrie-2.5.0-py3-none-any.whl (25 kB)\n",
"Collecting zc.lockfile>=1.2.1\n",
" Downloading zc.lockfile-3.0.post1-py3-none-any.whl (9.8 kB)\n",
"Collecting gto<2,>=1.6.0\n",
" Downloading gto-1.7.1-py3-none-any.whl (46 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.6/46.6 KB\u001b[0m \u001b[31m6.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting tabulate>=0.8.7\n",
" Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)\n",
"Collecting iterative-telemetry>=0.0.7\n",
" Downloading iterative_telemetry-0.0.8-py3-none-any.whl (10 kB)\n",
"Requirement already satisfied: pyparsing>=2.4.7 in ./venv/lib/python3.10/site-packages (from dvc) (3.1.2)\n",
"Collecting networkx>=2.5\n",
" Using cached networkx-3.3-py3-none-any.whl (1.7 MB)\n",
"Collecting distro>=1.3\n",
" Downloading distro-1.9.0-py3-none-any.whl (20 kB)\n",
"Requirement already satisfied: tqdm<5,>=4.63.1 in ./venv/lib/python3.10/site-packages (from dvc) (4.66.2)\n",
"Collecting flufl.lock<8,>=5\n",
" Downloading flufl.lock-7.1.1-py3-none-any.whl (11 kB)\n",
"Collecting scmrepo<4,>=3.3.2\n",
" Downloading scmrepo-3.3.5-py3-none-any.whl (73 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m73.6/73.6 KB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting dvc-http>=2.29.0\n",
" Downloading dvc_http-2.32.0-py3-none-any.whl (12 kB)\n",
"Collecting grandalf<1,>=0.7\n",
" Downloading grandalf-0.8-py3-none-any.whl (41 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m41.8/41.8 KB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: packaging>=19 in ./venv/lib/python3.10/site-packages (from dvc) (24.0)\n",
"Collecting dvc-task<1,>=0.3.0\n",
" Downloading dvc_task-0.4.0-py3-none-any.whl (21 kB)\n",
"Collecting dvc-studio-client<1,>=0.20\n",
" Downloading dvc_studio_client-0.20.0-py3-none-any.whl (16 kB)\n",
"Collecting dpath<3,>=2.1.0\n",
" Downloading dpath-2.1.6-py3-none-any.whl (17 kB)\n",
"Collecting fsspec\n",
" Downloading fsspec-2024.5.0-py3-none-any.whl (316 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m316.1/316.1 KB\u001b[0m \u001b[31m4.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hCollecting dvc-render<2,>=1.0.1\n",
" Downloading dvc_render-1.0.2-py3-none-any.whl (22 kB)\n",
"Collecting dvc-data<3.16,>=3.15\n",
" Downloading dvc_data-3.15.1-py3-none-any.whl (71 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m71.6/71.6 KB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting ruamel.yaml>=0.17.11\n",
" Downloading ruamel.yaml-0.18.6-py3-none-any.whl (117 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m117.8/117.8 KB\u001b[0m \u001b[31m1.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hCollecting shortuuid>=0.5\n",
" Downloading shortuuid-1.0.13-py3-none-any.whl (10 kB)\n",
"Requirement already satisfied: requests>=2.22 in ./venv/lib/python3.10/site-packages (from dvc) (2.31.0)\n",
"Collecting tomlkit>=0.11.1\n",
" Downloading tomlkit-0.12.5-py3-none-any.whl (37 kB)\n",
"Requirement already satisfied: six in ./venv/lib/python3.10/site-packages (from configobj>=5.0.6->dvc) (1.16.0)\n",
"Collecting sqltrie<1,>=0.11.0\n",
" Downloading sqltrie-0.11.0-py3-none-any.whl (17 kB)\n",
"Collecting diskcache>=5.2.1\n",
" Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m45.5/45.5 KB\u001b[0m \u001b[31m3.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting dictdiffer>=0.8.1\n",
" Downloading dictdiffer-0.9.0-py2.py3-none-any.whl (16 kB)\n",
"Collecting aiohttp-retry>=2.5.0\n",
" Downloading aiohttp_retry-2.8.3-py3-none-any.whl (9.8 kB)\n",
"Collecting click-didyoumean>=0.3.0\n",
" Downloading click_didyoumean-0.3.1-py3-none-any.whl (3.6 kB)\n",
"Requirement already satisfied: click<9.0,>=8.1.2 in ./venv/lib/python3.10/site-packages (from celery->dvc) (8.1.7)\n",
"Collecting click-plugins>=1.1.1\n",
" Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)\n",
"Collecting vine<6.0,>=5.1.0\n",
" Downloading vine-5.1.0-py3-none-any.whl (9.6 kB)\n",
"Collecting click-repl>=0.2.0\n",
" Downloading click_repl-0.3.0-py3-none-any.whl (10 kB)\n",
"Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.10/site-packages (from celery->dvc) (2024.1)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in ./venv/lib/python3.10/site-packages (from celery->dvc) (2.9.0.post0)\n",
"Collecting billiard<5.0,>=4.2.0\n",
" Downloading billiard-4.2.0-py3-none-any.whl (86 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.7/86.7 KB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting atpublic>=2.3\n",
" Downloading atpublic-4.1.0-py3-none-any.whl (5.0 kB)\n",
"Collecting semver>=2.13.0\n",
" Downloading semver-3.0.2-py3-none-any.whl (17 kB)\n",
"Collecting pydantic!=2.0.0,<3,>=1.9.0\n",
" Downloading pydantic-2.7.1-py3-none-any.whl (409 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m409.3/409.3 KB\u001b[0m \u001b[31m9.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting typer>=0.4.1\n",
" Downloading typer-0.12.3-py3-none-any.whl (47 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m47.2/47.2 KB\u001b[0m \u001b[31m6.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: entrypoints in ./venv/lib/python3.10/site-packages (from gto<2,>=1.6.0->dvc) (0.4)\n",
"Collecting antlr4-python3-runtime==4.9.*\n",
" Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m117.0/117.0 KB\u001b[0m \u001b[31m9.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25hCollecting appdirs\n",
" Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)\n",
"Collecting filelock\n",
" Downloading filelock-3.14.0-py3-none-any.whl (12 kB)\n",
"Collecting amqp<6.0.0,>=5.1.1\n",
" Downloading amqp-5.2.0-py3-none-any.whl (50 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m50.9/50.9 KB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: PyYAML>=5.1.0 in ./venv/lib/python3.10/site-packages (from omegaconf->dvc) (6.0.1)\n",
"Requirement already satisfied: certifi>=2017.4.17 in ./venv/lib/python3.10/site-packages (from requests>=2.22->dvc) (2024.2.2)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in ./venv/lib/python3.10/site-packages (from requests>=2.22->dvc) (3.3.2)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in ./venv/lib/python3.10/site-packages (from requests>=2.22->dvc) (2.2.1)\n",
"Requirement already satisfied: idna<4,>=2.5 in ./venv/lib/python3.10/site-packages (from requests>=2.22->dvc) (3.6)\n",
"Collecting markdown-it-py>=2.2.0\n",
" Downloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m87.5/87.5 KB\u001b[0m \u001b[31m5.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: pygments<3.0.0,>=2.13.0 in ./venv/lib/python3.10/site-packages (from rich>=12->dvc) (2.17.2)\n",
"Collecting ruamel.yaml.clib>=0.2.7\n",
" Downloading ruamel.yaml.clib-0.2.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (526 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m526.7/526.7 KB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hCollecting pygit2>=1.14.0\n",
" Downloading pygit2-1.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.1/5.1 MB\u001b[0m \u001b[31m7.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hCollecting asyncssh<3,>=2.13.1\n",
" Downloading asyncssh-2.14.2-py3-none-any.whl (352 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m352.5/352.5 KB\u001b[0m \u001b[31m1.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: gitpython>3 in ./venv/lib/python3.10/site-packages (from scmrepo<4,>=3.3.2->dvc) (3.1.43)\n",
"Requirement already satisfied: setuptools in ./venv/lib/python3.10/site-packages (from zc.lockfile>=1.2.1->dvc) (59.6.0)\n",
"Collecting aiohttp\n",
" Downloading aiohttp-3.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.2/1.2 MB\u001b[0m \u001b[31m6.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0mm\n",
"\u001b[?25hRequirement already satisfied: typing-extensions>=3.6 in ./venv/lib/python3.10/site-packages (from asyncssh<3,>=2.13.1->scmrepo<4,>=3.3.2->dvc) (4.11.0)\n",
"Collecting cryptography>=39.0\n",
" Downloading cryptography-42.0.7-cp39-abi3-manylinux_2_28_x86_64.whl (3.8 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.8/3.8 MB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0mm\n",
"\u001b[?25hRequirement already satisfied: prompt-toolkit>=3.0.36 in ./venv/lib/python3.10/site-packages (from click-repl>=0.2.0->celery->dvc) (3.0.43)\n",
"Requirement already satisfied: gitdb<5,>=4.0.1 in ./venv/lib/python3.10/site-packages (from gitpython>3->scmrepo<4,>=3.3.2->dvc) (4.0.11)\n",
"Collecting mdurl~=0.1\n",
" Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)\n",
"Collecting pydantic-core==2.18.2\n",
" Downloading pydantic_core-2.18.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.1/2.1 MB\u001b[0m \u001b[31m7.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
"\u001b[?25hCollecting annotated-types>=0.4.0\n",
" Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)\n",
"Collecting cffi>=1.16.0\n",
" Using cached cffi-1.16.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (443 kB)\n",
"Collecting orjson\n",
" Downloading orjson-3.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m142.5/142.5 KB\u001b[0m \u001b[31m5.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting shellingham>=1.3.0\n",
" Downloading shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)\n",
"Collecting yarl<2.0,>=1.0\n",
" Using cached yarl-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (301 kB)\n",
"Collecting aiosignal>=1.1.2\n",
" Using cached aiosignal-1.3.1-py3-none-any.whl (7.6 kB)\n",
"Collecting frozenlist>=1.1.1\n",
" Using cached frozenlist-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (239 kB)\n",
"Collecting multidict<7.0,>=4.5\n",
" Using cached multidict-6.0.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (124 kB)\n",
"Collecting async-timeout<5.0,>=4.0\n",
" Using cached async_timeout-4.0.3-py3-none-any.whl (5.7 kB)\n",
"Collecting pycparser\n",
" Downloading pycparser-2.22-py3-none-any.whl (117 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m117.6/117.6 KB\u001b[0m \u001b[31m6.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: smmap<6,>=3.0.1 in ./venv/lib/python3.10/site-packages (from gitdb<5,>=4.0.1->gitpython>3->scmrepo<4,>=3.3.2->dvc) (5.0.1)\n",
"Requirement already satisfied: wcwidth in ./venv/lib/python3.10/site-packages (from prompt-toolkit>=3.0.36->click-repl>=0.2.0->celery->dvc) (0.2.13)\n",
"Using legacy 'setup.py install' for antlr4-python3-runtime, since package 'wheel' is not installed.\n",
"Installing collected packages: pygtrie, funcy, dictdiffer, appdirs, antlr4-python3-runtime, zc.lockfile, voluptuous, vine, tomlkit, tabulate, shtab, shortuuid, shellingham, semver, ruamel.yaml.clib, pydot, pydantic-core, pycparser, platformdirs, pathspec, orjson, omegaconf, networkx, multidict, mdurl, grandalf, fsspec, frozenlist, flatten-dict, filelock, dvc-render, dulwich, dpath, distro, diskcache, configobj, click-plugins, click-didyoumean, billiard, attrs, atpublic, async-timeout, annotated-types, yarl, sqltrie, ruamel.yaml, pydantic, markdown-it-py, iterative-telemetry, hydra-core, flufl.lock, dvc-studio-client, dvc-objects, click-repl, cffi, amqp, aiosignal, rich, pygit2, kombu, dvc-data, cryptography, aiohttp, typer, celery, asyncssh, aiohttp-retry, scmrepo, dvc-task, dvc-http, gto, dvc\n",
" Running setup.py install for antlr4-python3-runtime ... \u001b[?25ldone\n",
"\u001b[?25h Attempting uninstall: platformdirs\n",
" Found existing installation: platformdirs 4.2.0\n",
" Uninstalling platformdirs-4.2.0:\n",
" Successfully uninstalled platformdirs-4.2.0\n",
"Successfully installed aiohttp-3.9.5 aiohttp-retry-2.8.3 aiosignal-1.3.1 amqp-5.2.0 annotated-types-0.7.0 antlr4-python3-runtime-4.9.3 appdirs-1.4.4 async-timeout-4.0.3 asyncssh-2.14.2 atpublic-4.1.0 attrs-23.2.0 billiard-4.2.0 celery-5.4.0 cffi-1.16.0 click-didyoumean-0.3.1 click-plugins-1.1.1 click-repl-0.3.0 configobj-5.0.8 cryptography-42.0.7 dictdiffer-0.9.0 diskcache-5.6.3 distro-1.9.0 dpath-2.1.6 dulwich-0.22.1 dvc-3.50.2 dvc-data-3.15.1 dvc-http-2.32.0 dvc-objects-5.1.0 dvc-render-1.0.2 dvc-studio-client-0.20.0 dvc-task-0.4.0 filelock-3.14.0 flatten-dict-0.4.2 flufl.lock-7.1.1 frozenlist-1.4.1 fsspec-2024.5.0 funcy-2.0 grandalf-0.8 gto-1.7.1 hydra-core-1.3.2 iterative-telemetry-0.0.8 kombu-5.3.7 markdown-it-py-3.0.0 mdurl-0.1.2 multidict-6.0.5 networkx-3.3 omegaconf-2.3.0 orjson-3.10.3 pathspec-0.12.1 platformdirs-3.11.0 pycparser-2.22 pydantic-2.7.1 pydantic-core-2.18.2 pydot-2.0.0 pygit2-1.15.0 pygtrie-2.5.0 rich-13.7.1 ruamel.yaml-0.18.6 ruamel.yaml.clib-0.2.8 scmrepo-3.3.5 semver-3.0.2 shellingham-1.5.4 shortuuid-1.0.13 shtab-1.7.1 sqltrie-0.11.0 tabulate-0.9.0 tomlkit-0.12.5 typer-0.12.3 vine-5.1.0 voluptuous-0.14.2 yarl-1.9.4 zc.lockfile-3.0.post1\n"
]
}
],
2021-05-31 11:55:27 +02:00
"source": [
2022-05-23 09:57:46 +02:00
"!pip3 install dvc"
2021-05-31 11:55:27 +02:00
]
},
{
"cell_type": "markdown",
"id": "20975d62",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Stwórzmy katalog, w którym będziemy przechowywać nasz projekt:"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2023-04-26 13:44:55 +02:00
"id": "4d94e912",
2021-05-31 11:55:27 +02:00
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
"outputs": [],
"source": [
2023-04-26 13:44:55 +02:00
"!rm -r -f IUM_10/sample-ml-project-2023\n",
"!mkdir -p IUM_10/sample-ml-project-2023"
2021-05-31 11:55:27 +02:00
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2023-04-26 13:44:55 +02:00
"id": "aae59ec2",
2021-05-31 11:55:27 +02:00
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"#Jupyter notebook magic https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-cd\n",
2023-04-26 13:44:55 +02:00
"%cd \"IUM_10/sample-ml-project-2023\""
2021-05-31 11:55:27 +02:00
]
},
{
"cell_type": "markdown",
"id": "199c0d92",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Inicjalizujemy puste repozytorium Git (możemy też pominąć ten krok i działać w istniejącym już repozytorium)"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "c13c525b",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!git init"
]
},
{
"cell_type": "markdown",
"id": "c7155369",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Teraz inicjalizujemy repozytorium DVC:"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "44f28226",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!dvc init"
]
},
{
"cell_type": "markdown",
"id": "00bc72ed",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Zobaczmy jakie pliki dodał (również do repozytorium git) DVC.\n",
"Ich opis znajdziemy tutaj: https://dvc.org/doc/user-guide/project-structure/internal-files"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "d1aefe16",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!git status"
]
},
2023-04-26 13:44:55 +02:00
{
"cell_type": "markdown",
"id": "b16a62e6",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"- `.dvc/config` - główny plik konfiguracyjny dvc\n",
"- `.dvc/config.local` - nadpisuje wartości z `config`, do lokalnych zmian nie commitowanych do repo\n",
"- `.dvc/.gitignore` - pliki dvc, które nie mają znaleźć się w repo\n",
"- `.dvcignore` - dvc pomija pliki zdefiniowane w tym pliku (np. aby poprawić wydajność)"
]
},
2021-05-31 11:55:27 +02:00
{
"cell_type": "markdown",
"id": "72e0a272",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Możemy teraz zacommitować zmiany w git:"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "59780e99",
"metadata": {
2022-05-29 09:51:07 +02:00
"scrolled": true,
2021-05-31 11:55:27 +02:00
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!git commit -m \"Initial commit\""
]
},
{
"cell_type": "markdown",
"id": "a8861abe",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Przygotujmy przykładowe dane, pobierając je z Kaggle:"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "f05ece1b",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!kaggle datasets download -d uciml/iris\n",
"!unzip -o iris.zip\n",
"!rm database.sqlite iris.zip\n",
"!mkdir -p data\n",
"!mv Iris.csv data/"
]
},
{
"cell_type": "markdown",
"id": "adb9a522",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Teraz dodamy plik(i) z danymi do DVC:"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "74d182c7",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!dvc add data/Iris.csv"
]
},
{
"cell_type": "markdown",
"id": "72c6b5d0",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
" - DVC utworzył plik `data/Iris.csv.dvc` i dadał oryginalny plik do `.gitignore`\n",
" - W repozytorium będzie obecny tylko plik `*.dvc`, zawierający odnośnik do prawdziwego pliku"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "74d54652",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!git status -u"
]
},
{
"cell_type": "markdown",
"id": "8589fecf",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Dodajmy pliki `data/Iris.csv.dvc data/.gitignore` do repozytorium git, zgodnie z sugestią DVC:"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "460c4a17",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
"outputs": [],
"source": [
"!git add data/Iris.csv.dvc data/.gitignore"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "80644077",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!git commit -m \"Dodano dane IRIS (DVC)\""
]
},
{
"cell_type": "markdown",
"id": "03899863",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Plik `*.dvc` zawiera m.in. hash pliku. Więcej o plikach `*.dvc`: [link](https://dvc.org/doc/user-guide/project-structure/dvc-files)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8cb2ba7c",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
"outputs": [],
"source": [
2023-04-26 17:35:34 +02:00
"# %load data/Iris.csv.dvc\n"
2021-05-31 11:55:27 +02:00
]
},
{
"cell_type": "markdown",
"id": "0b421d45",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Oryginalny plik `Iris.csv` został przeniesiony do katalogu ./dvc/cache/{wartość hash pliku) i podlinkowany z powrotem do oryginalnej lokalizacji. Sposób tworzenia linków może być [różny w zależności od systemu plików](https://dvc.org/doc/user-guide/large-dataset-optimization)."
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "1d471f3a",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!ls -l .dvc/cache/71"
]
},
2022-05-23 09:57:46 +02:00
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2023-04-26 13:44:55 +02:00
"id": "32531aa8",
2022-05-23 09:57:46 +02:00
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2022-05-23 09:57:46 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2022-05-23 09:57:46 +02:00
"source": [
2023-04-26 13:44:55 +02:00
"!head -n 3 .dvc/cache/71/7820ef0af287ff346c5cabfb4c612c"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2023-04-26 13:44:55 +02:00
"id": "2396c762",
"metadata": {},
2023-04-26 17:35:34 +02:00
"outputs": [],
2023-04-26 13:44:55 +02:00
"source": [
"!git remote add origin git@git.wmi.amu.edu.pl:tzietkiewicz/sample-ml-project.git\n",
"!git push --set-upstream origin main"
2022-05-23 09:57:46 +02:00
]
},
2021-05-31 11:55:27 +02:00
{
"cell_type": "markdown",
"id": "901e8e90",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## dvc remote\n",
" - żeby wysłać właściwe pliki śledzone przez DVC do zdalnej lokalizacji (z której będą mogłby być pobrane np. przez system CI albo innych użytkowników) musimy mieć skonfigurowaną taką lokazliację\n",
" - służy do tego polecenie [`dvc remote add`](https://dvc.org/doc/command-reference/remote/add)\n",
" - użyjemy lokalnego \"remote\". Tutaj będzie to po prostu utworzony wcześniej katalog `/dvcstore`. Taki katalog istnieje też na naszym Jenkinsie, oczywiście należy go podmontować w Dockerze\n",
" - w realnych zastosowaniach podalibyśmy tutaj ścieżkę do jakiegoś zasobu dostępnego przez inernet jak np. serwer SFTP, ścieżka do AWS S3 itp."
]
},
2022-05-23 09:57:46 +02:00
{
"cell_type": "markdown",
2022-05-29 09:51:07 +02:00
"id": "53429521",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2022-05-23 09:57:46 +02:00
"source": [
"Obsługiwane typy zdalnych lokalizacji (remotes): https://dvc.org/doc/command-reference/remote/add#supported-storage-types\n",
" - Amazon S3\n",
" - S3-compatible storage\n",
" - Microsoft Azure Blob Storage\n",
" - Google Drive\n",
" - Google Cloud Storage\n",
" - Aliyun OSS\n",
" - SSH\n",
" - HDFS\n",
" - WebHDFS\n",
" - HTTP\n",
" - WebDAV\n",
" - local remote"
]
},
2023-04-26 13:44:55 +02:00
{
"cell_type": "markdown",
"id": "507e3a09",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"#### Dodawanie remote typu local"
]
},
2021-05-31 11:55:27 +02:00
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2022-05-29 09:51:07 +02:00
"id": "a16f2bfa",
2021-05-31 11:55:27 +02:00
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!dvc remote add -d my_local_remote /dvcstore"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "9c3deeaf",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!git status"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "899eac7d",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!git add .dvc/config\n",
"!git commit -m \"Added DVC remote\""
]
},
{
"cell_type": "markdown",
"id": "8c556c96",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## dvc push\n",
"Kiedy mamy już skonfigurowany \"remote\" możemy wypchnąć do niego pliki korzystając z polecenia `dvc push`:"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-06-04 15:52:21 +02:00
"id": "c7f24f75",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-09-28 10:56:21 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!dvc push"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "8a355575",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!tree /dvcstore"
]
},
{
"cell_type": "markdown",
"id": "af59ecb3",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## dvc pull\n",
"Żeby pobrać dane z DVC (np. w innej lokalizacji, przez innego użytkownika), musimy:\n",
" - sklonować repozytorium git (żeby m.in. pobrać pliki `*.dvc`\n",
" - wykonać `dvc pull`"
]
},
{
"cell_type": "markdown",
"id": "9fa914a7",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
2022-05-23 09:57:46 +02:00
"Dodawanie nowych plików i modyfikacja istniejących wygląda podobnie jak przy zwykłych plikach śledzonych przez git, tylko zamiast `git` używamy polecenia `dvc` a dodatkowo pamiętamy o zarządzaniu plikami `*.dvc` za pomocą gita:"
2021-05-31 11:55:27 +02:00
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "dde39796",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
"outputs": [],
"source": [
"!head -n -1 data/Iris.csv | sponge data/Iris.csv"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "7f14ec60",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!git status"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "8a841039",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-05-31 11:55:27 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!dvc status"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "bf6c1067",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!dvc add data/Iris.csv"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "4a4865c9",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!git add data/Iris.csv.dvc\n",
"!git commit -m \"Removed last line from Iris dataset\"\n"
]
},
2023-04-26 17:35:34 +02:00
{
"cell_type": "code",
"execution_count": null,
"id": "05e2d320",
"metadata": {},
"outputs": [],
"source": [
"!wc -l .dvc/cache/*/*"
]
},
2021-05-31 11:55:27 +02:00
{
"cell_type": "markdown",
"id": "d710977c",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### dvc checkout\n",
" - Polecenia `dvc checkout` używamy razem z `git checkout`, żeby zmienić branch, na którym pracujemy.\n",
" - DVC podmieni wersje plików śledzonych przez siebie na pochodzące z innego brancha (o ile pliki te się różnią i różnią się pliki `*.dvc` w odpowiednich branchach\n",
" - zmiana brancha przez git powoduje (ewentualną) zmianę plików `*.dvc` a `dvc checkout` kopiuje/linkuje pliki z katalogu `.dvc/cache` o wartościach hash odpowiadających tym z plików `*.dvc`"
]
},
{
"cell_type": "markdown",
"id": "5897e8eb",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Wymiana danych między projektami\n",
" - za pomocą poleceń `dvc import` i `dvc update` możemy dodać i później aktualizować pliki śledzone przez DVC w innym repozytorium"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "9b018146",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!dvc import https://github.com/iterative/dataset-registry \\\n",
" get-started/data.xml -o data/data.xml"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "be2c1a37",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"!dvc status"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-05-31 11:55:27 +02:00
"id": "3306c5b7",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-05-31 11:55:27 +02:00
"source": [
"ls -l data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b73c56ea",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"# %load data/data.xml.dvc\n",
2023-04-26 17:35:34 +02:00
"md5: 9d5921765bfba6c2c7e4c780c66edaa0\n",
2021-05-31 11:55:27 +02:00
"frozen: true\n",
"deps:\n",
"- path: get-started/data.xml\n",
" repo:\n",
" url: https://github.com/iterative/dataset-registry\n",
2023-04-26 17:35:34 +02:00
" rev_lock: 08c38bbea04e4f9e2130615dd679309ed0e11a72\n",
2021-05-31 11:55:27 +02:00
"outs:\n",
2023-04-26 17:35:34 +02:00
"- md5: 22a1a2931c8370d3aeedd7183606fd7f\n",
" size: 14445097\n",
2021-05-31 11:55:27 +02:00
" path: data.xml\n"
]
},
{
"cell_type": "markdown",
"id": "db1063ac",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## DVC pipelines\n",
" - wprowadzenie: https://youtu.be/71IGzyH95UY\n",
" - Getting started: https://dvc.org/doc/start/data-pipelines\n",
" - dvc pipelines pozwala nam zbudować (za pomocą polecenie `dvc run`) lub zdefiniować (edytując plik `dvc.yaml`) graf zależności między krokami wykonywanymi w naszym projekcie (takimi jak \"przygotowanie danych\", \"trenowanie\", \"ewaluacja\")\n",
" - tak zdefiniowany pipeline można potem uruchomić za pomocą polecenia `dvc reproduce`"
]
},
{
"cell_type": "markdown",
"id": "e2939867",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
2023-04-26 17:35:34 +02:00
"## Zadania [10+10 pkt]\n",
2021-05-31 11:55:27 +02:00
"1. Zainicjalizuj repozytorium DVC wewnątrz Twojego repozytorium z projektem [1pkt]\n",
"2. Dodaj plik(i) z danymi w Twoim projekcie do DVC [1pkt]\n",
2022-05-23 10:01:41 +02:00
"3. Skonfiguruj remote (dane do konfiguracji podane poniżej) [3pkt]\n",
2021-06-04 16:11:54 +02:00
"4. Stwórz/zdefiniuj i dodaj do repozytorium plik `dvc.yaml` opisujący kroki wykonywane w Twoim projekcie. Wydziel przynajmniej 2 kroki (np. przygotowanie danych/trenowanie) powiązane ze sobą za pomocą zależności (skorzystaj z \n",
2023-04-26 17:35:34 +02:00
"materiałów \"Getting started\", link powyżej) [10pkt (opcjonalne)]\n",
2023-04-26 13:44:55 +02:00
"5. Stwórz projekt na Jenkinsie (`s1233456-dvc`), w którym sklonujesz repozytorium, ściągniesz pliki dvc (za pomocą `dvc pull`) i uruchomisz pipeline (za pomocą `dvc reproduce`) [5pkt]"
2021-05-31 11:55:27 +02:00
]
2021-06-04 15:52:21 +02:00
},
{
"cell_type": "markdown",
"id": "2f5a8590",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"source": [
2021-06-04 16:11:54 +02:00
"## SSH remote\n",
"Jednym z remote obsługiwanych przez DVC jest SFTP/SSH.\n",
"W celu jego wykorzystania na serwerze tzietkiewicz.vm.wmi.amu.edu.pl utworzony został użytkownik `ium-sftp` i skonfigurowany serwer SFTP.\n",
"Został też dla niego wygenerowany klucz ssh, który został dodany jako \"Jenkins credential\" (patrz opis konfiguracji na Jenkins poniżej)"
2021-06-04 15:52:21 +02:00
]
},
{
"cell_type": "markdown",
"id": "82a61107",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"source": [
2021-06-04 16:11:54 +02:00
"### Lokalnie\n",
2022-05-29 09:51:07 +02:00
"Będziemy potrzebować zależności ([szczegóły](https://dvc.org/doc/command-reference/remote/add))\n",
2021-06-04 15:52:21 +02:00
" \n",
" `conda install dvc-ssh` \n",
"\n",
"albo\n",
"\n",
"`pip install dvc[ssh] paramiko`"
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-06-04 15:52:21 +02:00
"id": "c48c5b8e",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting package metadata (current_repodata.json): done\n",
2023-04-26 17:35:34 +02:00
"Solving environment: \\ "
2021-06-04 15:52:21 +02:00
]
}
],
"source": [
"conda install -c conda-forge dvc-ssh"
]
},
2023-04-26 17:35:34 +02:00
{
"cell_type": "code",
"execution_count": null,
"id": "9662b7aa",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"## Poniższe są potrzebne, żeby polecania dvc remote działały:\n",
"!sudo apt install libssl3 libffi7"
]
},
2023-04-26 13:44:55 +02:00
{
"cell_type": "markdown",
"id": "04c41da0",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Dodajemy remote:"
]
},
2021-06-04 15:52:21 +02:00
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-06-04 15:52:21 +02:00
"id": "e9a04876",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-09-28 10:56:21 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-06-04 15:52:21 +02:00
"source": [
2022-05-29 09:51:07 +02:00
"!dvc remote add -f -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl"
2021-06-04 15:52:21 +02:00
]
},
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-06-04 15:52:21 +02:00
"id": "e3f27bbb",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-09-28 10:56:21 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-06-04 15:52:21 +02:00
"source": [
"!dvc remote list"
]
},
2022-05-29 09:51:07 +02:00
{
"cell_type": "markdown",
"id": "c92edd7b",
2023-04-26 13:44:55 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2022-05-29 09:51:07 +02:00
"source": [
"Zapisujemy hasło:"
]
},
2021-06-04 15:52:21 +02:00
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-06-04 15:52:21 +02:00
"id": "5b2fa175",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-09-28 10:56:21 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-06-04 15:52:21 +02:00
"source": [
2022-05-29 09:51:07 +02:00
"!dvc remote modify --local ium_ssh_remote password IUM@2021"
2021-06-04 15:52:21 +02:00
]
},
2023-04-26 13:44:55 +02:00
{
"cell_type": "markdown",
"id": "8b83049b",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Pushujemy do skonfigurowanego remote:"
]
},
2021-06-04 15:52:21 +02:00
{
"cell_type": "code",
2023-04-26 17:35:34 +02:00
"execution_count": null,
2021-06-04 15:52:21 +02:00
"id": "ea6e16fa",
"metadata": {
"slideshow": {
2023-04-26 13:44:55 +02:00
"slide_type": "fragment"
2021-06-04 15:52:21 +02:00
}
},
2023-04-26 17:35:34 +02:00
"outputs": [],
2021-06-04 15:52:21 +02:00
"source": [
"!dvc push"
]
},
{
"cell_type": "markdown",
"id": "1468c44c",
2021-09-28 10:56:21 +02:00
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
2021-06-04 15:52:21 +02:00
"source": [
"### Jenkins\n",
"\n",
"W Jenkins można użyć mechanizmu \"Credentials\", żeby w bezpieczny sposób przekazać hasło albo klucz prywatny.\n",
"\n",
"Takie dane dla użytkownika ium-sftp zostały stworzone na Jenkinsie:\n",
"\n",
2023-04-26 13:44:55 +02:00
" - typu ssh key: https://tzietkiewicz.vm.wmi.amu.edu.pl:8081/credentials/store/system/domain/_/credential/48ac7004-216e-4260-abba-1fe5db753e18/\n",
" - typu \"secret text\" - zawierający hasło użytkownika ium-shftp: https://tzietkiewicz.vm.wmi.amu.edu.pl:8081/credentials/store/system/domain/_/credential/ium-sftp-password/\n",
2021-06-04 15:52:21 +02:00
"\n",
"Opis używania \"Credentials\" w Jenkinsfile: https://www.jenkins.io/doc/book/pipeline/jenkinsfile/#for-other-credential-types\n",
"\n",
"Klucza ssh można użyć tak: \n",
"\n",
"```Jenkinsfile\n",
"withCredentials(\n",
" [sshUserPrivateKey(credentialsId: '48ac7004-216e-4260-abba-1fe5db753e18', keyFileVariable: 'IUM_SFTP_KEY', passphraseVariable: '', usernameVariable: '')]) {\n",
2021-06-14 11:44:01 +02:00
" sh 'dvc remote add -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp'\n",
" sh 'dvc remote modify --local ium_ssh_remote keyfile $IUM_SFTP_KEY'\n",
" sh 'dvc pull'}\n",
2021-06-04 15:52:21 +02:00
"```\n",
"\n",
"Secret text tak:\n",
"\n",
"```Jenkinsfile\n",
" withCredentials([string(credentialsId: 'ium-sftp-password', variable: 'IUM_SFTP_PASS')]) {\n",
2021-06-14 11:44:01 +02:00
" sh 'dvc remote add -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp'\n",
2023-05-31 14:36:45 +02:00
" sh 'dvc remote modify --local ium_ssh_remote password $IUM_SFTP_PASS'\n",
2021-06-14 11:44:01 +02:00
" sh 'dvc pull'\n",
2021-06-04 15:52:21 +02:00
" }\n",
"```\n",
"\n",
2023-04-26 13:44:55 +02:00
"Przykład konfiguracji: \n",
" - https://tzietkiewicz.vm.wmi.amu.edu.pl:8081/job/docker-test-mount/ \n",
" - https://git.wmi.amu.edu.pl/tzietkiewicz/ium-helloworld"
2021-06-04 15:52:21 +02:00
]
2021-05-31 11:55:27 +02:00
}
],
"metadata": {
2021-09-28 10:56:21 +02:00
"author": "Tomasz Ziętkiewicz",
2021-05-31 11:55:27 +02:00
"celltoolbar": "Slideshow",
2021-09-28 10:56:21 +02:00
"email": "tomasz.zietkiewicz@amu.edu.pl",
2021-05-31 11:55:27 +02:00
"kernelspec": {
2023-04-26 13:44:55 +02:00
"display_name": "Python 3",
2021-05-31 11:55:27 +02:00
"language": "python",
"name": "python3"
},
2021-09-28 10:56:21 +02:00
"lang": "pl",
2021-05-31 11:55:27 +02:00
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2024-05-21 14:24:34 +02:00
"version": "3.10.12"
2021-09-28 10:56:21 +02:00
},
"slideshow": {
"slide_type": "slide"
},
"subtitle": "10.DVC[laboratoria]",
"title": "Inżynieria uczenia maszynowego",
"year": "2021"
2021-05-31 11:55:27 +02:00
},
"nbformat": 4,
"nbformat_minor": 5
}