1094 lines
26 KiB
Plaintext
1094 lines
26 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7fe475ae",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Inżynieria uczenia maszynowego\n",
|
|
"### 22 maja 2024\n",
|
|
"# 10. DVC"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0c6f27a5",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"<img src=\"img/expcontrol/dvc-logo.png\">"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "560eec71",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## DVC - Data Version Control\n",
|
|
"- [dvc.org](https://dvc.org/)\n",
|
|
"- \"Version Control System for Machine Learning Projects\" (System kontroli wersji dla projektów uczenia maszynowego)\n",
|
|
"- Open Source\n",
|
|
"- Umożliwia:\n",
|
|
" - wersjonowanie danych i modeli. \"Git dla danych i modeli\"\n",
|
|
" - budowanie potoków (\"pipeline\") definiujących jak budować/trenować/ewaluować modele. \"Makefile dla uczenia maszynowego\"\n",
|
|
" - śledzenie, porównywanie metryk i parametrów\n",
|
|
"- ściśle zintegowany z gitem\n",
|
|
"- działa niezależnie od używanego języka/bibliotek i systemu operacyjnego\n",
|
|
"- 5-minutowe wprowadzenie: https://www.youtube.com/watch?v=UbL7VUpv1Bs&t=197s"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3d4ce1cb",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Śledzenie plików za pomocą DVC\n",
|
|
" - dużymi plikami, takimi jak plikami z danymi wejściowymi czy plikami modeli, trudno zarządza się za pomocą gita, ze względu na problemy z:\n",
|
|
" - wydajnością\n",
|
|
" - przestrzenią w repozytorium\n",
|
|
" - ograniczenia ze strony serwisu (np. [limit 100 MB na plik w Github](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github))\n",
|
|
" - Git posiada rozszerzenie [lfs(Large File Storage)](https://git-lfs.github.com/), które stanowi pewne rozwiązanie tego problemu. \n",
|
|
" - Same pliki przechowywane są na specjalnym zdalnym serwerze, w repozytorium przechowywane są jedynie odnośniki do tych plików i pewne metadane\n",
|
|
" - Github ma zintegrowany LFS z [limitem 1GB dla kont bezpłatnych](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dd8e529b",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
" - DVC proponuje podobne podejście co LFS, ale:\n",
|
|
" - pliki mogą być przechowywane na niemal dowolnym serwerze, również lokalnie\n",
|
|
" - brak limitu wielkości plików (w Git-LFS na Github [limit 2GB](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage))\n",
|
|
" - DVC zapewnia dodatkowe narzędzie umożliwiające śledzenie plików i ich powiązań z wynikami eksperymentów\n",
|
|
" - więcej, patrz [tutaj](https://dvc.org/doc/user-guide/related-technologies)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9bfb356e",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Instalacja i inicjalizacja\n",
|
|
" - https://dvc.org/doc/install\n",
|
|
" - ```pip(x) install dvc``` albo:\n",
|
|
" - ```conda install dvc```"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "054c7a11",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip3 install dvc"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "20975d62",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Stwórzmy katalog, w którym będziemy przechowywać nasz projekt:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4d94e912",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!rm -r -f IUM_10/sample-ml-project-2023\n",
|
|
"!mkdir -p IUM_10/sample-ml-project-2023"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "aae59ec2",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Jupyter notebook magic https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-cd\n",
|
|
"%cd \"IUM_10/sample-ml-project-2023\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "199c0d92",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Inicjalizujemy puste repozytorium Git (możemy też pominąć ten krok i działać w istniejącym już repozytorium)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c13c525b",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git init"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c7155369",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Teraz inicjalizujemy repozytorium DVC:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "44f28226",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc init"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "00bc72ed",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Zobaczmy jakie pliki dodał (również do repozytorium git) DVC.\n",
|
|
"Ich opis znajdziemy tutaj: https://dvc.org/doc/user-guide/project-structure/internal-files"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d1aefe16",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git status"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b16a62e6",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"source": [
|
|
"- `.dvc/config` - główny plik konfiguracyjny dvc\n",
|
|
"- `.dvc/config.local` - nadpisuje wartości z `config`, do lokalnych zmian nie commitowanych do repo\n",
|
|
"- `.dvc/.gitignore` - pliki dvc, które nie mają znaleźć się w repo\n",
|
|
"- `.dvcignore` - dvc pomija pliki zdefiniowane w tym pliku (np. aby poprawić wydajność)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "72e0a272",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Możemy teraz zacommitować zmiany w git:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "59780e99",
|
|
"metadata": {
|
|
"scrolled": true,
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git commit -m \"Initial commit\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a8861abe",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Przygotujmy przykładowe dane, pobierając je z Kaggle:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f05ece1b",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!kaggle datasets download -d uciml/iris\n",
|
|
"!unzip -o iris.zip\n",
|
|
"!rm database.sqlite iris.zip\n",
|
|
"!mkdir -p data\n",
|
|
"!mv Iris.csv data/"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "adb9a522",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Teraz dodamy plik(i) z danymi do DVC:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "74d182c7",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc add data/Iris.csv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "72c6b5d0",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
" - DVC utworzył plik `data/Iris.csv.dvc` i dadał oryginalny plik do `.gitignore`\n",
|
|
" - W repozytorium będzie obecny tylko plik `*.dvc`, zawierający odnośnik do prawdziwego pliku"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "74d54652",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git status -u"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8589fecf",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Dodajmy pliki `data/Iris.csv.dvc data/.gitignore` do repozytorium git, zgodnie z sugestią DVC:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "460c4a17",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git add data/Iris.csv.dvc data/.gitignore"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "80644077",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git commit -m \"Dodano dane IRIS (DVC)\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "03899863",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Plik `*.dvc` zawiera m.in. hash pliku. Więcej o plikach `*.dvc`: [link](https://dvc.org/doc/user-guide/project-structure/dvc-files)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8cb2ba7c",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load data/Iris.csv.dvc\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0b421d45",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Oryginalny plik `Iris.csv` został przeniesiony do katalogu ./dvc/cache/{wartość hash pliku) i podlinkowany z powrotem do oryginalnej lokalizacji. Sposób tworzenia linków może być [różny w zależności od systemu plików](https://dvc.org/doc/user-guide/large-dataset-optimization)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "1d471f3a",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!ls -l .dvc/cache/71"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "32531aa8",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!head -n 3 .dvc/cache/71/7820ef0af287ff346c5cabfb4c612c"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2396c762",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git remote add origin git@git.wmi.amu.edu.pl:tzietkiewicz/sample-ml-project.git\n",
|
|
"!git push --set-upstream origin main"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "901e8e90",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## dvc remote\n",
|
|
" - żeby wysłać właściwe pliki śledzone przez DVC do zdalnej lokalizacji (z której będą mogłby być pobrane np. przez system CI albo innych użytkowników) musimy mieć skonfigurowaną taką lokazliację\n",
|
|
" - służy do tego polecenie [`dvc remote add`](https://dvc.org/doc/command-reference/remote/add)\n",
|
|
" - użyjemy lokalnego \"remote\". Tutaj będzie to po prostu utworzony wcześniej katalog `/dvcstore`. Taki katalog istnieje też na naszym Jenkinsie, oczywiście należy go podmontować w Dockerze\n",
|
|
" - w realnych zastosowaniach podalibyśmy tutaj ścieżkę do jakiegoś zasobu dostępnego przez inernet jak np. serwer SFTP, ścieżka do AWS S3 itp."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "53429521",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Obsługiwane typy zdalnych lokalizacji (remotes): https://dvc.org/doc/command-reference/remote/add#supported-storage-types\n",
|
|
" - Amazon S3\n",
|
|
" - S3-compatible storage\n",
|
|
" - Microsoft Azure Blob Storage\n",
|
|
" - Google Drive\n",
|
|
" - Google Cloud Storage\n",
|
|
" - Aliyun OSS\n",
|
|
" - SSH\n",
|
|
" - HDFS\n",
|
|
" - WebHDFS\n",
|
|
" - HTTP\n",
|
|
" - WebDAV\n",
|
|
" - local remote"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "507e3a09",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"#### Dodawanie remote typu local"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "a16f2bfa",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc remote add -d my_local_remote /dvcstore"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "9c3deeaf",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git status"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "899eac7d",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git add .dvc/config\n",
|
|
"!git commit -m \"Added DVC remote\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8c556c96",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## dvc push\n",
|
|
"Kiedy mamy już skonfigurowany \"remote\" możemy wypchnąć do niego pliki korzystając z polecenia `dvc push`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c7f24f75",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc push"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8a355575",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!tree /dvcstore"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "af59ecb3",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## dvc pull\n",
|
|
"Żeby pobrać dane z DVC (np. w innej lokalizacji, przez innego użytkownika), musimy:\n",
|
|
" - sklonować repozytorium git (żeby m.in. pobrać pliki `*.dvc`\n",
|
|
" - wykonać `dvc pull`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9fa914a7",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Dodawanie nowych plików i modyfikacja istniejących wygląda podobnie jak przy zwykłych plikach śledzonych przez git, tylko zamiast `git` używamy polecenia `dvc` a dodatkowo pamiętamy o zarządzaniu plikami `*.dvc` za pomocą gita:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "dde39796",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!head -n -1 data/Iris.csv | sponge data/Iris.csv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7f14ec60",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git status"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8a841039",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc status"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bf6c1067",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc add data/Iris.csv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4a4865c9",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!git add data/Iris.csv.dvc\n",
|
|
"!git commit -m \"Removed last line from Iris dataset\"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "05e2d320",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!wc -l .dvc/cache/*/*"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d710977c",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### dvc checkout\n",
|
|
" - Polecenia `dvc checkout` używamy razem z `git checkout`, żeby zmienić branch, na którym pracujemy.\n",
|
|
" - DVC podmieni wersje plików śledzonych przez siebie na pochodzące z innego brancha (o ile pliki te się różnią i różnią się pliki `*.dvc` w odpowiednich branchach\n",
|
|
" - zmiana brancha przez git powoduje (ewentualną) zmianę plików `*.dvc` a `dvc checkout` kopiuje/linkuje pliki z katalogu `.dvc/cache` o wartościach hash odpowiadających tym z plików `*.dvc`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5897e8eb",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Wymiana danych między projektami\n",
|
|
" - za pomocą poleceń `dvc import` i `dvc update` możemy dodać i później aktualizować pliki śledzone przez DVC w innym repozytorium"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "9b018146",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc import https://github.com/iterative/dataset-registry \\\n",
|
|
" get-started/data.xml -o data/data.xml"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "be2c1a37",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc status"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3306c5b7",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"ls -l data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b73c56ea",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %load data/data.xml.dvc\n",
|
|
"md5: 9d5921765bfba6c2c7e4c780c66edaa0\n",
|
|
"frozen: true\n",
|
|
"deps:\n",
|
|
"- path: get-started/data.xml\n",
|
|
" repo:\n",
|
|
" url: https://github.com/iterative/dataset-registry\n",
|
|
" rev_lock: 08c38bbea04e4f9e2130615dd679309ed0e11a72\n",
|
|
"outs:\n",
|
|
"- md5: 22a1a2931c8370d3aeedd7183606fd7f\n",
|
|
" size: 14445097\n",
|
|
" path: data.xml\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "db1063ac",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## DVC pipelines\n",
|
|
" - wprowadzenie: https://youtu.be/71IGzyH95UY\n",
|
|
" - Getting started: https://dvc.org/doc/start/data-pipelines\n",
|
|
" - dvc pipelines pozwala nam zbudować (za pomocą polecenie `dvc run`) lub zdefiniować (edytując plik `dvc.yaml`) graf zależności między krokami wykonywanymi w naszym projekcie (takimi jak \"przygotowanie danych\", \"trenowanie\", \"ewaluacja\")\n",
|
|
" - tak zdefiniowany pipeline można potem uruchomić za pomocą polecenia `dvc reproduce`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e2939867",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Zadania [10+10 pkt]\n",
|
|
"1. Zainicjalizuj repozytorium DVC wewnątrz Twojego repozytorium z projektem [1pkt]\n",
|
|
"2. Dodaj plik(i) z danymi w Twoim projekcie do DVC [1pkt]\n",
|
|
"3. Skonfiguruj remote (dane do konfiguracji podane poniżej) [3pkt]\n",
|
|
"4. Stwórz/zdefiniuj i dodaj do repozytorium plik `dvc.yaml` opisujący kroki wykonywane w Twoim projekcie. Wydziel przynajmniej 2 kroki (np. przygotowanie danych/trenowanie) powiązane ze sobą za pomocą zależności (skorzystaj z \n",
|
|
"materiałów \"Getting started\", link powyżej) [10pkt (opcjonalne)]\n",
|
|
"5. Stwórz projekt na Jenkinsie (`s1233456-dvc`), w którym sklonujesz repozytorium, ściągniesz pliki dvc (za pomocą `dvc pull`) i uruchomisz pipeline (za pomocą `dvc reproduce`) [5pkt]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2f5a8590",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"## SSH remote\n",
|
|
"Jednym z remote obsługiwanych przez DVC jest SFTP/SSH.\n",
|
|
"W celu jego wykorzystania na serwerze tzietkiewicz.vm.wmi.amu.edu.pl utworzony został użytkownik `ium-sftp` i skonfigurowany serwer SFTP.\n",
|
|
"Został też dla niego wygenerowany klucz ssh, który został dodany jako \"Jenkins credential\" (patrz opis konfiguracji na Jenkins poniżej)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "82a61107",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Lokalnie\n",
|
|
"Będziemy potrzebować zależności ([szczegóły](https://dvc.org/doc/command-reference/remote/add))\n",
|
|
" \n",
|
|
" `conda install dvc-ssh` \n",
|
|
"\n",
|
|
"albo\n",
|
|
"\n",
|
|
"`pip install dvc[ssh] paramiko`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c48c5b8e",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Collecting package metadata (current_repodata.json): done\n",
|
|
"Solving environment: \\ "
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"conda install -c conda-forge dvc-ssh"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "9662b7aa",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"## Poniższe są potrzebne, żeby polecania dvc remote działały:\n",
|
|
"!sudo apt install libssl3 libffi7"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "04c41da0",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Dodajemy remote:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e9a04876",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc remote add -f -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e3f27bbb",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc remote list"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c92edd7b",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Zapisujemy hasło:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "5b2fa175",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc remote modify --local ium_ssh_remote password IUM@2021"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8b83049b",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"Pushujemy do skonfigurowanego remote:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ea6e16fa",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "fragment"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!dvc push"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1468c44c",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### Jenkins\n",
|
|
"\n",
|
|
"W Jenkins można użyć mechanizmu \"Credentials\", żeby w bezpieczny sposób przekazać hasło albo klucz prywatny.\n",
|
|
"\n",
|
|
"Takie dane dla użytkownika ium-sftp zostały stworzone na Jenkinsie:\n",
|
|
"\n",
|
|
" - typu ssh key: https://tzietkiewicz.vm.wmi.amu.edu.pl:8081/credentials/store/system/domain/_/credential/48ac7004-216e-4260-abba-1fe5db753e18/\n",
|
|
" - typu \"secret text\" - zawierający hasło użytkownika ium-shftp: https://tzietkiewicz.vm.wmi.amu.edu.pl:8081/credentials/store/system/domain/_/credential/ium-sftp-password/\n",
|
|
"\n",
|
|
"Opis używania \"Credentials\" w Jenkinsfile: https://www.jenkins.io/doc/book/pipeline/jenkinsfile/#for-other-credential-types\n",
|
|
"\n",
|
|
"Klucza ssh można użyć tak: \n",
|
|
"\n",
|
|
"```Jenkinsfile\n",
|
|
"withCredentials(\n",
|
|
" [sshUserPrivateKey(credentialsId: '48ac7004-216e-4260-abba-1fe5db753e18', keyFileVariable: 'IUM_SFTP_KEY', passphraseVariable: '', usernameVariable: '')]) {\n",
|
|
" sh 'dvc remote add -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp'\n",
|
|
" sh 'dvc remote modify --local ium_ssh_remote keyfile $IUM_SFTP_KEY'\n",
|
|
" sh 'dvc pull'}\n",
|
|
"```\n",
|
|
"\n",
|
|
"Secret text tak:\n",
|
|
"\n",
|
|
"```Jenkinsfile\n",
|
|
" withCredentials([string(credentialsId: 'ium-sftp-password', variable: 'IUM_SFTP_PASS')]) {\n",
|
|
" sh 'dvc remote add -d ium_ssh_remote ssh://ium-sftp@tzietkiewicz.vm.wmi.amu.edu.pl/ium-sftp'\n",
|
|
" sh 'dvc remote modify --local ium_ssh_remote password $IUM_SFTP_PASS'\n",
|
|
" sh 'dvc pull'\n",
|
|
" }\n",
|
|
"```\n",
|
|
"\n",
|
|
"Przykład konfiguracji: \n",
|
|
" - https://tzietkiewicz.vm.wmi.amu.edu.pl:8081/job/docker-test-mount/ \n",
|
|
" - https://git.wmi.amu.edu.pl/tzietkiewicz/ium-helloworld"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"author": "Tomasz Ziętkiewicz",
|
|
"celltoolbar": "Slideshow",
|
|
"email": "tomasz.zietkiewicz@amu.edu.pl",
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"lang": "pl",
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.16"
|
|
},
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
},
|
|
"subtitle": "10.DVC[laboratoria]",
|
|
"title": "Inżynieria uczenia maszynowego",
|
|
"year": "2021"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|