Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: best-practices section #1748

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@
"label": "DVC Files and Directories",
"slug": "dvc-files-and-directories"
},
"best-practices",
"merge-conflicts",
{
"slug": "dvcignore",
Expand Down
135 changes: 135 additions & 0 deletions content/docs/user-guide/best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Best Practices for DVC Projects

DVC provides a systematic approach towards managing and collaborating on data
science projects. You can manage your projects with DVC more efficiently using
the practices listed here:

## Source code and data versioning

You can use DVC to avoid discrepancies between
[revisions](https://git-scm.com/docs/revisions) of source code and
[versions](/doc/use-cases/versioning-data-and-model-files) of data files. DVC
replaces all large data files, models, etc. with small
[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These
files point to the original data, which you can access by first checking out the
required `revision` using Git followed by `dvc checkout` to update DVC tracked
data files/dir:

```dvc
$ git checkout 95485f # Git commit of required data version
$ dvc checkout
```

If your dataset consist of multiple files like images, etc. then the best way to
track whole directory is with single `.dvc` file. You can use `dvc add` with
relative path to directory:

```dvc
$ dvc add data/images
```

## Experiments and tracking parameters

You can use DVC for tuning [parameters](doc/command-reference/params), improving
target [metrics](doc/command-reference/metrics) and visualizing the changes with
[plots](doc/command-reference/plots). In the first step tune parameters in
default `params.yaml` file and reproduce the pipeline:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not bad! But per #1705 (review) a previous best practice about how to organize DVC project experiments with Git is needed. The 2 basic options are: as commits in a single branch (plus tags), and as multiple branches (one per experiment).


```dvc
$ dvc repro # Reproducing pipeline
$ git add -am "Epoch Experiment"
```

Commit the new changes in files using Git. Next step is to compare the
experiments. Use `dvc metrics` to find difference in target metric between two
commits:

```dvc
$ dvc metrics diff rev1 rev2
```

And finally you can plot target metrics using `dvc plots`:

```dvc
$ dvc plots diff -x recall -y precision rev1 rev2
```

If you want to recover a model from last week without wasting time required for
the model to retrain you can use DVC to navigate through your experiments. First
you can checkout the required `revision` using Git:

```dvc
$ git checkout baseline-experiment # Git commit, tag or branch
$ dvc checkout
```

Followed by `dvc checkout` to update DVC-tracked files and directories in your
workspace.

If you are training different models on your data files in the same project,
using Git commits, tags, or branches makes it easy to manage the project. Have a
look at this [example]() to see how this works.

## Reproducibility

You can run a model's evaluation process again without actually retraining the
model and preprocessing a raw dataset. DVC provides a way to reproduce pipelines
partially. You can use `dvc repro` to execute evaluation stage without
reproducing complete pipeline:

```dvc
$ dvc repro evaluate
```

## Managing and sharing large data files

Cloud or local storage can be used to store the project's data. You can share
the entire 147 GB of your ML project, with all of its data sources, intermediate
data files, and models with others if they are stored on
[remote storage](doc/command-reference/remote/add#supported-storage-types).
Using this you can share models trained in a GPU environment with colleagues who
don't have access to a GPU. Have a look at this
[example](doc/command-reference/pull#example-download-from-specific-remote-storage)
to see how this works.

## Manually editing dvc.yaml or .dvc files

It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:

```yaml
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
params:
- prepare.split
outs:
- data/prepared
```

You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
files please remember not to change the `md5` or `checksum` fields as they
contain hash values which DVC uses to track the file or directory.

## Never store credentials in project config

Do not store any user credentials in project config file. This file can be found
by default in `.dvc/config`. Use `--local`, `--global`, or `--system` command
options with `dvc config` for storing sensitive, or user-specific settings:

```dvc
$ dvc config --system remote.username [password]
```

## Tracking <abbr>outputs</abbr> by Git

If your `output` files are small in size and you want to track them with Git
then you can use `--outs-no-cache` option to define outputs while creating or
modifying a stage. DVC will not track will not track outputs in this case:

```dvc
$ dvc run -n train -d src/train.py -d data/features \
---outs-no-cache model.p \
python src/train.py data/features model.pkl
```
17 changes: 17 additions & 0 deletions content/docs/user-guide/tips-and-tricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Tips and tricks for DVC Projects

This guide provides general tips and tricks related to DVC, which can be
utilized while working on a project. Using the practices listed here, you can
manage your projects with DVC more efficiently.

## Using meta in dvc.yaml or .dvc files

DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be
used to add any user specific information. It also supports YAML content.

## Switching between datasets

You can quickly switch between a large dataset and a small subset without
modifying source code. To achieve this you need to change dependencies of
relevant stage either by using `dvc run` with the `-f` option or by manually
editing the stage in `dvc.yaml` file.