Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: best-practices section #1748

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@
"label": "DVC Files and Directories",
"slug": "dvc-files-and-directories"
},
"best-practices",
"merge-conflicts",
{
"slug": "dvcignore",
Expand Down
133 changes: 133 additions & 0 deletions content/docs/user-guide/best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Best Practices for DVC Projects

DVC provides a systematic approach towards managing and collaborating on data
science projects. Here are a few recommended practices to organize your workflow
and project structure effectively:

> See also these quick [tips & tricks](/doc/user-guide/tips-and-tricks).

## Matching source code to data

One of DVC's basic uses is to avoid a disconnection between
[revisions](https://git-scm.com/docs/revisions) of source code and
[versions](/doc/use-cases/versioning-data-and-model-files) of data. DVC replaces
large data files and directories, models, etc. with small
Comment on lines +9 to +14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a best practice, just an intro to data tracking with DVC. I guess it could stay here... Not sure 🤔

[metafiles](/doc/user-guide/dvc-files-and-directories), which you can track with
Git, along with the corresponding code.

These metafiles point to the original data, which is <abbr>cached</abbr>
automatically. You can access it later by restoring that Git working tree (e.g.
with `git checkout`) and using `dvc checkout` to update DVC tracked data
files/dir:

```dvc
$ git checkout 95485f # Git commit of a desired project version
$ dvc checkout
```

> See
> [Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files)
> for more details.

## Using directories as single data units

If your dataset consist of multiple files like images, etc. then the best way to
track it is
[as a directory](/doc/command-reference/add#adding-entire-directories), with a
single `.dvc` file:

```dvc
$ dvc add data/images/
```

## Manually editing dvc.yaml or .dvc files

It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:

```yaml
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
params:
- prepare.split
outs:
- data/prepared
```

You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
files please remember not to change the `md5` or `checksum` fields as they
contain hash values which DVC uses to track the file or directory.

## Managing and sharing large data

Traditional or cloud storage can be used to store the project's data. You can
share the entire 147 GB of your ML project, with all of its data sources,
intermediate data files, and models with others by setting up DVC
[remote storage](doc/command-reference/remote) (optional).

This way you can share models trained in a GPU environment with colleagues who
don't have access to GPUs.
Comment on lines +63 to +71
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what's the best practice?


## Never store secrets in the shared config file

Do not put user credentials in the default config file (`.dvc/config`), which is
tracked by Git. Use the `--local`, `--global`, or `--system` options of
`dvc config` to provide sensitive or user-specific settings:

```dvc
$ dvc config --local remote.password mypassword # just here
$ dvc config --global core.checksum_jobs 16 # all my projest
$ dvc config --system core.check_update false # all users
```

## Tracking experiments with Git

If you are training different models on your data files in the same project,
using Git commits, tags, or branches makes it easy to manage the project.

<!-- TODO: needs much elaboration! -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto (in the TODO)


## Basic experimentation flow

Use DVC for [reproducing](/doc/command-reference/repro) experiments after tuning
their [parameters](/doc/command-reference/params), tracking resulting
[metrics](/doc/command-reference/metrics), and visualizing their evolution with
[plots](/doc/command-reference/plots).

For example, let's first setup some parameters in `params.yaml` and reproduce
the pipeline:

<!-- TODO: sample params file -->

```dvc
$ dvc repro
```

<!-- TODO: what about the command output above? -->

Commit the changes using Git. Having some commits allows us to compare the
experiments using `dvc metrics diff`:

```dvc
$ dvc metrics diff rev1 rev2
```

<!-- TODO: command output above? -->

Finally, you can see how certain metrics evolved using `dvc plots diff`:

```dvc
$ dvc plots diff -x recall -y precision rev1 rev2
```

<!-- TODO: insert plot img -->

If you want to recover a model from last week without wasting time required to
retrain the model, you can use Git and DVC to navigate through your experiments:

```dvc
$ git checkout baseline-experiment # Git commit, tag or branch
$ dvc checkout
```
40 changes: 40 additions & 0 deletions content/docs/user-guide/tips-and-tricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Tips and tricks for DVC Projects

Using the methods listed here, you can manage your DVC projects more
efficiently.

## Switching between datasets

You can quickly switch between a large dataset and a small subset without
modifying source code: Change the dependencies of stage, either by manually
editing the stage in `dvc.yaml` or by using `dvc run` again with `-f`.

<!-- TODO: needs actual example -->

## Tracking small data with Git

If your `output` files are small in size and you want to track them with Git
then you can use `--outs-no-cache` option to define outputs while creating or
modifying a stage. DVC will not track will not track outputs in this case:

```dvc
$ dvc run -n train -d src/train.py -d data/features \
---outs-no-cache model.p \
python src/train.py data/features model.pkl
```

## Partial reproducibility

You can run a model's evaluation process again without preprocessing a raw
dataset again, or retraining the model. Pass a target stage to `dvc repro` to
execute only the necessary parts of the pipeline:

```dvc
$ dvc repro evaluate
```

## User metadata in DVC metafiles

DVC provides an optional `meta` field for `dvc.yaml` and `.dvc` metafiles
(that's very meta!). It can be used to add any user information (as YAML content
e.g. `"a string"`).