Skip to content

Commit

Permalink
guide: review Best Practices and tips&tricks so far...
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Sep 20, 2020
1 parent 887f2c1 commit c30116b
Show file tree
Hide file tree
Showing 2 changed files with 118 additions and 97 deletions.
170 changes: 84 additions & 86 deletions content/docs/user-guide/best-practices.md
Original file line number Diff line number Diff line change
@@ -1,135 +1,133 @@
# Best Practices for DVC Projects

DVC provides a systematic approach towards managing and collaborating on data
science projects. You can manage your projects with DVC more efficiently using
the practices listed here:
science projects. Here are a few recommended practices to organize your workflow
and project structure effectively:

## Source code and data versioning
> See also these quick [tips & tricks](/doc/user-guide/tips-and-tricks).
You can use DVC to avoid discrepancies between
## Matching source code to data

One of DVC's basic uses is to avoid a disconnection between
[revisions](https://git-scm.com/docs/revisions) of source code and
[versions](/doc/use-cases/versioning-data-and-model-files) of data files. DVC
replaces all large data files, models, etc. with small
[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These
files point to the original data, which you can access by first checking out the
required `revision` using Git followed by `dvc checkout` to update DVC tracked
data files/dir:
[versions](/doc/use-cases/versioning-data-and-model-files) of data. DVC replaces
large data files and directories, models, etc. with small
[metafiles](/doc/user-guide/dvc-files-and-directories), which you can track with
Git, along with the corresponding code.

These metafiles point to the original data, which is <abbr>cached</abbr>
automatically. You can access it later by restoring that Git working tree (e.g.
with `git checkout`) and using `dvc checkout` to update DVC tracked data
files/dir:

```dvc
$ git checkout 95485f # Git commit of required data version
$ git checkout 95485f # Git commit of a desired project version
$ dvc checkout
```

> See
> [Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files)
> for more details.
## Using directories as single data units

If your dataset consist of multiple files like images, etc. then the best way to
track whole directory is with single `.dvc` file. You can use `dvc add` with
relative path to directory:
track it is
[as a directory](/doc/command-reference/add#adding-entire-directories), with a
single `.dvc` file:

```dvc
$ dvc add data/images
$ dvc add data/images/
```

## Experiments and tracking parameters
## Manually editing dvc.yaml or .dvc files

You can use DVC for tuning [parameters](doc/command-reference/params), improving
target [metrics](doc/command-reference/metrics) and visualizing the changes with
[plots](doc/command-reference/plots). In the first step tune parameters in
default `params.yaml` file and reproduce the pipeline:
It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:

```dvc
$ dvc repro # Reproducing pipeline
$ git add -am "Epoch Experiment"
```yaml
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
params:
- prepare.split
outs:
- data/prepared
```
Commit the new changes in files using Git. Next step is to compare the
experiments. Use `dvc metrics` to find difference in target metric between two
commits:
You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
files please remember not to change the `md5` or `checksum` fields as they
contain hash values which DVC uses to track the file or directory.

```dvc
$ dvc metrics diff rev1 rev2
```
## Managing and sharing large data

And finally you can plot target metrics using `dvc plots`:
Traditional or cloud storage can be used to store the project's data. You can
share the entire 147 GB of your ML project, with all of its data sources,
intermediate data files, and models with others by setting up DVC
[remote storage](doc/command-reference/remote) (optional).

```dvc
$ dvc plots diff -x recall -y precision rev1 rev2
```
This way you can share models trained in a GPU environment with colleagues who
don't have access to GPUs.

## Never store secrets in the shared config file

If you want to recover a model from last week without wasting time required for
the model to retrain you can use DVC to navigate through your experiments. First
you can checkout the required `revision` using Git:
Do not put user credentials in the default config file (`.dvc/config`), which is
tracked by Git. Use the `--local`, `--global`, or `--system` options of
`dvc config` to provide sensitive or user-specific settings:

```dvc
$ git checkout baseline-experiment # Git commit, tag or branch
$ dvc checkout
$ dvc config --local remote.password mypassword # just here
$ dvc config --global core.checksum_jobs 16 # all my projest
$ dvc config --system core.check_update false # all users
```

Followed by `dvc checkout` to update DVC-tracked files and directories in your
workspace.
## Tracking experiments with Git

If you are training different models on your data files in the same project,
using Git commits, tags, or branches makes it easy to manage the project. Have a
look at this [example]() to see how this works.
using Git commits, tags, or branches makes it easy to manage the project.

## Reproducibility
<!-- TODO: needs much elaboration! -->

You can run a model's evaluation process again without actually retraining the
model and preprocessing a raw dataset. DVC provides a way to reproduce pipelines
partially. You can use `dvc repro` to execute evaluation stage without
reproducing complete pipeline:
## Basic experimentation flow

```dvc
$ dvc repro evaluate
```
Use DVC for [reproducing](/doc/command-reference/repro) experiments after tuning
their [parameters](/doc/command-reference/params), tracking resulting
[metrics](/doc/command-reference/metrics), and visualizing their evolution with
[plots](/doc/command-reference/plots).

## Managing and sharing large data files
For example, let's first setup some parameters in `params.yaml` and reproduce
the pipeline:

Cloud or local storage can be used to store the project's data. You can share
the entire 147 GB of your ML project, with all of its data sources, intermediate
data files, and models with others if they are stored on
[remote storage](doc/command-reference/remote/add#supported-storage-types).
Using this you can share models trained in a GPU environment with colleagues who
don't have access to a GPU. Have a look at this
[example](doc/command-reference/pull#example-download-from-specific-remote-storage)
to see how this works.
<!-- TODO: sample params file -->

## Manually editing dvc.yaml or .dvc files
```dvc
$ dvc repro
```

It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:
<!-- TODO: what about the command output above? -->

```yaml
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
params:
- prepare.split
outs:
- data/prepared
```
Commit the changes using Git. Having some commits allows us to compare the
experiments using `dvc metrics diff`:

You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
files please remember not to change the `md5` or `checksum` fields as they
contain hash values which DVC uses to track the file or directory.
```dvc
$ dvc metrics diff rev1 rev2
```

## Never store credentials in project config
<!-- TODO: command output above? -->

Do not store any user credentials in project config file. This file can be found
by default in `.dvc/config`. Use `--local`, `--global`, or `--system` command
options with `dvc config` for storing sensitive, or user-specific settings:
Finally, you can see how certain metrics evolved using `dvc plots diff`:

```dvc
$ dvc config --system remote.username [password]
$ dvc plots diff -x recall -y precision rev1 rev2
```

## Tracking <abbr>outputs</abbr> by Git
<!-- TODO: insert plot img -->

If your `output` files are small in size and you want to track them with Git
then you can use `--outs-no-cache` option to define outputs while creating or
modifying a stage. DVC will not track will not track outputs in this case:
If you want to recover a model from last week without wasting time required to
retrain the model, you can use Git and DVC to navigate through your experiments:

```dvc
$ dvc run -n train -d src/train.py -d data/features \
---outs-no-cache model.p \
python src/train.py data/features model.pkl
$ git checkout baseline-experiment # Git commit, tag or branch
$ dvc checkout
```
45 changes: 34 additions & 11 deletions content/docs/user-guide/tips-and-tricks.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,40 @@
# Tips and tricks for DVC Projects

This guide provides general tips and tricks related to DVC, which can be
utilized while working on a project. Using the practices listed here, you can
manage your projects with DVC more efficiently.

## Using meta in dvc.yaml or .dvc files

DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be
used to add any user specific information. It also supports YAML content.
Using the methods listed here, you can manage your DVC projects more
efficiently.

## Switching between datasets

You can quickly switch between a large dataset and a small subset without
modifying source code. To achieve this you need to change dependencies of
relevant stage either by using `dvc run` with the `-f` option or by manually
editing the stage in `dvc.yaml` file.
modifying source code: Change the dependencies of stage, either by manually
editing the stage in `dvc.yaml` or by using `dvc run` again with `-f`.

<!-- TODO: needs actual example -->

## Tracking small data with Git

If your `output` files are small in size and you want to track them with Git
then you can use `--outs-no-cache` option to define outputs while creating or
modifying a stage. DVC will not track will not track outputs in this case:

```dvc
$ dvc run -n train -d src/train.py -d data/features \
---outs-no-cache model.p \
python src/train.py data/features model.pkl
```

## Partial reproducibility

You can run a model's evaluation process again without preprocessing a raw
dataset again, or retraining the model. Pass a target stage to `dvc repro` to
execute only the necessary parts of the pipeline:

```dvc
$ dvc repro evaluate
```

## User metadata in DVC metafiles

DVC provides an optional `meta` field for `dvc.yaml` and `.dvc` metafiles
(that's very meta!). It can be used to add any user information (as YAML content
e.g. `"a string"`).

0 comments on commit c30116b

Please sign in to comment.