Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: begin How To section #1705

Merged
merged 28 commits into from
Sep 28, 2020
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
1fd3f9f
Revert "guide: undo starting How To subsection"
jorgeorpinel Aug 10, 2020
c66b888
resolving conflict
imhardikj Aug 21, 2020
f3c8776
Merge branch 'master' into guide/how-to
imhardikj Aug 21, 2020
8880af5
update best practices
imhardikj Aug 22, 2020
11ab85b
Merge branch 'master' into guide/how-to
imhardikj Aug 24, 2020
93cb036
Best practices update
imhardikj Aug 24, 2020
3a39654
adding best pratices
imhardikj Aug 25, 2020
b2af801
modifying best pratices
imhardikj Aug 27, 2020
2994cf8
Update content/docs/user-guide/how-to/best-practices.md
jorgeorpinel Aug 27, 2020
ba02f17
updates
imhardikj Aug 29, 2020
ac6e997
updates
imhardikj Aug 29, 2020
eb67860
Update best-practices.md
imhardikj Aug 29, 2020
fb62cb1
Update best-practices.md
imhardikj Aug 29, 2020
8121897
removing best practice doc
imhardikj Sep 12, 2020
a3a5837
Undo dvc add doc
imhardikj Sep 18, 2020
c030f4f
Update content/docs/user-guide/how-to/undo-dvc-add.md
jorgeorpinel Sep 19, 2020
f0e4c79
updates
imhardikj Sep 20, 2020
da456cd
updates
imhardikj Sep 20, 2020
288b627
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 23, 2020
7e3ac80
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 23, 2020
3139491
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 23, 2020
dfb9824
updates
imhardikj Sep 24, 2020
33504c9
updates
imhardikj Sep 24, 2020
8e15350
Update content/docs/command-reference/add.md
jorgeorpinel Sep 24, 2020
d5f422d
Update content/docs/command-reference/add.md
jorgeorpinel Sep 24, 2020
e9edbdd
updates
imhardikj Sep 25, 2020
5bfd2c8
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 28, 2020
c7f30b7
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 28, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,12 @@
"katacoda": "https://katacoda.com/dvc/courses/examples/dvcignore"
}
},
"updating-tracked-files",
{
"label": "How To",
"slug": "how-to",
"source": false,
"children": ["best-practices", "update-tracked-files"]
},
"setup-google-drive-remote",
"large-dataset-optimization",
"external-dependencies",
Expand Down
136 changes: 136 additions & 0 deletions content/docs/user-guide/how-to/best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Best Practices for DVC Projects

Asking questions on data science collaboration to data scientists, engineers, or
managers, we'll get a variety of answers. DVC provides a systematic approach
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
towards managing and collaborating on data science projects. You can manage your
projects with DVC more efficiently using the practices listed here:

- Source code and data versioning

You can use DVC to avoid discrepancies between
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[revisions](https://git-scm.com/docs/revisions) of source code and versions of
data files, when the data doesn't fit into a traditional repository. DVC
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
replaces all large data files, models, etc. with small
[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These
files point to the original data, which you can access by checking out the
required `revision`.

- Experiments

You can make use of Git branches to document progress of training different
types of models on your data files in the same project. Create a branch for
each of the model and then utilise DVC features while working on that branch.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- Experiment time log

[Hyperparameter](<https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)>)
are defined using the the `--params` option of `dvc run` and the default
parameters file is `params.yaml`. You can commit different versions of
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
`params.yaml` and then use `dvc metrics` or `dvc plots` to track which of your
changes contributed the most in improving target
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[metric](doc/command-reference/metrics). You can monitor the degree of each
change.

- Navigating through experiments

To recover a model from last week without wasting time required for the model
to retrain, first you can checkout the required `revision`. Followed by
`dvc checkout` to update DVC-tracked files and directories in your workspace.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- Switching between datasets

You can quickly switch between a large dataset and a small subset without
modifying source code. To achieve this yoe need to change dependencies of
relevant stage either by using `dvc run` with `-f` option or by manually
editing the stage in `dvc.yaml` file.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- Reproducibility

You can run a model's evaluation process again without actually retraining the
model and preprocessing a raw dataset. DVC provides a way to reproduce
pipelines partially. You can use `dvc repro` to execute evaluation stage
without reproducing complete pipeline:

```dvc
$ dvc repro evaluate
```

- Managing and sharing large data files

Cloud or local storage can be used to store the project's data. You can share
the entire 147 GB of your ML project, with all of its data sources,
intermediate data files, and models with others if they are stored on
[remote storage](doc/command-reference/remote/add#supported-storage-types).
Using this you can share models trained in a GPU environment with colleagues
who don't have access to a GPU. Have a look at this
[example](doc/command-reference/pull#example-download-from-specific-remote-storage)
to see how this works.

- Manually editing dvc.yaml or .dvc files

It's safe to edit `dvc.yaml` and `.dvc` files. You can manually change all the
fields present in these files. However, please keep in mind to not change the
`md5` or `checksum` fields in `.dvc` files as they contain hash values which
DVC uses to track the file or directory.

- Never store credentials in project config

Do not store any user credentials in project config file. This file can be
found by default in `.dvc/config`. Use `--local`, `--global`, or `--system`
command options with `dvc config` for storing sensitive, or user-specific
settings:

```dvc
$ dvc config --system remote.username [password]
```

- Tracking <abbr>outputs</abbr> by Git

If `outs` are small files in size and you want to track them with Git then you
can use `--outs-no-cache` option to define outputs while creating or modifying
a stage. DVC will not track will not track outputs in this case:

```dvc
$ dvc run -n train -d src/train.py -d data/features \
---outs-no-cache model.p \
python src/train.py data/features model.pkl
```

---

## Questions on...

### Source code and data versioning

- How do you avoid discrepancies between
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[revisions](https://git-scm.com/docs/revisions) of source code and versions of
data files, when the data cannot fit into a traditional repository?

### Experiment time log

- How do you track which of your
[hyperparameter](<https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)>)
changes contributed the most to producing or improving your target
[metric](/doc/command-reference/metrics)? How do you monitor the degree of
each change?

### Navigating through experiments

- How do you recover a model from last week without wasting time waiting for the
model to retrain?

- How do you quickly switch between a large dataset and a small subset without
modifying source code?

### Reproducibility

- How do you run a model's evaluation process again without retraining the model
and preprocessing a raw dataset?

### Managing and sharing large data files

- How do you share models trained in a GPU environment with colleagues who don't
have access to a GPU?

- How do you share the entire 147 GB of your ML project, with all of its data
sources, intermediate data files, and models?
10 changes: 10 additions & 0 deletions content/docs/user-guide/how-to/tips-and-tricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Tips and tricks for DVC Projects

This guide provides general tips and tricks related to DVC, which can be
utilized while working on a project. Using the practices listed here, you can
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
manage your projects with DVC more efficiently.

### Using meta in dvc.yaml or .dvc files

DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be
used to add any user specific information. It also supports YAML content.