Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: begin How To section #1705

Merged
merged 28 commits into from
Sep 28, 2020
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
1fd3f9f
Revert "guide: undo starting How To subsection"
jorgeorpinel Aug 10, 2020
c66b888
resolving conflict
imhardikj Aug 21, 2020
f3c8776
Merge branch 'master' into guide/how-to
imhardikj Aug 21, 2020
8880af5
update best practices
imhardikj Aug 22, 2020
11ab85b
Merge branch 'master' into guide/how-to
imhardikj Aug 24, 2020
93cb036
Best practices update
imhardikj Aug 24, 2020
3a39654
adding best pratices
imhardikj Aug 25, 2020
b2af801
modifying best pratices
imhardikj Aug 27, 2020
2994cf8
Update content/docs/user-guide/how-to/best-practices.md
jorgeorpinel Aug 27, 2020
ba02f17
updates
imhardikj Aug 29, 2020
ac6e997
updates
imhardikj Aug 29, 2020
eb67860
Update best-practices.md
imhardikj Aug 29, 2020
fb62cb1
Update best-practices.md
imhardikj Aug 29, 2020
8121897
removing best practice doc
imhardikj Sep 12, 2020
a3a5837
Undo dvc add doc
imhardikj Sep 18, 2020
c030f4f
Update content/docs/user-guide/how-to/undo-dvc-add.md
jorgeorpinel Sep 19, 2020
f0e4c79
updates
imhardikj Sep 20, 2020
da456cd
updates
imhardikj Sep 20, 2020
288b627
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 23, 2020
7e3ac80
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 23, 2020
3139491
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 23, 2020
dfb9824
updates
imhardikj Sep 24, 2020
33504c9
updates
imhardikj Sep 24, 2020
8e15350
Update content/docs/command-reference/add.md
jorgeorpinel Sep 24, 2020
d5f422d
Update content/docs/command-reference/add.md
jorgeorpinel Sep 24, 2020
e9edbdd
updates
imhardikj Sep 25, 2020
5bfd2c8
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 28, 2020
c7f30b7
Update content/docs/user-guide/how-to/undo-adding-data.md
jorgeorpinel Sep 28, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,12 @@
"katacoda": "https://katacoda.com/dvc/courses/examples/dvcignore"
}
},
"updating-tracked-files",
{
"label": "How To",
"slug": "how-to",
"source": false,
"children": ["best-practices", "update-tracked-files"]
},
"setup-google-drive-remote",
"large-dataset-optimization",
"external-dependencies",
Expand Down
74 changes: 74 additions & 0 deletions content/docs/user-guide/how-to/best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Best Practices for DVC Projects

Data scientists, engineers, or managers may already know or can easily find
answers to some of these questions. However, the variety of answers and
approaches makes data science collaboration a nightmare. **A systematic approach
is required.**
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Questions on...

### Source code and data versioning

- How do you avoid discrepancies between
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[revisions](https://git-scm.com/docs/revisions) of source code and versions of
data files, when the data cannot fit into a traditional repository?

DVC replaces all large data files, models, etc. with small
[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These
files point to the original data, which you can access by checking out the
required `revision`.

### Experiments

- How do you document progress of training different types of models on your
data files in the same project?
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

You can make use of Git branches for each of the model and then utilise DVC
features while working on that branch.

### Experiment time log

- How do you track which of your
[hyperparameter](<https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)>)
changes contributed the most to producing or improving your target
[metric](doc/command-reference/metrics)? How do you monitor the degree of each
change?

Hyperparameters are defined using the the `--params` option of `dvc run` and
the default parameters file is `params.yaml`. You can commit different
versions of `params.yaml` and then use `dvc metrics` or `dvc plots` to track
which parameter contributes most to the change.

### Navigating through experiments

- How do you recover a model from last week without wasting time waiting for the
model to retrain?

First you can checkout the required `revision`, followed by `dvc checkout` to
update DVC-tracked files and directories in your workspace.

- How do you quickly switch between a large dataset and a small subset without
modifying source code?

You can change dependencies of relevant stage either by using `dvc run` with
`-f` option or by manually editing the stage in `dvc.yaml` file.

### Reproducibility

- How do you run a model's evaluation process again without retraining the model
and preprocessing a raw dataset?

DVC provides a way to reproduce pipelines partially. You can use `dvc repro`
to execute evaluation stage without reproducing complete pipeline.

### Managing and sharing large data files

- How do you share models trained in a GPU environment with colleagues who don't
have access to a GPU?

- How do you share the entire 147 GB of your ML project, with all of its data
sources, intermediate data files, and models?

Cloud or local storage can be used to store the project's data. You can share
large data files and models with others if they are stored on
[remote storage](doc/command-reference/remote/add#supported-storage-types).