guide: review Best Practices and tips&tricks so far...

iterative · Sep 20, 2020 · c30116b · c30116b
1 parent 887f2c1
commit c30116b
Show file tree

Hide file tree

Showing 2 changed files with 118 additions and 97 deletions.
diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md
@@ -1,135 +1,133 @@
 # Best Practices for DVC Projects
 
 DVC provides a systematic approach towards managing and collaborating on data
-science projects. You can manage your projects with DVC more efficiently using
-the practices listed here:
+science projects. Here are a few recommended practices to organize your workflow
+and project structure effectively:
 
-## Source code and data versioning
+> See also these quick [tips & tricks](/doc/user-guide/tips-and-tricks).
 
-You can use DVC to avoid discrepancies between
+## Matching source code to data
+
+One of DVC's basic uses is to avoid a disconnection between
 [revisions](https://git-scm.com/docs/revisions) of source code and
-[versions](/doc/use-cases/versioning-data-and-model-files) of data files. DVC
-replaces all large data files, models, etc. with small
-[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These
-files point to the original data, which you can access by first checking out the
-required `revision` using Git followed by `dvc checkout` to update DVC tracked
-data files/dir:
+[versions](/doc/use-cases/versioning-data-and-model-files) of data. DVC replaces
+large data files and directories, models, etc. with small
+[metafiles](/doc/user-guide/dvc-files-and-directories), which you can track with
+Git, along with the corresponding code.
+
+These metafiles point to the original data, which is <abbr>cached</abbr>
+automatically. You can access it later by restoring that Git working tree (e.g.
+with `git checkout`) and using `dvc checkout` to update DVC tracked data
+files/dir:
 
 ```dvc
-$ git checkout 95485f   # Git commit of required data version
+$ git checkout 95485f  # Git commit of a desired project version
 $ dvc checkout
 ```
 
+> See
+> [Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files)
+> for more details.
+
+## Using directories as single data units
+
 If your dataset consist of multiple files like images, etc. then the best way to
-track whole directory is with single `.dvc` file. You can use `dvc add` with
-relative path to directory:
+track it is
+[as a directory](/doc/command-reference/add#adding-entire-directories), with a
+single `.dvc` file:
 
 ```dvc
-$ dvc add data/images
+$ dvc add data/images/
 ```
 
-## Experiments and tracking parameters
+## Manually editing dvc.yaml or .dvc files
 
-You can use DVC for tuning [parameters](doc/command-reference/params), improving
-target [metrics](doc/command-reference/metrics) and visualizing the changes with
-[plots](doc/command-reference/plots). In the first step tune parameters in
-default `params.yaml` file and reproduce the pipeline:
+It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:
 
-```dvc
-$ dvc repro        # Reproducing pipeline
-$ git add -am "Epoch Experiment"
+```yaml
+stages:
+  prepare:
+    cmd: python src/prepare.py data/data.xml
+    deps:
+      - data/data.xml
+    params:
+      - prepare.split
+    outs:
+      - data/prepared
 ```
 
-Commit the new changes in files using Git. Next step is to compare the
-experiments. Use `dvc metrics` to find difference in target metric between two
-commits:
+You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
+files please remember not to change the `md5` or `checksum` fields as they
+contain hash values which DVC uses to track the file or directory.
 
-```dvc
-$ dvc metrics diff rev1 rev2
-```
+## Managing and sharing large data
 
-And finally you can plot target metrics using `dvc plots`:
+Traditional or cloud storage can be used to store the project's data. You can
+share the entire 147 GB of your ML project, with all of its data sources,
+intermediate data files, and models with others by setting up DVC
+[remote storage](doc/command-reference/remote) (optional).
 
-```dvc
-$ dvc plots diff -x recall -y precision rev1 rev2
-```
+This way you can share models trained in a GPU environment with colleagues who
+don't have access to GPUs.
+
+## Never store secrets in the shared config file
 
-If you want to recover a model from last week without wasting time required for
-the model to retrain you can use DVC to navigate through your experiments. First
-you can checkout the required `revision` using Git:
+Do not put user credentials in the default config file (`.dvc/config`), which is
+tracked by Git. Use the `--local`, `--global`, or `--system` options of
+`dvc config` to provide sensitive or user-specific settings:
 
 ```dvc
-$ git checkout baseline-experiment   # Git commit, tag or branch
-$ dvc checkout
+$ dvc config --local remote.password mypassword  # just here
+$ dvc config --global core.checksum_jobs 16      # all my projest
+$ dvc config --system core.check_update false    # all users
 ```
 
-Followed by `dvc checkout` to update DVC-tracked files and directories in your
-workspace.
+## Tracking experiments with Git
 
 If you are training different models on your data files in the same project,
-using Git commits, tags, or branches makes it easy to manage the project. Have a
-look at this [example]() to see how this works.
+using Git commits, tags, or branches makes it easy to manage the project.
 
-## Reproducibility
+<!-- TODO: needs much elaboration! -->
 
-You can run a model's evaluation process again without actually retraining the
-model and preprocessing a raw dataset. DVC provides a way to reproduce pipelines
-partially. You can use `dvc repro` to execute evaluation stage without
-reproducing complete pipeline:
+## Basic experimentation flow
 
-```dvc
-$ dvc repro evaluate
-```
+Use DVC for [reproducing](/doc/command-reference/repro) experiments after tuning
+their [parameters](/doc/command-reference/params), tracking resulting
+[metrics](/doc/command-reference/metrics), and visualizing their evolution with
+[plots](/doc/command-reference/plots).
 
-## Managing and sharing large data files
+For example, let's first setup some parameters in `params.yaml` and reproduce
+the pipeline:
 
-Cloud or local storage can be used to store the project's data. You can share
-the entire 147 GB of your ML project, with all of its data sources, intermediate
-data files, and models with others if they are stored on
-[remote storage](doc/command-reference/remote/add#supported-storage-types).
-Using this you can share models trained in a GPU environment with colleagues who
-don't have access to a GPU. Have a look at this
-[example](doc/command-reference/pull#example-download-from-specific-remote-storage)
-to see how this works.
+<!-- TODO: sample params file -->
 
-## Manually editing dvc.yaml or .dvc files
+```dvc
+$ dvc repro
+```
 
-It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:
+<!-- TODO: what about the command output above? -->
 
-```yaml
-stages:
-  prepare:
-    cmd: python src/prepare.py data/data.xml
-    deps:
-      - data/data.xml
-    params:
-      - prepare.split
-    outs:
-      - data/prepared
-```
+Commit the changes using Git. Having some commits allows us to compare the
+experiments using `dvc metrics diff`:
 
-You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
-files please remember not to change the `md5` or `checksum` fields as they
-contain hash values which DVC uses to track the file or directory.
+```dvc
+$ dvc metrics diff rev1 rev2
+```
 
-## Never store credentials in project config
+<!-- TODO: command output above? -->
 
-Do not store any user credentials in project config file. This file can be found
-by default in `.dvc/config`. Use `--local`, `--global`, or `--system` command
-options with `dvc config` for storing sensitive, or user-specific settings:
+Finally, you can see how certain metrics evolved using `dvc plots diff`:
 
 ```dvc
-$ dvc config --system remote.username [password]
+$ dvc plots diff -x recall -y precision rev1 rev2
 ```
 
-## Tracking <abbr>outputs</abbr> by Git
+<!-- TODO: insert plot img -->
 
-If your `output` files are small in size and you want to track them with Git
-then you can use `--outs-no-cache` option to define outputs while creating or
-modifying a stage. DVC will not track will not track outputs in this case:
+If you want to recover a model from last week without wasting time required to
+retrain the model, you can use Git and DVC to navigate through your experiments:
 
 ```dvc
-$ dvc run -n train -d src/train.py -d data/features \
-          ---outs-no-cache model.p \
-          python src/train.py data/features model.pkl
+$ git checkout baseline-experiment   # Git commit, tag or branch
+$ dvc checkout
 ```
diff --git a/content/docs/user-guide/tips-and-tricks.md b/content/docs/user-guide/tips-and-tricks.md
@@ -1,17 +1,40 @@
 # Tips and tricks for DVC Projects
 
-This guide provides general tips and tricks related to DVC, which can be
-utilized while working on a project. Using the practices listed here, you can
-manage your projects with DVC more efficiently.
-
-## Using meta in dvc.yaml or .dvc files
-
-DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be
-used to add any user specific information. It also supports YAML content.
+Using the methods listed here, you can manage your DVC projects more
+efficiently.
 
 ## Switching between datasets
 
 You can quickly switch between a large dataset and a small subset without
-modifying source code. To achieve this you need to change dependencies of
-relevant stage either by using `dvc run` with the `-f` option or by manually
-editing the stage in `dvc.yaml` file.
+modifying source code: Change the dependencies of stage, either by manually
+editing the stage in `dvc.yaml` or by using `dvc run` again with `-f`.
+
+<!-- TODO: needs actual example -->
+
+## Tracking small data with Git
+
+If your `output` files are small in size and you want to track them with Git
+then you can use `--outs-no-cache` option to define outputs while creating or
+modifying a stage. DVC will not track will not track outputs in this case:
+
+```dvc
+$ dvc run -n train -d src/train.py -d data/features \
+          ---outs-no-cache model.p \
+          python src/train.py data/features model.pkl
+```
+
+## Partial reproducibility
+
+You can run a model's evaluation process again without preprocessing a raw
+dataset again, or retraining the model. Pass a target stage to `dvc repro` to
+execute only the necessary parts of the pipeline:
+
+```dvc
+$ dvc repro evaluate
+```
+
+## User metadata in DVC metafiles
+
+DVC provides an optional `meta` field for `dvc.yaml` and `.dvc` metafiles
+(that's very meta!). It can be used to add any user information (as YAML content
+e.g. `"a string"`).