Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: exp init improvements #3071

Merged
merged 18 commits into from
Dec 23, 2021
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions content/docs/command-reference/exp/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,19 +26,19 @@ usage: dvc exp [-h] [-q | -v]

positional arguments:
COMMAND
init Quickly setup any project to use DVC Experiments.
run Reproduce complete or partial experiment pipelines.
show Print experiments.
apply Apply the changes from an experiment to your
workspace.
diff Show changes between experiments in the DVC
repository.
run Reproduce complete or partial experiment pipelines.
gc Garbage collect unneeded experiments.
branch Promote an experiment to a Git branch.
list List local and remote experiments.
apply Apply the changes from an experiment to your
workspace.
branch Promote an experiment to a Git branch.
remove Remove local experiments.
gc Garbage collect unneeded experiments.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
push Push a local experiment to a Git remote.
pull Pull an experiment from a Git remote.
remove Remove local experiments.
init Codify project using DVC metafiles to run experiments.
```

## Description
Expand Down
230 changes: 154 additions & 76 deletions content/docs/command-reference/exp/init.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# exp init

Codify project using [DVC metafiles](/doc/user-guide/project-structure) to run
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[experiments](/doc/user-guide/experiment-management).
Quickly setup any project to use [DVC Experiments].
Comment on lines 1 to +3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this (per last check box in this PR desc.) @dberenbaum @skshetry . Currently dvc exp init -h prints Initialize DVC in the current directory. Expects directory to be a Git repository... Do we want to update that text?

Copy link
Collaborator

@dberenbaum dberenbaum Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the dvc init -h text 😄 . dvc exp init -h says Initialize experiments. However, I think we agreed this new language is better, so I'm happy to put in a PR to update in the core repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the dvc init -h text 😄

yep 🤦 sry. OK, thanks for the PR


> Requires a <abbr>DVC repository</abbr>, created with `git init` and
> `dvc init`.
Expand All @@ -19,43 +18,60 @@ usage: dvc exp init [-h] [-q | -v] [--run] [--interactive] [-f]

## Description
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in the description, it might be useful to explain that dvc exp init by default expects that input data, parameters, and source code paths exist before running an experiment, and that the command is expected to generate models, metrics, and plots.

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more or a problem for exp run though, @dberenbaum (already linked form the --run option). But should we state that exp run (and repro for that matter) expect that the stage definition and code are good? Hopefully its evident.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more or a problem for exp run though

Well that's the point of exp init, right? Better to have users understand what's needed up front than to have them run exp init only to fail on exp run.

Doesn't need to be part of this PR. It could also be handled by some of the suggested changes to the core command rather than the docs.

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have users understand what's needed up front than to have them run exp init only to fail on exp run.

I'm not sure. If users need to read that on the cmd ref., is exp init serving as a self-explanatory command? I think those notes should be in the command's output fist (and added to this doc as a result of that product update).


`dvc exp init` helps you quickly get started with experiments. It reduces
boilerplate for initializing [pipeline](/doc/command-reference/dag) stages in a
`dvc.yaml` file by assuming defaults about the location of your data,
[parameters](/doc/command-reference/params), source code, models,
[metrics](/doc/command-reference/metrics) and
[plots](/doc/command-reference/plots), which can be customized through config.
`dvc exp init` helps you get started with DVC Experiments quickly. It reduces
boilerplate DVC procedures by creating a `dvc.yaml` file that assumes standard
locations of your input data, <abbr>parameters</abbr>, source code, models,
<abbr>metrics</abbr> and [plots](/doc/command-reference/plots). These locations
can be customized through the [options](#options) below or via
[configuration](/doc/command-reference/config#exp).

It also offers guided `--interactive` mode for creating a stage to be
[`exp run`](/doc/command-reference/exp/run) later. `dvc exp init` supports
creating different types of stages, eg: `dl` if you are doing deep learning,
which uses [dvclive](/doc/dvclive) to monitor and checkpoint progress during
training of machine learning models.
Repository structure assumed by default:

This command is intended to be a quick way to start running experiments. To
create more complex stages and pipelines, use `dvc stage add`.
```
├── data/
├── metrics.json
├── models/
├── params.yaml # required
├── plots/
└── src/
```

> Note that `dvc exp init` expects at least a `params.yaml` file present. DVC
> reads it to find parameters to include in the [stage definition]. It can
> however be omitted when using the `--explicit` and/or `-i` flags.

> 📖 More context in [Experiments Overview].
You must always provide a command that runs your experiment(s). It can be given
either directly [as an argument](#the-command-argument), or by using the
`--interactive` (`-i`) mode which will prompt you for it. This command will be
wrapped as a <abbr>stage</abbr> that `dvc exp run` can execute.

[experiments overview]:
/doc/user-guide/experiment-management/experiments-overview
Different types of stages are supported, such as `dl` (deep learning) which uses
[DVCLive](/doc/dvclive) to monitor [checkpoints] during training of ML models.

> `dvc exp init` is intended as a quick way to start running [DVC Experiments].
> See the `dvc.yaml` specification for complex data pipelines.

[stage definition]:
/doc/user-guide/project-structure/pipelines-files#stage-entries
[checkpoints]: /doc/user-guide/experiment-management/checkpoints
[dvc experiments]: /doc/user-guide/experiment-management/experiments-overview

### The `command` argument

The `command` argument is optional, if you are using `--interactive` mode. The
Comment on lines 59 to -45
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW a question I have is whether to keep this section or move command under Options like we did for targets in https://dvc.org/doc/command-reference/repro#options. I personally like the section more but I remember we discussed it (cc @shcheklein) this and using Options was picked, so for consistency I'd move this under Options as well.

From #3015 (review)

`command` sent to `dvc exp init` can be anything your terminal would accept and
run directly, for example a shell built-in, expression, or binary found in
`PATH`. Please remember that any flags sent after the `command` are interpreted
by the command itself, not by `dvc exp init`.
The command given to `dvc exp init` can be anything your system terminal would
accept and run directly, for example a shell built-in, an expression, or a
binary found in `PATH`. Please note that any flags sent after the `command`
argument will normally become part of that command itself and ignored by
`dvc exp init` (so provide it last).

⚠️ While DVC is platform-agnostic, the commands defined in your
[pipeline](/doc/command-reference/dag) stages may only work on some operating
systems and require certain software packages to be installed.
⚠️ While DVC is platform-agnostic, commands defined in `dvc.yaml` (`cmd` field)
may only work on some operating systems and require certain software packages or
libraries in the environment.

Wrap the command with double quotes `"` if there are special characters in it
like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to
`dvc exp init` itself. Use single quotes `'` instead if there are environment
variables in it that should be evaluated dynamically. Examples:
Surround the command with double quotes `"` if it includes special characters
like `|` or `<`, `>` -- otherwise they would apply to `dvc exp init` itself. Use
single quotes `'` instead if there are environment variables in it that should
be evaluated dynamically.

```dvc
$ dvc exp init "./a_script.sh > /dev/null 2>&1"
Expand All @@ -64,71 +80,133 @@ $ dvc exp init './another_script.sh $MYENVVAR'

## Options

- `-i`, `--interactive` - prompts user for the command to execute and different
paths for tracking outputs and dependencies, unless they are provided through
arguments explicitly. Interactive mode allows users to set those locations
from default values or omit them.
- `-i`, `--interactive` - prompts user for a command that runs your
experiment(s) (see [details](#the-command-argument)) and to confirm or define
the paths that conform your repo's structure.

- `--explicit` - `dvc exp init` assumes default location of your outputs and
dependencies (which can be overriden from the config). By using `--explicit`,
it will not use those default values while initializing experiments. In
`--interactive` mode, prompt won't set default value and all the values for
the prompt needs to be explicitly provided, or omitted.
- `-n <stage>`, `--name <stage>` - specify a custom name for the stage generated
by this command. The default is `train`. It can only contain letters, numbers,
dash `-` and underscore `_` (same as `dvc stage add --name`).

- `--code` - override the a path to your source file or directory which your
experiment depends on. The default is `src` directory for your code.
- `--run` - automatically run the experiment after creating the stage (same as
`dvc exp run`).

- `--data` - override the path to your data file or directory to track, which
your experiment depends on. The default is `data` directory.
- `--type` - selects the type of the stage to create. Currently it provides two
alternatives: `dl` and `default` (no need to specify this one).

- `--params` - override the path to
[parameter dependencies](/doc/command-reference/params) which your experiment
depends on. The default parameters file name is `params.yaml`. Note that
`dvc exp init` may fail if the parameters file does not exist at the time of
the invocation, as DVC reads the file to find parameters to track for the
stage.
`dl` stages are intended for use in deep-learning scenarios, where metrics and
plots are tracked with [DVCLive](/doc/dvclive). This also supports logging
[checkpoints](/doc/command-reference/exp/run#checkpoints) during the training
of DL models.

- `--model` - override the path to your models file or directory to track, which
your experiment produces. `dvc exp init` assumes `models` directory by
default.
- `--code` - set the path to the file or directory where the source code that
your experiment depends on can be found (if any). Overrides other
configuration and default value (`src/`).

- `--metrics` - override the path to metrics file to track, which your
experiment produces. Default is `metrics.json` file.
- `--params` - set the path to the file or directory where the
</abbr>parameters</abbr> that your experiment depends on can be found.
Overrides other configuration and default value (`params.yaml`).

- `--plots` - override the path to plots file or directory, which your
experiment produces. The default is `plots`.
> Note that `dvc exp init` will fail if the params file does not exist. This
> is because DVC reads it to find params to include in the [stage definition].

- `--live` - override the directory `path` for [DVCLive](/doc/dvclive), which
your experiment will write logs to. The default is `dvclive` directory, which
only comes to effect when used with `--type=dl`.
- `--data` - set the path to the data file or directory that your experiment
depends on can be found (if any). Overrides other configuration and default
value (`data/`).

- `--type` - selects the type of the stage to create. Currently it provides two
different kinds of stages: `default` and `dl`. If unspecified, `default` stage
is created.
- `--model` - set the path to the file or directory where the model(s) produced
by your experiment can be found (if any). Overrides other configuration and
default value (`models/`).

`default` stage creates a stage with `metrics` and `plots` tracked by DVC
itself, and does not track live-created artifacts (unless explicitly
specified).
> 💡 This could be used for any artifacts produced by your experiment.

`dl` stage is intended for use in deep-learning scenarios, where metrics and
plots are tracked by [dvclive](/doc/dvclive) and supports tracking progress
while training a deep-learning model with
[checkpoints](/doc/command-reference/exp/run#checkpoints).
- `--metrics` - set the path to the file or directory where the metrics produced
by your experiment can be found (if any). Overrides other configuration and
default value (`metrics.json`).

- `-n <stage>`, `--name <stage>` - specify a custom name for the stage generated
by this command (e.g. `-n train`). The default is `train`.
- `--plots` - set the path to the file or directory where the plots produced by
your experiment can be found (if any). Overrides other configuration and
default value (`plots/`).

Note that the stage name can only contain letters, numbers, dash `-` and
underscore `_`.
- `--live` - configure the `path` directory for [DVCLive](/doc/dvclive). This is
where experiment logs will be written. Overrides other configuration and
default value (`dvclive/`).

- `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without
asking for confirmation.
> This only has an effect when used with `--type=dl`.

- `--run` - runs the experiment after initializing it.
- `--explicit` - do not assume default locations of project dependencies and
outputs. You'll have to provide specific locations via other options or
`dvc config exp`. In `--interactive` this removes default values from prompts.

- `-f`, `--force` - overwrite an existing stage in `dvc.yaml` file without
asking for confirmation (same as `dvc stage add --force`).

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

## Example: interactive mode

Let's prepare an ML model training script to start running experiments on it.
The easiest route is using interactive mode and answering a few questions:

```dvc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a model training example? Data ingestion is a nice application to show how this can be useful even for non-training stages, but model training is the primary use case. It's unlikely for someone to run experiments on a data ingestion script.

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data ingestion example produced a very small dvc.yaml file and then I had that note about using stage add to produce the same. Idk, I thought it was interesting. Anyway, I agree the training scenario is more applicable so I replaced all that now.

$ dvc exp init --interactive
This command will guide you to set up a train stage in dvc.yaml.
See https://s.dvc.org/g/pipeline-files.

Command to execute: python src/train.py

Enter the paths for dependencies and outputs of the command.
DVC assumes the following workspace structure:
├── data
├── metrics.json
├── models
├── params.yaml
├── plots
└── src

Path to a code file/directory [src, n to omit]: src/train.py
Path to a data file/directory [data, n to omit]: data/features
Path to a model file/directory [models, n to omit]: models/predict.h5
Path to a parameters file [params.yaml, n to omit]:
Path to a metrics file [metrics.json, n to omit]:
Path to a plots file/directory [plots, n to omit]: n
...
```

In this example the code, data, and model locations were specified above to
avoid using the defaults (which are too broad). `params.yaml` and `metrics.json`
are accepted (pressed Enter) for <abbr>parameters</abbr> and
<abbr>metrics</abbr>. Plots are omitted (entered `n`) as none will be written.

The resulting `dvc.yaml` file codifies the meta-information you provided in
DVC's format:

```yaml
train:
cmd: python src/train.py
deps:
- data/features
- src/train.py
params:
- epochs
outs:
- models/predict.h5
metrics:
- metrics.json:
cache: false
```

> Notes:
>
> - `train` is the default stage name unless you provide one with the `--name`
> option.
> - The `epochs` param was obtained from the `params.yaml` file. Any other param
> keys found there would all be listed under `params:` automatically.

The next step would be to tune `params.yaml` or improve `src/train.py` directly,
and start [running experiments](/doc/command-reference/exp/run).
16 changes: 8 additions & 8 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,10 @@
"slug": "exp",
"source": "exp/index.md",
"children": [
{
"label": "exp init",
"slug": "init"
},
{
"label": "exp run",
"slug": "run"
Expand All @@ -262,14 +266,14 @@
"label": "exp show",
"slug": "show"
},
{
"label": "exp init",
"slug": "init"
},
{
"label": "exp diff",
"slug": "diff"
},
{
"label": "exp list",
"slug": "list"
},
{
"label": "exp apply",
"slug": "apply"
Expand All @@ -293,10 +297,6 @@
{
"label": "exp pull",
"slug": "pull"
},
{
"label": "exp list",
"slug": "list"
}
]
},
Expand Down
6 changes: 4 additions & 2 deletions content/docs/user-guide/project-structure/pipelines-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ so you may modify, write, or generate stages and pipelines on your own.

## Stages

The `stages` list contains a list of user-defined stages. Here's a simple one
named `transpose`:
The list of `stages` contains one or more user-defined stages. Here's a simple
one named `transpose`:

```yaml
stages:
Expand All @@ -33,6 +33,8 @@ stages:
- columns.txt
```

> See also `dvc stage add`, a helper command to write stages in `dvc.yaml`.

The most important part of a stage it's the terminal command(s) it executes
(`cmd` field). This is what DVC runs when the stage is reproduced (see
`dvc repro`).
Expand Down