Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: exp init #6446

Closed
13 of 16 tasks
dberenbaum opened this issue Aug 16, 2021 · 32 comments · Fixed by #7497
Closed
13 of 16 tasks

Epic: exp init #6446

dberenbaum opened this issue Aug 16, 2021 · 32 comments · Fixed by #7497
Assignees
Labels
A: experiments Related to dvc exp

Comments

@dberenbaum
Copy link
Collaborator

dberenbaum commented Aug 16, 2021

Parent issue for lightweight command for initializing experiments.

Linked issues:

Small tasks that are left to do:

@dberenbaum dberenbaum added the A: experiments Related to dvc exp label Aug 16, 2021
@karajan1001
Copy link
Contributor

karajan1001 commented Aug 17, 2021

Maybe we can have a label for these summaries?

@skshetry
Copy link
Member

skshetry commented Sep 28, 2021

From the @dberenbaum's comment on #6681 (comment).

How do you anticipate this (dvclive scenario) working for non-interactive mode?

I'm still wondering whether the template option is needed (solely as a cli option, not as an actual template; it could be renamed) to separate the dvclive scenario and provide flexibility to add other scenarios in the future. What do you think?

One way may be to make --live mandatory for dvclive scenario unless live is set on the config (though there's a question about what to do if metrics and plots are also specified in the config). Maybe keeping live from config is also a way to go.

Another would be to create a --template or something similar flag (--with={live, etc.})?

@dberenbaum
Copy link
Collaborator Author

Another would be to create a --template or something similar flag (--with={live, etc.})?

Although it adds another option, I like this mainly because it's more explicit. Now that we don't have templates, it's less transparent what the command and various options are doing. Having a separate option for this suggests that it's not just another path to add but actually changes the stage configuration. We can also further explain it in the help text for that option.

On a similar note, at least interactive mode should probably have a confirmation so that users can see what will be added to dvc.yaml beforehand. There could be an option to do the same outside of interactive mode (--dry?). Not sure if this has been discussed before for dvc stage add.

@skshetry

This comment has been minimized.

@daavoo

This comment has been minimized.

@skshetry

This comment has been minimized.

@dberenbaum
Copy link
Collaborator Author

We should probably include docs work also.

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Oct 6, 2021

I noticed a couple of workflow issues when doing some QA of dvc exp init:

  1. It's unclear when or what to git add. If I don't git add the dvc.yaml or the dependencies, dvc exp init works in the workspace but not in temp.
  2. If I am adding a data dependency, I most likely want to dvc add it rather than git add.

cc @pmrowla

@pmrowla
Copy link
Contributor

pmrowla commented Oct 8, 2021

New/untracked files aren't pulled into the --temp/--queue temporary workspace by default.

from https://dvc.org/doc/user-guide/experiment-management/running-experiments#the-experiments-queue:

Note that Git-ignored files/dirs are explicitly excluded from queued/temp runs to avoid committing unwanted files into Git (e.g. once successful experiments are persisted).

💡 To include untracked files, stage them with git add first (before dvc exp run) and git reset them afterwards.

It sounds like we need:

  • deps from exp init should get .dvc files (dvc added)
  • exp run --temp should just automatically stage any untracked .dvc or dvc.yaml files (it seems like this is probably the desired default behavior in all cases, not just exp init specific ones?)
    • this doesn't address everything and there may be other untracked files that users actually need to manually stage for their experiment, but we can't really automate behavior for non-dvc-related files

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Oct 8, 2021

deps from exp init should get .dvc files (dvc added)

This is only true for the data dependency, since the code and params dependencies probably should be git tracked. We might just start with a hint here. Doing a dvc add automatically might be unexpected and make dvc exp init much slower. I don't love the hint idea, but it's an easy starting point. 🤔

exp run --temp should just automatically stage any untracked .dvc or dvc.yaml files (it seems like this is probably the desired default behavior in all cases, not just exp init specific ones?)

  • this doesn't address everything and there may be other untracked files that users actually need to manually stage for their experiment, but we can't really automate behavior for non-dvc-related files

👍 Can we also try to add (not force but at least do a regular git add) any other dependencies of the experiment?

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Oct 14, 2021

Discussed with @pmrowla and he noted that iterating over stages and git-adding everything during exp run could be time-consuming when doing things like queuing an experiment. It seems easier to git add during exp init since we already have all the stage info. Thoughts on adding a git add operation at the end of the command @skshetry? This might make sense as a dvc stage add option along the lines of #5932.


Would it be possible to mark the data dependency as dvc-tracked in a lightweight way? For example, add data to .gitignore and create a data.dvc file with contents:

outs:
- path: data

This also seems like it might be useful generally as an option in dvc add (similar to https://git-scm.com/docs/git-add#Documentation/git-add.txt---intent-to-add but likely more useful given the size of data dvc handles). Thoughts/suggestions?

@skshetry
Copy link
Member

skshetry commented Oct 19, 2021

Till now, we have considered dependency to be a git-tracked or dvc-tracked dependency, so we do not track them by default. But makes sense to do for data in exp init.

Would it be confusing to create data.dvc and dvc.yaml files?

Regarding the git add by default, we have core.autostage config that does this automatically. I don't see any issues with doing this by default (it's only dvc.yaml/.gitignore right?).

@skshetry
Copy link
Member

@dberenbaum, should we also dvc init by default on dvc exp init?

@dberenbaum
Copy link
Collaborator Author

Would it be confusing to create data.dvc and dvc.yaml files?

Ideally, we could specify in dvc.yaml that data is a DVC-tracked dependency, but I think it's probably easier to add data.dvc than to figure out new dvc.yaml syntax for now, unless you have an idea to do that easily. Adding data.dvc might be awkward, but it also introduces a basic concept to users. What do you think?

I don't see any issues with doing this by default (it's only dvc.yaml/.gitignore right?).

👍

@dberenbaum, should we also dvc init by default on dvc exp init?

👍

@skshetry
Copy link
Member

skshetry commented Oct 26, 2021

@pmrowla, do we need to also stage params file? Or, does experiments does some magic for those?

EDIT: looks like we do need to stage that file ourselves.

skshetry added a commit that referenced this issue Nov 11, 2021
It'll give a error message to the user and reprompt them.
This mechanism can be extended to other prompts, and we
can show a warning message and/or reprompt them based on
validations.

Closes #6865. Part of #6446.
@daavoo
Copy link
Contributor

daavoo commented Nov 12, 2021

Just wanted to upvote:

Use train as a default name? (exp init: iterate on ui, tree and show to-be-generated yaml content #6750 (comment))

At least for the --type live option, having the name of the added stage (live) to be the same as one of the output fields feels a little strange.

And regardless of whether is a deep learning (live) experiment or not, the action to be performed in the stage would be the same (so train sounds good)

@dberenbaum
Copy link
Collaborator Author

What about naming the types to match the use cases, like classical ml and deep learning? ml and dl? This might be more user-friendly since it focuses less on the products/features involved, plus dvclive might expand to non-deep learning use cases, which makes things even more confusing.

@karajan1001
Copy link
Contributor

I'm sorry, I do not quite understand why live is a dl only thing?

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Nov 30, 2021

Two reasons:

  1. The stage includes checkpoints.
  2. Dvclive was initially created for "live" monitoring of steps during training.

I would consider these to mostly be useful when doing deep learning, although maybe that's my personal bias. @karajan1001 do you see those being useful in other scenarios?

However, dvclive may become more of a general-purpose logger, in which case it may be useful in any ml scenario. That's one reason to consider renaming dvc exp init --type live.

@iesahin
Copy link

iesahin commented Dec 1, 2021

My 2c:

--type may be renamed to --feature for a more flexible approach.

dvc exp init --feature live --feature some-future-feature

may place nicely together then.

@skshetry
Copy link
Member

skshetry commented Dec 1, 2021

@iesahin, dvc exp init is targeted more in terms of scenarios rather than features, which we have plenty of in dvc stage add.

@dberenbaum
Copy link
Collaborator Author

@skshetry Can we go with dl or deep for now since that's the focus of the scenario? We don't even need to rename the default.

@dberenbaum
Copy link
Collaborator Author

init dvc repo if uninitialized
Add data as a dvc-tracked directory

After discussion, these are being de-prioritized for now since they are not simple changes and are not strictly necessary.

@skshetry
Copy link
Member

skshetry commented Dec 2, 2021

Can we go with dl or deep for now since that's the focus of the scenario? We don't even need to rename the default.

deep does not convey enough information, I think. Is DL a common abbreviation?

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Dec 2, 2021

Is DL a common abbreviation?

Yeah, it's not like ML, but it's fairly common. Google "deep learning abbreviation."

@skshetry
Copy link
Member

skshetry commented Dec 6, 2021

@dberenbaum, with --dl, I don't think dl as a default stage name makes sense. Maybe we should go with train then?

@dberenbaum
Copy link
Collaborator Author

What's the status @skshetry?

@jorgeorpinel
Copy link
Contributor

Raise issue if dvc is initialized with --no-scm or if the project is not git init-ed.
should we also dvc init by default on dvc exp init?

I think exp init could even do both git/dvc init if needed (#7109). But definitely at least the latter 👍

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 7, 2021

Also, I noticed params.yaml is the only required part of the default project structure (even if it's empty). Should exp init create an empty one (with some comment on top perhaps) if it's not there? Otherwise it fails.

@dberenbaum
Copy link
Collaborator Author

Also, I noticed params.yaml is the only required part of the default project structure (even if it's empty). Should exp init create an empty one (with some comment on top perhaps) if it's not there? Otherwise it fails.

@jorgeorpinel See my response in iterative/dvc.org#3071 (comment).

@jorgeorpinel
Copy link
Contributor

The params section of dvc.yaml currently expects keys from a file

What about generating a dvc.yaml without params if there's no params.yaml file?

@pmrowla
Copy link
Contributor

pmrowla commented Dec 13, 2021

related: #7130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants