Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0 (not only) testing - comments, finding, issues, etc #3777

Closed
13 of 15 tasks
shcheklein opened this issue May 10, 2020 · 8 comments
Closed
13 of 15 tasks

1.0 (not only) testing - comments, finding, issues, etc #3777

shcheklein opened this issue May 10, 2020 · 8 comments
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. testing Related to the tests and the testing infrastructure
Milestone

Comments

@shcheklein
Copy link
Member

shcheklein commented May 10, 2020

Some notes on the playground, how to reproduce, and what I'm testing for.

Repo to reproduce: here.

  • I'm running testing on a new version of the example-get-started that I built with this updated gen script.

  • The repo itself (that script from above generates) can be downloaded here. Should be enough to reproduce issues below described below. And can be useful for everyone as an easy entry point to try DVC 1.0 on a bit more sophisticated project.

  • This list below might include things not related 100% to the latest changes we introduced in 1.0, but all of them are potentially relevant. E.g., we change pipelines interface and should try scenarios around.

dvc repro behavior and UI/UX

  • I think we should assume dvc.yaml prefix by default. Instead of running a stage like this dvc repro dvc.yaml:prepare, I should be able to run it like this (if dvc.yaml is found in the current dir? or if it is a single one in the project?):
$ dvc repro prepare

#3842

$ dvc repro model.pkl
  • Repro excessive warning on the default target file name. Either make it info (in general I would say WARNING is a rare thing that we need to show), or don't show again. Instead make a message about the stage that we execute more meaningful (see below). default target: reduce loglevel to debug #3822
$ dvc repro
WARNING: assuming default target 'dvc.yaml'.
$ dvc repro
WARNING: stage: 'data/data.xml.dvc' is locked. Its dependencies are not going to be reproduced.

dvc repro options

$ vim scr/train.py (some dummy change)
$ dvc repro --dry`
  • -p,. --pipeline - deprecate? It does not make sense moving forward to accept .dvc files.

    -p still makes sense for running a pipeline of a given stage, even in 1.0.
    Though, repro -p will not work as either dvc file or stage name needs to be specified.

  • Related to the previous one - we still accept .dvc files into dvc repro - we should not be doing this anymore - semantics is not clear anymore. Deprecate or remove.

    dvc repro is still backward compatible. And, we cannot detect if the repo is 1.0-one,
    as the repo can have a mixture of old-style dvcfiles and new pipeline files.

  • --help review and fix it. e.g. we don't expect Dvcfile by default anymore. repro: fix help message for default target #3831

$ dvc repro --help
DVC-file to reproduce. 'Dvcfile' by default.

Other commands

  • dvc pipeline show - expects Dvcfile, should now do the same as dvc repro + fix help.

  • dvc pipeline show output is broken from a lot of different perspectives - see this comment for suggestions and details.

  • We broke the progress bar for the get/import. It is now Downloading 0/1 files vs Downloading 0/36.2M before. Mean that we effectively don't show a progress bar for the case like:

dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml

This is not fixed by reverting b312895?w=1

Extracted to #3874

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label May 10, 2020
@shcheklein shcheklein added the testing Related to the tests and the testing infrastructure label May 10, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label May 10, 2020
@efiop efiop added p0-critical Critical issue. Needs to be fixed ASAP. bug Did we break something? labels May 11, 2020
@skshetry
Copy link
Member

skshetry commented May 11, 2020

$ dvc repro prepare

The default is dvc.yaml. The addressing is via :stage_name (i.e. without explicit dvc.yaml). This is ugly, I know. But, this was done so as to have one way of addressing things (on data related commands, the stage name might collide with the file name, eg: dvc checkout foo). I'm fine with just having it addressed without colon on pipeline-related commands though.

$ dvc repro model.pkl

I don't like this, because that specific stage might be generating lots of other stuffs than just model.pkl. And, if we go this way, the same argument I mentioned of colliding with stage name applies here.

When I run dvc repro I don't really see the stage name being reproduced

Good point. That's a bad UX.

--dry does not work at all.

Broken by 18e8f07

Related to the previous one - we still accept .dvc files into dvc repro - we should not be doing this anymore - semantics is not clear anymore. Deprecate or remove.

You mean, not supporting old .dvc files at all bearing a cmd/pipeline stage? I'm okay with deprecation.

--help review and fix it. e.g. we don't expect Dvcfile by default anymore.

Oops.

dvc pipeline show - expects Dvcfile, should now do the same as dvc repro + fix help

The problem is what to show for bare pipeline show. One idea is to show all possible DAGs
(could be too much, could also limit to just dvc.yaml by default) and other is to show all of the stages inside a dvc.yaml file one-by-one.

@shcheklein
Copy link
Member Author

because that specific stage might be generating lots of other stuffs than just model.pkl.

not a problem I believe, I think most people will understand that stage will be running (means generating outputs). Also in majority of cases there will be one - two outputs, one of the metrics - it's totally fine. On the other hand - it's very convenient - I don't care about some stage names, I care about my model.

And, if we go this way, the same argument I mentioned of colliding with stage name applies here.

yes, but it's a matter of defining rules (first check stage name, second output or something like this). I would also be surprised (and it's bad design) if stage and output that are not related to each other have the same name.

@dmpetrov
Copy link
Member

@shcheklein thank you for the great summary!

Also, I'd add that repro is too verbose. I do repro for two unchanged stages without lock:

$ dvc repro
WARNING: assuming default target 'dvc.yaml'.
Stage is cached, skipping.
Output 'users.csv' didn't change. Skipping saving.
Stage is cached, skipping.
Output 'summary.json' doesn't use cache. Skipping saving.
Output 'logs.csv' didn't change. Skipping saving.
Output 'logs' didn't change. Skipping saving.

To track the changes with git, run:

	git add dvc.lock

All this information is overwhelming - it takes time to read and understand that basically nothing happened. What is actually important - two stages hit the run-cache and lock file was generates (this info is not presented directly).

I'd expect something like:

$ dvc repro
Restoring stage 'proc' from run-cache.
Restoring stage 'trainme' from run-cache.
Lock file 'dvc.lock' was generated.

To track the changes with git, run:

	git add dvc.lock

@dmpetrov dmpetrov added this to the 1.0 milestone May 12, 2020
@elgehelge
Copy link
Contributor

because that specific stage might be generating lots of other stuffs than just model.pkl.

not a problem I believe, I think most people will understand that stage will be running (means generating outputs). Also in majority of cases there will be one - two outputs, one of the metrics - it's totally fine. On the other hand - it's very convenient - I don't care about some stage names, I care about my model.

Maybe a message like "Additional output generated: someplot.png, metrics.json" would be nice in that case?

And, if we go this way, the same argument I mentioned of colliding with stage name applies here.

yes, but it's a matter of defining rules (first check stage name, second output or something like this). I would also be surprised (and it's bad design) if stage and output that are not related to each other have the same name.

Yes. As long as there is a clear hierarchy of rules. It should be fine. I would say look for a stage name first, then look for a filename. At least I would love dvc repro <stagename> to work without the :.

@elgehelge
Copy link
Contributor

elgehelge commented May 13, 2020

Another tiny thing. I am not sure how the plot files works, but I got a little surprised by the number of files I saw the first time I did a git status. Here is what I saw:
image
I know there is only 3 plot files, but still I got a little surprised. Maybe the plots could be combined in one json file?

efiop added a commit to efiop/dvc that referenced this issue May 13, 2020
efiop added a commit to efiop/dvc that referenced this issue May 14, 2020
skshetry pushed a commit that referenced this issue May 14, 2020
* tests: unit: move stage tests into a separate dir

* stage: fix --dry-run

Part of #3777
@skshetry skshetry pinned this issue May 18, 2020
@skshetry skshetry self-assigned this May 19, 2020
@skshetry
Copy link
Member

-p,. --pipeline - deprecate? It does not make sense moving forward to accept .dvc files.

The dvc.yaml is not a single pipeline, so it still makes sense. It is confusing when dvc pipeline -p dvc.yaml is run, as it will try to run every pipeline of every stage inside the dvc.yaml.

Related to the previous one - we still accept .dvc files into dvc repro - we should not be doing this anymore - semantics is not clear anymore. Deprecate or remove.

Can you please elaborate on this? We are keeping backward compatibility, so what do you mean by deprecate?

dvc pipeline show - expects Dvcfile, should now do the same as dvc repro + fix help.

We need to decide what should we do for dvc pipeline show from options that we have:

  1. Show all of the pipeline of each stages one at a time. (in --ascii)
  2. Show all the DAG of the repo
  3. Show all the DAG from stages inside dvc.yaml.

/cc @shcheklein

@shcheklein
Copy link
Member Author

The dvc.yaml is not a single pipeline, so it still makes sense. It is confusing when dvc pipeline -p dvc.yaml is run, as it will try to run every pipeline of every stage inside the dvc.yaml.

could you please elaborate a bit? not sure I completely understood your point.

so what do you mean by deprecate?

If we can detect new/old project - don't allow to run this if it's a new one? I guess what I want to simplify the logic eventually in terms of what do we accept into dvc repro. I hope that we'll be able to accept an output path dvc repro model.pkl, stage name (resolves to the dvc.yaml in the current dir) and something advanced like - path to yaml + stage name, or muliple stages, multiple outputs (not sure about this).

But would love to start removing stuff that does not make much sense from the new version perpespective.

We need to decide what should we do for dvc pipeline show from options that we have

we have a separate ticket(s) for changing semantics of the dvc pipeline show - e.g. #3661 , especially the problem that it's broken in a way right now - #3661 (comment)

I would at say as a minimum for the DVC 1.0 release we can just keep the same semantics and accept exactly the same stuff the dvc repro accepts. In case when no arguments provided, it means dvc.yaml:* I guess? (it'll potentially solve that issue I mentioned btw).

@skshetry
Copy link
Member

@shcheklein, I have extracted two of the issues from here. And, pipeline show has already an open issue. The rest are already done.

Closing this one...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. testing Related to the tests and the testing infrastructure
Projects
None yet
Development

No branches or pull requests

5 participants