Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified show/diff command for all output types #4

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dberenbaum
Copy link
Contributor

No description provided.

@pmrowla
Copy link

pmrowla commented Mar 30, 2021

Why not just top-level commands like

$ dvc diff [--data] [--params] [--metrics] [--binary]

where each flag adds the output for a respective diff driver, and --data would be the current dvc diff filesystem/data diff. --binary would use the appropriate binary diff driver set via a gitattributes style path regex + diff program configuration. Image diffing would fall into this --binary category.

@pmrowla
Copy link

pmrowla commented Mar 30, 2021

On image diffing, I mentioned this in the other thread, but I don't think we should get into providing/supporting binary diff drivers in DVC. It would be better for us to provide a way for users to define/write their diff drivers to do whatever they need. Providing a basic example for image diffing would be fine here (but I think it should be separate from the core DVC repo).

Basically, this is similar to the remotes issue - we are moving towards supporting user-defined remotes which comply with fsspec , rather than implementing/maintaining/supporting too many remote types ourselves in core DVC. We should take the same approach with binary diffing - provide the API/configuration to work with any user-provided diff driver rather than writing/maintaining them ourselves in core DVC.

optional arguments:
--type Restrict report to single output type (metrics, plots, images).
```

Copy link

@pmrowla pmrowla Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the expected output be for

show --type images

Are we just listing them on the command line, or are we actually opening them in a web browser like with plots?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also mentioned this in the other thread, but maybe we should consider defining an optional metadata/schema field for mimetype. If an output has a mimetype defined, we would then show it in a browser as needed.

The user would need to explicitly set mimetype themselves (on outputs they know will be images, or directories containing images for example). DVC would not attempt to do any auto detection (whether filename/extension based or by actually reading headers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also mentioned this in the other thread, but maybe we should consider defining an optional metadata/schema field for mimetype. If an output has a mimetype defined, we would then show it in a browser as needed.

I like this idea except that we sort of already have a different convention with plots (and metrics although browser support isn't needed). Do you see those as special cases? Should dvc.yaml syntax support mimetype: plots instead of the current plots: section?

Copy link

@pmrowla pmrowla Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think plots and metrics would stay a special case, partially due to backwards-compatibility/existing behavior, and partially because they don't really fit into the mimetype idea anyways.

Side note - if we go this route, we should avoid creating our own mimetypes - mimetype is a formal IANA standard. There would be some leeway to use something like application/vnd.dvc.plots but it's probably not worth it for us?

Metrics are a special case for DVC since it's not just "yaml/json" files, it's "yaml/json" files containing only numeric values structured in a specific DVC-metrics compatible way.

plots (outputs) are a special case in the same way as metrics. And even though the actual generated plot is just a file with mimetype: text/html, we are really talking about the DVC plots (yaml/json) output and not the dvc plots ... generated HTML file. The plots field also contains configuration information about things like the default template to use for a specific (ploattable) metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to just leave binary/image support out of this proposal. There's already a separate discussion in iterative/dvc#5681, and it's a bit separate from the main point of the proposal.

@pared
Copy link

pared commented Mar 30, 2021

Related: iterative/dvc#5693

* Support file/directory `targets` as positional arguments.
* By default, show all outputs in the workspace.

`diff` implementations will:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding support for "revision:target" style parsing as per iterative/dvc#5693

Also, we might want to consider implementing options similar to git's --no-index.

Git does not have "driver" concept. All files are text ones so making diff --no-index is easy. In our case, we will probably need some sensible default behavior to detect "type" and what driver we should use (basing, on, for example, outputs type), but in some cases (eg. diff for metric file in no-dvc repo) providing the type will be a necessity while now it works out of the box thanks to the fact that the commands are split.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking out the parts of the proposal related to iterative/dvc#5693 for now since they are a bit unrelated to the concept of unifying the commands.

@dberenbaum
Copy link
Contributor Author

dberenbaum commented Mar 30, 2021

$ dvc diff [--data] [--params] [--metrics] [--binary]

I like the idea. I have a few questions:

  • What if I want to diff everything together? Part of the goal is collecting all these diffs together so users don't have to run multiple commands to get metrics, plots, etc.
  • Any reason you prefer individual flags for each type instead of a --type flag?
  • Binary type might not be quite right since not all binary files necessarily need the same driver.

EDIT: Separating responses into multiple comments.

@dberenbaum
Copy link
Contributor Author

On image diffing, I mentioned this in the other thread, but I don't think we should get into providing/supporting binary diff drivers in DVC. It would be better for us to provide a way for users to define/write their diff drivers to do whatever they need. Providing a basic example for image diffing would be fine here (but I think it should be separate from the core DVC repo).

Basically, this is similar to the remotes issue - we are moving towards supporting user-defined remotes which comply with fsspec , rather than implementing/maintaining/supporting too many remote types ourselves in core DVC. We should take the same approach with binary diffing - provide the API/configuration to work with any user-provided diff driver rather than writing/maintaining them ourselves in core DVC.

Yeah, I agree. I need to clarify what I mean by a diff driver for dvc. I thought of it as an API or a parent class with a method for returning certain fields like old data, new data, etc. I did not think of it as a UI output or HTML. That would instead be handled by the difftool. The current dvc outputs from the diff commands today could be more or less the built-in difftool(s) (maybe they could be consolidated into one html), and users could provide any other difftools they want.

For example, the driver for plots would return all the underlying data needed to produce the Vega/HTML, but the Vega/HTML would be in the difftool. For images, the driver might collect the images, verify that they are recognizable as images, and convert them to some common format. The difftool would handle how these get visualized (maybe dvc provides some very basic default here so that there is some way to visualize them without external tools).

By the way, I will add a summary of this conversation into the doc once we get to some agreement.

@pmrowla
Copy link

pmrowla commented Mar 31, 2021

  • What if I want to diff everything together? Part of the goal is collecting all these diffs together so users don't have to run multiple commands to get metrics, plots, etc.
  • Any reason you prefer individual flags for each type instead of a --type flag?

To get more than one type of diff, you just add multiple flags and we would just concatenate the respective outputs (similar to what exp diff does now for metrics + plots)

So if I wanted all 3 of data/metrics/params could just run something along the lines of dvc diff -dmp. Being able to combine short flags is why I went with this over --type <...>

@pmrowla
Copy link

pmrowla commented Mar 31, 2021

  • Binary type might not be quite right since not all binary files necessarily need the same driver.

Right, but the diff driver that gets run on binary would be controlled in the git driver + gitattributes way

So in .dvc/config (more likely in my global or system DVC config) I would have something like

[diff "raster-img"]
    command = /usr/bin/my_image_diff

[diff "vector-img"]
    command = /usr/bin/my_vector_image_diff

[diff "archive"]
    command = /usr/bin/my_archive_diff

and in my .dvcattributes (or whatever we decide to call it) I would have

*.jpg diff=raster-img
*.png diff=raster-img
*.svg diff=vector-img
*.zip diff=archive
*.tar diff=archive

When I do dvc diff --binary, when DVC encounters any files matching *.jpg,*.png it will automatically call the vector image diff tool, for files matching *.svg it would call the vector image diff tool, and so on. For files which do not have a binary diff tool set, we will do nothing in --binary mode.

@dberenbaum
Copy link
Contributor Author

When I do dvc diff --binary, when DVC encounters any files matching *.jpg,*.png it will automatically call the vector image diff tool, for files matching *.svg it would call the vector image diff tool, and so on.

  • Which *.jpg,*.png,*.svg files would be included in dvc diff --binary? Does it include all of those files in the git repo, or only those that are tracked by DVC?
  • Let's say I have lots of image files but only want to see a diff for some subset (raw data might be images but so might plots and related files explaining my model). I wonder if it makes more sense to keep the .dvcattributes data in dvc.yaml (or allow it to be overriden there) like - mypic.jpg:\n diff: raster-imgto avoid inconsistencies from spreading the info across multiple dvc metafiles?
  • What if I only want to see diffs for images and not archives?
  • Should dvc provide basic drivers for images (or other binary types), and should it be the default for image mimetypes? Do we expect every user to have to define/implement this themselves?

@pmrowla
Copy link

pmrowla commented Apr 1, 2021

Which .jpg,.png,*.svg files would be included in dvc diff --binary? Does it include all of those files in the git repo, or only those that are tracked by DVC?

I think it makes sense for us to be able to diff any file in the entire repo and not only DVC tracked files. We should just accept the dvc diff ... -- [path1] [path2]... syntax to allow the user to diff specific files (if that's what they want). It would also make sense to have flags like --dvc-only for only dvc tracked files and potentially even --targets [stage_or_dvcfile] to restrict the diff to outputs of a specific dvc stage.

Let's say I have lots of image files but only want to see a diff for some subset (raw data might be images but so might plots and related files explaining my model).

If we are keeping gitattributes syntax, patterns are matched against paths, not just filetype extensions. So if you only wanted images inside a foo/images to be diffed, you can do

/foo/images/*.jpg diff=...

I wonder if it makes more sense to keep the .dvcattributes data in dvc.yaml (or allow it to be overriden there) like - mypic.jpg:\n diff: raster-imgto avoid inconsistencies from spreading the info across multiple dvc metafiles?

I'm not sure if keeping this data in dvc.yaml is right, I'm not even sure if this hypothetical .dvcattributes would be git-tracked at all. This seems more like information that belongs in local/system/global configs, since you have to have both the appropriate diff tools and the appropriate attributes configuration on your machine.

In cases where the user has the diff tools included in their repo itself (maybe in something like /scripts/my_diff.py), then they could git track everything, but in typical cases where this is dependent on a user's specific machine, I don't think this configuration would get git-tracked.

The way I see it, configuring my preferred diff tool for is comparable to "I want my default git text editor to be vim" - Not everyone on my team will want that same configuration.

What if I only want to see diffs for images and not archives?

This is covered by allowing specific paths/outputs/targets to be specified (i.e. dvc diff -- **/*.jpg)

Should dvc provide basic drivers for images (or other binary types), and should it be the default for image mimetypes? Do we expect every user to have to define/implement this themselves?

I think providing basic example drivers is fine (and at least one good example would probably be required so that users actually understand how diff drivers work, images seem like a good place to start here). I don't think any of our (example) drivers should be enabled by default. We could provide a basic template .dvcattributes file with "uncomment the following line to use default image driver" example content. But I'd want to be clear about a few caveats:

  1. This should not be part of the core DVC repo (we should have a separate iterative/example-diff-driver(s) repo)
  2. It should be clearly documented that our drivers are examples, the user can assume they will work out of the box for typical scenarios but will not be actively maintained/developed by the core DVC team
  3. We should not accept requests to add more "default" drivers ourselves

This follows along with our new/current approach to remote support - the core DVC team should focus on developing core DVC. Users with specific use-case diffing needs should write their diff own tools.

Comment on lines +6 to +9
# Summary

Establish a unified command to combine and `show` or `diff` all outputs of
specified types, including:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the different show subcommands are similar enough to get merged, and even more so for diff subcmds. but are shows similar enough to diffs, or do they have intrinsically different uses? In the latter case we would need to keep 2 top commands. BTW exp show is very different (experiments are not outputs).

* Plots
* Images (depends on https://github.com/iterative/dvc/discussions/5681)

These would all be put under a single `dvc outputs show/diff` command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like dvc out(put)s. What I suggested originally was dvc compare (since diff is taken) but that only consolidated metrics/plots/experiments diffcommands. If we decide to keep 2 top commands maybe we have have bothouts(mergedshows) and compare(mergeddiff`s).

Comment on lines +119 to +120
A common abstraction for all output types minimizes the flexibility of each
output type's `show` and `diff` commands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the key question is are we forcing the generalization too much? Originally the idea was to merge commands that are already basically the same thing but somehow ended up duplicated. Maybe no need to force it, I think (e.g. throw in show, throw in mime/types, etc.) at least as a first iteration. Then again ERs are useful for big changes like that so up to you!

Comment on lines +3 to +4
- Enhancement Proposal PR: (leave this empty)
- Contributors: dberenbaum

This comment was marked as resolved.

@dberenbaum
Copy link
Contributor Author

dberenbaum commented Apr 8, 2021

I'm migrating this to https://www.notion.so/iterative/Unified-show-diff-command-for-all-output-types-df3c941296e84c6ba6c2fab1c3a55f7e. I'm also going to rewrite a lot of it to try to capture and address the discussions already in this PR. If I don't capture your feedback from here, please add a comment or add it to the list of unresolved questions at the bottom of the proposal.

EDIT: Also, one of the improvements in Notion should be that it's easy to make changes directly on the document, so it hopefully feels less like one person owns the document 😄 .

@skshetry skshetry removed their request for review December 22, 2022 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants