Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plots diff: don't make one plot break all the others #9025

Closed
dberenbaum opened this issue Feb 14, 2023 · 19 comments · Fixed by #9146
Closed

plots diff: don't make one plot break all the others #9025

dberenbaum opened this issue Feb 14, 2023 · 19 comments · Fixed by #9146
Labels
A: plots Related to the plots p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension

Comments

@dberenbaum
Copy link
Collaborator

dberenbaum commented Feb 14, 2023

In iterative/vscode-dvc#3222, one mis-configured plot for one revision breaks the entire dvc plots diff command. in #5984, we made it so that an invalid dvc.yaml doesn't break other revisions in dvc plots diff. However, even if the dvc.yaml is valid, a single revision or plot within one revision can break dvc plots diff if there are other errors encountered while rendering one of the plots.

To do:

  1. Skip revisions if any errors are encountered rendering plots for that revision.
  2. Skip only the specific plots within a revision that fail to render.

Related: #7787

Edit: The problem occurs because the workspace dvc.yaml config is applied to all other revisions, and if the data structure has changed, it may be impossible to render an old revision with the current dvc.yaml config.

@dberenbaum dberenbaum added A: plots Related to the plots product: VSCode Integration with VSCode extension p1-important Important, aka current backlog of things to do labels Feb 14, 2023
@daavoo
Copy link
Contributor

daavoo commented Feb 14, 2023

the workspace dvc.yaml config is applied to all other revisions,

That doesn't sound right

@dberenbaum
Copy link
Collaborator Author

the workspace dvc.yaml config is applied to all other revisions,

That doesn't sound right

I think it's intentional -- in the happy path where the underlying data structure doesn't change, it's common to want to propagate a new axis label or template to all prior revisions.

@skshetry
Copy link
Member

skshetry commented Feb 23, 2023

We merge all of the plot props, so it's not just limited to workspace dvc.yaml file it seems.

def _squash_plots_properties(data: List) -> Dict:

@dberenbaum
Copy link
Collaborator Author

So it merges all revisions into a single set of properties, preferring the more recent revisions when they conflict, right?

@skshetry
Copy link
Member

skshetry commented Feb 24, 2023

It does not merge in terms of recency, but in the order of revisions passed, so the revision on the left side is preferred (except workspace where it always gets higher priority).

I guess the arguments are usually passed in the order of latest to oldest, so in that case, it will prefer recent revisions.

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Mar 16, 2023

@pmrowla mentioned that it was discussed whether we should be merging all the plots properties. I looked through some of the old issues and PRs and can't find a good rationale for why we do this.

It does seem like it would be much simpler to use a single set of properties (either the leftmost revision or workspace in cases where it's implied). Seems like that would resolve some related issues like #9188 (since it would not try to render the old custom template) and #7193.

Edit: Meant to reference #7913, not #7193.

@daavoo
Copy link
Contributor

daavoo commented Mar 16, 2023

It does seem like it would be much simpler to use a single set of properties (either the leftmost revision or workspace in cases where it's implied). Seems like that would resolve some related issues like #9188 (since it would not try to render the old custom template) and #7193.

I would say, let's agree on expected behavior and implement it.


To clarify, when I said:

the workspace dvc.yaml config is applied to all other revisions,
That doesn't sound right

I meant that it didn't sound like what I thought the code was doing (because of #9025 (comment)), not that I was strongly against the idea of having a single source of truth for properties.

@mattseddon
Copy link
Member

This comment is also related to #8786.

The extension caches revision data (images + datapoints) which means that we should only call plots diff once for each revision (except workspace). It is looking less likely that this approach (caching on the extension side) actually works well. For example, after reading this from #9025 (comment)

It does seem like it would be much simpler to use a single set of properties (either the leftmost revision or workspace in cases where it's implied). Seems like that would resolve some related issues like #9188 (since it would not try to render the old custom template) and #7193.

It seems like revision data is actually mutable depending on the combination of revisions passed to plots diff. This becomes even more error-prone when we throw FileNotFound errors into the mix and whether or not these errors are important/should be displayed. In order to cache data on the VS Code side we need revisions to be static.

I would say, let's agree on expected behavior and implement it.

So after a fair bit of deliberation triggered by the ongoing work with errors I think that the extension should only attempt to display templates/images that are available in the workspace's dvc.yamls. Everything else should be disregarded. Old(er) revision should be applied to the current workspace templates and that's it. The user won't lose much but this reduces/limits complexity greatly.

Even doing this we will still get edge cases where the workspace's dvc.yamls get updated and we have to "refresh" all of the cached revisions on the extension side.

If that means every time we call plots diff we call it with the workspace as the leftmost argument then no worries.

Ideally, we would get plots diff performance to a place where we can call it with up to 7 revisions and it returns in a timely enough manner to drop all of the caching (mess) that is currently in the extension (will mention this in #8786 too).

Please LMK what you think.

@pmrowla
Copy link
Contributor

pmrowla commented Mar 17, 2023

either the leftmost revision or workspace in cases where it's implied

I would say that even when workspace is not implied it still makes sense to be using the workspace config (and not the leftmost revision). If we are using the leftmost revision's config, the user can never modify plots unless they also include the workspace data. Even if the user adds a new template, or modifies/fixes something in an existing template, they would not be able to populate it with purely historical data (they would have to also include workspace data).

@dberenbaum
Copy link
Collaborator Author

@pmrowla Makes sense and is simpler, although I have a couple hesitations:

  1. We had discussions in Studio where there was a need to compare branches that had different config/plots than the main branch. That's not such a big issue in CLI/VS Code since the user could check out the branch they want to use, but it might be inconvenient/unexpected to have to do that when diffing other branches.
  2. If your workspace is a mess and maybe your dvc.yaml is even invalid, it might be helpful to do plots diff without having to clean up your workspace first.

I'm still not sure those outweigh the simplicity of always using the workspace but wanted to raise them for discussion.

@mattseddon
Copy link
Member

  1. We had https://github.com/iterative/studio/issues/2574 in Studio where there was a need to compare branches that had different config/plots than the main branch. That's not such a big issue in CLI/VS Code since the user could check out the branch they want to use, but it might be inconvenient/unexpected to have to do that when diffing other branches.

Thanks for the link to https://github.com/iterative/studio/issues/2574. Took me a while but I think I got up to speed. In a way, Studio is lucky in that it doesn't have to deal with the workspace which makes the situation less dynamic.

If we going to continue to attempt to cache revisions in the extension then we need some kind of spine to hold everything together. As the extension is meant for the local/dev experience I would expect that to be the workspace as opposed to HEAD. One alternative that I see (vs using workspace + caching) is to decide on the implementation of mutable revisions for DVC and have the extension call plots diff with all of the required revisions every time we need an update.

Note: depending on the implementation for iterative/vscode-dvc#1966 we could end up having to try and compare plots between branches in the extension as well.

  1. If your workspace is a mess and maybe your dvc.yaml is even invalid, it might be helpful to do plots diff without having to clean up your workspace first.

We could encourage git stash under these circumstances. At least under these circumstances we can easily show a lot of errors in the extension and make it obvious what needs to be fixed.

@dberenbaum
Copy link
Collaborator Author

Let's move forward with always using the workspace config as @pmrowla suggested.

@mattseddon WDYT about trying to always call plots diff with all the revisions now that performance is better in DVC, and we can follow up if caching is needed?

@shcheklein
Copy link
Member

WDYT about trying to always call plots diff with all the revisions

what would be the reason for this, folks? I would still expect a substantial difference for this, no?

@dberenbaum
Copy link
Collaborator Author

@skshetry Do you have capacity to take on always using the workspace config this sprint?

what would be the reason for this, folks? I would still expect a substantial difference for this, no?

My reason would be to avoid premature optimization. I have nothing against prioritizing it if it's needed, but I figured we could wait to see the performance first.

If it's just as easy to use the existing caching we have, that's great, but I was under the impression from #9025 (comment) that it was broken and would require work to incorporate the new changes.

@mattseddon
Copy link
Member

If it's just as easy to use the existing caching we have, that's great, but I was under the impression from #9025 (comment) that it was broken and would require work to incorporate the new changes.

Discussed at length in the VS Code planning call this morning. The problem boils down to this:

Changes made to a template can lead to changes in the pre-processed datapoints that we receive from plots diff. This makes the datapoints mutable as opposed to immutable. We have the same problem with errors being mutable.

We can start by always calling plots diff with the workspace revision and invalidating the cache whenever any dvc.yaml is changed but I do not think this will be enough to keep everything up to date.

@skshetry
Copy link
Member

using the workspace config

A few questions:

  1. What does this mean for images? Do we only use the config from the workspace? Or, do we still need to collect dvc.yaml from other revisions to find image plots?
  2. I guess this won't allow us to cache the rendered plots. As the immediate issue is fixed, do we need to prioritize this at all?

If we collect config from the workspace, that means the config and hence the plots data might change, no? Unless we calculate hash based on that workspace config.

@dberenbaum
Copy link
Collaborator Author

1. What does this mean for images? Do we only use the config from the workspace? Or, do we still need to collect `dvc.yaml` from other revisions to find image plots?

Not sure I follow. AFAIK the only config for images is the path. plots diff is intended to compare the same paths side by side across revisions. If you want to include an image path, then it should be included in the workspace dvc.yaml. If you have moved your image paths between revisions, I'm not sure how we can compare them regardless of whether we use the workspace dvc.yaml.

2. I guess this won't allow us to cache the rendered plots. As the immediate issue is fixed, do we need to prioritize this at all?

I thought that using the workspace config would be a small change that would improve usability and reduce errors, but I agree that it's not a blocker/critical on the DVC CLI side.

@mattseddon How much does would it help you to know that the config is always defined by the workspace?

If we collect config from the workspace, that means the config and hence the plots data might change, no? Unless we calculate hash based on that workspace config.

Yep, I think that's why @mattseddon proposed that he will invalidate the VS Code cache when dvc.yaml changes.

@shcheklein
Copy link
Member

In an attempt to put a bit more structure to this and clarify the thread for myself a bit and make sure that we are all on the same page :) We have three components:

  • plots data (actual image files, actual data points in TSV, etc, etc)
  • plots definitions (plot id, for image it can be a path, for top level plot can be a name / or a path, etc)
  • plot spec (field names, template, etc. applies to non-image plots only)

When we talk about applying something from workspace I think we need to clarify this in regards to all three components to have a full picture from the product perspective.

I think @skshetry was specifically asking about definitions (in case of images we don't have a spec yet). And from the product perspective the question is - do we collect definitions across all commits/revisions/whatever or not? Even of the answer for specs is to take to left-most, or the workspace, the question is valid for definitions I think.

A user scenario - I have a branch where I added a new directory with images. I added a branch into the table of experiments. I ask the extension or Studio to plot it. And I don't see it if we literally apply definitions and specs from the workspace. It might be confusing.

@mattseddon
Copy link
Member

mattseddon commented Mar 23, 2023

@mattseddon How much does would it help you to know that the config is always defined by the workspace?

I had another discussion about this today with @shcheklein.

We decided that we will try dropping the cache data altogether from the extension to see whether or not the performance is acceptable.

If performance is not acceptable then we can always try caching revision combinations.

My plan is to always call plots diff with the revisions ordered by recency. If the workspace is present, I'll always provide that as the first revision. This should give stable semantics for users and will avoid having to explain why different combinations of revisions can lead to different results.

I think this means we can shelf the idea of always applying the workspace config.

Some questions that I have on the DVC implementation:

  • Are there any situations where the remote will be hit multiple times on subsequent calls to plots diff? E.g if there is an error will it attempt to hit the remote a second time to see if the error has been resolved?
  • Is any of the pipeline/dvc.yaml/data collection cached for revisions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: plots Related to the plots p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants