Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge plot ids across dvc.yaml files #9898

Merged
merged 14 commits into from
Sep 14, 2023
Merged

merge plot ids across dvc.yaml files #9898

merged 14 commits into from
Sep 14, 2023

Conversation

dberenbaum
Copy link
Collaborator

@dberenbaum dberenbaum commented Aug 31, 2023

Addresses the issue in iterative/dvclive#687 (comment).

It merges plot IDs across all dvc.yaml files in the repo. Technically this is a breaking change, but most likely it's an improvement for almost all use cases. We already merge plot IDs across revisions, so this PR does the same across dvc.yaml files. Not sure if there are valid cases to have separate plots with the same plot ID in different dvc.yaml files. Is it worth warning when there is an overlap?

Note: artifacts should follow the same behavior, which would help with issues like https://github.com/iterative/studio/issues/6939.

Before this PR:

visualization(12)
visualization(13)

After this PR:

visualization(15)

@dberenbaum dberenbaum marked this pull request as ready for review August 31, 2023 22:52
@dberenbaum dberenbaum requested a review from daavoo August 31, 2023 22:52
dvc/render/match.py Outdated Show resolved Hide resolved
dberenbaum and others added 2 commits September 1, 2023 07:49
Co-authored-by: David de la Iglesia Castro <[email protected]>
@dberenbaum

This comment was marked as resolved.

@codecov
Copy link

codecov bot commented Sep 1, 2023

Codecov Report

Patch coverage is 89.09% of modified lines.

Files Changed Coverage
tests/integration/plots/conftest.py 72.72%
dvc/render/match.py 100.00%
dvc/utils/plots.py 100.00%
tests/func/plots/test_show.py 100.00%
tests/integration/plots/test_plots.py 100.00%
tests/unit/render/test_match.py 100.00%
tests/unit/utils/test_plots.py 100.00%

📢 Thoughts on this report? Let us know!.

@daavoo daavoo added the A: plots Related to the plots label Sep 1, 2023
@skshetry
Copy link
Member

skshetry commented Sep 4, 2023

It merges plot IDs across all dvc.yaml files in the repo. Technically this is a breaking change, but most likely it's an improvement for almost all use cases. We already merge plot IDs across revisions, so this PR does the same across dvc.yaml files. Not sure if there are valid cases to have separate plots with the same plot ID in different dvc.yaml files. Is it worth warning when there is an overlap?

This goes against how everything works in dvc, and is problematic for several reasons.
We won't be able to create plots by just looking at the definition in one dvc.yaml file.
And there may be errors in some dvc.yaml files that we'll never be sure if we have all the required definitions for the plot.

Also, a breaking change as you said, and makes it more confusing to the users if they try to share some configs across dvc.yaml files and expect them to be unique.

I don't have a strong opinion on disallowing use of same id/names across multiple dvc.yaml files, whether by enforcing it or by convention. This would make it easier to migrate plots of same ids across different dvc.yaml files.

But here, it feels like the correct fix should be to always use the definitions from workspace/HEAD instead of merging them (could not find the issue to link to :( ).

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Sep 5, 2023

But here, it feels like the correct fix should be to always use the definitions from workspace/HEAD instead of merging them (could not find the issue to link to :( ).

This PR doesn't touch how we merge plot definitions or how we handle changes across revisions. It is only about cases where the same plot ID is found in different dvc.yaml files.


@skshetry I agree with your overall point that merging two plots with the same ID in different dvc.yaml files doesn't fit how we address other things in dvc like stages. I updated so that it doesn't merge anything. Instead, it drops the dvc.yaml file name from the plot ID if there are no conflicts (a conflict means the same plot ID exists in another dvc.yaml for that revision).

Examples (need to make sure we test all of these):

  • If rev1 has plot_id1 in dvc.yaml, this PR changes its ID from dvc.yaml::plot_id1 to plot_id1
  • If rev1 has plot_id1 in dvc.yaml, and rev2 has plot_id1 in dvclive/dvc.yaml, both will have ID plot_id1 since there is no conflict within each revision.
  • If rev1 has plot_id1 in both dvc.yaml and dvclive/dvc.yaml, they will have IDs dvc.yaml::plot_id1 and dvclive/dvc.yaml::plot_id1.
  • If rev1 has plot_id1 in both dvc.yaml and dvclive/dvc.yaml, and rev2 has plot_id1 in only dvc.yaml, rev1 will have IDs dvc.yaml::plot_id1 and dvclive/dvc.yaml::plot_id1, and rev2 will have ID plot_id1.

In the last example, should rev2 have ID dvc.yaml::plot_id1 to match rev1?

@dberenbaum dberenbaum marked this pull request as draft September 5, 2023 21:17
@dberenbaum dberenbaum marked this pull request as ready for review September 6, 2023 21:06
@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Sep 7, 2023

Need to test how it looks in VS Code and Studio.

Edit: VS Code looks good.

Copy link
Contributor

@daavoo daavoo Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: I prefer higher-level tests, like setting up dvc.yaml definitions and asserting on plots.show output. At least for me, these ones feel harder to follow/read and are kind of brittle because they are tied to the internal structures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an integration test for the dvclive 2.x->3.x transition. I don't want to get into testing every combination of functionality in the high-level tests.

dvc/render/match.py Outdated Show resolved Hide resolved
Copy link
Contributor

@daavoo daavoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually tested the scenario of a repo with commits defining plots in dvclive/dvc.yaml and then using root dvc.yaml (DVLive 2.0 -> DVCLive 3.0).

Works as expected when doing dvc plots show from both revisions and also dvc plots diff dvclive2 dvclive3. VSCode also works as expected.

I have not tested other scenarios.

@dberenbaum
Copy link
Collaborator Author

After some debugging with @mvshmakov, I'm so far unable to get the local studio dev ui working, so I might need someone who already has this setup to test these changes to unblock this.

@daavoo
Copy link
Contributor

daavoo commented Sep 8, 2023

After some debugging with @mvshmakov, I'm so far unable to get the local studio dev ui working, so I might need someone who already has this setup to test these changes to unblock this.

I will test it

@daavoo
Copy link
Contributor

daavoo commented Sep 8, 2023

After some debugging with @mvshmakov, I'm so far unable to get the local studio dev ui working, so I might need someone who already has this setup to test these changes to unblock this.

I will test it

Unfortunately, the change doesn't have an effect on Studio. It appears that Studio doesn't use the code modified here. I am checking the Studio backend parsing code to see how this change could be applied

@daavoo
Copy link
Contributor

daavoo commented Sep 8, 2023

Unfortunately, the change doesn't have an effect on Studio. It appears that Studio doesn't use the code modified here. I am checking the Studio backend parsing code to see how this change could be applied

Here is the code:

https://github.com/iterative/studio/blame/5b60ed17ecea77aec9c83cc58b5edf6616dfc4e1/backend/repos/parsing/dvcmeat.py#L194-L196

@dberenbaum Should I try to patch it there?

In the longer term, it would be good to understand why we kind of have this logic duplicated there

@daavoo
Copy link
Contributor

daavoo commented Sep 8, 2023

@dberenbaum Should I try to patch it there?

In the longer term, it would be good to understand why we kind of have this logic duplicated there

Ok, this is not a 5-minute change 😅

Studio parses and stores plots separately for each commit.
I am now not sure if the ID of the plots should be overwritten in the backend after all the commits are processed or if this should be handled in the FE based on the revisions selected for the plot.

cc @shcheklein

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Sep 8, 2023

Studio parses and stores plots separately for each commit.
I am now not sure if the ID of the plots should be overwritten in the backend after all the commits are processed or if this should be handled in the FE based on the revisions selected for the plot.

AFAIR Studio does this in part so that it doesn't change depending on the revisions selected. I think we should update the plot IDs on the backend. The logic in this PR is independent for each revision, so it shouldn't be hard to incorporate.

In the longer term, it would be good to understand why we kind of have this logic duplicated there

💯 We are duplicating a lot of plots logic.

As a small starting point, I could extract the logic in this PR to a utils function so it can be reused in the Studio code linked above.

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Sep 8, 2023

As a small starting point, I could extract the logic in this PR to a utils function so it can be reused in the Studio code linked above.

Done. Now it should be possible to replace https://github.com/iterative/studio/blob/5b60ed17ecea77aec9c83cc58b5edf6616dfc4e1/backend/repos/parsing/dvcmeat.py#L192-L195 with:

for name, (plot_inner_id, plot_properties) in group_definitions_by_id(definitions).items():

We would still need to iterate over definitions one more time in https://github.com/iterative/studio/blob/5b60ed17ecea77aec9c83cc58b5edf6616dfc4e1/backend/repos/parsing/dvcmeat.py#L218-L219, but it at least saves from having to copy this PR's logic into Studio. Open to other ideas for how to structure this.

@daavoo
Copy link
Contributor

daavoo commented Sep 11, 2023

We would still need to iterate over definitions one more time in https://github.com/iterative/studio/blob/5b60ed17ecea77aec9c83cc58b5edf6616dfc4e1/backend/repos/parsing/dvcmeat.py#L218-L219, but it at least saves from having to copy this PR's logic into Studio. Open to other ideas for how to structure this.

Could we also collect and return errors in the utils function here?:

        for plot_id, plot_definition in config_file_content.get("data", {}).items():

I am also curious as I can't find a test in Studio where the code passes through this error in config_file_content.
I wonder what these different levels of errors are and what we do with them in DVC (are we just ignoring them?)

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Sep 11, 2023

I am also curious as I can't find a test in Studio where the code passes through this error in config_file_content.
I wonder what these different levels of errors are and what we do with them in DVC (are we just ignoring them?)

From looking at the test, it seems something like this is expected:

"definitions": {
"data": {"config_file_1": {"error": FileNotFoundError()}},
},
"sources": {},

I can't reproduce errors at this level, but I don't know how to prove or test that it's no longer needed.

If the file is not found, it will be raised under sources instead of definitions, like this:

{
  'definitions': {
    'data': {
      'dvclive/dvc.yaml': {
        'data': {
          'dvclive/training/plots/metrics/train/acc.tsv': {}
        }
      }
    }
  },
  'sources': {
    'data': {
      'dvclive/training/plots/metrics/train/acc.tsv': {
        'props': {}, 
        'error': FileNotFoundError(2, 'No such file or directory')
      }
    }
  }
}

If there's an error with the plots definition, it will not be caught during collection and will be logged here:

dvc/dvc/render/match.py

Lines 114 to 120 in 33b6c1d

try:
dps, rev_props = converter.flat_datapoints(rev)
except Exception as e: # noqa: BLE001, pylint: disable=broad-except
logger.warning("In %r, %s", rev, str(e).lower())
def_errors[rev] = e
continue

tldr I think we should assume these errors no longer exist and revisit if needed

@dberenbaum dberenbaum merged commit bf2af86 into main Sep 14, 2023
21 checks passed
@dberenbaum dberenbaum deleted the merge-plots branch September 14, 2023 14:36
@mattseddon
Copy link
Member

mattseddon commented Sep 18, 2023

Related to #9898 (comment).

From going up and down the stack between dvc, dvc-render, studio and vscode-dvc I think that potentially we are missing something by merging definitions after collecting each revision's data.

If a user removed a file from a plot definition and ran an experiment then there will be no way to show that data even if the file was captured as an out and the file is brought back as part of the definition.

I can recreate this in all three products. I think the only way to get around this would be to collect/merge definitions first and then use those merged definitions to attempt to collect all of the relevant data. I think that approach would bring back the errors shown in the aforementioned comment.

The issue can be recreated using a dvc.yaml with a plot definition of this form:

plots:
  - Accuracy:
      x: step
      y:
        training/plots/metrics/train/acc.tsv: acc
        training/plots/metrics/test/acc.tsv: acc
      y_label: accuracy

and outs

    outs:
      - model.pt
      - training/plots: # everything captured as outs
          persist: true

To restate the problem with a concrete example: If training/plots/metrics/test/acc.tsv: acc or training/plots/metrics/train/acc.tsv: acc are removed from an experiment's plot definition then that data will not show up in any of the products no matter the combination of revs/definitions provided. IIUC this is because the file is not collected by repo.plots.collect as per the revision's definition.

Edit: bringing this up because I keep running into issues like this with #9940. I'm leaning towards fixing issues like this before moving on.

@dberenbaum
Copy link
Collaborator Author

@mattseddon AFAIU this is the same as the discussion from iterative/vscode-dvc#3676, or am I missing something?

I made an example in https://github.com/iterative/vscode-dvc-demo/tree/drop-test-acc (it still doesn't raise the types of errors mentioned in #9898 (comment)). No matter when we do the merging, we will need to choose between two conflicting y definitions (including or excluding test/acc).

In DVC, we merge based on the order the revisions are passed. For example, dvc plots diff main drop-test-acc looks like:

visualization(17)

dvc plots diff drop-test-acc main looks like:

visualization(16)

For VS Code, AFAIK the revisions are ordered by latest revision (drop-test-acc if it's included) but it seems like some of the post-processing in VS Code is throwing things off:

Screen.Recording.2023-09-18.at.11.43.17.AM.mov

In Studio, it seems like it depends on the order in which you click:

Screen.Recording.2023-09-18.at.10.58.58.AM.mov

I think Studio is supposed to split the plots if any of the field definitions change between revisions, so this looks like a bug to me.

@daavoo
Copy link
Contributor

daavoo commented Sep 18, 2023

I think Studio is supposed to split the plots if any of the field definitions change between revisions, so this looks like a bug to me.

I updated Studio to merge definitions preferring the order of selection to mimic DVC behavior when it receives revs

@dberenbaum
Copy link
Collaborator Author

@daavoo That was in https://github.com/iterative/studio/pull/6773? I guess I missed that we were changing more than just live plots there. I guess we might need what behavior should be in each product and how to make it more consistent, although not sure it needs to be a high priority.

@daavoo
Copy link
Contributor

daavoo commented Sep 18, 2023

That was in iterative/studio#6773? I guess I missed that we were changing more than just live plots there.

Yes, the 3rd point in the description.

I guess we might need what behavior should be in each product and how to make it more consistent, although not sure it needs to be a high priority.

The behavior of DVC made more sense to me for simple scenarios (i.e. updating title or axis name) so I went with uptating Studio to match DVC.
However, I did not consider the case where updating a property alters the data sources, as in the examples you are discussing.

@dberenbaum
Copy link
Collaborator Author

We have done several iterations on how to merge in these scenarios and I'm not sure I've seen it become an issue for users, so I don't think we need to go too deep on this.

@mattseddon
Copy link
Member

@dberenbaum I know that we already had a lengthy discussion, sorry for dragging this up again.

I made an example in https://github.com/iterative/vscode-dvc-demo/tree/drop-test-acc (it still doesn't raise the types of errors mentioned in #9898 (comment)). No matter when we do the merging, we will need to choose between two conflicting y definitions (including or excluding test/acc).

In DVC, we merge based on the order the revisions are passed. For example, dvc plots diff main drop-test-acc looks like:

My point is that in this example when main is selected first the drop-test-acc::test/acc.tsv rev will never be shown even though the data may be available. This happens because the definitions are merged after data collection and test/acc.tsv is not collected for drop-test-acc even though it was saved as per this line in the dvc.yaml. So yes, right now those errors are not created but if we collected all of the definitions prior to collecting the data I assume we would then need them. Does that make more sense now?

For VS Code, AFAIK the revisions are ordered by latest revision (drop-test-acc if it's included) but it seems like some of the post-processing in VS Code is throwing things off:

This was a caching bug that I have fixed in iterative/vscode-dvc#4678

In Studio, it seems like it depends on the order in which you click:

Is this the behaviour that we want to standardise on?

@dberenbaum
Copy link
Collaborator Author

My point is that in this example when main is selected first the drop-test-acc::test/acc.tsv rev will never be shown even though the data may be available. This happens because the definitions are merged after data collection and test/acc.tsv is not collected for drop-test-acc even though it was saved as per this line in the dvc.yaml. So yes, right now those errors are not created but if we collected all of the definitions prior to collecting the data I assume we would then need them. Does that make more sense now?

Thanks, I get it now. I think it's related to #7913 (both are about loading plots data in revs where the data isn't defined in any plot). In that issue, it's called expected behavior rather than a bug. I think it would be useful in some scenarios, but I'm not sure it's worth spending time on now. We already have too many marginal things to do on plots, haven't seen anyone ask for this, and it will likely add its own complexity and break other behavior. WDYT?

This was a caching bug that I have fixed in iterative/vscode-dvc#4678

👍

Is this the behaviour that we want to standardise on?

I don't have a strong opinion on what we do or that we need to standardize at the moment. My understanding from iterative/vscode-dvc#3676 was that we intentionally chose not to follow either the existing DVC or Studio convention, and I don't know that we have a strong reason to prioritize changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: plots Related to the plots
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants