Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to require explicit fetch/checkout #10451

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

pbailey-hf
Copy link

@pbailey-hf pbailey-hf commented Jun 6, 2024

This PR adds the explicit field to outs in .dvc files. outs marked as explicit must be explicitly mentioned on the command line to be affected by fetch, checkout, or pull.


Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@@ -55,6 +55,8 @@ def stage_filter(stage: "Stage") -> bool:
def outs_filter(out: "Output") -> bool:
if push and not out.can_push:
return False
if out.explicit and not any(targets):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should out be compared to each target to see if it matches any of the targets? It looks like this would fetch out anytime any target is specified, even if the target does not match the explicit out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally went down that path, then it occurred to me that the usual target matching logic would further filter the list that this function returns, so there should be no need to reimplement that logic. I updated the unit test to demonstrate and build confidence.

@dberenbaum dberenbaum requested a review from skshetry June 24, 2024 20:10
Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! @skshetry Could you take a look please?

@shcheklein
Copy link
Member

one ask (if we decide to go with this) - can we find a better name please? explicit can mean anything (explicit output), explicit path? explicit remote?, etc, etc

if this options is about pull / push / fetch it should have a name ideally related to that. From the top of my head:

pull: auto | explicit | None

if it affects both push and pull (does it?):

sync: ...

etc. Let's brainstorm here a bit please.

thanks @pbailey-hf !

@pbailey-hf
Copy link
Author

if this options is about pull / push / fetch it should have a name ideally related to that. From the top of my head:

It only affects fetch/checkout and by extension pull. I like your suggestion of calling it

pull: auto|explicit|None (=default auto)

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jun 25, 2024

Agreed, but I would suggest slightly different naming to better match the existing push options in https://dvc.org/doc/user-guide/project-structure/dvc-files#output-entries:

pull: true|false|explicit (=default true)

Edit: note that this also implies that we add pull: false, which is not implemented now, but should be straightforward to do similarly to explicit.

@pbailey-hf
Copy link
Author

That makes sense to me. I will implement it shortly.

dvc/output.py Outdated
@@ -1501,6 +1507,7 @@ def _merge_dir_version_meta(self, other: "Output"):
**ANNOTATION_SCHEMA,
Output.PARAM_CACHE: bool,
Output.PARAM_REMOTE: str,
Output.PARAM_PULL: vol.All(vol.Lower, Choices("always", "never", "explicit")),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for always and never instead of true and false to avoid confusion with YAML types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to use YAML types and true / false. Also, I don't think we need the third option. I would make false behavior that same as explicit to keep things simpler.

Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.

Project coverage is 90.41%. Comparing base (8f8d49d) to head (e8b3535).
Report is 9 commits behind head on main.

Files Patch % Lines
dvc/repo/fetch.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10451      +/-   ##
==========================================
+ Coverage   90.34%   90.41%   +0.07%     
==========================================
  Files         481      481              
  Lines       35981    36036      +55     
  Branches     5038     5053      +15     
==========================================
+ Hits        32506    32583      +77     
+ Misses       2877     2873       -4     
+ Partials      598      580      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dvc/repo/fetch.py Outdated Show resolved Hide resolved
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a dvc.org PR / update. Thanks @pbailey-hf !

@pbailey-hf
Copy link
Author

We also need a dvc.org PR / update. Thanks @pbailey-hf !

Done! iterative/dvc.org#5263

@shcheklein
Copy link
Member

@skshetry if everything looks good, can we merge this? can you do some final checks please?

assert not (tmp_dir / "explicit_2").exists()
assert (tmp_dir / "always").read_text() == "z"

dvc.checkout(targets="explicit_1.dvc", force=True)
Copy link
Member

@skshetry skshetry Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum @shcheklein, this looks odd to me, that a pull option is affecting checkout.

I don't think we should support this at all.

If we don't support it, checkout will fail if we haven't fetched it already as before with the following message:

dvc.exceptions.CheckoutError: Checkout failed for following targets:
   explicit_1
   explicit_2
Is your cache up to date?
   <https://error.dvc.org/missing-files>

pull() should work as expected though, and not fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry I understand your concern, what is the practical thing here for users though? any idea? If you have a few targets that have pull: false you will keep hitting this failure? do we need an explicit option for checkout then - something like "--allow-missing" in dvc pull?

what is the expected behavior for dvc pull itself considering that it is dvc fetch + dvc checkout?

probably they should be aligned ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my intuition as well, that pull = fetch + checkout, so the option should apply to both.

Copy link
Member

@skshetry skshetry Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it is a combination of fetch and checkout, it does not necessarily mean pull options should apply directly to fetch and checkout. pull is a higher-level command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry do, what is your practical suggestion here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative, re the name: sync: push | pull | both | all | checkout - it can be a single option that pretty much encompasses everything

I think this is reasonable and aligns with what I understood explicit to be trying to do - skip syncing those outputs unless you explicitly pass them as targets (so dvc push/pull/checkout skips it but dvc push/pull/checkout skipped_target doesn't).

I'm not sure we need all of push | pull | both | all | checkout now because we already have push: false and can't deprecate it right away. We could try a minimal version of sync: true | false to start and see with more time which of push | pull | both | checkout are needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum

We could try a minimal version of sync: true | false to start

And we would avoid all the operation on it? push, pull, checkout? I think that's fine. I think we can keep granular push: false and even probably fetch: false (see below). People will be asking about this. sync is too broad and won't be convenient in a lot of cases. But I'm fine to start there. @skshetry what is your take?

Would you be open to implicit "allow missing" behavior for pull:false outputs?

It should be fetch: false (not pull:false to emphasize that if affects only fetch. And checkout --allow-missing behaves the same way as it does now (it should not be trying to analyze lower level fetch flags, etc). This way pull and checkout are consistent in their behavior - that is important here.

I think we can add a config option for checkout to allow missing by default.

would that combination work for you?

  • fetch: false
  • config option to imply allow-missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbailey-hf What do you think about sync: false? If you are okay starting there, please go ahead with it, and if you want, you could also add fetch: false or other granular options.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum I'd prefer to control pull/push independently, which makes me lean away from the sync: false option. If I'm picking from all of the options presented so far, I'd prefer fetch:false combined with a configured allow-missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for the feedback and your patience @pbailey-hf! I worry about the complexity of so many flags to configure, but let's go with your preference for now to move forward. Please ping when you would like another review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants