Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: Throw error for invalid ** usage outside folder segments (e.g. /tmp/**.csv) #3100

Merged
merged 7 commits into from
Oct 31, 2024

Conversation

conradsoon
Copy link
Contributor

Closes #1820.

Main issue seems to be that the globset crate is permissive for what kind of pattern it builds (no error is thrown when we try to build a pattern for /tmp/**.csv, for instance, so we have to check ourselves for any such patterns.

@github-actions github-actions bot added the enhancement New feature or request label Oct 22, 2024
@conradsoon
Copy link
Contributor Author

Some things I'm not too sure about:

  • Currently, this throws a InvalidArgumentError (which is then caught by PyO3 bindings and default-cased into a DaftCoreException), but should I use a different error for this? I'm wary of modifying the bindings for now, so I just left it as it is.
  • Is there a better place to put these tests? Currently I'm duplicating these tests for every possible ObjectSource but seems like they end up using the same glob function in object_store_glob.rs, is there a better place to write the testing logic?

Copy link

codspeed-hq bot commented Oct 22, 2024

CodSpeed Performance Report

Merging #3100 will not alter performance

Comparing conradsoon:globbing-wildcard (b1202eb) with main (c69ee3f)

Summary

✅ 17 untouched benchmarks

@conradsoon conradsoon marked this pull request as ready for review October 22, 2024 10:19
@jaychia jaychia self-requested a review October 22, 2024 19:30
Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Regarding tests, we don't actually need to weave this into our integration tests for every single object source. Totally fine to have verify_glob just unit-tested rigorously in Rust in the object_store_glob.rs module , and then have one (non-integration) test that hits the local disk to verify the user experience from Python.

if re.is_match(segment) && segment != "**" {
return Err(super::Error::InvalidArgument {
msg: format!(
"Invalid usage of '**' in glob pattern. The '**' wildcard must occupy an entire path segment and be surrounded by '{}' characters. Found invalid usage in '{}'.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be more helpful with the error message as well? Would love to add a suggestion here for the user to do this glob path instead: {re_group_1}/**/*{re_group_2}/{re_group_3}.

Copy link
Contributor Author

@conradsoon conradsoon Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've slightly rewritten the regex to process the path fully instead of segmenting it delimited portions.
This should allow us to suggest a corrected glob path for the user as well.

Have rewritten it to give suggestions in this manner: is this the behaviour you expect?

  • Original: invalid/blahblah**.txtCorrected: invalid/blahblah/**/*.txt
  • Original: invalid/\***.txtCorrected: invalid/\*/**/*.txt
  • Original: invalid/\**blahblah**.txtCorrected: invalid/\**blahblah/**/*.txt

@@ -404,6 +404,29 @@ pub async fn glob(
};
let glob = glob.as_str();

// We need to do some validation on the glob pattern before compiling it, since the globset crate is very permissive
// and will happily compile patterns that don't make sense without throwing an error.
fn verify_glob(glob: &str) -> super::Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can move this out of the function, and just run Rust unit-tests!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes a lot of sense.

Wasn't familiar with how Rust unit tests worked, have moved the testing logic for the verify_glob function to object_store_glob.rs.

@conradsoon
Copy link
Contributor Author

conradsoon commented Oct 23, 2024

Hey @jaychia, thanks for the comments. I've made some changes accordingly:

  • Tweaked the regex logic to match on the whole path, instead of path segments. This also allows me to suggest a corrected path to the user.
  • Moved the integration test logic into the object_store_glob.rs file as unit tests instead.
  • Left the local Python integration test, and added an additional check to assert that we also suggest a "corrected" path to the user when we detect this.

@conradsoon
Copy link
Contributor Author

Hey @jaychia, sorry to ping! Do these changes look good to you?

@jaychia jaychia self-requested a review October 31, 2024 04:08
Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@jaychia jaychia merged commit 073ec37 into Eventual-Inc:main Oct 31, 2024
38 checks passed
@jaychia
Copy link
Contributor

jaychia commented Oct 31, 2024

Awesome, thanks for the contribution! 🚀 🚀 🚀

sagiahrac pushed a commit to sagiahrac/Daft that referenced this pull request Nov 4, 2024
…. /tmp/**.csv) (Eventual-Inc#3100)

Closes Eventual-Inc#1820.

Main issue seems to be that the `globset` crate is permissive for what
kind of pattern it builds (no error is thrown when we try to build a
pattern for `/tmp/**.csv`, for instance, so we have to check ourselves
for any such patterns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disallow glob paths with ** that is not a folder segment
2 participants