Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Native glob functionality #1450

Merged
merged 27 commits into from
Oct 3, 2023
Merged

[FEAT] Native glob functionality #1450

merged 27 commits into from
Oct 3, 2023

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Sep 28, 2023

Adds a new io_glob Python function that performs globbing using the rules provided by the globset crate (https://docs.rs/globset/latest/globset/)

@github-actions github-actions bot added the enhancement New feature or request label Sep 28, 2023
@codecov
Copy link

codecov bot commented Sep 28, 2023

Codecov Report

Merging #1450 (428f1b3) into main (ca41122) will not change coverage.
The diff coverage is n/a.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1450   +/-   ##
=======================================
  Coverage   74.64%   74.64%           
=======================================
  Files          60       60           
  Lines        6042     6042           
=======================================
  Hits         4510     4510           
  Misses       1532     1532           

@jaychia jaychia marked this pull request as draft September 28, 2023 20:50
@jaychia jaychia marked this pull request as ready for review September 29, 2023 18:30
src/daft-io/src/glob.rs Outdated Show resolved Hide resolved
src/daft-io/src/glob.rs Outdated Show resolved Hide resolved
src/daft-io/src/glob.rs Outdated Show resolved Hide resolved
src/daft-io/src/glob.rs Show resolved Hide resolved
src/daft-io/src/glob.rs Outdated Show resolved Hide resolved
src/daft-io/src/object_io.rs Outdated Show resolved Hide resolved
src/daft-io/src/object_io.rs Show resolved Hide resolved
src/daft-io/src/object_io.rs Outdated Show resolved Hide resolved
}
Ok(PyList::new(py, to_rtn))
}

#[pyfunction]
fn io_list(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets drop the io_list functionality

src/daft-io/src/glob.rs Outdated Show resolved Hide resolved
src/daft-io/src/glob.rs Outdated Show resolved Hide resolved
src/daft-io/src/glob.rs Outdated Show resolved Hide resolved
let full_fragment = GlobFragment::new(glob);
if !full_fragment.has_special_character() {
let glob = full_fragment.escaped_str().to_string();
return Ok(stream! {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a stream! macro here. you should be able to use StreamExt::filter

source.iter_dir(glob.as_str(), Some("/"), None).await?
  .filter(|fm| Ok(matches!(fm.filetype, FileType::Directory)))
  .boxed()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been having some trouble with streams and lifetimes. The stream! macro seems to correctly move values into the Future so that it owns it? For instance:

return Ok(source.iter_dir(glob.as_str(), Some("/"), None).await?.boxed());

The above throws an error:

cannot return value referencing local variable `source`
returns a value referencing data owned by the current function

I'm guessing this is because source is on the local stack, and does not correctly move into the returned Future since the function signature here is: async iter_dir(&self, ...) and the returned Future holds a reference to the local stack's source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One workaround I found is to change the function signature of iter_dir to: iter_dir(self: Arc<Self>, ...) -> BoxStream<'a , ...> where Self: 'a

This forces the BoxStream to own an Arc<dyn ObjectSource> instead of &dyn ObjectSource, so it doesn't hold a reference to some local Stack variable.

This in turn lets us return BoxStreams that contain local Arc<dyn ObjectSource> objects. Otherwise, we will have to immediately consume them in the local block before the Arc goes out of scope.

Seems a little heavy-handed though? Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's cause stream! is moving all the values into a closure whereas source in source.iter_dir is being coerced into a reference. You can do your own explict move closure instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can go through this tomorrow in person, I wasn't able to get lifetimes working correctly...

return stream!{ let s = source.iter_dir(...); while let Some(_) = s.next().await{...} }; moves source into a wrapper stream, which then owns both the source and the resultant BoxStream from source.iter_dir, making this new stream legal to return.

Returning the stream directly however (return source.iter_dir(...).filter(...).boxed();) returns a stream that holds a reference to source which is on the local stack. Even if I wrap this in a move closure, source will still remain on the local stack.

src/daft-io/src/object_io.rs Show resolved Hide resolved
src/daft-io/src/object_io.rs Outdated Show resolved Hide resolved
src/daft-io/src/object_io.rs Outdated Show resolved Hide resolved
src/daft-io/src/object_io.rs Show resolved Hide resolved
src/daft-io/src/python.rs Outdated Show resolved Hide resolved
src/daft-io/src/python.rs Outdated Show resolved Hide resolved
@samster25
Copy link
Member

@jaychia I also don't see any test cases for delimited special characters or local filepaths. I'm a little worried about the windows local filepath globing so we should add a few test cases there. They should also not be under the integration test path since they don't require fixures.

@jaychia
Copy link
Contributor Author

jaychia commented Oct 1, 2023

@jaychia I also don't see any test cases for delimited special characters or local filepaths. I'm a little worried about the windows local filepath globing so we should add a few test cases there. They should also not be under the integration test path since they don't require fixures.

This PR only implements tests for S3, I'm going to tackle each other backend separately in follow-up PRs. io_glob should be treated as not working yet for the other backends until we have tests.

@jaychia jaychia merged commit 9c32d73 into main Oct 3, 2023
24 checks passed
@jaychia jaychia deleted the jay/native-globbing branch October 3, 2023 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants