-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] [New Query Planner] Refactor file globbing logic by exposing FileInfos
to Python
#1307
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1307 +/- ##
==========================================
- Coverage 87.34% 87.31% -0.04%
==========================================
Files 61 61
Lines 6039 6016 -23
==========================================
- Hits 5275 5253 -22
+ Misses 764 763 -1
|
7b38cd7
to
864fea4
Compare
864fea4
to
7d6985d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look, lgtm overall! Thanks for doing this!
self.filepaths_column_name in data | ||
), f"TabularFilesScan should be ran on vPartitions with '{self.filepaths_column_name}' column" | ||
filepaths = data[self.filepaths_column_name] | ||
filepaths = data["file_paths"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this a constant somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh we weren't using a constant for "file_sizes"
or "num_rows"
, and I personally think that we can keep the contract informal for now since we're going to continue to rework this as (1) we move to the new query planner, (2) we move to the Rust-native file globber, and (3) we move file globbing to execution time.
If you feel strongly about this, I can try to expose the column names on FileInfo
and FileInfos
somehow, and propagate them as sidecar data whenever we pass around the Table
representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naw I don't feel strongly about this 😛 lgtm!
@@ -493,7 +480,7 @@ def __repr__(self) -> str: | |||
) | |||
|
|||
def required_columns(self) -> list[set[str]]: | |||
return [{self._filepaths_column_name} | self._predicate.required_columns()] | |||
return [{"file_paths"} | self._predicate.required_columns()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov comment is interesting, is this not covered by tests running on the old planner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it appears that it is not. The only place that it could happen is in the (Projection, LogicalPlan)
optimization rule, and that rule doesn't apply to TabularFileScan
nodes:
Daft/daft/logical/optimizer.py
Lines 219 to 220 in 102de0f
if isinstance(child, Projection) or isinstance(child, LocalAggregate) or isinstance(child, TabularFilesScan): | |
return None |
This PR refactors the file globbing logic by exposing
FileInfos
from the new query planner to Python and using it for both the old query planner and the new query planner. Hopefully the upcoming Rust-native file globber will be able to leverage theFileInfos
struct or something akin to it, allowing it to be a ~drop-in replacement for the existing Python file globber.