Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] [New Query Planner] Refactor file globbing logic by exposing FileInfos to Python #1307

Merged
merged 1 commit into from
Aug 30, 2023

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Aug 26, 2023

This PR refactors the file globbing logic by exposing FileInfos from the new query planner to Python and using it for both the old query planner and the new query planner. Hopefully the upcoming Rust-native file globber will be able to leverage the FileInfos struct or something akin to it, allowing it to be a ~drop-in replacement for the existing Python file globber.

@github-actions github-actions bot added the enhancement New feature or request label Aug 26, 2023
@codecov
Copy link

codecov bot commented Aug 26, 2023

Codecov Report

Merging #1307 (7d6985d) into main (e15aa27) will decrease coverage by 0.04%.
The diff coverage is 98.14%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1307      +/-   ##
==========================================
- Coverage   87.34%   87.31%   -0.04%     
==========================================
  Files          61       61              
  Lines        6039     6016      -23     
==========================================
- Hits         5275     5253      -22     
+ Misses        764      763       -1     
Files Changed Coverage Δ
daft/execution/physical_plan_factory.py 91.52% <ø> (ø)
daft/execution/rust_physical_plan_shim.py 98.21% <ø> (-0.07%) ⬇️
daft/logical/builder.py 78.57% <ø> (ø)
daft/logical/optimizer.py 97.92% <ø> (ø)
daft/logical/logical_plan.py 79.42% <80.00%> (-0.07%) ⬇️
daft/execution/execution_step.py 93.62% <100.00%> (-0.04%) ⬇️
daft/execution/physical_plan.py 94.58% <100.00%> (+0.01%) ⬆️
daft/filesystem.py 86.58% <100.00%> (+0.33%) ⬆️
daft/io/common.py 95.00% <100.00%> (+0.88%) ⬆️
daft/io/file_path.py 100.00% <100.00%> (ø)
... and 4 more

... and 1 file with indirect coverage changes

Copy link
Contributor

@xcharleslin xcharleslin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look, lgtm overall! Thanks for doing this!

self.filepaths_column_name in data
), f"TabularFilesScan should be ran on vPartitions with '{self.filepaths_column_name}' column"
filepaths = data[self.filepaths_column_name]
filepaths = data["file_paths"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a constant somewhere?

Copy link
Contributor Author

@clarkzinzow clarkzinzow Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh we weren't using a constant for "file_sizes" or "num_rows", and I personally think that we can keep the contract informal for now since we're going to continue to rework this as (1) we move to the new query planner, (2) we move to the Rust-native file globber, and (3) we move file globbing to execution time.

If you feel strongly about this, I can try to expose the column names on FileInfo and FileInfos somehow, and propagate them as sidecar data whenever we pass around the Table representation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naw I don't feel strongly about this 😛 lgtm!

@@ -493,7 +480,7 @@ def __repr__(self) -> str:
)

def required_columns(self) -> list[set[str]]:
return [{self._filepaths_column_name} | self._predicate.required_columns()]
return [{"file_paths"} | self._predicate.required_columns()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov comment is interesting, is this not covered by tests running on the old planner?

Copy link
Contributor Author

@clarkzinzow clarkzinzow Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it appears that it is not. The only place that it could happen is in the (Projection, LogicalPlan) optimization rule, and that rule doesn't apply to TabularFileScan nodes:

if isinstance(child, Projection) or isinstance(child, LocalAggregate) or isinstance(child, TabularFilesScan):
return None

@clarkzinzow clarkzinzow merged commit 9565316 into main Aug 30, 2023
27 checks passed
@clarkzinzow clarkzinzow deleted the clark/globbing-file-info branch August 30, 2023 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants