[FEAT] [New Query Planner] Refactor file globbing logic by exposing `FileInfos` to Python #1307

clarkzinzow · 2023-08-26T01:40:35Z

This PR refactors the file globbing logic by exposing FileInfos from the new query planner to Python and using it for both the old query planner and the new query planner. Hopefully the upcoming Rust-native file globber will be able to leverage the FileInfos struct or something akin to it, allowing it to be a ~drop-in replacement for the existing Python file globber.

codecov · 2023-08-26T01:47:45Z

Codecov Report

Merging #1307 (7d6985d) into main (e15aa27) will decrease coverage by 0.04%.
The diff coverage is 98.14%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1307      +/-   ##
==========================================
- Coverage   87.34%   87.31%   -0.04%     
==========================================
  Files          61       61              
  Lines        6039     6016      -23     
==========================================
- Hits         5275     5253      -22     
+ Misses        764      763       -1

Files Changed	Coverage Δ
daft/execution/physical_plan_factory.py	`91.52% <ø> (ø)`
daft/execution/rust_physical_plan_shim.py	`98.21% <ø> (-0.07%)`	⬇️
daft/logical/builder.py	`78.57% <ø> (ø)`
daft/logical/optimizer.py	`97.92% <ø> (ø)`
daft/logical/logical_plan.py	`79.42% <80.00%> (-0.07%)`	⬇️
daft/execution/execution_step.py	`93.62% <100.00%> (-0.04%)`	⬇️
daft/execution/physical_plan.py	`94.58% <100.00%> (+0.01%)`	⬆️
daft/filesystem.py	`86.58% <100.00%> (+0.33%)`	⬆️
daft/io/common.py	`95.00% <100.00%> (+0.88%)`	⬆️
daft/io/file_path.py	`100.00% <100.00%> (ø)`
... and 4 more

... and 1 file with indirect coverage changes

xcharleslin

Took a quick look, lgtm overall! Thanks for doing this!

xcharleslin · 2023-08-30T17:47:55Z

daft/execution/execution_step.py

-            self.filepaths_column_name in data
-        ), f"TabularFilesScan should be ran on vPartitions with '{self.filepaths_column_name}' column"
-        filepaths = data[self.filepaths_column_name]
+        filepaths = data["file_paths"]


Should we make this a constant somewhere?

Eh we weren't using a constant for "file_sizes" or "num_rows", and I personally think that we can keep the contract informal for now since we're going to continue to rework this as (1) we move to the new query planner, (2) we move to the Rust-native file globber, and (3) we move file globbing to execution time.

If you feel strongly about this, I can try to expose the column names on FileInfo and FileInfos somehow, and propagate them as sidecar data whenever we pass around the Table representation.

Naw I don't feel strongly about this 😛 lgtm!

xcharleslin · 2023-08-30T17:55:31Z

daft/logical/logical_plan.py

@@ -493,7 +480,7 @@ def __repr__(self) -> str:
        )

    def required_columns(self) -> list[set[str]]:
-        return [{self._filepaths_column_name} | self._predicate.required_columns()]
+        return [{"file_paths"} | self._predicate.required_columns()]


Codecov comment is interesting, is this not covered by tests running on the old planner?

Hmm it appears that it is not. The only place that it could happen is in the (Projection, LogicalPlan) optimization rule, and that rule doesn't apply to TabularFileScan nodes:

Daft/daft/logical/optimizer.py

Lines 219 to 220 in 102de0f

if isinstance(child, Projection) or isinstance(child, LocalAggregate) or isinstance(child, TabularFilesScan):

return None

clarkzinzow requested review from samster25 and xcharleslin August 26, 2023 01:40

github-actions bot added the enhancement New feature or request label Aug 26, 2023

clarkzinzow force-pushed the clark/globbing-file-info branch from 7b38cd7 to 864fea4 Compare August 29, 2023 18:41

Unify file globbing logic by exposing FileInfos to Python.

7d6985d

clarkzinzow force-pushed the clark/globbing-file-info branch from 864fea4 to 7d6985d Compare August 30, 2023 00:46

xcharleslin approved these changes Aug 30, 2023

View reviewed changes

clarkzinzow merged commit 9565316 into main Aug 30, 2023
27 checks passed

clarkzinzow deleted the clark/globbing-file-info branch August 30, 2023 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] [New Query Planner] Refactor file globbing logic by exposing `FileInfos` to Python #1307

[FEAT] [New Query Planner] Refactor file globbing logic by exposing `FileInfos` to Python #1307

clarkzinzow commented Aug 26, 2023 •

edited

Loading

codecov bot commented Aug 26, 2023 •

edited

Loading

xcharleslin left a comment

xcharleslin Aug 30, 2023

clarkzinzow Aug 30, 2023 •

edited

Loading

xcharleslin Aug 30, 2023

xcharleslin Aug 30, 2023

clarkzinzow Aug 30, 2023 •

edited

Loading

	if isinstance(child, Projection) or isinstance(child, LocalAggregate) or isinstance(child, TabularFilesScan):
	return None

[FEAT] [New Query Planner] Refactor file globbing logic by exposing FileInfos to Python #1307

[FEAT] [New Query Planner] Refactor file globbing logic by exposing FileInfos to Python #1307

Conversation

clarkzinzow commented Aug 26, 2023 • edited Loading

codecov bot commented Aug 26, 2023 • edited Loading

Codecov Report

xcharleslin left a comment

Choose a reason for hiding this comment

xcharleslin Aug 30, 2023

Choose a reason for hiding this comment

clarkzinzow Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

xcharleslin Aug 30, 2023

Choose a reason for hiding this comment

xcharleslin Aug 30, 2023

Choose a reason for hiding this comment

clarkzinzow Aug 30, 2023 • edited Loading

Choose a reason for hiding this comment

[FEAT] [New Query Planner] Refactor file globbing logic by exposing `FileInfos` to Python #1307

[FEAT] [New Query Planner] Refactor file globbing logic by exposing `FileInfos` to Python #1307

clarkzinzow commented Aug 26, 2023 •

edited

Loading

codecov bot commented Aug 26, 2023 •

edited

Loading

clarkzinzow Aug 30, 2023 •

edited

Loading

clarkzinzow Aug 30, 2023 •

edited

Loading