-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix ScanTask memory estimations when limits are provided #2735
Conversation
CodSpeed Performance ReportMerging #2735 will degrade performances by 58.47%Comparing Summary
Benchmarks breakdown
|
daft/execution/execution_step.py
Outdated
@@ -307,7 +307,7 @@ def run_partial_metadata(self, input_metadatas: list[PartialPartitionMetadata]) | |||
return [ | |||
PartialPartitionMetadata( | |||
num_rows=self.scan_task.num_rows(), | |||
size_bytes=self.scan_task.size_bytes(), | |||
size_bytes=self.scan_task.size_bytes_on_disk(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, this is pretty much never used in our codebase... And is also the only place we really use size_bytes_on_disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be using estimate_in_memory_size_bytes
since this is what we use in the the Scheduling?
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2735 +/- ##
=======================================
Coverage ? 63.33%
=======================================
Files ? 981
Lines ? 113277
Branches ? 0
=======================================
Hits ? 71748
Misses ? 41529
Partials ? 0
|
daft/execution/execution_step.py
Outdated
@@ -307,7 +307,7 @@ def run_partial_metadata(self, input_metadatas: list[PartialPartitionMetadata]) | |||
return [ | |||
PartialPartitionMetadata( | |||
num_rows=self.scan_task.num_rows(), | |||
size_bytes=self.scan_task.size_bytes(), | |||
size_bytes=self.scan_task.size_bytes_on_disk(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be using estimate_in_memory_size_bytes
since this is what we use in the the Scheduling?
Fixes memory estimations for ScanTask when limits are pushed down
approx_num_rows
API to ScanTask, which is distinct from thenum_rows
API which is expected to provide an exact number of rows if availablenum_rows
to account for limit pushdowns as well