Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix ScanTask memory estimations when limits are provided #2735

Merged
merged 4 commits into from
Aug 30, 2024

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Aug 27, 2024

Fixes memory estimations for ScanTask when limits are pushed down

  • Adds a newapprox_num_rows API to ScanTask, which is distinct from the num_rows API which is expected to provide an exact number of rows if available
  • Adds some heuristics for determining the approximate number of rows with the available information, including accounting for limits
  • Fixes num_rows to account for limit pushdowns as well

@github-actions github-actions bot added the bug Something isn't working label Aug 27, 2024
Copy link

codspeed-hq bot commented Aug 27, 2024

CodSpeed Performance Report

Merging #2735 will degrade performances by 58.47%

Comparing jay/fix-memory-estimations-limits-scantask (06c4122) with main (5eb34fc)

Summary

⚡ 1 improvements
❌ 2 regressions
✅ 13 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main jay/fix-memory-estimations-limits-scantask Change
test_count[1 Small File] 10.1 ms 24.2 ms -58.47%
test_explain[100 Small Files] 41.5 ms 37.1 ms +11.95%
test_show[100 Small Files] 396.6 ms 659 ms -39.81%

@jaychia jaychia requested a review from samster25 August 27, 2024 19:14
@@ -307,7 +307,7 @@ def run_partial_metadata(self, input_metadatas: list[PartialPartitionMetadata])
return [
PartialPartitionMetadata(
num_rows=self.scan_task.num_rows(),
size_bytes=self.scan_task.size_bytes(),
size_bytes=self.scan_task.size_bytes_on_disk(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this is pretty much never used in our codebase... And is also the only place we really use size_bytes_on_disk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be using estimate_in_memory_size_bytes since this is what we use in the the Scheduling?

Copy link

codecov bot commented Aug 27, 2024

Codecov Report

Attention: Patch coverage is 84.00000% with 8 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@20ffed4). Learn more about missing BASE report.
Report is 9 commits behind head on main.

Files Patch % Lines
src/daft-scan/src/lib.rs 85.10% 7 Missing ⚠️
src/daft-plan/src/physical_ops/scan.rs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2735   +/-   ##
=======================================
  Coverage        ?   63.33%           
=======================================
  Files           ?      981           
  Lines           ?   113277           
  Branches        ?        0           
=======================================
  Hits            ?    71748           
  Misses          ?    41529           
  Partials        ?        0           
Files Coverage Δ
daft/execution/execution_step.py 91.96% <ø> (ø)
src/daft-scan/src/python.rs 69.68% <100.00%> (ø)
src/daft-plan/src/physical_ops/scan.rs 10.11% <0.00%> (ø)
src/daft-scan/src/lib.rs 60.33% <85.10%> (ø)

@@ -307,7 +307,7 @@ def run_partial_metadata(self, input_metadatas: list[PartialPartitionMetadata])
return [
PartialPartitionMetadata(
num_rows=self.scan_task.num_rows(),
size_bytes=self.scan_task.size_bytes(),
size_bytes=self.scan_task.size_bytes_on_disk(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be using estimate_in_memory_size_bytes since this is what we use in the the Scheduling?

@jaychia jaychia enabled auto-merge (squash) August 30, 2024 21:39
@jaychia jaychia merged commit 7e134f2 into main Aug 30, 2024
34 checks passed
@jaychia jaychia deleted the jay/fix-memory-estimations-limits-scantask branch August 30, 2024 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants