[data] Enable per-op resource reservation #43171

raulchen · 2024-02-14T18:45:15Z

Why are these changes needed?

This PR enables per-op resource reservation by default. See docstring of ReservationOpResourceAllocator for the concrete protocol.
Besides, this PR also:

Increases the default object store memory limit from 25% to 50%, because now we have more precise control of memory.
Renames OpResourceLimiter to OpResourceAllocator, and adds 2 new APIs.
Implements a bunch of perf optimizations for different scenarios in ReservationOpResourceAllocator.
Removes StreamingOutputBackpressurePolicy. as this feature is now implemented as ReservationOpResourceAllocator.max_task_output_bytes_to_read.
Removes use_runtime_metrics_scheduling.
Fixes a bug and adds util functions in ExecutionResources.
Adds detailed resource usage in progress bars.

Related issue number

Closes #42217
#40754

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <[email protected]>

Signed-off-by: Hao Chen <[email protected]> fix usage_str Signed-off-by: Hao Chen <[email protected]>

Signed-off-by: Hao Chen <[email protected]>

bveeramani

Discussed feedback offline. Otherwise LGTM.

bveeramani · 2024-02-26T20:58:25Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

        ExecutionResources(cpu=1) as its incremental usage.
+        Args:


Suggested change

ExecutionResources(cpu=1) as its incremental usage.

Args:

ExecutionResources(cpu=1) as its incremental usage.

Args:

Signed-off-by: Hao Chen <[email protected]>

can-anyscale · 2024-02-28T18:09:53Z

This broke #43490 and #43493, I'm putting up a revert to confirm

This reverts commit d6380d4.

can-anyscale · 2024-02-28T18:12:31Z

as well as this test

Fix bugs in ReservationOpResourceAllocator (introduced by #43171) * We treat a map op and its following non-map ops as the same group. `update_resources` already handles this properly . But `_should_unblock_streaming_output_backpressure` and ` _op_outputs_reserved_remaining` didn't consider this. * Since we don't reserve any resources for `limit` and `streaming_split`, should set `num_cpus=0` for their tasks. * `_reserved_for_op_outputs` currently also includes op's internal output buffers. This is incorrect, because when `preserve_order=True`, task outputs will accumulate in op's internal output buffer, and use all the memory budget from `_reserved_for_op_outputs`. Then we still don't have memory budget to pull the blocks from the internal output buffer. Excluding the internal output buffer from `_reserved_for_op_outputs` fixes this issue. Also deflake `test_backpressure_from_output` and `test_e2e_autoscaling_up`, as they depend on physical memory size of the node. Signed-off-by: Hao Chen <[email protected]>

there is a perf regression for those 2 small-sized test cases in ray-data-resnet50-ingest-file-size-benchmark, due to the new backpressure change (#43171). Update some configs to fix the perf issue. Signed-off-by: Hao Chen <[email protected]>

…on not enabled (#43686) #43171 increased the default memory limit fraction to 50%. because with memory reservation, we have more precise control over the memory. This PR set the default back to 25%, when memory reservation is not enabled, to prevent regression. Signed-off-by: Hao Chen <[email protected]>

raulchen added 2 commits February 14, 2024 10:40

integrate streaming output backpressure

a454087

Signed-off-by: Hao Chen <[email protected]>

integrate scheduling

a90c82d

Signed-off-by: Hao Chen <[email protected]>

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, stephanie-wang and omatthew98 as code owners February 14, 2024 18:45

raulchen marked this pull request as draft February 14, 2024 18:45

raulchen added 13 commits February 14, 2024 10:57

default streaming gen buffer

45aace2

Signed-off-by: Hao Chen <[email protected]>

fix remote args

ac4ee7a

Signed-off-by: Hao Chen <[email protected]>

enable

c8e8235

Signed-off-by: Hao Chen <[email protected]>

fix

7eef1c3

Signed-off-by: Hao Chen <[email protected]>

streaming backpressure based on size

0c57088

Signed-off-by: Hao Chen <[email protected]>

fix

d2b16cd

Signed-off-by: Hao Chen <[email protected]>

comment out

276c7c6

Signed-off-by: Hao Chen <[email protected]>

reduce streaming gen buffer to 2 blocks

21e80d0

Signed-off-by: Hao Chen <[email protected]>

fix obj_store_mem_max_pending_output_per_task

b00f347

Signed-off-by: Hao Chen <[email protected]>

increase default obj memory to 50%

35d0e24

Signed-off-by: Hao Chen <[email protected]>

print usage in progress bar

3a05243

Signed-off-by: Hao Chen <[email protected]> fix usage_str Signed-off-by: Hao Chen <[email protected]>

separate budgets

6e69a89

Signed-off-by: Hao Chen <[email protected]>

simplify

df2e4be

Signed-off-by: Hao Chen <[email protected]>

raulchen force-pushed the enable-memory-reservation branch from 3c3fd30 to df2e4be Compare February 17, 2024 03:45

raulchen added 5 commits February 20, 2024 16:48

refine code

a380f72

Signed-off-by: Hao Chen <[email protected]>

Merge branch 'master' into enable-memory-reservation

8fab9b7

only assign running tasks

cbf29fd

Signed-off-by: Hao Chen <[email protected]>

fix

d555014

Signed-off-by: Hao Chen <[email protected]>

handle fractional remaining

85eec48

Signed-off-by: Hao Chen <[email protected]>

raulchen added 3 commits February 26, 2024 13:03

fix

73efa2b

Signed-off-by: Hao Chen <[email protected]>

comment

1ffab5d

Signed-off-by: Hao Chen <[email protected]>

resnet test

92a515d

Signed-off-by: Hao Chen <[email protected]>

bveeramani reviewed Feb 26, 2024

View reviewed changes

raulchen added 8 commits February 26, 2024 14:48

update comments

bae892e

Signed-off-by: Hao Chen <[email protected]>

reserve min memory

5358ff5

Signed-off-by: Hao Chen <[email protected]>

lint

699e711

Signed-off-by: Hao Chen <[email protected]>

move e2e tests

6c72db1

Signed-off-by: Hao Chen <[email protected]>

loosen condition

bef8277

Signed-off-by: Hao Chen <[email protected]>

lint

a45edd9

Signed-off-by: Hao Chen <[email protected]>

lint

275ab66

Signed-off-by: Hao Chen <[email protected]>

fix

e2c2bdd

Signed-off-by: Hao Chen <[email protected]>

bveeramani approved these changes Feb 27, 2024

View reviewed changes

raulchen merged commit d6380d4 into ray-project:master Feb 27, 2024
8 of 9 checks passed

raulchen deleted the enable-memory-reservation branch February 27, 2024 21:05

can-anyscale added a commit that referenced this pull request Feb 28, 2024

Revert "[data] Enable per-op resource reservation (#43171)"

582b3b9

This reverts commit d6380d4.

can-anyscale mentioned this pull request Feb 28, 2024

Revert "[data] Enable per-op resource reservation" #43504

Closed

raulchen mentioned this pull request Feb 28, 2024

[data] Fix bug in memory reservation #43511

Merged

8 tasks

raulchen mentioned this pull request Feb 29, 2024

[data] optimize ray-data-resnet50-ingest-file-size-benchmark #43571

Merged

8 tasks

raulchen mentioned this pull request Mar 4, 2024

[data] set default memory limit fraction to 25% when memory reservation not enabled #43686

Merged

8 tasks

bveeramani mentioned this pull request Mar 4, 2024

[Data] Streaming executor backpressure #40754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Enable per-op resource reservation #43171

[data] Enable per-op resource reservation #43171

raulchen commented Feb 14, 2024 •

edited

Loading

bveeramani left a comment

bveeramani Feb 26, 2024

can-anyscale commented Feb 28, 2024

can-anyscale commented Feb 28, 2024

[data] Enable per-op resource reservation #43171

[data] Enable per-op resource reservation #43171

Conversation

raulchen commented Feb 14, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

bveeramani left a comment

Choose a reason for hiding this comment

bveeramani Feb 26, 2024

Choose a reason for hiding this comment

can-anyscale commented Feb 28, 2024

can-anyscale commented Feb 28, 2024

raulchen commented Feb 14, 2024 •

edited

Loading