-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Fix bug in memory reservation #43511
[data] Fix bug in memory reservation #43511
Conversation
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
|
||
E.g., | ||
- "cur_map->downstream_map" will return [downstream_map]. | ||
- "cur_map->limit1->limi2->downstream_map" will return [downstream_map]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limi2
typo
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
@pytest.mark.parametrize( | ||
"cluster_cpus, cluster_obj_store_mem_mb", | ||
"cluster_cpus, cluster_obj_store_mem_mb, insert_limit_op", | ||
[ | ||
(3, 500), # CPU not enough | ||
(4, 100), # Object store memory not enough | ||
(3, 100), # Both not enough | ||
(3, 500, False), # CPU not enough | ||
(3, 500, True), # CPU not enough | ||
(4, 100, False), # Object store memory not enough | ||
(4, 100, True), # Object store memory not enough | ||
(3, 100, False), # Both not enough | ||
(3, 100, True), # Both not enough | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: To test all existing combinations with insert_limit_op
, it might be cleaner to do another parametrize:
@pytest.mark.parametrize(
"cluster_cpus", "cluster_obj_store_mem_mb",
[
(3, 500),
(4, 100),
(3, 100),
]
)
@pytest.mark.parametrize("insert_limit_op", [False, True])
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Can you help check that the failing tests on premerge are same as ones failing. in go/flaky. Thankks |
@can-anyscale checked. one is an existing flaky test, the other is because hugging face is down now. |
Why are these changes needed?
Fix bugs in ReservationOpResourceAllocator (introduced by #43171)
update_resources
already handles this properly . But_should_unblock_streaming_output_backpressure
and_op_outputs_reserved_remaining
didn't consider this.limit
andstreaming_split
, should setnum_cpus=0
for their tasks._reserved_for_op_outputs
currently also includes op's internal output buffers. This is incorrect, because whenpreserve_order=True
, task outputs will accumulate in op's internal output buffer, and use all the memory budget from_reserved_for_op_outputs
. Then we still don't have memory budget to pull the blocks from the internal output buffer. Excluding the internal output buffer from_reserved_for_op_outputs
fixes this issue.Also deflake
test_backpressure_from_output
andtest_e2e_autoscaling_up
, as they depend on physical memory size of the node.Related issue number
closes #43493
closes #43490
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.