Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Placement Group] Fix placement group removal leak #19138

Merged
merged 7 commits into from
Oct 8, 2021

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Oct 6, 2021

Why are these changes needed?

This is reproducible when

  1. placement group ready task is queued
  2. New node is up and pg is created
  3. ready task is scheduled as soon as pg is created

The issue was that when the new bundle is created, AddLocalInstances function doesn't properly update the total resources which causes the removal to fail (since we don't have enough total resources, freeing resources fails). This causes the resource leakage.

Detailed scenario

1 nodes A: {CPU 16}
1 pg created {CPU: 1, GPU:1} + {CPU:1} * 30

- ready task is scheduled. It is queued since pg is not created yet
- new node is added B: {CPU: 16, GPU:1}
- Now bundles are created. Imagine in node A, 15 bundles are created
- Each bundle adds its wildcard resources -> bundle_[pg_id] will become 1000, 2000, 3000, ... 15000
- Now ready task is scheduled. This occupies the bundle resource. 
- Imagine this happens before all bundles are created; let's say after bundle_[pg_id]: 3000, this happens
- Now available resources is bundle_[pg_id]: 2999 .999
- And then new bundle is created. The calculation logic here is wrong so the resource will become `bundle_[pg_id]: {total: 3999.999, avail:3999.999}`
- Now the bundle is "unremoveable" because before we remove the placement group, we need 4000 bundles
- - ^ This calculation should be fixed in a way that the bundle id would become `bundle_[pg_id]: {total: 4000, avail:3999.999}` -> this is what the PR does

Related issue number

Closes #19131

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567 rkooo567 changed the title Fix gpu only node issue [Placement Group] Fix placement group removal leak Oct 6, 2021
@rkooo567 rkooo567 changed the title [Placement Group] Fix placement group removal leak [WIP][Placement Group] Fix placement group removal leak Oct 6, 2021
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 6, 2021
@@ -456,8 +456,7 @@ void ClusterResourceScheduler::AddLocalResourceInstances(

for (size_t i = 0; i < instances.size(); i++) {
node_instances->available[i] += instances[i];
node_instances->total[i] =
std::max(node_instances->total[i], node_instances->available[i]);
node_instances->total[i] += instances[i];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable, though I don't get what the previous code was trying to do. @wuisawesome any comment? It seems to blame to f52c855

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is already merged, but check #19138 (comment). (seems like he doesn't remember)

@@ -268,7 +268,6 @@ void ClusterTaskManager::DispatchScheduledTasksToWorkers(
WorkerPoolInterface &worker_pool,
std::unordered_map<WorkerID, std::shared_ptr<WorkerInterface>> &leased_workers) {
if (num_tasks_waiting_for_dispatch_ == 0) {
RAY_LOG(DEBUG) << "No new tasks since last call to dispatch, skipping";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed logs in this file are too noisy

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 7, 2021
@rkooo567 rkooo567 changed the title [WIP][Placement Group] Fix placement group removal leak [Placement Group] Fix placement group removal leak Oct 7, 2021
@@ -138,7 +138,7 @@ class TaskResourceInstances {
/// Check whether there are no resource instances.
bool IsEmpty() const;
/// Returns human-readable string for these resources.
std::string DebugString() const;
[[nodiscard]] std::string DebugString(const StringIdMap &string_id_map) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is nodiscard? Can we remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mwtian is this necessary or best practices? Should we keep including this in the linter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be useful in cases where a status code is returned, but it seems too pedantic here. Added you to reviewers of #19175 which disables this check. I can also disable this in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want to call DebugString() and ignore the result?
It sounds like a bug.

(I like [[nodiscard]] here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I see. If that's what it means it makes sense to have it. I think it is up to what's the convention we'd like to use

Copy link
Member

@mwtian mwtian Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sasha-s I agree [[nodiscard]] can be useful in many cases, but for a function like DebugString() the benefit seems to be small since ignoring its result is probably not a serious error. The cost here is making the code look too different from "normal" C++. In an alternative world, I think [[nodiscard]], noexcept and const should be the default requiring no annotations, and only the opposite behavior needs annotations. Not requiring these annotations and only tagging them strategically seem more realistic right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that if you have DebugString() in your code and do not check the output, then, most likely, you have some post-debugging artifacts.
Would you want those to be checked in?

Also, to me, "normal" c++ code has [[nodiscard]] almost everywhere, both on functions and on types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a stylistic decision. Given that we currently don't have [[nodiscard]], we should probably defer to the time-tested rule of being consistent with current style decisions in the codebase, which is to not have [[nodiscard]].

Otherwise, we should add it everywhere (definitely should be a separate PR, that would be pretty huge).

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 7, 2021
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 7, 2021
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 8, 2021
@ericl ericl merged commit afaee05 into ray-project:master Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
7 participants