Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Placement group] Refine Remove API #10232

Open
rkooo567 opened this issue Aug 20, 2020 · 2 comments
Open

[Placement group] Refine Remove API #10232

rkooo567 opened this issue Aug 20, 2020 · 2 comments
Labels
core-placement-group enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks

Comments

@rkooo567
Copy link
Contributor

rkooo567 commented Aug 20, 2020

Describe your feature request

When workers are killed, their death information is not properly propagated to the cluster. This causes issues like

  • When tasks or actors are killed by pg fate sharing, their exception type is "WorkerCrashedError" or "ActorDiedError" instead of PlacementGroupError.
  • When we schedule a new task on removed placement group, they become permanently infeasible. This has issues when task retry is involved (e.g., Task has failed because the worker is killed, and the core worker try rescheduling a task, but it becomes infeasible).
  • Newly submitted infeasible tasks demand will never be deleted.

The ideal final state:

  • All killed workers by "remove_placement_group" should raise PlacementGroupRemovedException.
  • All subsequent actor creation & task submission with the placement group should raise PlacementGroupRemovedException.

Implementation:
There are 2 possible implementations

  • When the placement group is removed, publish the removal information to all interest party (core worker & raylet). This means when we start using the pg, we should always subscribe the pg state (to make sure we don't use removed placement groups).
  • Mark tasks as "infeasible" if the corresponding placement group is already removed. This can be implemented as a part of Raise an Exception when a task, actor, or placement group is permanently infeasible #18835

Prefer the second solution

@rkooo567 rkooo567 added enhancement Request for new feature and/or capability P2 Important issue, but not time-critical labels Aug 20, 2020
@rkooo567 rkooo567 added this to the Placement Group API milestone Aug 20, 2020
@rkooo567 rkooo567 changed the title Refine Remove API [Placement group] Refine Remove API Aug 20, 2020
@rkooo567 rkooo567 self-assigned this Aug 21, 2020
@rkooo567
Copy link
Contributor Author

Not worth doing it before new scheduler is enabled.

@rkooo567
Copy link
Contributor Author

@oliverhu If we implement pubsub stuff, this can also be handled.

@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Nov 1, 2021
@rkooo567 rkooo567 removed their assignment Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-placement-group enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

1 participant