Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Fake TPU e2e Autoscaling Test Cases #2279
base: master
Are you sure you want to change the base?
Add Fake TPU e2e Autoscaling Test Cases #2279
Changes from 7 commits
aa2b12f
9bc92a6
81fce4c
01d5978
dd96c66
eece329
0bab892
0c6bb58
2649e1f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior seems a bit unexpected to me. What's the reason we expect a scale down and a scale up again in this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just be mis-understanding this comment. Should there be an assertion for this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detached actors keep running when the Ray node they're scheduled on is scaled down, so the autoscaler sees the request for TPUs and scales back up a multi-host worker group to meet the unmet demand. In a regular scenario (i..e non-detached actors), the actors would be terminated along with their respective nodes when the replica scales down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add an assertion that checks that the pod list length becomes 0 before becoming 4 again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, I missed the behavior specific to detached actors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing this section in 0c6bb58, because getting the node to become idle requires setting the timeout to 5+ minutes which I'd imagine would slow down the presubmit too much. The behavior to scale down a multi-host replica is still tested by deleting the detached actors.