-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spark] Automatically shut down ray on spark cluster if user does not execute commands on databricks notebook for a long time #31962
[spark] Automatically shut down ray on spark cluster if user does not execute commands on databricks notebook for a long time #31962
Conversation
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this case:
the notebook cell is
ray.init()
result = long_running_task.remote()
this cell will finish running immediately and the next cell is executed >30 minutes later
ray.get(result)
will the ray cluster gets terminated between these two cells?
In this case, the notebook status will be running (blocking on |
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
There are related test failures:
|
Signed-off-by: Weichen Xu <[email protected]>
Addressed. |
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
… execute commands on databricks notebook for a long time (ray-project#31962) Databricks Runtime provides an API: dbutils.entry_point.getIdleTimeMillisSinceLastNotebookExecution() that returns elapsed milliseconds since last databricks notebook code execution. This PR code calls this interface to monitor notebook activity and shut down Ray cluster on timeout. Signed-off-by: Weichen Xu <[email protected]>
@WeichenXu123 Can you add test cases for the following cases?
|
Filed a follow-up PR #32162 |
…ng messages (#32162) See follow-up comments in #31962 Signed-off-by: Weichen Xu <[email protected]>
… execute commands on databricks notebook for a long time (ray-project#31962) Databricks Runtime provides an API: dbutils.entry_point.getIdleTimeMillisSinceLastNotebookExecution() that returns elapsed milliseconds since last databricks notebook code execution. This PR code calls this interface to monitor notebook activity and shut down Ray cluster on timeout. Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: Edward Oakes <[email protected]>
…ng messages (ray-project#32162) See follow-up comments in ray-project#31962 Signed-off-by: Weichen Xu <[email protected]> Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: Weichen Xu [email protected]
Why are these changes needed?
Automatically shut down ray on spark cluster if user does not execute commands on databricks notebook for a long time.
Databricks Runtime provides an API:
dbutils.entry_point.getIdleTimeMillisSinceLastNotebookExecution()
that returns elapsed milliseconds since last databricks notebook code execution.This PR code calls this interface to monitor notebook activity and shut down Ray cluster on timeout.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.