You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
When calling WorkspaceClient.clusters.ensure_cluster_is_running, the timeout is not configurable. The timeout is also not cascaded to the successor function calls. This intermittently can result in failures due to slow cluster spin-up time. I am facing long Computer-cluster startup times in my orgs instance of Databricks so configuring this would be very helpful.
Reproduction
client=WorkspaceClient()
cluster_id="1234"client.clusters.ensure_cluster_is_running(cluster_id)
# and 20+ minutes elapses
Expected behavior
I expect ability to adjust the timeout period through the constructor
Instead of:
defensure_cluster_is_running(self, cluster_id: str) ->None:
"""Ensures that given cluster is running, regardless of the current state"""timeout=datetime.timedelta(minutes=20)
deadline=time.time() +timeout.total_seconds()
whiletime.time() <deadline:
try:
state=compute.Stateinfo=self.get(cluster_id)
ifinfo.state==state.RUNNING:
returnelifinfo.state==state.TERMINATED:
self.start(cluster_id).result()
returnelifinfo.state==state.TERMINATING:
self.wait_get_cluster_terminated(cluster_id)
self.start(cluster_id).result()
returnelifinfo.statein (state.PENDING, state.RESIZING, state.RESTARTING):
self.wait_get_cluster_running(cluster_id)
returnelifinfo.statein (state.ERROR, state.UNKNOWN):
raiseRuntimeError(f'Cluster {info.cluster_name} is {info.state}: {info.state_message}')
exceptDatabricksErrorase:
ife.error_code=='INVALID_STATE':
_LOG.debug(f'Cluster was started by other process: {e} Retrying.')
continueraiseeexceptOperationFailedase:
_LOG.debug('Operation failed, retrying', exc_info=e)
raiseTimeoutError(f'timed out after {timeout}')
Would be good to have:
defensure_cluster_is_running(self, cluster_id: str, timeout_minutes: int=20) ->None:
"""Ensures that given cluster is running, regardless of the current state"""timeout=datetime.timedelta(minutes=timeout_minutes)
deadline=time.time() +timeout.total_seconds()
whiletime.time() <deadline:
try:
state=compute.Stateinfo=self.get(cluster_id)
ifinfo.state==state.RUNNING:
returnelifinfo.state==state.TERMINATED:
self.start(cluster_id).result(timeout=timeout) # ADDED TIMEOUTreturnelifinfo.state==state.TERMINATING:
self.wait_get_cluster_terminated(cluster_id, timeout=timeout) #ADDED TIMEOUTself.start(cluster_id).result(timeout=timeout) #ADDED TIMEOUT)returnelifinfo.statein (state.PENDING, state.RESIZING, state.RESTARTING):
self.wait_get_cluster_running(cluster_id, timeout=timeout) # ADDED TIMEOUTreturnelifinfo.statein (state.ERROR, state.UNKNOWN):
raiseRuntimeError(f'Cluster {info.cluster_name} is {info.state}: {info.state_message}')
exceptDatabricksErrorase:
ife.error_code=='INVALID_STATE':
_LOG.debug(f'Cluster was started by other process: {e} Retrying.')
continueraiseeexceptOperationFailedase:
_LOG.debug('Operation failed, retrying', exc_info=e)
raiseTimeoutError(f'timed out after {timeout}')
Is it a regression?
Not that I'm aware of.
Debug Logs
n/a
Other Information
databricks-sdk == 0.28.0
The text was updated successfully, but these errors were encountered:
Description
When calling
WorkspaceClient.clusters.ensure_cluster_is_running
, the timeout is not configurable. The timeout is also not cascaded to the successor function calls. This intermittently can result in failures due to slow cluster spin-up time. I am facing long Computer-cluster startup times in my orgs instance of Databricks so configuring this would be very helpful.Reproduction
Expected behavior
I expect ability to adjust the timeout period through the constructor
Instead of:
Would be good to have:
Is it a regression?
Not that I'm aware of.
Debug Logs
n/a
Other Information
databricks-sdk == 0.28.0
The text was updated successfully, but these errors were encountered: