You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently added a query-monitor in ksql that restarts failed queries so that we can automatically recover from transient errors. However, the downside is that for errors that are not transient (e.g. a bug in a udf), or for external errors that persist for a long time (e.g. some serious outage in a system we depend on), it may be better to stop the query, stop retrying, and then resume when we know the error condition is resolved (a human deploys a patched udf, or fixes the external system). This way, if there are any problems that may be compounded by repeated retries, we don't make them worse. This enhancement request proposes the following:
Add a way to detect that an error is non-transient. To start with, this could just be a time threshold beyond which we decide an error is not going away.
Leverage the solution from 6403 to set the desired state of the query to STOPPED, with some additional indication that this is for an internal error
Add tooling that an operator can use to issue a START for the query once the problem is resoled. Longer term we can look at doing this automatically via some loop that does pre-flight checks (e.g. can I talk to kafka, consume/produce a message, can I talk to sr, etc) before enqueuing a restart message.
The text was updated successfully, but these errors were encountered:
This builds on top of #6403
We recently added a query-monitor in ksql that restarts failed queries so that we can automatically recover from transient errors. However, the downside is that for errors that are not transient (e.g. a bug in a udf), or for external errors that persist for a long time (e.g. some serious outage in a system we depend on), it may be better to stop the query, stop retrying, and then resume when we know the error condition is resolved (a human deploys a patched udf, or fixes the external system). This way, if there are any problems that may be compounded by repeated retries, we don't make them worse. This enhancement request proposes the following:
The text was updated successfully, but these errors were encountered: