ksql should automatically pause queries that repeatedly hit fatal errors #6404

rodesai · 2020-10-10T20:57:17Z

This builds on top of #6403

We recently added a query-monitor in ksql that restarts failed queries so that we can automatically recover from transient errors. However, the downside is that for errors that are not transient (e.g. a bug in a udf), or for external errors that persist for a long time (e.g. some serious outage in a system we depend on), it may be better to stop the query, stop retrying, and then resume when we know the error condition is resolved (a human deploys a patched udf, or fixes the external system). This way, if there are any problems that may be compounded by repeated retries, we don't make them worse. This enhancement request proposes the following:

Add a way to detect that an error is non-transient. To start with, this could just be a time threshold beyond which we decide an error is not going away.
Leverage the solution from 6403 to set the desired state of the query to STOPPED, with some additional indication that this is for an internal error
Add tooling that an operator can use to issue a START for the query once the problem is resoled. Longer term we can look at doing this automatically via some loop that does pre-flight checks (e.g. can I talk to kafka, consume/produce a message, can I talk to sr, etc) before enqueuing a restart message.

rodesai added enhancement needs-triage and removed needs-triage labels Oct 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ksql should automatically pause queries that repeatedly hit fatal errors #6404

ksql should automatically pause queries that repeatedly hit fatal errors #6404

rodesai commented Oct 10, 2020

ksql should automatically pause queries that repeatedly hit fatal errors #6404

ksql should automatically pause queries that repeatedly hit fatal errors #6404

Comments

rodesai commented Oct 10, 2020