Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ksql should automatically pause queries that repeatedly hit fatal errors #6404

Open
rodesai opened this issue Oct 10, 2020 · 0 comments
Open

Comments

@rodesai
Copy link
Contributor

rodesai commented Oct 10, 2020

This builds on top of #6403

We recently added a query-monitor in ksql that restarts failed queries so that we can automatically recover from transient errors. However, the downside is that for errors that are not transient (e.g. a bug in a udf), or for external errors that persist for a long time (e.g. some serious outage in a system we depend on), it may be better to stop the query, stop retrying, and then resume when we know the error condition is resolved (a human deploys a patched udf, or fixes the external system). This way, if there are any problems that may be compounded by repeated retries, we don't make them worse. This enhancement request proposes the following:

  • Add a way to detect that an error is non-transient. To start with, this could just be a time threshold beyond which we decide an error is not going away.
  • Leverage the solution from 6403 to set the desired state of the query to STOPPED, with some additional indication that this is for an internal error
  • Add tooling that an operator can use to issue a START for the query once the problem is resoled. Longer term we can look at doing this automatically via some loop that does pre-flight checks (e.g. can I talk to kafka, consume/produce a message, can I talk to sr, etc) before enqueuing a restart message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant