-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: surface error to user when command topic deleted while server running #6240
feat: surface error to user when command topic deleted while server running #6240
Conversation
39235d4
to
059b68f
Compare
059b68f
to
04bf50f
Compare
812a4eb
to
2d66b31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Some minor comments inline
INCOMPATIBLE_COMMAND(Errors:: commandRunnerDegradedIncompatibleCommandsErrorMessage); | ||
CORRUPTED(Errors::commandRunnerDegradedBackupCorruptedErrorMessage), | ||
INCOMPATIBLE_COMMAND(Errors::commandRunnerDegradedIncompatibleCommandsErrorMessage), | ||
COMMAND_TOPIC_DELETED(Errors::commandRunnerDegradedCommandTopicDeletedErrorMessage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember a discussion on the design doc indicated that maybe we don't want to differentiate between corrupted and deleted. What was the decision on that? (also good to document it here for posterity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main reason is right now, backups are optional, so if backups aren't enabled, then it wouldn't really make sense to have the degraded reason be CORRUPTED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking it over with Rohan a bit more, I've combined CORRUPTED and COMMAND_TOPIC_DELETED into CORRUPTED. The COMMAND_TOPIC_DELETED could show up as CORRUPTED after a restart and having different states for the same external action (deleting the command topic) is strange.
I've made the error message returned from the API generic enough so it address both the backup corruption and the command topic deleted/modified case
The server has detected corruption in the command topic due to modifications performed on it. DDL statements will not be processed any further.
If a backup of the command topic is available, restore the command topic using the backup file.
A server restart is required to restore full functionality
@@ -289,6 +302,9 @@ void fetchAndRunCommands() { | |||
lastPollTime.set(clock.instant()); | |||
final List<QueuedCommand> commands = commandStore.getNewCommands(NEW_CMDS_TIMEOUT); | |||
if (commands.isEmpty()) { | |||
if (!commandTopicExists.get()) { | |||
commandTopicDeleted = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can save this for another PR, but I think it makes sense to have this method return some status then to set a ton of flags and check them in the main loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll file a follow up issue for that refactoring improvement.
LOG.warn("The command topic offset was reset. CommandRunner thread exiting."); | ||
state = new Status( | ||
CommandRunnerStatus.DEGRADED, | ||
CommandRunnerDegradedReason.COMMAND_TOPIC_DELETED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does offset reset always mean command topic deleted? it could mean the retention was configured incorrectly perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time we start up the server, we'd check/override the command topic properties such that the retention.ms=-1
although I guess it is possible for the user themselves to override the retention config on the command topic which could trigger this. I can enhance the error message so that it mentions the degraded state is a result of the command topic being deleted or the configurations being modified for it
The REASON can be updated to COMMAND_TOPIC_DELETED_OR_MODIFIED
ksqldb-rest-app/src/test/java/io/confluent/ksql/rest/server/computation/CommandRunnerTest.java
Show resolved
Hide resolved
INCOMPATIBLE_COMMAND(Errors:: commandRunnerDegradedIncompatibleCommandsErrorMessage); | ||
CORRUPTED(Errors::commandRunnerDegradedBackupCorruptedErrorMessage), | ||
INCOMPATIBLE_COMMAND(Errors::commandRunnerDegradedIncompatibleCommandsErrorMessage), | ||
COMMAND_TOPIC_DELETED_OR_MODIFIED( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different than corrupted? We're leaking an implementation detail of how we detect corruption here. The underlying condition is the same - the command topic isn't in the state that the system expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've combined the two now.
1445238
to
0368e77
Compare
Description
Stacked on top of #6164
If the command topic is deleted currently, the command topic consumer in the CommandRunner thread continues to poll the non-existent topic, spamming logs with
PR changes it so that if the command topic doesn't exist, the server enters DEGRADED state and adds warnings to all api requests sent to the
/ksql
that the command topic was deleted.Also added a new CommandRunnerDegradedReason category to expose this reason in the server metrics.
Testing done
Started up the server, deleted the command topic
Used jconsole to verify the command runner status is DEGRADED and the reason is COMMAND_TOPIC_DELETED
Reviewer checklist