Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schema-reader: Shutdown service if corrupt entries in _schemas topic #936

Merged
merged 2 commits into from
Sep 5, 2024

Conversation

nosahama
Copy link
Contributor

@nosahama nosahama commented Aug 21, 2024

About this change - What it does

These breaking changes are guarded by environmental variables

These can be toggled by the various users, the default values are shown below:

KARAPACE_KAFKA_SCHEMA_READER_STRICT_MODE: false
KARAPACE_KAFKA_RETRIABLE_ERRORS_SILENCED: true

The logic below only applies when KARAPACE_KAFKA_SCHEMA_READER_STRICT_MODE is set to true, everything else remains the same, thus no test changes.

Previously, when we encounter errors within the _schemas topic, we would continue the message loading and skip the problematic schema. This is not ideal as it might leave the application with corrupt schema data and the side-effects could be grave.

What we do now is to kill the service, log the errors and allow a graceful shutdown. We will follow this work by adding metrics for such cases.

This will also stop the service post backup-v1 restore if there are any corrupt schemas present in the backup log file

We can see below that the shutdown is graceful:
Screenshot 2024-08-21 at 15 15 24

Adding restart: always to docker compose shows the behaviour, the service never proceeds past that stage, i think based on the restart behaviour for systemd, we might need to rely on the alerts and then intervene otherwise it'd leave the service in a crash loop, we need to verify if there are SLOs or metrics setup somewhere to track at least service uptime, which will be affected by this.

@nosahama nosahama requested review from a team as code owners August 21, 2024 13:14
@nosahama nosahama force-pushed the nosahama/abort-service-on-corrupt-schema branch 4 times, most recently from d4cfcbb to 5040bf1 Compare August 22, 2024 14:02
Copy link

github-actions bot commented Aug 22, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  karapace
  config.py
  errors.py
  in_memory_database.py
  karapace_all.py
  messaging.py
  rapu.py
  schema_reader.py 193, 203-207, 370, 519-521, 535-537, 551-556, 566-567, 603
  schema_registry.py
  schema_registry_apis.py
  statsd.py
  utils.py
  karapace/compatibility
  __init__.py
Project Total  

This report was generated by python-coverage-comment-action

@nosahama nosahama force-pushed the nosahama/abort-service-on-corrupt-schema branch 3 times, most recently from 92d127e to 9086eeb Compare August 27, 2024 08:20
karapace/schema_reader.py Outdated Show resolved Hide resolved
karapace/schema_reader.py Outdated Show resolved Hide resolved
Previously, when we encounter errors within the `_schemas` topic, we would continue the
message loading and skip the problematic schema.
This is not ideal as it might leave the application with corrupt schema data and the side-effects could be grave.
What we do now is to kill the service, log the errors and allow a graceful shutdown.
We will follow this work by adding metrics for such cases.
@nosahama nosahama force-pushed the nosahama/abort-service-on-corrupt-schema branch from 9086eeb to 1cdd0f1 Compare August 27, 2024 13:20
@nosahama nosahama force-pushed the nosahama/abort-service-on-corrupt-schema branch 5 times, most recently from c6588b7 to 751c662 Compare September 3, 2024 16:22
We'd like to allow the shutdown logic on corrutp schema be guarded by a feature
flag so we do not surprise customers, this is disabled by default.
@nosahama nosahama force-pushed the nosahama/abort-service-on-corrupt-schema branch from 751c662 to 2d9f1e8 Compare September 4, 2024 12:45
Copy link
Contributor

@jjaakola-aiven jjaakola-aiven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jjaakola-aiven jjaakola-aiven merged commit 800f3cb into main Sep 5, 2024
10 checks passed
@jjaakola-aiven jjaakola-aiven deleted the nosahama/abort-service-on-corrupt-schema branch September 5, 2024 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants