Support KIP-320: Handle log truncation #818

Nevon · 2020-07-20T14:40:13Z

Is your feature request related to a problem? Please describe.

#778 introduces support for Fetch V9, which adds support for new errors to communicate to a consumer that they are trying to fetch from a broker which is not in sync with the leader. #778 only added support for the new API version, but not the functionality that it enables.

When unclean leader election is enabled, we may lose committed data. A consumer which is reading from the end of the log will typically see an out of range error, which will cause it to use its auto.offset.reset policy. To avoid losing data, users should use the "earliest" option, but that means consuming the log from the beginning.

It is also possible that prior to sending the next fetch, new data is written to the log so that the consumer's fetch offset becomes valid again. In this case, the consumer will just miss whatever data had been written between the truncation point and its fetch offset.

Neither behavior is ideal, but we tend to overlook it because the user has opted into weaker semantics by enabling unclean leader election. Unfortunately in some situations we have to enable unclean leader election in order to recover from serious faults on the brokers. Some users have also opted to keep unclean leader election enabled because they cannot sacrifice availability ever. We would like to offer better client semantics for these situations.

Describe the solution you'd like

The proposal in this KIP is to have the consumer behave more like a follower. The consumer will obtain the current leader epoch using the Metadata API. When fetching from a new leader, the consumer will first check for truncation using the OffsetForLeaderEpoch API. In order to enable this, we need to keep track of the last epoch that was consumed. If we do not have one (e.g. because the user has seeked to a particular offset or because the message format is older), then the consumer will skip this step. To support this tracking, we will extend the OffsetCommit API to include the leader epoch if one is available.

Leader changes are detected either through a metadata refresh or in response to a FENCED_LEADER_EPOCH error. It is also possible that the consumer sees an UNKNOWN_LEADER_EPOCH in a fetch response if its metadata has gotten ahead of the leader.

This change in behavior has implications for the consumer's offset reset policy, which defines what the consumer should do if its fetch offset becomes out of range. With this KIP, the only case in which this is possible other than an out of range seek is if the consumer fetches from an offset earlier than the log start offset. By opting into an offset reset policy, the user allows for automatic adjustments to the fetch position, so we take advantage of this to to reset the offset as precisely as possible when log truncation is detected. In some pathological cases (e.g. multiple consecutive unclean leader elections), we may not be able to find the exact offset, but we should be able to get close by finding the starting offset of the next largest epoch that the leader is aware of. We propose in this KIP to change the behavior for both the "earliest" and "latest" reset modes to do this automatically as long as the message format supports lookup by leader epoch. The consumer will log a message to indicate that the truncation was detected, but will reset the position automatically.

If a user is not using an auto reset option, we will raise a LogTruncationException from poll() when log truncation is detected. This gives users the ability to reset state if needed or revert changes to downstream systems. The exception will include the partitions that were truncated and the offset of divergence as found above. This gives applications the ability to execute any logic to revert changes if needed and rewind the fetch offset. Users must handle this exception and reset the position of the consumer if no auto reset policy is enabled.

For consumers, we propose some additional extensions:

When the consumer needs to reset offsets, it uses the ListOffsets API to query the leader. To avoid querying stale leaders, we will add epoch fencing. Additionally, we will modify this API to return the corresponding epoch for any offsets looked up.
We will also provide the leader epoch in the offset commit and fetch APIs. This allows consumer groups to detect truncation across rebalances or restarts. Note that in cases like that found in KIP-232, it is possible for the leader epoch included in the committed offset to be ahead of the metadata that is known to the consumer. Consumers are expected to wait until the metadata has at least reached the epoch of the committed offset before checking for truncation.
For users that store offsets in an external system, we will provide APIs which expose the leader epoch of each record and we will provide an alternative seek API so that users can initialize the offset and leader epoch.

Additional context

See:

Nevon added the help wanted label Jul 20, 2020

Nevon mentioned this issue Jul 20, 2020

Support Fetch v9 protocol (minimal) #778

Merged

Nevon mentioned this issue Jan 28, 2021

Group Instance ID support #884

Open

Nevon mentioned this issue Oct 17, 2022

kafkajs should implement KIP-320 #1464

Closed

distracteddev mentioned this issue Jul 5, 2023

Implement static membership feature on client side (KIP-345) #1594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support KIP-320: Handle log truncation #818

Support KIP-320: Handle log truncation #818

Nevon commented Jul 20, 2020

Support KIP-320: Handle log truncation #818

Support KIP-320: Handle log truncation #818

Comments

Nevon commented Jul 20, 2020