Supporting exactly once #7522

yingfeng · 2019-10-29T09:45:51Z

Currently, data updating is used through ReplacingMergeTree engine. However, the update operation happens asynchronously through background merging thread, there are many cases where business wants read after write semantic(or nearly) and it can only happen after optimize final is used right now, but it might causes the database to be blocked for a long time. On the other hand, when inserting data to ClickHouse, if the worker has corrupted and be restarted, duplication might happen unless optimize final is called.

For OLAP solution as Apache Doris, there's an important feature of exactly once, worked together with Kafka, it's the so called Doris Stream Load:

For each Doris Stream Load http request, it could be attached a Label http header, and Apache Doris could guarantee data under same Label to be loaded only once within 7 days(a configurable duration), and errors would be reported for duplicated insertions. As a result, if the Label for the load request for Apache Doris (dorisDb_dorisTable_sequence_id) could be strictly aligned with the offsets from Kafka, then duplicate insertion from Kafka could be avoided.

If this feature has also been implemented in ClickHouse,too, then the data inconsistency issue could be greatly relieved.

The text was updated successfully, but these errors were encountered:

akuzm · 2019-10-29T10:31:46Z

A related issue -- using the user-supplied query_id as a deduplication key for inserts: #7461

yingfeng · 2019-10-29T10:48:30Z

It has some differences.
Apache Doris has the component named GlobalTransactionMgr to guarantee the atomic for each Stream Load.

Storing insert_id into Zookeeper might not be a good solution because the burden of ZK has already been high right now.

den-crane · 2019-10-29T13:50:03Z

https://clickhouse.yandex/docs/en/operations/table_engines/replication/

Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. The reason for this is in case of network failures when the client application doesn't know if the data was written to the DB, so the INSERT query can simply be repeated. It doesn't matter which replica INSERTs were sent to with identical data. INSERTs are idempotent. Deduplication parameters are controlled by merge_tree server settings.

block's hashsums stored in the ZK.
replicated_deduplication_window =100
replicated_deduplication_window_seconds=604800 <-- a week

nvartolomei · 2019-10-29T18:34:09Z

One important thing to note: de duplication happens on shard level.

yingfeng · 2019-10-30T03:37:19Z

Although replicated engine could guarantee a block to be written exactly once, it's still different from the real exactly once semantic, which means, after a suitable configuration, the client could write data into tables without worrying about the duplication issues. Doris could guarantee it within a configurable time duration (such as 7 days). As a result, more advanced encapsulation over CH is required.

yingfeng · 2019-11-22T02:52:15Z

A new proposal to redesign Kafka Engine could help to resolve this issue too

alexey-milovidov · 2024-09-21T20:10:48Z

Done

yingfeng added the feature label Oct 29, 2019

Lucgarg mentioned this issue Sep 14, 2021

Get the message failing crobox/clickhouse-scala-client#125

Closed

genzgd mentioned this issue Mar 8, 2022

Does Clickhouse support the XA standard? #35095

Closed

alexey-milovidov closed this as completed Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting exactly once #7522

Supporting exactly once #7522

yingfeng commented Oct 29, 2019

akuzm commented Oct 29, 2019

yingfeng commented Oct 29, 2019

den-crane commented Oct 29, 2019 •

edited

Loading

nvartolomei commented Oct 29, 2019

yingfeng commented Oct 30, 2019 •

edited

Loading

yingfeng commented Nov 22, 2019

alexey-milovidov commented Sep 21, 2024

Supporting exactly once #7522

Supporting exactly once #7522

Comments

yingfeng commented Oct 29, 2019

akuzm commented Oct 29, 2019

yingfeng commented Oct 29, 2019

den-crane commented Oct 29, 2019 • edited Loading

nvartolomei commented Oct 29, 2019

yingfeng commented Oct 30, 2019 • edited Loading

yingfeng commented Nov 22, 2019

alexey-milovidov commented Sep 21, 2024

den-crane commented Oct 29, 2019 •

edited

Loading

yingfeng commented Oct 30, 2019 •

edited

Loading