Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting exactly once #7522

Closed
yingfeng opened this issue Oct 29, 2019 · 7 comments
Closed

Supporting exactly once #7522

yingfeng opened this issue Oct 29, 2019 · 7 comments
Labels

Comments

@yingfeng
Copy link

Currently, data updating is used through ReplacingMergeTree engine. However, the update operation happens asynchronously through background merging thread, there are many cases where business wants read after write semantic(or nearly) and it can only happen after optimize final is used right now, but it might causes the database to be blocked for a long time. On the other hand, when inserting data to ClickHouse, if the worker has corrupted and be restarted, duplication might happen unless optimize final is called.

For OLAP solution as Apache Doris, there's an important feature of exactly once, worked together with Kafka, it's the so called Doris Stream Load:

For each Doris Stream Load http request, it could be attached a Label http header, and Apache Doris could guarantee data under same Label to be loaded only once within 7 days(a configurable duration), and errors would be reported for duplicated insertions. As a result, if the Label for the load request for Apache Doris (dorisDb_dorisTable_sequence_id) could be strictly aligned with the offsets from Kafka, then duplicate insertion from Kafka could be avoided.

If this feature has also been implemented in ClickHouse,too, then the data inconsistency issue could be greatly relieved.

@akuzm
Copy link
Contributor

akuzm commented Oct 29, 2019

A related issue -- using the user-supplied query_id as a deduplication key for inserts: #7461

@yingfeng
Copy link
Author

It has some differences.
Apache Doris has the component named GlobalTransactionMgr to guarantee the atomic for each Stream Load.
doris-stream-load

Storing insert_id into Zookeeper might not be a good solution because the burden of ZK has already been high right now.

@den-crane
Copy link
Contributor

den-crane commented Oct 29, 2019

https://clickhouse.yandex/docs/en/operations/table_engines/replication/

Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. The reason for this is in case of network failures when the client application doesn't know if the data was written to the DB, so the INSERT query can simply be repeated. It doesn't matter which replica INSERTs were sent to with identical data. INSERTs are idempotent. Deduplication parameters are controlled by merge_tree server settings.

block's hashsums stored in the ZK.
replicated_deduplication_window =100
replicated_deduplication_window_seconds=604800 <-- a week

@nvartolomei
Copy link
Contributor

One important thing to note: de duplication happens on shard level.

@yingfeng
Copy link
Author

yingfeng commented Oct 30, 2019

Although replicated engine could guarantee a block to be written exactly once, it's still different from the real exactly once semantic, which means, after a suitable configuration, the client could write data into tables without worrying about the duplication issues. Doris could guarantee it within a configurable time duration (such as 7 days). As a result, more advanced encapsulation over CH is required.

@yingfeng
Copy link
Author

A new proposal to redesign Kafka Engine could help to resolve this issue too

@alexey-milovidov
Copy link
Member

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants