Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sharding and replication guide #34

Merged
merged 9 commits into from
Jul 27, 2021
Merged

Conversation

pavolloffay
Copy link
Member

Signed-off-by: Pavol Loffay [email protected]

Resolves #28

Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
The "local" tables have to be created on the nodes before the distributed table.

```sql
CREATE TABLE jaeger_spans_global AS jaeger_spans ENGINE = Distributed(sharded, default, jaeger_spans, rand());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to use traceID as a sharding key, but it is string and that cannot be used.

Copy link
Contributor

@chhetripradeep chhetripradeep Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, you can use murmurHash3_64(traceID) or cityHash64(traceID) as sharding key.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on your experience with CH, is it good to use sharding by traceID?

Copy link
Contributor

@chhetripradeep chhetripradeep Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use rand()as the sharding key in production but cityHash64() is pretty fast too. We can do some quick testing to see how each performs for our usecase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavolloffay
Copy link
Member Author

Some questions: what is the ideal number of shards and replicas? For replicas I assume it is at least 3?

@chhetripradeep
Copy link
Contributor

We run with 3 replica and as we need to expand the cluster we add more shards. One thing to note is clickhouse doesn't have any inbuilt data balancing feature i.e. once a data is written to a node, it will stay there throughout the lifetime of that data unless the operator moves the data manually, so it's good to do capacity planning in the beginning of cluster provisioning.

CREATE TABLE jaeger_spans_global ON CLUSTER sharded AS jaeger_spans ENGINE = Distributed(sharded, default, jaeger_spans, rand());
CREATE TABLE jaeger_index_global ON CLUSTER sharded AS jaeger_index ENGINE = Distributed(sharded, default, jaeger_index, rand());
CREATE TABLE jaeger_operations_global on CLUSTER sharded AS jaeger_operations ENGINE = Distributed(sharded, default, jaeger_operations, rand());
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's more universal to use '{cluster}' instead of particular name;
  2. Add "IF NOT EXISTS" to creation of global tables;
  3. Add ';' after creation of jaeger_operations.
CREATE TABLE IF NOT EXISTS jaeger_spans ON CLUSTER '{cluster}'  (
                                                                timestamp DateTime CODEC(Delta, ZSTD(1)),
    traceID String CODEC(ZSTD(1)),
    model String CODEC(ZSTD(3))
    ) ENGINE ReplicatedMergeTree('/clickhouse/tables/{shard}/jaeger_spans', '{replica}')
    PARTITION BY toDate(timestamp)
    ORDER BY traceID
    SETTINGS index_granularity=1024;

CREATE TABLE IF NOT EXISTS jaeger_index ON CLUSTER '{cluster}' (
                                                               timestamp DateTime CODEC(Delta, ZSTD(1)),
    traceID String CODEC(ZSTD(1)),
    service LowCardinality(String) CODEC(ZSTD(1)),
    operation LowCardinality(String) CODEC(ZSTD(1)),
    durationUs UInt64 CODEC(ZSTD(1)),
    tags Array(String) CODEC(ZSTD(1)),
    INDEX idx_tags tags TYPE bloom_filter(0.01) GRANULARITY 64,
    INDEX idx_duration durationUs TYPE minmax GRANULARITY 1
    ) ENGINE ReplicatedMergeTree('/clickhouse/tables/{shard}/jaeger_index', '{replica}')
    PARTITION BY toDate(timestamp)
    ORDER BY (service, -toUnixTimestamp(timestamp))
    SETTINGS index_granularity=1024;

CREATE MATERIALIZED VIEW IF NOT EXISTS jaeger_operations ON CLUSTER '{cluster}'
ENGINE ReplicatedMergeTree('/clickhouse/tables/{shard}/jaeger_operations', '{replica}')
PARTITION BY toYYYYMM(date) ORDER BY (date, service, operation)
SETTINGS index_granularity=32
POPULATE
AS SELECT
              toDate(timestamp) AS date,
  service,
  operation,
  count() as count
   FROM jaeger_index
   GROUP BY date, service, operation;


CREATE TABLE IF NOT EXISTS jaeger_spans_global ON CLUSTER '{cluster}' AS jaeger_spans
    ENGINE = Distributed('{cluster}', default, jaeger_spans, rand());
CREATE TABLE IF NOT EXISTS jaeger_index_global ON CLUSTER '{cluster}' AS jaeger_index 
    ENGINE = Distributed('{cluster}', default, jaeger_index, rand());
CREATE TABLE IF NOT EXISTS jaeger_operations_global on CLUSTER '{cluster}' AS jaeger_operations
    ENGINE = Distributed('{cluster}', default, jaeger_operations, rand());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more universal to use '{cluster}' instead of particular name;

I like it I will test if the substitution works.

@pavolloffay
Copy link
Member Author

@EinKrebs regarding the naming of tables. Most of the tutorials that I have seen use table_name_local for local tables and table_name for the distributed tables. if no objections I will rename to follow this schema.

@EinKrebs
Copy link
Member

@EinKrebs regarding the naming of tables. Most of the tutorials that I have seen use table_name_local for local tables and table_name for the distributed tables. if no objections I will rename to follow this schema.

No objections, I think it's good

Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
Signed-off-by: Pavol Loffay <[email protected]>
@pavolloffay pavolloffay mentioned this pull request Jul 27, 2021
Signed-off-by: Pavol Loffay <[email protected]>
@pavolloffay pavolloffay merged commit d67bbb6 into main Jul 27, 2021
@pavolloffay pavolloffay deleted the sharding-replication branch July 27, 2021 10:29
timestamp DateTime CODEC(Delta, ZSTD(1)),
traceID String CODEC(ZSTD(1)),
model String CODEC(ZSTD(3))
) ENGINE ReplicatedMergeTree('/clickhouse/tables/{shard}/jaeger_spans', '{replica}')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it might make sense to omit ReplicatedMergeTree parameters at all, as per email thread discussion: Atomic DB engine will choose a path automatically so there wont be any conflicts when cluster nodes are re-created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a guide how to setup sharding and replication for Jaeger data.
This guide uses [clickhouse-operator](https://github.com/Altinity/clickhouse-operator) to deploy
the storage.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest putting a recommendation to use latest LTS Clickhouse version (21.3 currently).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New is always better :) Are there any particular features that we might use?

) ENGINE ReplicatedMergeTree('/clickhouse/tables/{shard}/jaeger_index', '{replica}')
PARTITION BY toDate(timestamp)
ORDER BY (service, -toUnixTimestamp(timestamp))
SETTINGS index_granularity=1024;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful for some to put a cap on the table size with TTL, like this:

...
ORDER BY (service, -toUnixTimestamp(timestamp))
TTL toDate(timestamp) + INTERVAL 2 MONTH DELETE
SETTINGS ttl_only_drop_parts=1 ...;

ttl_only_drop_parts would prevent scheduling any merges to perform a ttl cleanup and would cause the part to be just dropped when the last record in it expires.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create "tutorial" with sharding and replication and HA setup
4 participants