-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(hyperloglog): add support of the Hyperloglog data structure #2142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a careful review this week
@tutububug Thank you for your contribution. Running |
Thank you for your contribution! Maybe you need to add a command parser to use the data structure with actual command like redis. |
Thank you for your contribution! Could you include your design in the PR description? For example, explain how to encode the metadata and HLL data (subkeys), similar to what is shown on https://kvrocks.apache.org/community/data-structure-on-rocksdb. |
Yes, I will give the commit later.
OK. |
@PragmaTwice I create a PR(apache/kvrocks-website#207) for describe hyperloglog storage format. |
Thank you! Regarding your design, I have some questions:
Concerning the code, although I haven't reviewed it thoroughly yet, there are some points worth mentioning:
|
@tutububug As @PragmaTwice mentioned in #2142 (comment), it's unnecessary to use a static number of 16384 since it may heavily affect the read performance while using PFMERGE, I guess a smaller one like 16 is enough and every subkey has 1000 integers. |
I suggest that we can store the number of registers in one rocksdb key in the metadata (for example, referred to as Currently, a fixed value for cc @tutububug |
There're two parts of things.
|
So I think the question is, should we also introduce two mode of hll encoding (sparse and dense layout) and an auto switching policy between these two layout? |
I prefer to do this :-) But we can regard it as a further optimization |
src/types/redis_hyperloglog.cc
Outdated
|
||
/* Store the value of the register at position 'regnum' into variable 'target'. | ||
* 'p' is an array of unsigned bytes. */ | ||
#define HLL_DENSE_GET_REGISTER(target, p, regnum) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you change it to a function?
Other Looks ok to me |
Not really. The register(subkey) only be stored which its count is not zero. This point is different from the memory implementation with static array as dense encode. On disk, I think its sparse encode naturally.
The number of consecutive 0s is calculated from the last 50 digits of the hash value, so the maximum value is 50, and the maximum value stored in a string is 2 bytes. It should not waste a lot of space, and at the same time save the overhead of integer encoding and decoding. For keys, it may be more efficient, but the largest index is only 5 bytes (16383).
Currently, the ‘size’ variable has no practical purpose; the only requirement is that it be non-zero. Due to the implementation of kvrocks getmetadata, non-string type data structures with a size of 0 are judged to be expired, and the HLL add parameter that redis has implemented allows no parameters but the key will be stored. For compatibility, size is used as a constant to prevent expiration.
OK.
For an HLL user key, the maximum register value is 16384, and it cannot be larger. In fact, I think this should be considered controllable compared to data structures such as hash, set, list, etc. whose size is determined by user input. |
For rocksdb value, it's should be 1-2 bytes payload, the value size is also included. So, it introduce an extremly huge overhead. So I prefer the impl of bitmap/bitfield. Get 2^14 times in rocksdb would also be heavy, and might break some statistics in rocksdb, which making it tent to compaction more or caching the unneccessary blocks. |
src/storage/redis_metadata.h
Outdated
@@ -49,6 +49,8 @@ enum RedisType : uint8_t { | |||
kRedisStream = 8, | |||
kRedisBloomFilter = 9, | |||
kRedisJson = 10, | |||
kRedisSearch = 11, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to delete
}; | ||
|
||
REDIS_REGISTER_COMMANDS(MakeCmdAttr<CommandPfAdd>("pfadd", -2, "write", 1, 1, 1), | ||
MakeCmdAttr<CommandPfCount>("pfcount", 2, "read-only", 1, 1, 1), ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about PFMERGE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be add in next patch
DCHECK_LT(index, kHyperLogLogRegisterCount); | ||
hash >>= kHyperLogLogRegisterCountPow; /* Remove bits used to address the register. */ | ||
hash |= (static_cast<uint64_t>(1U) << kHyperLogLogHashBitCount); | ||
uint8_t ctz = __builtin_ctzll(hash) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little difference from valkey code, I've send a pr to them
I think it's good to merge after fixing the lint issues. |
Merged, thanks all! |
Quality Gate passedIssues Measures |
storage format description: https://github.com/apache/kvrocks-website/pull/207/files.
Only dense is supported now, and Merge is moved in d3b2978 / 22923f9 and would be back in coming patches