HyperLogLog support #2438

mapleFU · 2024-07-20T05:50:49Z

mapleFU
Jul 20, 2024
Collaborator

Basics

Redis HyperLogLog is a probabilistic data structure that estimates the cardinality of a set. The idea comes from original paper [1] and paper from Google. It is particularly useful for applications that require the estimation of unique elements in massive datasets, such as network traffic analysis, data warehousing, and large-scale databases.

The principle behind HyperLogLog is based on the idea that the number of leading zeros in the binary representation of a hash value can be used to infer the size of the set. It includes:

Hashing: Each element in the dataset is passed through a hash function, which produces a fixed-size binary string.
Register Array: The algorithm maintains an array of registers, each corresponding to a subset of the hash space. The size of the array is 2^p, where p is a precision parameter that determines the trade-off between accuracy and memory usage. The HyperLogLog algorithm uses a subset of the hash value to determine the index in the register array. Specifically, it takes the first p bits of the hash value.

For the remaining bits of the hash value (after the first p bits), the algorithm counts the number of leading zeros. Each entry in the array keeps track of the maximum observed for all elements that hash to the same index. If the newly encountered value is greater than the current value in the register, it updates the register with this new value.

Sparse Representation

The Sparse Representation is an optimization introduced to improve the memory efficiency of the HyperLogLog algorithm, especially when dealing with small cardinalities. Concept of Sparse Representation: Instead of maintaining a dense array of registers for the entire hash space, the Sparse Representation only stores the non-zero counts, significantly reducing memory usage when the actual number of unique elements is small compared to the size of the hash space.

It uses a list to store pairs of (index, rho(w)), where index corresponds to the hash value's position in the substream, and rho(w) is the count of leading zeros plus one for that substream's hash value.

Redis Required syntax

The redis supports pfadd, pfcount, pfmerge here, so generally, we need support:

Insert an element to HyperLogLog
Estimate the carinality in the set
Merging two HyperLogLog

Implement detail

Hash Function

In the HyperLogLog algorithm, the hash function is a critical component that influences the accuracy and efficiency of the cardinality estimation. The function should be Uniform Distribution.

Redis chooses a modified MurmurHash2 function, since MurmurHash2 says [3] "It will not produce the same results on little-endian and big-endian machines."

We can:

Choose xxhash or murmurhash3
Vendor a redis like modified murmurhash2

Storage format

In Redis, bitmap and HyperLogLog are all regard as "string". The "dense" HyperLogLog in redis
But in kvrocks, we should take them as different thing:

String: use msb for bitmap, and store value in payload
Bitmap: use lsb for storage, and store value in subkeys

Generally, in HyperLogLog, we need:

A flag for HyperLogLog storage type, like dense, sparse.
In dense HyperLogLog, if we choose string-like representation, the pfcount will get more data(and might be cached), but write might be heavier if dense is used. If Bitmap like format is used, the write amplify will be a bit smaller, but for 12k data, maybe 13 slot would be used, which is a bit-large

For sparse, I think string like is better. For sparse repr, a "size" for number of slots should be used.

Tuning

Redis uses a fixed-size estimation: 6bit per register and 14bits index length(2^14 slots). Presto will tune and allowing self-defined slot number(but only uses 4bit for per-slot). Should we allow user to define more detail about HLL, or just follow the redis way?

[1] http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
[2] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf
[3] https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp#L13

PragmaTwice · 2024-07-20T06:39:08Z

PragmaTwice
Jul 20, 2024
Collaborator

I think we can first focus on one of the storage format (e.g. bitmap-like) and make it extensible.

0 replies

mapleFU · 2024-07-20T10:17:32Z

mapleFU
Jul 20, 2024
Collaborator Author

Some other differents:

Redis uses "cached" cardinality during get the counting. Maybe we can leave it as a choice when meet the problem of scaling the getting procedure.

Redis sparse representation uses a different way from Presto and Google paper.

What Google paper and Presto uses an array of <Register Index, RegisterValue>, and Keep each one in an array. Assume each element is 4B, and it will keeping this until the memory of sparse representation reaches same memory as dense implementations( since in dense representation a register is much smaller).
In Redis, it uses an RLE method to represent the sparse mode. It uses XZERO( means the previous state is zero util the current index), ZERO ( the next k registers are 0), VAL ( the next k registers is value). And it setup a value server.hll_sparse_max_bytes to limit the memory used for sparse-hll. Comparing to the 4B representation, this way might save more space but is a bit tricky in insertion path.

0 replies

mapleFU · 2024-07-20T11:02:00Z

mapleFU
Jul 20, 2024
Collaborator Author

I've update the HLL with only redis-style dense support, cc @git-hulk @PragmaTwice @tutububug

apache/kvrocks-website#207

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HyperLogLog support #2438

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

HyperLogLog support #2438

mapleFU Jul 20, 2024 Collaborator

Basics

Sparse Representation

Redis Required syntax

Implement detail

Hash Function

Storage format

Tuning

Replies: 3 comments

PragmaTwice Jul 20, 2024 Collaborator

mapleFU Jul 20, 2024 Collaborator Author

mapleFU Jul 20, 2024 Collaborator Author

mapleFU
Jul 20, 2024
Collaborator

PragmaTwice
Jul 20, 2024
Collaborator

mapleFU
Jul 20, 2024
Collaborator Author

mapleFU
Jul 20, 2024
Collaborator Author