You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Xor8 filter is immutable data-structure that generates a bitmap index
for given set of unique keys. The generated bitmap index has a footprint
of ~ 9 bits for each key. And the sizing requirement for building the
index can be understood with following example.
Keys are u64 types and we cannot incrementally add new keys to existing
bitmap index. So let us say we need to create a bitmap index for
1 million keys:
keys as Vec<u64> = 8MB stack of keys.len() as Vec<KeyIndex> = 12MB q0 of block_length as Vec<KeyIndex> = 4.8MB q1 of block_length as Vec<KeyIndex> = 4.8MB q2 of block_length as Vec<KeyIndex> = 4.8MB sets0 of block_length as Vec<XorSet> = 4.8MB sets1 of block_length as Vec<XorSet> = 4.8MB sets2 of block_length as Vec<XorSet> = 4.8MB finger_print = 1.23 MB
block_length ~= (keys.len() * 1.23) / 3
A total of 50MB memory footprint is required for indexing 1 Million
keys.
To index 1 billion keys we may need a footprint of 50GB of memory.
The text was updated successfully, but these errors were encountered:
To index 1 billion keys we may need a footprint of 50GB of memory.
Having a billion keys (64-bit integers) in memory would require 8GB. A billion keys is a quite a lot.
An obvious solution is sharding... divide your data set into subsets, and process each set separately. This would give you several small filters... which is maybe more manageable.
But what if you still want to construct a gigantic, out-of-memory, filter?
Then I think we want to use a sorting-based approach (because sorting can be done out-of-memory with relative ease).
How might this work? I think it is best expressed in code, but I think I can describe the idea. For each key, you have h1(k), h2(k), h3(k). Sort your keys by h1(k). Then go through the sorted list and find all occurrences when the h1(k) value occurs once. Remove these keys from the list and add them to a stack. Then sort what remains with respect to h2(k). And so forth.
If you implement it with some care, it could use no more memory than the size of the input... I guess...
You can use radix sort, so the sort routine can be in linear time (though this might be at the expense of memory).
Xor8 filter is immutable data-structure that generates a bitmap index
for given set of unique keys. The generated bitmap index has a footprint
of ~ 9 bits for each key. And the sizing requirement for building the
index can be understood with following example.
Keys are u64 types and we cannot incrementally add new keys to existing
bitmap index. So let us say we need to create a bitmap index for
1 million keys:
keys
asVec<u64>
= 8MBstack
of keys.len() asVec<KeyIndex>
= 12MBq0
of block_length asVec<KeyIndex>
= 4.8MBq1
of block_length asVec<KeyIndex>
= 4.8MBq2
of block_length asVec<KeyIndex>
= 4.8MBsets0
of block_length asVec<XorSet>
= 4.8MBsets1
of block_length asVec<XorSet>
= 4.8MBsets2
of block_length asVec<XorSet>
= 4.8MBfinger_print
= 1.23 MBblock_length ~= (keys.len() * 1.23) / 3
A total of 50MB memory footprint is required for indexing 1 Million
keys.
To index 1 billion keys we may need a footprint of 50GB of memory.
The text was updated successfully, but these errors were encountered: