Add bloom_filter_agg and might_contain SparkSql function #3342
jinchengchenghh
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
@jinchengchenghh FYI, there is an effort to document Spark functions: #3890 It would be nice to add these new functions to the documentation. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As apache/spark#35789 describes, it will have performance enhancement.
So I will try to implement these functions in velox.
Spark use xxhash64(col) whose return type hashcode is int64_t while velox BloomFilter only support uint64_t hashcode as hashInput https://github.com/facebookincubator/velox/blob/main/velox/common/base/BloomFilter.h#L65
hashInput false, hashcode = folly::hasher(xxhash64(col))
hashInput true, hashcode = xxhash64(col)
Need to implement a stronger BloomFilter as Spark https://github.com/apache/spark/blob/master/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java
The new BloomFilter should adapt hash function number according to estimatedNumItems(rowCount) and accept more column types, uint64_t and int64_t at least.
Beta Was this translation helpful? Give feedback.
All reactions