Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Data Skipping Index Part 4: BloomFilterSketch #483

Merged
merged 2 commits into from
Sep 14, 2021

Conversation

clee704
Copy link

@clee704 clee704 commented Jul 27, 2021

What is the context for this pull request?

What changes were proposed in this pull request?

A new data skipping sketch type BloomFilterSketch is added.

import com.microsoft.hyperspace.Hyperspace
import com.microsoft.hyperspace.index.dataskipping.DataSkippingIndexConfig
import com.microsoft.hyperspace.index.dataskipping.sketches.{BloomFilterSketch, MinMaxSketch}

spark.range(100).selectExpr("id as A", "id * 2 as B").write.parquet("X")
val df = spark.read.parquet("X")
val hs = Hyperspace()
hs.createIndex(df, DataSkippingIndexConfig("myind", MinMaxSketch("A"), BloomFilterSketch("B", 0.01, 10)))
val indexDataPath = hs.index("myind").select("indexLocation").collect().head.getString(0)
spark.read.parquet(indexDataPath).show
hs.explain(df.filter("B = 1"))
+-------------+-----------+-----------+--------------------------+
|_data_file_id|MinMax_A__0|MinMax_A__1|BloomFilter_B__0.01__10__0|
+-------------+-----------+-----------+--------------------------+
|           10|         75|         82|      {7, 47, [-8781728...|
|            6|         41|         49|      {7, 50, [96460795...|
|            4|         58|         65|      {7, 44, [57871781...|
|            3|         91|         99|      {7, 53, [-7349426...|
|            9|          0|          7|      {7, 44, [47615823...|
|           11|         83|         90|      {7, 49, [-6606656...|
|            8|         50|         57|      {7, 49, [11259074...|
|            5|         66|         74|      {7, 54, [40415990...|
|            0|         25|         32|      {7, 45, [49736543...|
|            2|         16|         24|      {7, 49, [-4128164...|
|            1|         33|         40|      {7, 45, [-8340072...|
|            7|          8|         15|      {7, 47, [-8916036...|
+-------------+-----------+-----------+--------------------------+
=============================================================
Plan with indexes:
=============================================================
Filter (isnotnull(B#9L) AND (B#9L = 1))
+- ColumnarToRow
   +- FileScan Hyperspace(Type: DS, Name: myind, LogVersion: 1) [A#8L,B#9L] Batched: true, DataFilters: [isnotnull(B#9L), (B#9L = 1)], Format: Parquet, Location: DataSkippingFileIndex[file:/home/chungmin/Repos/hyperspace4/X], PartitionFilters: [], PushedFilters: [IsNotNull(B), EqualTo(B,1)], ReadSchema: struct<A:bigint,B:bigint>

=============================================================
Plan without indexes:
=============================================================
Filter (isnotnull(B#9L) AND (B#9L = 1))
+- ColumnarToRow
   +- FileScan parquet [A#8L,B#9L] Batched: true, DataFilters: [isnotnull(B#9L), (B#9L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/home/chungmin/Repos/hyperspace4/X], PartitionFilters: [], PushedFilters: [IsNotNull(B), EqualTo(B,1)], ReadSchema: struct<A:bigint,B:bigint>

=============================================================
Indexes used:
=============================================================
myind:file:/home/chungmin/Repos/hyperspace4/spark-warehouse/indexes/myind/v__=0

Does this PR introduce any user-facing change?

Yes, users can now create data skipping indexes with bloom filters.

How was this patch tested?

Unit test

@clee704 clee704 changed the title Data Skipping Index Part 3: BloomFilterSketch Data Skipping Index Part 4: BloomFilterSketch Jul 28, 2021
@clee704 clee704 force-pushed the ds_part3 branch 16 times, most recently from 2c390e5 to f8202d9 Compare August 2, 2021 13:24
@clee704 clee704 force-pushed the ds_part3 branch 4 times, most recently from b745f05 to 0eec377 Compare September 1, 2021 08:45
Implement BloomFilterSketch.
@clee704 clee704 marked this pull request as ready for review September 1, 2021 09:10
@clee704 clee704 requested a review from sezruby September 1, 2021 09:11
@clee704 clee704 self-assigned this Sep 1, 2021
Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @clee704! Could you update a PR description with an example of createIndex API & index data similar to #461?
It would be good if the example includes both min/max and bf.

@sezruby sezruby merged commit 1ab046d into microsoft:master Sep 14, 2021
paryoja pushed a commit to paryoja/hyperspace that referenced this pull request Nov 4, 2021
@sezruby sezruby added the enhancement New feature or request label Nov 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants