Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Data Skipping Index Part 3-2: Rule #482

Merged
merged 10 commits into from
Aug 31, 2021
Merged

Conversation

clee704
Copy link

@clee704 clee704 commented Jul 27, 2021

What is the context for this pull request?

What changes were proposed in this pull request?

Implement the data skipping index application rule.

Does this PR introduce any user-facing change?

Yes, users can create data skipping indexes that can be applied to filter queries.

import com.microsoft.hyperspace.Hyperspace
import com.microsoft.hyperspace.index.dataskipping.DataSkippingIndexConfig
import com.microsoft.hyperspace.index.dataskipping.sketches.MinMaxSketch

spark.range(100).toDF("A").write.parquet("X")
val df = spark.read.parquet("X")
val hs = Hyperspace()
hs.createIndex(df, DataSkippingIndexConfig("myind", MinMaxSketch("A")))
hs.explain(df.filter("A = 1"))
=============================================================
Plan with indexes:
=============================================================
Filter (isnotnull(A#271L) AND (A#271L = 1))
+- ColumnarToRow
   +- FileScan Hyperspace(Type: DS, Name: myind, LogVersion: 1) [A#271L] Batched: true, DataFilters: [isnotnull(A#271L), (A#271L = 1)], Format: Parquet, Location: DataSkippingFileIndex[file:/home/chungmin/Repos/spark3.1/X], PartitionFilters: [], PushedFilters: [IsNotNull(A), EqualTo(A,1)], ReadSchema: struct<A:bigint>

=============================================================
Plan without indexes:
=============================================================
Filter (isnotnull(A#271L) AND (A#271L = 1))
+- ColumnarToRow
   +- FileScan parquet [A#271L] Batched: true, DataFilters: [isnotnull(A#271L), (A#271L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/home/chungmin/Repos/spark3.1/X], PartitionFilters: [], PushedFilters: [IsNotNull(A), EqualTo(A,1)], ReadSchema: struct<A:bigint>

=============================================================
Indexes used:
=============================================================
myind:file:/home/chungmin/Repos/spark3.1/spark-warehouse/indexes/myind/v__=0

How was this patch tested?

Unit test

@clee704 clee704 changed the title Ds part2 Data Skipping Index Part 2: MinMaxSketch Jul 27, 2021
@clee704 clee704 changed the title Data Skipping Index Part 2: MinMaxSketch Data Skipping Index Part 3: Rule Jul 28, 2021
@clee704 clee704 force-pushed the ds_part2 branch 7 times, most recently from 0664707 to d39d05e Compare July 29, 2021 08:00
@clee704 clee704 force-pushed the ds_part2 branch 5 times, most recently from b3f1f73 to 9318ab8 Compare July 29, 2021 17:53
@clee704 clee704 force-pushed the ds_part2 branch 6 times, most recently from 7f7f5df to 6ca6cc7 Compare August 2, 2021 13:07
@sezruby
Copy link
Collaborator

sezruby commented Aug 10, 2021

Could you split the PR? e.g. part 3-1: utils, part 3-2: apply?

@clee704 clee704 changed the title Data Skipping Index Part 3: Rule Data Skipping Index Part 3-2: Rule Aug 22, 2021
@clee704 clee704 force-pushed the ds_part2 branch 4 times, most recently from 2c961b4 to 9e1dce4 Compare August 23, 2021 17:54
@clee704 clee704 marked this pull request as ready for review August 24, 2021 07:54
@clee704 clee704 requested a review from sezruby August 24, 2021 07:54
Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clee704 Could you update plan change examples (executedPlan) to PR description?

Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks!

@sezruby sezruby merged commit 9735b57 into microsoft:master Aug 31, 2021
@clee704
Copy link
Author

clee704 commented Sep 1, 2021

Thanks for the detailed review!

@clee704 clee704 deleted the ds_part2 branch September 1, 2021 07:30
@clee704 clee704 added the enhancement New feature or request label Sep 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants