Add ZOrderUDF and ZOrderField for Z-address calculation #517

sezruby · 2021-12-10T00:07:28Z

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: [PROPOSAL]: ZOrderCoveringIndex #515
Dependencies: n/a

What changes were proposed in this pull request?

Add ZOrderUDf and ZOrderField classes for Z-address calculation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

unit test

MironAtHome · 2021-12-11T15:03:19Z

src/main/scala/com/microsoft/hyperspace/index/zordercovering/ZOrderField.scala

+}
+
+/**
+ * ZOrderField for numeric types based on percentile values of the original column values.


While reading approach design blueprint in this comment, I am missing the critical component of thinking and design approach, covered indexes make sense in most situation, except, what is called, an lookup, when index is used to calculate row address to read additional data.
It would seem that in such a query case, particularly with large subquery passed in for matching, having unaligned partitions between index and outer rowstore, would result in shuffle, which would either reduce value of index used, or, depending on optimizer, prompt scan of rowstore, ignoring index altogether.
The rule inference for effective colocation of partitions was long thought out, and it seems like a very nice time to either tag on to this feature, or add it as a V2 for this feature, going forward. This would particularly impact data raised into memory.

Are you suggesting partitioned covering index e.g. #509? Otherwise I don't really get the point of your comment..

partitioning is not a "secret souse" in an of itself. By partitioning we don't get much win, unless partition is aligned ( colocated physically ) with outer row, the data row.
In fact this
org.apache.spark.sql.connector.catalog.InMemoryPartitionTable
pushes time to deliver this feature too close, to feel comfortable. I really want us to make it happen ASAP, rather than turn out late to finish line with someone already being past us.
On the aside, somewhere I promised to test by EOD today index on Spark 3.2.1. I have to apologies, I didn't realise code was version aligned with Spark. My quick attempt to create 3.2 folder with Scala 2.12.15 and Java 11 in the fold will take a bit more time to be ready to complete test pass. So, it will probably be next week - end.

@MironAtHome I'm not sure whether we are on the same page. Hyperspace covering indexes are, simply put, a derived dataset of the original dataset, only including a subset of columns and bucketed by some indexed columns chosen by the user during the index creation. When an index is applied to a plan, the original file scan node is replaced by a file scan node that reads from the index files (derived dataset), instead of the original data files. Therefore, at least for the current covering index implementation, there is no such concept as "row address" or "outer row".

MironAtHome · 2021-12-11T17:36:05Z

src/main/scala/com/microsoft/hyperspace/index/zordercovering/ZOrderField.scala

+      quantiles: Seq[Any] = Nil,
+      quantileEnabled: Boolean = false): ZOrderField = {
+    dataType match {
+      case LongType =>


does it really have to go through the list, or can the type be extracted from dataType parameter for brevity?

I don't think the implementation is inefficient and it would be okay as this is one time cost for each z-order columns.
Could you elaborate more if it still needs to be improved?

I think @MironAtHome is saying about code repetition. Having a long list of if or case conditions with similar code modulo some parameter is a typical antipattern. If it were C++, this kind of repetition can be easily factored out using templates, but in Scala, I think it needs more boilerplates and glue code than necessary.

We could have done something like this:

val builder: {def apply(name: String, min: Any, max: Any): ZOrderField} = dataType match { case IntegerType => IntMinMaxZOrderField case LongType => LongMinMaxZOrderField ... } builder(name, min, max)

But the actual code is slightly more complex than the simple apply function above and we would need to define a factory trait and factories for each class anyway, which seems to be a little too much considering there will be very few, if any, new types to Spark in the future, let alone whether we will support them or not.

I do prefer a modular approach where I would group classes by their differentiating properties (so I would put IntMinMaxZOrderField, IntPercentileZOrderField, and factories for them in IntZOrderField.scala, and so on) and make the combining code like ZOrderField.build agnostic about different implementations, meaning that you don't have to change the function when you add/remove implementations. But such a modular approach makes more sense for a larger abstraction. If we need to, we can change to that later.

clee704

I left some comments but none are significant. Please feel free to submit as most code has been already reviewed.

clee704 · 2021-12-13T15:53:16Z

src/main/scala/com/microsoft/hyperspace/index/zordercovering/ZOrderField.scala

+}
+
+/**
+ * ZOrderField for numeric types based on percentile values of the original column values.


@MironAtHome I'm not sure whether we are on the same page. Hyperspace covering indexes are, simply put, a derived dataset of the original dataset, only including a subset of columns and bucketed by some indexed columns chosen by the user during the index creation. When an index is applied to a plan, the original file scan node is replaced by a file scan node that reads from the index files (derived dataset), instead of the original data files. Therefore, at least for the current covering index implementation, there is no such concept as "row address" or "outer row".

clee704 · 2021-12-13T17:10:33Z

src/main/scala/com/microsoft/hyperspace/index/zordercovering/ZOrderField.scala

+      quantiles: Seq[Any] = Nil,
+      quantileEnabled: Boolean = false): ZOrderField = {
+    dataType match {
+      case LongType =>


I think @MironAtHome is saying about code repetition. Having a long list of if or case conditions with similar code modulo some parameter is a typical antipattern. If it were C++, this kind of repetition can be easily factored out using templates, but in Scala, I think it needs more boilerplates and glue code than necessary.

We could have done something like this:

val builder: {def apply(name: String, min: Any, max: Any): ZOrderField} = dataType match { case IntegerType => IntMinMaxZOrderField case LongType => LongMinMaxZOrderField ... } builder(name, min, max)

But the actual code is slightly more complex than the simple apply function above and we would need to define a factory trait and factories for each class anyway, which seems to be a little too much considering there will be very few, if any, new types to Spark in the future, let alone whether we will support them or not.

I do prefer a modular approach where I would group classes by their differentiating properties (so I would put IntMinMaxZOrderField, IntPercentileZOrderField, and factories for them in IntZOrderField.scala, and so on) and make the combining code like ZOrderField.build agnostic about different implementations, meaning that you don't have to change the function when you add/remove implementations. But such a modular approach makes more sense for a larger abstraction. If we need to, we can change to that later.

clee704 · 2021-12-13T17:11:49Z

src/main/scala/com/microsoft/hyperspace/index/zordercovering/ZOrderField.scala

+            Seq(minVal.asInstanceOf[Double], maxVal.asInstanceOf[Double]))
+        }
+      case FloatType =>
+        // minVal, maxVal can be Float.


I think the comment should be the opposite to make sense? When can minVal/maxVal can be Double?

If it's from approxQuantile then the value type is Double. If it's from aggs(min, max), then it's Float type.

Add ZOrderUDF and ZOrderField for Z-address calculation

e02d27c

sezruby requested a review from clee704 December 10, 2021 00:12

sezruby self-assigned this Dec 10, 2021

sezruby mentioned this pull request Dec 10, 2021

[PROPOSAL]: ZOrderCoveringIndex #515

Open

4 tasks

MironAtHome reviewed Dec 11, 2021

View reviewed changes

clee704 approved these changes Dec 13, 2021

View reviewed changes

Merge branch 'master' into zorderfield

c9638e3

sezruby merged commit 8cd1b13 into microsoft:master Dec 13, 2021

sezruby deleted the zorderfield branch December 13, 2021 21:05

sezruby added the enhancement New feature or request label Dec 14, 2021

sezruby mentioned this pull request Dec 15, 2021

Introduce ZOrderCoveringIndex #495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ZOrderUDF and ZOrderField for Z-address calculation #517

Add ZOrderUDF and ZOrderField for Z-address calculation #517

sezruby commented Dec 10, 2021

MironAtHome Dec 11, 2021 •

edited

Loading

sezruby Dec 12, 2021

MironAtHome Dec 13, 2021 •

edited

Loading

clee704 Dec 13, 2021

MironAtHome Dec 11, 2021

sezruby Dec 12, 2021

clee704 Dec 13, 2021

clee704 left a comment

clee704 Dec 13, 2021

clee704 Dec 13, 2021

clee704 Dec 13, 2021

sezruby Dec 13, 2021

Add ZOrderUDF and ZOrderField for Z-address calculation #517

Add ZOrderUDF and ZOrderField for Z-address calculation #517

Conversation

sezruby commented Dec 10, 2021

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

MironAtHome Dec 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MironAtHome Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clee704 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MironAtHome Dec 11, 2021 •

edited

Loading

MironAtHome Dec 13, 2021 •

edited

Loading