Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Add ZOrderUDF and ZOrderField for Z-address calculation #517

Merged
merged 2 commits into from
Dec 13, 2021

Conversation

sezruby
Copy link
Collaborator

@sezruby sezruby commented Dec 10, 2021

What is the context for this pull request?

What changes were proposed in this pull request?

Add ZOrderUDf and ZOrderField classes for Z-address calculation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

unit test

@sezruby sezruby requested a review from clee704 December 10, 2021 00:12
@sezruby sezruby self-assigned this Dec 10, 2021
@sezruby sezruby mentioned this pull request Dec 10, 2021
4 tasks
}

/**
* ZOrderField for numeric types based on percentile values of the original column values.
Copy link

@MironAtHome MironAtHome Dec 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading approach design blueprint in this comment, I am missing the critical component of thinking and design approach, covered indexes make sense in most situation, except, what is called, an lookup, when index is used to calculate row address to read additional data.
It would seem that in such a query case, particularly with large subquery passed in for matching, having unaligned partitions between index and outer rowstore, would result in shuffle, which would either reduce value of index used, or, depending on optimizer, prompt scan of rowstore, ignoring index altogether.
The rule inference for effective colocation of partitions was long thought out, and it seems like a very nice time to either tag on to this feature, or add it as a V2 for this feature, going forward. This would particularly impact data raised into memory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting partitioned covering index e.g. #509? Otherwise I don't really get the point of your comment..

Copy link

@MironAtHome MironAtHome Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitioning is not a "secret souse" in an of itself. By partitioning we don't get much win, unless partition is aligned ( colocated physically ) with outer row, the data row.
In fact this
org.apache.spark.sql.connector.catalog.InMemoryPartitionTable
pushes time to deliver this feature too close, to feel comfortable. I really want us to make it happen ASAP, rather than turn out late to finish line with someone already being past us.
On the aside, somewhere I promised to test by EOD today index on Spark 3.2.1. I have to apologies, I didn't realise code was version aligned with Spark. My quick attempt to create 3.2 folder with Scala 2.12.15 and Java 11 in the fold will take a bit more time to be ready to complete test pass. So, it will probably be next week - end.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MironAtHome I'm not sure whether we are on the same page. Hyperspace covering indexes are, simply put, a derived dataset of the original dataset, only including a subset of columns and bucketed by some indexed columns chosen by the user during the index creation. When an index is applied to a plan, the original file scan node is replaced by a file scan node that reads from the index files (derived dataset), instead of the original data files. Therefore, at least for the current covering index implementation, there is no such concept as "row address" or "outer row".

quantiles: Seq[Any] = Nil,
quantileEnabled: Boolean = false): ZOrderField = {
dataType match {
case LongType =>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it really have to go through the list, or can the type be extracted from dataType parameter for brevity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the implementation is inefficient and it would be okay as this is one time cost for each z-order columns.
Could you elaborate more if it still needs to be improved?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @MironAtHome is saying about code repetition. Having a long list of if or case conditions with similar code modulo some parameter is a typical antipattern. If it were C++, this kind of repetition can be easily factored out using templates, but in Scala, I think it needs more boilerplates and glue code than necessary.

We could have done something like this:

val builder: {def apply(name: String, min: Any, max: Any): ZOrderField} = dataType match {
  case IntegerType => IntMinMaxZOrderField
  case LongType => LongMinMaxZOrderField
  ...
}
builder(name, min, max)

But the actual code is slightly more complex than the simple apply function above and we would need to define a factory trait and factories for each class anyway, which seems to be a little too much considering there will be very few, if any, new types to Spark in the future, let alone whether we will support them or not.

I do prefer a modular approach where I would group classes by their differentiating properties (so I would put IntMinMaxZOrderField, IntPercentileZOrderField, and factories for them in IntZOrderField.scala, and so on) and make the combining code like ZOrderField.build agnostic about different implementations, meaning that you don't have to change the function when you add/remove implementations. But such a modular approach makes more sense for a larger abstraction. If we need to, we can change to that later.

Copy link

@clee704 clee704 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but none are significant. Please feel free to submit as most code has been already reviewed.

}

/**
* ZOrderField for numeric types based on percentile values of the original column values.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MironAtHome I'm not sure whether we are on the same page. Hyperspace covering indexes are, simply put, a derived dataset of the original dataset, only including a subset of columns and bucketed by some indexed columns chosen by the user during the index creation. When an index is applied to a plan, the original file scan node is replaced by a file scan node that reads from the index files (derived dataset), instead of the original data files. Therefore, at least for the current covering index implementation, there is no such concept as "row address" or "outer row".

quantiles: Seq[Any] = Nil,
quantileEnabled: Boolean = false): ZOrderField = {
dataType match {
case LongType =>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @MironAtHome is saying about code repetition. Having a long list of if or case conditions with similar code modulo some parameter is a typical antipattern. If it were C++, this kind of repetition can be easily factored out using templates, but in Scala, I think it needs more boilerplates and glue code than necessary.

We could have done something like this:

val builder: {def apply(name: String, min: Any, max: Any): ZOrderField} = dataType match {
  case IntegerType => IntMinMaxZOrderField
  case LongType => LongMinMaxZOrderField
  ...
}
builder(name, min, max)

But the actual code is slightly more complex than the simple apply function above and we would need to define a factory trait and factories for each class anyway, which seems to be a little too much considering there will be very few, if any, new types to Spark in the future, let alone whether we will support them or not.

I do prefer a modular approach where I would group classes by their differentiating properties (so I would put IntMinMaxZOrderField, IntPercentileZOrderField, and factories for them in IntZOrderField.scala, and so on) and make the combining code like ZOrderField.build agnostic about different implementations, meaning that you don't have to change the function when you add/remove implementations. But such a modular approach makes more sense for a larger abstraction. If we need to, we can change to that later.

Seq(minVal.asInstanceOf[Double], maxVal.asInstanceOf[Double]))
}
case FloatType =>
// minVal, maxVal can be Float.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment should be the opposite to make sense? When can minVal/maxVal can be Double?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's from approxQuantile then the value type is Double. If it's from aggs(min, max), then it's Float type.

@sezruby sezruby merged commit 8cd1b13 into microsoft:master Dec 13, 2021
@sezruby sezruby deleted the zorderfield branch December 13, 2021 21:05
@sezruby sezruby added the enhancement New feature or request label Dec 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants