-
Notifications
You must be signed in to change notification settings - Fork 115
Add ZOrderUDF and ZOrderField for Z-address calculation #517
Conversation
} | ||
|
||
/** | ||
* ZOrderField for numeric types based on percentile values of the original column values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reading approach design blueprint in this comment, I am missing the critical component of thinking and design approach, covered indexes make sense in most situation, except, what is called, an lookup, when index is used to calculate row address to read additional data.
It would seem that in such a query case, particularly with large subquery passed in for matching, having unaligned partitions between index and outer rowstore, would result in shuffle, which would either reduce value of index used, or, depending on optimizer, prompt scan of rowstore, ignoring index altogether.
The rule inference for effective colocation of partitions was long thought out, and it seems like a very nice time to either tag on to this feature, or add it as a V2 for this feature, going forward. This would particularly impact data raised into memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting partitioned covering index e.g. #509? Otherwise I don't really get the point of your comment..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitioning is not a "secret souse" in an of itself. By partitioning we don't get much win, unless partition is aligned ( colocated physically ) with outer row, the data row.
In fact this
org.apache.spark.sql.connector.catalog.InMemoryPartitionTable
pushes time to deliver this feature too close, to feel comfortable. I really want us to make it happen ASAP, rather than turn out late to finish line with someone already being past us.
On the aside, somewhere I promised to test by EOD today index on Spark 3.2.1. I have to apologies, I didn't realise code was version aligned with Spark. My quick attempt to create 3.2 folder with Scala 2.12.15 and Java 11 in the fold will take a bit more time to be ready to complete test pass. So, it will probably be next week - end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MironAtHome I'm not sure whether we are on the same page. Hyperspace covering indexes are, simply put, a derived dataset of the original dataset, only including a subset of columns and bucketed by some indexed columns chosen by the user during the index creation. When an index is applied to a plan, the original file scan node is replaced by a file scan node that reads from the index files (derived dataset), instead of the original data files. Therefore, at least for the current covering index implementation, there is no such concept as "row address" or "outer row".
quantiles: Seq[Any] = Nil, | ||
quantileEnabled: Boolean = false): ZOrderField = { | ||
dataType match { | ||
case LongType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it really have to go through the list, or can the type be extracted from dataType parameter for brevity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the implementation is inefficient and it would be okay as this is one time cost for each z-order columns.
Could you elaborate more if it still needs to be improved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @MironAtHome is saying about code repetition. Having a long list of if
or case
conditions with similar code modulo some parameter is a typical antipattern. If it were C++, this kind of repetition can be easily factored out using templates, but in Scala, I think it needs more boilerplates and glue code than necessary.
We could have done something like this:
val builder: {def apply(name: String, min: Any, max: Any): ZOrderField} = dataType match {
case IntegerType => IntMinMaxZOrderField
case LongType => LongMinMaxZOrderField
...
}
builder(name, min, max)
But the actual code is slightly more complex than the simple apply
function above and we would need to define a factory trait and factories for each class anyway, which seems to be a little too much considering there will be very few, if any, new types to Spark in the future, let alone whether we will support them or not.
I do prefer a modular approach where I would group classes by their differentiating properties (so I would put IntMinMaxZOrderField, IntPercentileZOrderField, and factories for them in IntZOrderField.scala, and so on) and make the combining code like ZOrderField.build
agnostic about different implementations, meaning that you don't have to change the function when you add/remove implementations. But such a modular approach makes more sense for a larger abstraction. If we need to, we can change to that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but none are significant. Please feel free to submit as most code has been already reviewed.
} | ||
|
||
/** | ||
* ZOrderField for numeric types based on percentile values of the original column values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MironAtHome I'm not sure whether we are on the same page. Hyperspace covering indexes are, simply put, a derived dataset of the original dataset, only including a subset of columns and bucketed by some indexed columns chosen by the user during the index creation. When an index is applied to a plan, the original file scan node is replaced by a file scan node that reads from the index files (derived dataset), instead of the original data files. Therefore, at least for the current covering index implementation, there is no such concept as "row address" or "outer row".
quantiles: Seq[Any] = Nil, | ||
quantileEnabled: Boolean = false): ZOrderField = { | ||
dataType match { | ||
case LongType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @MironAtHome is saying about code repetition. Having a long list of if
or case
conditions with similar code modulo some parameter is a typical antipattern. If it were C++, this kind of repetition can be easily factored out using templates, but in Scala, I think it needs more boilerplates and glue code than necessary.
We could have done something like this:
val builder: {def apply(name: String, min: Any, max: Any): ZOrderField} = dataType match {
case IntegerType => IntMinMaxZOrderField
case LongType => LongMinMaxZOrderField
...
}
builder(name, min, max)
But the actual code is slightly more complex than the simple apply
function above and we would need to define a factory trait and factories for each class anyway, which seems to be a little too much considering there will be very few, if any, new types to Spark in the future, let alone whether we will support them or not.
I do prefer a modular approach where I would group classes by their differentiating properties (so I would put IntMinMaxZOrderField, IntPercentileZOrderField, and factories for them in IntZOrderField.scala, and so on) and make the combining code like ZOrderField.build
agnostic about different implementations, meaning that you don't have to change the function when you add/remove implementations. But such a modular approach makes more sense for a larger abstraction. If we need to, we can change to that later.
Seq(minVal.asInstanceOf[Double], maxVal.asInstanceOf[Double])) | ||
} | ||
case FloatType => | ||
// minVal, maxVal can be Float. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment should be the opposite to make sense? When can minVal/maxVal can be Double
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's from approxQuantile
then the value type is Double. If it's from aggs(min, max), then it's Float type.
What is the context for this pull request?
What changes were proposed in this pull request?
Add ZOrderUDf and ZOrderField classes for Z-address calculation.
Does this PR introduce any user-facing change?
No
How was this patch tested?
unit test