-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15214][SQL] Code-generation for Generate #13065
Closed
Closed
Changes from 33 commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
5254559
CG for Generate/Explode
hvanhovell 9df56a9
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell 5d068b5
Fix compilation and binding errors.
hvanhovell b2d663b
Fix infinite loop on continue.
hvanhovell e04d66f
Fix outer.
hvanhovell 43a04bf
Really really fix lateral outer view explode(...).
hvanhovell 7b4772d
Add benchmark & fix subexpressionsuite.
hvanhovell b721b60
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell 09513e7
Generate WIP
hvanhovell f7c2307
Update GenerateExec
hvanhovell f5bd9cf
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell dba4240
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell 49f9e7f
Further tweaking...
hvanhovell b3531cb
Proper support for json_tuple.
hvanhovell 5cfba19
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell f86da0f
Use TraversableOnce for regular Generators.
hvanhovell 60da24e
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell 87688b1
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell 1d2d595
Add benchmarks for explode map & json_tuple
hvanhovell c9b3eda
fix generated json
hvanhovell 2732b06
disable benchmark
hvanhovell 36cd826
Add tests for generate with outer = true
hvanhovell 5b3d9bd
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell 3a40952
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell c41e308
Add new generators & update.
hvanhovell 116339a
Fix Stack
hvanhovell 2c6c7f2
Make Stack use the iteration path.
hvanhovell d20114b
Update benchmarks
hvanhovell ad36de5
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell 757b470
Revert json_tuple changes
hvanhovell 8c14194
Touch-ups
hvanhovell 459714c
Touch-ups
hvanhovell 7b7fa6e
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell ebd9d8c
Merge remote-tracking branch 'apache-github/master' into SPARK-15214
hvanhovell f81eed7
Code review
hvanhovell 29c606a
code review
hvanhovell af9a516
code review 2
hvanhovell 3146cc5
Add proper fallback for 'Stack' generator. Make traversable once oute…
hvanhovell ffd5ef8
Minor thing
hvanhovell File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,10 +17,12 @@ | |
|
||
package org.apache.spark.sql.catalyst.expressions | ||
|
||
import scala.collection.mutable | ||
|
||
import org.apache.spark.sql.Row | ||
import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow} | ||
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult | ||
import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback | ||
import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, CodegenFallback, ExprCode} | ||
import org.apache.spark.sql.catalyst.util.{ArrayData, MapData} | ||
import org.apache.spark.sql.types._ | ||
|
||
|
@@ -62,6 +64,21 @@ trait Generator extends Expression { | |
def terminate(): TraversableOnce[InternalRow] = Nil | ||
} | ||
|
||
/** | ||
* A collection producing [[Generator]]. This trait provides a different path for code generation, | ||
* by allowing code generation to return either an [[ArrayData]] or a [[MapData]] object. | ||
*/ | ||
trait CollectionGenerator extends Generator { | ||
/** The position of an element within the collection should also be returned. */ | ||
def position: Boolean | ||
|
||
/** Rows will be inlined during generation. */ | ||
def inline: Boolean | ||
|
||
/** The schema of the returned collection object. */ | ||
def collectionSchema: DataType = dataType | ||
} | ||
|
||
/** | ||
* A generator that produces its output using the provided lambda function. | ||
*/ | ||
|
@@ -77,7 +94,9 @@ case class UserDefinedGenerator( | |
private def initializeConverters(): Unit = { | ||
inputRow = new InterpretedProjection(children) | ||
convertToScala = { | ||
val inputSchema = StructType(children.map(e => StructField(e.simpleString, e.dataType, true))) | ||
val inputSchema = StructType(children.map { e => | ||
StructField(e.simpleString, e.dataType, nullable = true) | ||
}) | ||
CatalystTypeConverters.createToScalaConverter(inputSchema) | ||
}.asInstanceOf[InternalRow => Row] | ||
} | ||
|
@@ -109,8 +128,7 @@ case class UserDefinedGenerator( | |
1 2 | ||
3 NULL | ||
""") | ||
case class Stack(children: Seq[Expression]) | ||
extends Expression with Generator with CodegenFallback { | ||
case class Stack(children: Seq[Expression]) extends Generator { | ||
|
||
private lazy val numRows = children.head.eval().asInstanceOf[Int] | ||
private lazy val numFields = Math.ceil((children.length - 1.0) / numRows).toInt | ||
|
@@ -149,29 +167,52 @@ case class Stack(children: Seq[Expression]) | |
InternalRow(fields: _*) | ||
} | ||
} | ||
|
||
override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | ||
// Rows - we write these into an array. | ||
val rowData = ctx.freshName("rows") | ||
ctx.addMutableState("InternalRow[]", rowData, s"this.$rowData = new InternalRow[$numRows];") | ||
val values = children.tail | ||
val dataTypes = values.take(numFields).map(_.dataType) | ||
val rows = for (row <- 0 until numRows) yield { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we split these into multiple funcitons just in case of numRow is large? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
val fields = for (col <- 0 until numFields) yield { | ||
val index = row * numFields + col | ||
if (index < values.length) values(index) else Literal(null, dataTypes(col)) | ||
} | ||
val eval = CreateStruct(fields).genCode(ctx) | ||
s"${eval.code}\nthis.$rowData[$row] = ${eval.value};" | ||
} | ||
|
||
// Create the iterator. | ||
val wrapperClass = classOf[mutable.WrappedArray[_]].getName | ||
ctx.addMutableState( | ||
s"$wrapperClass<InternalRow>", | ||
ev.value, | ||
s"this.${ev.value} = $wrapperClass$$.MODULE$$.make(this.$rowData);") | ||
ev.copy(code = rows.mkString("\n"), isNull = "false") | ||
} | ||
} | ||
|
||
/** | ||
* A base class for Explode and PosExplode | ||
* A base class for [[Explode]] and [[PosExplode]]. | ||
*/ | ||
abstract class ExplodeBase(child: Expression, position: Boolean) | ||
extends UnaryExpression with Generator with CodegenFallback with Serializable { | ||
abstract class ExplodeBase extends UnaryExpression with CollectionGenerator with Serializable { | ||
override val inline: Boolean = false | ||
|
||
override def checkInputDataTypes(): TypeCheckResult = { | ||
if (child.dataType.isInstanceOf[ArrayType] || child.dataType.isInstanceOf[MapType]) { | ||
override def checkInputDataTypes(): TypeCheckResult = child.dataType match { | ||
case _: ArrayType | _: MapType => | ||
TypeCheckResult.TypeCheckSuccess | ||
} else { | ||
case _ => | ||
TypeCheckResult.TypeCheckFailure( | ||
s"input to function explode should be array or map type, not ${child.dataType}") | ||
} | ||
} | ||
|
||
// hive-compatible default alias for explode function ("col" for array, "key", "value" for map) | ||
override def elementSchema: StructType = child.dataType match { | ||
case ArrayType(et, containsNull) => | ||
if (position) { | ||
new StructType() | ||
.add("pos", IntegerType, false) | ||
.add("pos", IntegerType, nullable = false) | ||
.add("col", et, containsNull) | ||
} else { | ||
new StructType() | ||
|
@@ -180,12 +221,12 @@ abstract class ExplodeBase(child: Expression, position: Boolean) | |
case MapType(kt, vt, valueContainsNull) => | ||
if (position) { | ||
new StructType() | ||
.add("pos", IntegerType, false) | ||
.add("key", kt, false) | ||
.add("pos", IntegerType, nullable = false) | ||
.add("key", kt, nullable = false) | ||
.add("value", vt, valueContainsNull) | ||
} else { | ||
new StructType() | ||
.add("key", kt, false) | ||
.add("key", kt, nullable = false) | ||
.add("value", vt, valueContainsNull) | ||
} | ||
} | ||
|
@@ -218,6 +259,12 @@ abstract class ExplodeBase(child: Expression, position: Boolean) | |
} | ||
} | ||
} | ||
|
||
override def collectionSchema: DataType = child.dataType | ||
|
||
override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | ||
child.genCode(ctx) | ||
} | ||
} | ||
|
||
/** | ||
|
@@ -239,7 +286,9 @@ abstract class ExplodeBase(child: Expression, position: Boolean) | |
20 | ||
""") | ||
// scalastyle:on line.size.limit | ||
case class Explode(child: Expression) extends ExplodeBase(child, position = false) | ||
case class Explode(child: Expression) extends ExplodeBase { | ||
override val position: Boolean = false | ||
} | ||
|
||
/** | ||
* Given an input array produces a sequence of rows for each position and value in the array. | ||
|
@@ -260,7 +309,9 @@ case class Explode(child: Expression) extends ExplodeBase(child, position = fals | |
1 20 | ||
""") | ||
// scalastyle:on line.size.limit | ||
case class PosExplode(child: Expression) extends ExplodeBase(child, position = true) | ||
case class PosExplode(child: Expression) extends ExplodeBase { | ||
override val position = true | ||
} | ||
|
||
/** | ||
* Explodes an array of structs into a table. | ||
|
@@ -273,20 +324,24 @@ case class PosExplode(child: Expression) extends ExplodeBase(child, position = t | |
1 a | ||
2 b | ||
""") | ||
case class Inline(child: Expression) extends UnaryExpression with Generator with CodegenFallback { | ||
case class Inline(child: Expression) extends UnaryExpression with CollectionGenerator { | ||
override val inline: Boolean = true | ||
override val position: Boolean = false | ||
|
||
override def checkInputDataTypes(): TypeCheckResult = child.dataType match { | ||
case ArrayType(et, _) if et.isInstanceOf[StructType] => | ||
case ArrayType(st: StructType, _) => | ||
TypeCheckResult.TypeCheckSuccess | ||
case _ => | ||
TypeCheckResult.TypeCheckFailure( | ||
s"input to function $prettyName should be array of struct type, not ${child.dataType}") | ||
} | ||
|
||
override def elementSchema: StructType = child.dataType match { | ||
case ArrayType(et : StructType, _) => et | ||
case ArrayType(st: StructType, _) => st | ||
} | ||
|
||
override def collectionSchema: DataType = child.dataType | ||
|
||
private lazy val numFields = elementSchema.fields.length | ||
|
||
override def eval(input: InternalRow): TraversableOnce[InternalRow] = { | ||
|
@@ -298,4 +353,8 @@ case class Inline(child: Expression) extends UnaryExpression with Generator with | |
yield inputArray.getStruct(i, numFields) | ||
} | ||
} | ||
|
||
override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | ||
child.genCode(ctx) | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collectionType
is better?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, does we need this interface? Adding a new interface makes codes more complicated, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed a way to make sure that the collection based code path (iteration over ArrayData/MapData) can be easily identified. The other option would be to hard-code all Generators that support this code path, but that seemed just wrong to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for collectionType