[SPARK-35384][SQL] Improve performance for InvokeLike.invoke #32527

sunchao · 2021-05-12T23:20:18Z

What changes were proposed in this pull request?

Change map in InvokeLike.invoke to a while loop to improve performance, following Spark style guide.

Why are the changes needed?

InvokeLike.invoke, which is used in non-codegen path for Invoke and StaticInvoke, currently uses map to evaluate arguments:

val args = arguments.map(e => e.eval(input).asInstanceOf[Object])
if (needNullCheck && args.exists(_ == null)) {
  // return null if one of arguments is null
  null
} else { 
  ...

which is pretty expensive if the method itself is trivial. We can change it to a plain while loop.

Benchmark results show this can improve as much as 3x from V2FunctionBenchmark:

Before

 OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
 Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
 native_long_add                                                                         36506          36656         251         13.7          73.0       1.0X
 java_long_add_default                                                                   47151          47540         370         10.6          94.3       0.8X
 java_long_add_magic                                                                    178691         182457        1327          2.8         357.4       0.2X
 java_long_add_static_magic                                                             177151         178258        1151          2.8         354.3       0.2X

After

 OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure
 Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 scalar function (long + long) -> long, result_nullable = false codegen = false:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 --------------------------------------------------------------------------------------------------------------------------------------------------------------
 native_long_add                                                                         29897          30342         568         16.7          59.8       1.0X
 java_long_add_default                                                                   40628          41075         664         12.3          81.3       0.7X
 java_long_add_magic                                                                     54553          54755         182          9.2         109.1       0.5X
 java_long_add_static_magic                                                              55410          55532         127          9.0         110.8       0.5X

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

SparkQA · 2021-05-13T00:11:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42996/

SparkQA · 2021-05-13T00:11:18Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42996/

sunchao · 2021-05-13T00:56:12Z

cc @HyukjinKwon @cloud-fan @dongjoon-hyun @viirya @maropu

maropu

Looks fine otherwise.

maropu · 2021-05-13T01:13:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+    var i = 0
+    val len = arguments.length
+    while (i < len) {
+      val e = arguments(i)


It looks we don't need this intermediate val?

evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object]

yea let me remove it

maropu · 2021-05-13T01:13:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+      evaluatedArgs(i) = e.eval(input).asInstanceOf[Object]
+      i += 1
+    }
+    if (needNullCheck && evaluatedArgs.exists(_ == null)) {


nit: my IDE suggests .exists(_ == null) -> .contains(null)

the exists part is not related to this PR but I'm happy to change it :)

viirya · 2021-05-13T02:02:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

+    var i = 0
+    val len = arguments.length
+    while (i < len) {
+      evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object]


why we need to keep evaluatedArgs as a (lazy) val?

You mean just use val?

Doesn't we evaluate arguments for each time invoke is called? Why not just having val evaluatedArgs: Array[Object] = new Array[Object](arguments.length) in invoke?

I guess it aims to reuse Array[Object] itself and only changes the values of array.

Yea even though we evaluate arguments for each invoke call we can reuse the same array to store the results of evaluation. I guess it's better than allocating a new Array[Object] for each input row.

SparkQA · 2021-05-13T02:12:56Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43000/

SparkQA · 2021-05-13T02:19:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43000/

dongjoon-hyun

+1, LGTM. The improvement looks nice!

SparkQA · 2021-05-13T03:47:47Z

Test build #138475 has finished for PR 32527 at commit 9ce2542.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-05-13T03:58:01Z

Thank you, @sunchao and all! Merged to master for Apache Spark 3.2.0.

cloud-fan · 2021-05-13T04:49:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

      // return null if one of arguments is null
      null
    } else {
      val ret = try {
-        method.invoke(obj, args: _*)
+        method.invoke(obj, evaluatedArgs: _*)
      } catch {


Can we also improve the last piece?

val boxedClass = ScalaReflection.typeBoxedJavaMapping.get(dataType) if (boxedClass.isDefined) { boxedClass.get.cast(ret) } else { ret }

We can create a function for it

private lazy val boxing: Any => Any = ScalaReflection.typeBoxedJavaMapping.get(dataType).map(_.cast(_)).getOrElse(identity)

We can do the similar thing in Invoke.eval

Yea let me try it. In the profiling after this PR, HashMap.get takes 7.82% from the entire invoke call so it seems worthwhile to do this.

I'm not sure if we can do the similar thing in Invoke.eval though since obj in obj.getClass.getMethod(functionName, argClasses: _*) is different for each call.

You are right. Another idea: obj from InternalRow are always of the same class, we can avoid this

@transient lazy val method = { val cls = targetObject.dataType match { case ObjectType(cls) => cls case StringType => classOf[UTF8String] case _: DecimalType => classOf[Decimal] ... } findMethod(cls, encodedFunctionName, argClasses) }

Hmm I'm not sure. Looking at usages of Invoke, it seems targetObject.dataType is usually ObjectType (for instance, in ScalarFunction we wrap the UDF into a Literal with ObjectType), so curious how useful this would be and when we'd use StringType/DecimalType for the targetObject.

Looking at the profiling result for Invoke.eval, it is now dominated by InvokeLike.invoke:

Although this is somewhat unrelated to the above as V2FunctionBenchmark (and ScalarFunction) uses ObjectType for Invoke so it's already handled by the current code:

@transient lazy val method = targetObject.dataType match { case ObjectType(cls) => Some(findMethod(cls, encodedFunctionName, argClasses)) case _ => None }

we may need new benchmarks if we decide to do this.

makes sense, for UDF, it's just an extra method.isDefine check, and probably not a big issue.

SparkQA · 2021-05-13T05:53:05Z

Test build #138480 has finished for PR 32527 at commit 2831f9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

first commit

9ce2542

github-actions bot added the SQL label May 12, 2021

maropu reviewed May 13, 2021

View reviewed changes

comments

2831f9c

HyukjinKwon approved these changes May 13, 2021

View reviewed changes

maropu approved these changes May 13, 2021

View reviewed changes

viirya reviewed May 13, 2021

View reviewed changes

dongjoon-hyun approved these changes May 13, 2021

View reviewed changes

viirya approved these changes May 13, 2021

View reviewed changes

dongjoon-hyun closed this in 0ab9bd7 May 13, 2021

cloud-fan reviewed May 13, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35384][SQL] Improve performance for InvokeLike.invoke #32527

[SPARK-35384][SQL] Improve performance for InvokeLike.invoke #32527

sunchao commented May 12, 2021 •

edited

Loading

SparkQA commented May 13, 2021

SparkQA commented May 13, 2021

sunchao commented May 13, 2021

maropu left a comment

maropu May 13, 2021

sunchao May 13, 2021

maropu May 13, 2021

sunchao May 13, 2021

viirya May 13, 2021 •

edited

Loading

sunchao May 13, 2021

viirya May 13, 2021

dongjoon-hyun May 13, 2021 •

edited

Loading

sunchao May 13, 2021

SparkQA commented May 13, 2021

SparkQA commented May 13, 2021

dongjoon-hyun left a comment

SparkQA commented May 13, 2021

dongjoon-hyun commented May 13, 2021

cloud-fan May 13, 2021

cloud-fan May 13, 2021

sunchao May 13, 2021

sunchao May 13, 2021

cloud-fan May 13, 2021

sunchao May 13, 2021

cloud-fan May 13, 2021

SparkQA commented May 13, 2021

[SPARK-35384][SQL] Improve performance for InvokeLike.invoke #32527

[SPARK-35384][SQL] Improve performance for InvokeLike.invoke #32527

Conversation

sunchao commented May 12, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented May 13, 2021

SparkQA commented May 13, 2021

sunchao commented May 13, 2021

maropu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya May 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun May 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 13, 2021

SparkQA commented May 13, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented May 13, 2021

dongjoon-hyun commented May 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 13, 2021

sunchao commented May 12, 2021 •

edited

Loading

viirya May 13, 2021 •

edited

Loading

dongjoon-hyun May 13, 2021 •

edited

Loading