[SPARK-30764][SQL] Improve the readability of EXPLAIN FORMATTED style #27509

Eric5553 · 2020-02-09T13:02:00Z

What changes were proposed in this pull request?

The style of EXPLAIN FORMATTED output needs to be improved. We’ve already got some observations/ideas in
#27368 (comment)
#27368 (comment)

Observations/Ideas:

Using comma as the separator is not clear, especially commas are used inside the expressions too.
Show the column counts first? For example, Results [4]: …
Currently the attribute names are automatically generated, this need to refined.
Add arguments field in common implementations as EXPLAIN EXTENDED did by calling argString in TreeNode.simpleString. This will eliminate most existing minor differences between
EXPLAIN EXTENDED and EXPLAIN FORMATTED.
Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)SPARK-1127 Add saveAsHBase to PairRDDFunctions #123 looks clearer than collect_set(val#456, 0, 0)SPARK-1127 Add saveAsHBase to PairRDDFunctions #123

This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability.

Why are the changes needed?

The readability of EXPLAIN FORMATTED need to be improved, which will help user better understand the query plan.

Does this PR introduce any user-facing change?

Yes, EXPLAIN FORMATTED output style changed.

How was this patch tested?

Update expect results of test cases in explain.sql

Eric5553 · 2020-02-09T13:35:15Z

cc @gatorsmile @cloud-fan @maropu @dilipbiswal

maropu · 2020-02-09T23:22:39Z

ok to test

maropu · 2020-02-09T23:23:02Z

Thanks for your work, @Eric5553 !

SparkQA · 2020-02-10T02:59:27Z

Test build #118102 has finished for PR 27509 at commit 9d30fad.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2020-02-10T04:33:55Z

Also cc @maryannxue @hvanhovell

SparkQA · 2020-02-10T08:05:02Z

Test build #118110 has finished for PR 27509 at commit 4d0fcff.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-10T12:44:57Z

Test build #118138 has finished for PR 27509 at commit 4d0fcff.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2020-02-12T17:40:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala

-       |Left keys : ${leftKeys}
-       |Right keys: ${rightKeys}
-       |Join condition : ${joinCondStr}
+       |${ExplainUtils.generateFieldString("Left keys", leftKeys)}


It might not be related to this PR, but can we do the same thing as https://github.com/apache/spark/pull/27368/files#diff-ddb517fe44ae649ddda3c733c2adcb76R70 for joins? Just for symmetry and future handiness.

Make HashJoin extend BinaryExecNode, and ShuffledHashJoinExec/BroadcastHashJoinExec extend HashJoin, right? Yea, I can make it here together :-)

No. I meant creating a trait for all physical joins. It'll make pattern matching easier although we don't have this requirement right now. We could do it in a follow-up.

Oh, yea. I just recall the conversation, thanks for your explanation :-)
I'll submit a follow-up PR for joins accordingly.

PR #27595 opened for this follow-up.

maryannxue · 2020-02-12T17:42:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

    s"""
       |($operatorId) $nodeName $codegenIdStr
+       |Arguments: ${if (argStr != null && !argStr.isEmpty) argStr else "None"}


Do we need to mute "Arguments" if no arguments instead of printing "None"?

+1, thanks for the suggestion!

Eric5553 · 2020-02-14T17:25:49Z

Also trying to address the improvement #27368 (comment) here in this PR. I tried with adding new format string function for Expression, but that caused too much code change and hard to resolve AttributeReference.toString which may recursively built by name.
Now I'm working on just refactor the existing toString functions of Alias and AttributeReference.

SparkQA · 2020-02-14T21:35:16Z

Test build #118441 has finished for PR 27509 at commit 9d52f92.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-15T05:22:59Z

Test build #118458 has finished for PR 27509 at commit 9d52f92.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Eric5553 · 2020-02-15T16:01:07Z

retest this please

SparkQA · 2020-02-15T19:45:38Z

Test build #118480 has finished for PR 27509 at commit 9d52f92.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-16T20:06:06Z

Test build #118507 has finished for PR 27509 at commit f0bcf12.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-20T12:09:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

       |($operatorId) $nodeName $codegenIdStr
     """.stripMargin
+    if (argumentString != null && !argumentString.isEmpty) {
+      s"""${result} |Arguments: $argumentString\n""".stripMargin


This is too hard to read. How about

if (argumentString.nonEmpty) { result + s"Arguments: $argumentString\n" } else { result }

I see, thanks!

Oh, I just remember why I tried so complicated writing here.
result + s"Arguments: $argumentString\n" will lead to unexpected padding before Arguments.
Thus to avoid the padding and not using improper multiline string maybe we can use
s"${result} |Arguments: $argumentString\n".stripMargin ?

Any other suggestions? Thanks @cloud-fan

how about result.trim + s"Arguments: $argumentString\n"?

It seems trim will break the first line padding of result.

result.trim + s"\nArguments: $argumentString\n"
Shows

(3) ObjectHashAggregate Input [2]: [key#x, val#x] Keys [1]: [key#x] Functions [1]: [partial_collect_set(val#x, 0, 0)] Aggregate Attributes [1]: [buf#x] Results [2]: [key#x, buf#x] (4) Exchange Input [2]: [key#x, buf#x] Arguments: hashpartitioning(key#x, 4), true, [id=#x]

how about

val baseStr = s"($operatorId) $nodeName $codegenIdStr" if (argumentString.nonEmpty) { s""" |$baseStr |Arguments: $argumentString """. stripMargin } else { s""" |$baseStr """. stripMargin }

Yea, it works with better code style. Thanks a lot!

cloud-fan · 2020-02-20T12:12:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExplainUtils.scala

+    case iter: Iterable[_] => s"${fieldName} [${iter.size}]: ${iter.mkString("[", ", ", "]")}"
+    case str: String if (str == null || str.isEmpty) => s"${fieldName}: None"
+    case str: String => s"${fieldName}: ${str}"
+    case _ => s"${fieldName}: Unknown"


This is not expected. Shall we just throw exception here?

cloud-fan · 2020-02-20T12:13:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

+         |${ExplainUtils.generateFieldString("Output", producedAttributes)}
+       """.stripMargin
+    if (argumentString != null && !argumentString.isEmpty) {
+      s"""${result} |Arguments: $argumentString\n""".stripMargin


same as https://github.com/apache/spark/pull/27509/files#r381960043

cloud-fan · 2020-02-20T12:13:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

+         |${ExplainUtils.generateFieldString("Input", child.output)}
+       """.stripMargin
+    if (argumentString != null && !argumentString.isEmpty) {
+      s"""${result} |Arguments: $argumentString\n""".stripMargin


let's not use multiline string if it's not multiline

cloud-fan · 2020-02-20T12:17:16Z

File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py", line 282, in pyspark.sql.dataframe.DataFrame.explain
Failed example:
    df.explain(mode="formatted")
Differences (ndiff with -expected +actual):
      == Physical Plan ==
      * Scan ExistingRDD (1)
    + <BLANKLINE>
    + <BLANKLINE>
      (1) Scan ExistingRDD [codegen id : 1]
    - Output: [age#0, name#1]
    + Output [2]: [age#0, name#1]
    ?       ++++
    + Arguments: [age#0, name#1], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
    + <BLANKLINE>
    + <BLANKLINE>
**********************************************************************
   1 of   3 in pyspark.sql.dataframe.DataFrame.explain
***Test Failed*** 1 failures.

@Eric5553 can you fix the pyspark test? you should update dataframe.py explain method

>>> df.explain(mode="formatted")
        == Physical Plan ==
        * Scan ExistingRDD (1)
        (1) Scan ExistingRDD [codegen id : 1]
        Output: [age#0, name#1]

        .. versionchanged:: 3.0.0
           Added optional argument `mode` to specify the expected output format of plans.
        """

Eric5553 · 2020-02-20T13:05:14Z

@cloud-fan Sorry for missed the failed unit test(just found where to get the jenkins error log...). I'll fix the pyspark test and address comments today.

And I got an implementation of removing useless #{exprId.id}, but it needs a lot of changes for existing tests. Should I commit here in this PR or open a separate one? Thanks

cloud-fan · 2020-02-20T13:08:28Z

Let's open a new PR for that.

cloud-fan · 2020-02-21T08:00:13Z

LGTM

SparkQA · 2020-02-21T08:05:02Z

Test build #118745 has finished for PR 27509 at commit 1290cd5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Eric5553 · 2020-02-21T08:11:07Z

retest this please

cloud-fan · 2020-02-21T08:16:31Z

retest this please

SparkQA · 2020-02-21T14:18:25Z

Test build #118765 has finished for PR 27509 at commit 1290cd5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-21T15:36:31Z

thanks, merging to master/3.0!

### What changes were proposed in this pull request? The style of `EXPLAIN FORMATTED` output needs to be improved. We’ve already got some observations/ideas in #27368 (comment) #27368 (comment) Observations/Ideas: 1. Using comma as the separator is not clear, especially commas are used inside the expressions too. 2. Show the column counts first? For example, `Results [4]: …` 3. Currently the attribute names are automatically generated, this need to refined. 4. Add arguments field in common implementations as `EXPLAIN EXTENDED` did by calling `argString` in `TreeNode.simpleString`. This will eliminate most existing minor differences between `EXPLAIN EXTENDED` and `EXPLAIN FORMATTED`. 5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)#123 looks clearer than collect_set(val#456, 0, 0)#123 This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability. ### Why are the changes needed? The readability of `EXPLAIN FORMATTED` need to be improved, which will help user better understand the query plan. ### Does this PR introduce any user-facing change? Yes, `EXPLAIN FORMATTED` output style changed. ### How was this patch tested? Update expect results of test cases in explain.sql Closes #27509 from Eric5553/ExplainFormattedRefine. Authored-by: Eric Wu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 1f0300f) Signed-off-by: Wenchen Fan <[email protected]>

Eric5553 · 2020-02-21T16:12:59Z

Thanks so much! @cloud-fan @maryannxue @maropu @gatorsmile

### What changes were proposed in this pull request? Currently the join operators are not well abstracted, since there are lot of common logic. A trait can be created for easier pattern matching and other future handiness. This is a follow-up PR based on comment #27509 (comment) . This PR refined from the following aspects: 1. Refined structure of all physical join operators 2. Add missing joinType field for CartesianProductExec operator 3. Refined codes related to Explain Formatted The EXPLAIN FORMATTED changes are 1. Converge all join operator `verboseStringWithOperatorId` implementations to `BaseJoinExec`. Join condition displayed, and join keys displayed if it’s not empty. 2. `#1` will add Join condition to `BroadcastNestedLoopJoinExec`. 3. `#1` will **NOT** affect `CartesianProductExec`,`SortMergeJoin` and `HashJoin`s, since they already got there override implementation before. 4. Converge all join operator `simpleStringWithNodeId` to `BaseJoinExec`, which will enhance the one line description for `CartesianProductExec` with `JoinType` added. 5. Override `simpleStringWithNodeId` in `BroadcastNestedLoopJoinExec` to show `BuildSide`, which was only done for `HashJoin`s before. ### Why are the changes needed? Make the code consistent with other operators and for future handiness of join operators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #27595 from Eric5553/RefineJoin. Authored-by: Eric Wu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? The style of `EXPLAIN FORMATTED` output needs to be improved. We’ve already got some observations/ideas in apache#27368 (comment) apache#27368 (comment) Observations/Ideas: 1. Using comma as the separator is not clear, especially commas are used inside the expressions too. 2. Show the column counts first? For example, `Results [4]: …` 3. Currently the attribute names are automatically generated, this need to refined. 4. Add arguments field in common implementations as `EXPLAIN EXTENDED` did by calling `argString` in `TreeNode.simpleString`. This will eliminate most existing minor differences between `EXPLAIN EXTENDED` and `EXPLAIN FORMATTED`. 5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)apache#123 looks clearer than collect_set(val#456, 0, 0)apache#123 This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability. ### Why are the changes needed? The readability of `EXPLAIN FORMATTED` need to be improved, which will help user better understand the query plan. ### Does this PR introduce any user-facing change? Yes, `EXPLAIN FORMATTED` output style changed. ### How was this patch tested? Update expect results of test cases in explain.sql Closes apache#27509 from Eric5553/ExplainFormattedRefine. Authored-by: Eric Wu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Currently the join operators are not well abstracted, since there are lot of common logic. A trait can be created for easier pattern matching and other future handiness. This is a follow-up PR based on comment apache#27509 (comment) . This PR refined from the following aspects: 1. Refined structure of all physical join operators 2. Add missing joinType field for CartesianProductExec operator 3. Refined codes related to Explain Formatted The EXPLAIN FORMATTED changes are 1. Converge all join operator `verboseStringWithOperatorId` implementations to `BaseJoinExec`. Join condition displayed, and join keys displayed if it’s not empty. 2. `apache#1` will add Join condition to `BroadcastNestedLoopJoinExec`. 3. `apache#1` will **NOT** affect `CartesianProductExec`,`SortMergeJoin` and `HashJoin`s, since they already got there override implementation before. 4. Converge all join operator `simpleStringWithNodeId` to `BaseJoinExec`, which will enhance the one line description for `CartesianProductExec` with `JoinType` added. 5. Override `simpleStringWithNodeId` in `BroadcastNestedLoopJoinExec` to show `BuildSide`, which was only done for `HashJoin`s before. ### Why are the changes needed? Make the code consistent with other operators and for future handiness of join operators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes apache#27595 from Eric5553/RefineJoin. Authored-by: Eric Wu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

gatorsmile · 2020-04-18T16:24:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

@@ -243,7 +243,7 @@ case class FilterExec(condition: Expression, child: SparkPlan)
  override def verboseStringWithOperatorId(): String = {
    s"""
       |(${ExplainUtils.getOpId(this)}) $nodeName ${ExplainUtils.getCodegenId(this)}
-       |Input     : ${child.output.mkString("[", ", ", "]")}
+       |${ExplainUtils.generateFieldString("Input", child.output)}
       |Condition : ${condition}


Can we remove the space before ":"?

(3) Filter [codegen id : 1] Input [1]: [col.dots#22] Condition : (isnotnull(col.dots#22) AND (col.dots#22 = 500))

gatorsmile · 2020-04-18T18:33:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -76,7 +76,7 @@ trait DataSourceScanExec extends LeafExecNode {

    s"""
       |(${ExplainUtils.getOpId(this)}) $nodeName ${ExplainUtils.getCodegenId(this)}
-       |Output: ${producedAttributes.mkString("[", ", ", "]")}
+       |${ExplainUtils.generateFieldString("Output", producedAttributes)}
       |${metadataStr.mkString("\n")}


These changes are only for DSV1. Could we make the corresponding changes when using DSV2? Open the ticket https://issues.apache.org/jira/browse/SPARK-31480

Also, please check the output when the schema is very long. For example, containing 250+ columns.

Eric5553 mentioned this pull request Feb 9, 2020

[SPARK-30651][SQL] Add detailed information for Aggregate operators in EXPLAIN FORMATTED #27368

Closed

Eric5553 force-pushed the ExplainFormattedRefine branch from 9d30fad to 4d0fcff Compare February 10, 2020 03:55

dongjoon-hyun added the SQL label Feb 10, 2020

maryannxue reviewed Feb 12, 2020

View reviewed changes

Eric5553 force-pushed the ExplainFormattedRefine branch from 4d0fcff to 9d52f92 Compare February 14, 2020 17:05

Eric5553 mentioned this pull request Feb 15, 2020

[SPARK-30842][SQL] Adjust abstraction structure for join operators #27595

Closed

Eric5553 added 2 commits February 17, 2020 00:11

Impl common explain field formatter

bf9f07b

rebase and address arguments comment

f0bcf12

Eric5553 force-pushed the ExplainFormattedRefine branch from 9d52f92 to f0bcf12 Compare February 16, 2020 16:11

cloud-fan reviewed Feb 20, 2020

View reviewed changes

Address comments and fix PySpark test

1290cd5

cloud-fan closed this in 1f0300f Feb 21, 2020

Eric5553 deleted the ExplainFormattedRefine branch March 13, 2020 06:51

gatorsmile reviewed Apr 18, 2020

View reviewed changes

[SPARK-30764][SQL] Improve the readability of EXPLAIN FORMATTED style #27509

[SPARK-30764][SQL] Improve the readability of EXPLAIN FORMATTED style #27509

Conversation

Eric5553 commented Feb 9, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Eric5553 commented Feb 9, 2020

maropu commented Feb 9, 2020

maropu commented Feb 9, 2020

SparkQA commented Feb 10, 2020

gatorsmile commented Feb 10, 2020

SparkQA commented Feb 10, 2020

SparkQA commented Feb 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric5553 Feb 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric5553 commented Feb 14, 2020 • edited Loading

SparkQA commented Feb 14, 2020

SparkQA commented Feb 15, 2020

Eric5553 commented Feb 15, 2020

SparkQA commented Feb 15, 2020

SparkQA commented Feb 16, 2020

cloud-fan Feb 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric5553 Feb 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 20, 2020

Eric5553 commented Feb 20, 2020 • edited Loading

cloud-fan commented Feb 20, 2020

cloud-fan commented Feb 21, 2020

SparkQA commented Feb 21, 2020

Eric5553 commented Feb 21, 2020

cloud-fan commented Feb 21, 2020

SparkQA commented Feb 21, 2020

cloud-fan commented Feb 21, 2020

Eric5553 commented Feb 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric5553 commented Feb 9, 2020 •

edited

Loading

Eric5553 Feb 15, 2020 •

edited

Loading

Eric5553 commented Feb 14, 2020 •

edited

Loading

cloud-fan Feb 20, 2020 •

edited

Loading

Eric5553 Feb 20, 2020 •

edited

Loading

Eric5553 commented Feb 20, 2020 •

edited

Loading