Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30764][SQL] Improve the readability of EXPLAIN FORMATTED style #27509

Closed
wants to merge 3 commits into from

Conversation

Eric5553
Copy link
Contributor

@Eric5553 Eric5553 commented Feb 9, 2020

What changes were proposed in this pull request?

The style of EXPLAIN FORMATTED output needs to be improved. We’ve already got some observations/ideas in
#27368 (comment)
#27368 (comment)

Observations/Ideas:

  1. Using comma as the separator is not clear, especially commas are used inside the expressions too.
  2. Show the column counts first? For example, Results [4]: …
  3. Currently the attribute names are automatically generated, this need to refined.
  4. Add arguments field in common implementations as EXPLAIN EXTENDED did by calling argString in TreeNode.simpleString. This will eliminate most existing minor differences between
    EXPLAIN EXTENDED and EXPLAIN FORMATTED.
  5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)SPARK-1127 Add saveAsHBase to PairRDDFunctions #123 looks clearer than collect_set(val#456, 0, 0)SPARK-1127 Add saveAsHBase to PairRDDFunctions #123

This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability.

Why are the changes needed?

The readability of EXPLAIN FORMATTED need to be improved, which will help user better understand the query plan.

Does this PR introduce any user-facing change?

Yes, EXPLAIN FORMATTED output style changed.

How was this patch tested?

Update expect results of test cases in explain.sql

@Eric5553
Copy link
Contributor Author

Eric5553 commented Feb 9, 2020

@maropu
Copy link
Member

maropu commented Feb 9, 2020

ok to test

@maropu
Copy link
Member

maropu commented Feb 9, 2020

Thanks for your work, @Eric5553 !

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118102 has finished for PR 27509 at commit 9d30fad.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Also cc @maryannxue @hvanhovell

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118110 has finished for PR 27509 at commit 4d0fcff.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118138 has finished for PR 27509 at commit 4d0fcff.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|Left keys : ${leftKeys}
|Right keys: ${rightKeys}
|Join condition : ${joinCondStr}
|${ExplainUtils.generateFieldString("Left keys", leftKeys)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be related to this PR, but can we do the same thing as https://github.com/apache/spark/pull/27368/files#diff-ddb517fe44ae649ddda3c733c2adcb76R70 for joins? Just for symmetry and future handiness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make HashJoin extend BinaryExecNode, and ShuffledHashJoinExec/BroadcastHashJoinExec extend HashJoin, right? Yea, I can make it here together :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I meant creating a trait for all physical joins. It'll make pattern matching easier although we don't have this requirement right now. We could do it in a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yea. I just recall the conversation, thanks for your explanation :-)
I'll submit a follow-up PR for joins accordingly.

Copy link
Contributor Author

@Eric5553 Eric5553 Feb 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #27595 opened for this follow-up.

s"""
|($operatorId) $nodeName $codegenIdStr
|Arguments: ${if (argStr != null && !argStr.isEmpty) argStr else "None"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mute "Arguments" if no arguments instead of printing "None"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for the suggestion!

@Eric5553
Copy link
Contributor Author

Eric5553 commented Feb 14, 2020

Also trying to address the improvement #27368 (comment) here in this PR. I tried with adding new format string function for Expression, but that caused too much code change and hard to resolve AttributeReference.toString which may recursively built by name.
Now I'm working on just refactor the existing toString functions of Alias and AttributeReference.

@SparkQA
Copy link

SparkQA commented Feb 14, 2020

Test build #118441 has finished for PR 27509 at commit 9d52f92.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 15, 2020

Test build #118458 has finished for PR 27509 at commit 9d52f92.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Eric5553
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 15, 2020

Test build #118480 has finished for PR 27509 at commit 9d52f92.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2020

Test build #118507 has finished for PR 27509 at commit f0bcf12.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

|($operatorId) $nodeName $codegenIdStr
""".stripMargin
if (argumentString != null && !argumentString.isEmpty) {
s"""${result} |Arguments: $argumentString\n""".stripMargin
Copy link
Contributor

@cloud-fan cloud-fan Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too hard to read. How about

if (argumentString.nonEmpty) {
  result + s"Arguments: $argumentString\n"
} else {
  result
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just remember why I tried so complicated writing here.
result + s"Arguments: $argumentString\n" will lead to unexpected padding before Arguments.
Thus to avoid the padding and not using improper multiline string maybe we can use
s"${result} |Arguments: $argumentString\n".stripMargin ?

Any other suggestions? Thanks @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about result.trim + s"Arguments: $argumentString\n"?

Copy link
Contributor Author

@Eric5553 Eric5553 Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems trim will break the first line padding of result.

result.trim + s"\nArguments: $argumentString\n"
Shows

(3) ObjectHashAggregate 
Input [2]: [key#x, val#x]
Keys [1]: [key#x]
Functions [1]: [partial_collect_set(val#x, 0, 0)]
Aggregate Attributes [1]: [buf#x]
Results [2]: [key#x, buf#x]
     (4) Exchange 
Input [2]: [key#x, buf#x]
Arguments: hashpartitioning(key#x, 4), true, [id=#x]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

val baseStr = s"($operatorId) $nodeName $codegenIdStr"
if (argumentString.nonEmpty) {
  s"""
    |$baseStr
    |Arguments: $argumentString
  """. stripMargin
} else {
  s"""
  |$baseStr
  """. stripMargin
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it works with better code style. Thanks a lot!

case iter: Iterable[_] => s"${fieldName} [${iter.size}]: ${iter.mkString("[", ", ", "]")}"
case str: String if (str == null || str.isEmpty) => s"${fieldName}: None"
case str: String => s"${fieldName}: ${str}"
case _ => s"${fieldName}: Unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not expected. Shall we just throw exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

|${ExplainUtils.generateFieldString("Output", producedAttributes)}
""".stripMargin
if (argumentString != null && !argumentString.isEmpty) {
s"""${result} |Arguments: $argumentString\n""".stripMargin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|${ExplainUtils.generateFieldString("Input", child.output)}
""".stripMargin
if (argumentString != null && !argumentString.isEmpty) {
s"""${result} |Arguments: $argumentString\n""".stripMargin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not use multiline string if it's not multiline

@cloud-fan
Copy link
Contributor

File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py", line 282, in pyspark.sql.dataframe.DataFrame.explain
Failed example:
    df.explain(mode="formatted")
Differences (ndiff with -expected +actual):
      == Physical Plan ==
      * Scan ExistingRDD (1)
    + <BLANKLINE>
    + <BLANKLINE>
      (1) Scan ExistingRDD [codegen id : 1]
    - Output: [age#0, name#1]
    + Output [2]: [age#0, name#1]
    ?       ++++
    + Arguments: [age#0, name#1], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
    + <BLANKLINE>
    + <BLANKLINE>
**********************************************************************
   1 of   3 in pyspark.sql.dataframe.DataFrame.explain
***Test Failed*** 1 failures.

@Eric5553 can you fix the pyspark test? you should update dataframe.py explain method

>>> df.explain(mode="formatted")
        == Physical Plan ==
        * Scan ExistingRDD (1)
        (1) Scan ExistingRDD [codegen id : 1]
        Output: [age#0, name#1]

        .. versionchanged:: 3.0.0
           Added optional argument `mode` to specify the expected output format of plans.
        """

@Eric5553
Copy link
Contributor Author

Eric5553 commented Feb 20, 2020

@cloud-fan Sorry for missed the failed unit test(just found where to get the jenkins error log...). I'll fix the pyspark test and address comments today.

And I got an implementation of removing useless #{exprId.id}, but it needs a lot of changes for existing tests. Should I commit here in this PR or open a separate one? Thanks

@cloud-fan
Copy link
Contributor

Let's open a new PR for that.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Feb 21, 2020

Test build #118745 has finished for PR 27509 at commit 1290cd5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Eric5553
Copy link
Contributor Author

retest this please

1 similar comment
@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 21, 2020

Test build #118765 has finished for PR 27509 at commit 1290cd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 1f0300f Feb 21, 2020
cloud-fan pushed a commit that referenced this pull request Feb 21, 2020
### What changes were proposed in this pull request?
The style of `EXPLAIN FORMATTED` output needs to be improved. We’ve already got some observations/ideas in
#27368 (comment)
#27368 (comment)

Observations/Ideas:
1. Using comma as the separator is not clear, especially commas are used inside the expressions too.
2. Show the column counts first? For example, `Results [4]: …`
3. Currently the attribute names are automatically generated, this need to refined.
4. Add arguments field in common implementations as `EXPLAIN EXTENDED` did by calling `argString` in `TreeNode.simpleString`. This will eliminate most existing minor differences between
`EXPLAIN EXTENDED` and `EXPLAIN FORMATTED`.
5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)#123 looks clearer than collect_set(val#456, 0, 0)#123

This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability.

### Why are the changes needed?
The readability of `EXPLAIN FORMATTED` need to be improved, which will help user better understand the query plan.

### Does this PR introduce any user-facing change?
Yes, `EXPLAIN FORMATTED` output style changed.

### How was this patch tested?
Update expect results of test cases in explain.sql

Closes #27509 from Eric5553/ExplainFormattedRefine.

Authored-by: Eric Wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 1f0300f)
Signed-off-by: Wenchen Fan <[email protected]>
@Eric5553
Copy link
Contributor Author

Thanks so much! @cloud-fan @maryannxue @maropu @gatorsmile

cloud-fan pushed a commit that referenced this pull request Feb 28, 2020
### What changes were proposed in this pull request?
Currently the join operators are not well abstracted, since there are lot of common logic. A trait can be created for easier pattern matching and other future handiness. This is a follow-up PR based on comment
#27509 (comment) .

This PR refined from the following aspects:
1. Refined structure of all physical join operators
2. Add missing joinType field for CartesianProductExec operator
3. Refined codes related to Explain Formatted

The EXPLAIN FORMATTED changes are
1. Converge all join operator `verboseStringWithOperatorId` implementations to `BaseJoinExec`. Join condition displayed, and join keys displayed if it’s not empty.
2. `#1` will add Join condition to `BroadcastNestedLoopJoinExec`.
3. `#1` will **NOT** affect `CartesianProductExec`,`SortMergeJoin` and `HashJoin`s, since they already got there override implementation before.
4. Converge all join operator `simpleStringWithNodeId` to `BaseJoinExec`, which will enhance the one line description for `CartesianProductExec` with `JoinType` added.
5. Override `simpleStringWithNodeId` in `BroadcastNestedLoopJoinExec` to show `BuildSide`, which was only done for `HashJoin`s before.

### Why are the changes needed?
Make the code consistent with other operators and for future handiness of join operators.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests

Closes #27595 from Eric5553/RefineJoin.

Authored-by: Eric Wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@Eric5553 Eric5553 deleted the ExplainFormattedRefine branch March 13, 2020 06:51
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
The style of `EXPLAIN FORMATTED` output needs to be improved. We’ve already got some observations/ideas in
apache#27368 (comment)
apache#27368 (comment)

Observations/Ideas:
1. Using comma as the separator is not clear, especially commas are used inside the expressions too.
2. Show the column counts first? For example, `Results [4]: …`
3. Currently the attribute names are automatically generated, this need to refined.
4. Add arguments field in common implementations as `EXPLAIN EXTENDED` did by calling `argString` in `TreeNode.simpleString`. This will eliminate most existing minor differences between
`EXPLAIN EXTENDED` and `EXPLAIN FORMATTED`.
5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)apache#123 looks clearer than collect_set(val#456, 0, 0)apache#123

This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability.

### Why are the changes needed?
The readability of `EXPLAIN FORMATTED` need to be improved, which will help user better understand the query plan.

### Does this PR introduce any user-facing change?
Yes, `EXPLAIN FORMATTED` output style changed.

### How was this patch tested?
Update expect results of test cases in explain.sql

Closes apache#27509 from Eric5553/ExplainFormattedRefine.

Authored-by: Eric Wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Currently the join operators are not well abstracted, since there are lot of common logic. A trait can be created for easier pattern matching and other future handiness. This is a follow-up PR based on comment
apache#27509 (comment) .

This PR refined from the following aspects:
1. Refined structure of all physical join operators
2. Add missing joinType field for CartesianProductExec operator
3. Refined codes related to Explain Formatted

The EXPLAIN FORMATTED changes are
1. Converge all join operator `verboseStringWithOperatorId` implementations to `BaseJoinExec`. Join condition displayed, and join keys displayed if it’s not empty.
2. `apache#1` will add Join condition to `BroadcastNestedLoopJoinExec`.
3. `apache#1` will **NOT** affect `CartesianProductExec`,`SortMergeJoin` and `HashJoin`s, since they already got there override implementation before.
4. Converge all join operator `simpleStringWithNodeId` to `BaseJoinExec`, which will enhance the one line description for `CartesianProductExec` with `JoinType` added.
5. Override `simpleStringWithNodeId` in `BroadcastNestedLoopJoinExec` to show `BuildSide`, which was only done for `HashJoin`s before.

### Why are the changes needed?
Make the code consistent with other operators and for future handiness of join operators.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing tests

Closes apache#27595 from Eric5553/RefineJoin.

Authored-by: Eric Wu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@@ -243,7 +243,7 @@ case class FilterExec(condition: Expression, child: SparkPlan)
override def verboseStringWithOperatorId(): String = {
s"""
|(${ExplainUtils.getOpId(this)}) $nodeName ${ExplainUtils.getCodegenId(this)}
|Input : ${child.output.mkString("[", ", ", "]")}
|${ExplainUtils.generateFieldString("Input", child.output)}
|Condition : ${condition}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the space before ":"?

(3) Filter [codegen id : 1]
Input [1]: [col.dots#22]
Condition : (isnotnull(col.dots#22) AND (col.dots#22 = 500))

@@ -76,7 +76,7 @@ trait DataSourceScanExec extends LeafExecNode {

s"""
|(${ExplainUtils.getOpId(this)}) $nodeName ${ExplainUtils.getCodegenId(this)}
|Output: ${producedAttributes.mkString("[", ", ", "]")}
|${ExplainUtils.generateFieldString("Output", producedAttributes)}
|${metadataStr.mkString("\n")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are only for DSV1. Could we make the corresponding changes when using DSV2? Open the ticket https://issues.apache.org/jira/browse/SPARK-31480

Also, please check the output when the schema is very long. For example, containing 250+ columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants