[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema #28392

maropu · 2020-04-28T10:21:20Z

What changes were proposed in this pull request?

This PR intends to update sql in Rand/Randn with no argument to make a column name deterministic.

Before this PR (a column name changes run-by-run):

scala> sql("select rand()").show()
+-------------------------+
|rand(7986133828002692830)|
+-------------------------+
|       0.9524061403696937|
+-------------------------+

After this PR (a column name fixed):

scala> sql("select rand()").show()
+------------------+                                                            
|            rand()|
+------------------+
|0.7137935639522275|
+------------------+

// If a seed given, it is still shown in a column name
// (the same with the current behaviour)
scala> sql("select rand(1)").show()
+------------------+                                                            
|           rand(1)|
+------------------+
|0.6363787615254752|
+------------------+

// We can still check a seed in explain output:
scala> sql("select rand()").explain()
== Physical Plan ==
*(1) Project [rand(-2282124938778456838) AS rand()#0]
+- *(1) Scan OneRowRelation[]

Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests.

Why are the changes needed?

To make output schema deterministic.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests.

maropu · 2020-04-28T10:21:44Z

Minor fix though, how about this? @dongjoon-hyun @HyukjinKwon

HyukjinKwon · 2020-04-28T12:42:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

@@ -102,6 +102,8 @@ case class Rand(child: Expression) extends RDG with ExpressionWithRandomSeed {
  }

  override def freshCopy(): Rand = Rand(child)
+
+  override def sql: String = "rand()"


Hmmm .. shouldn't we print out the seed when it's explicitly given?

Ur, I see... I just want to reomve a non-deterministic number in a column name when no argument given. I need more time to think of it...

I think maybe the current output is useful. Yes the seed was randomly chosen, but you might want to know what it was.

Yea, I also think it's important that we can check seeds, but df.explain is not enough for checking it? Actually, the other two expression with random seeds (shuffle and uuid) don't display it in column names;

scala> sql("select shuffle(array(1, 2))").show() +--------------------+ |shuffle(array(1, 2))| +--------------------+ | [2, 1]| +--------------------+ scala> sql("select shuffle(array(1, 2))").explain() == Physical Plan == *(1) Project [shuffle([1,2], Some(894779230406706679)) AS shuffle(array(1, 2))#14] +- *(1) Scan OneRowRelation[] scala> sql("select uuid()").show() +--------------------+ | uuid()| +--------------------+ |dde93891-8a95-4e9...| +--------------------+ scala> sql("select uuid()").explain() == Physical Plan == *(1) Project [uuid(Some(4613707233104825008)) AS uuid()#23] +- *(1) Scan OneRowRelation[]

Fair point, consistency is good. I don't feel strongly either way.

SparkQA · 2020-04-28T14:46:38Z

Test build #121990 has finished for PR 28392 at commit 7216511.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-04-28T17:39:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

@@ -102,6 +105,11 @@ case class Rand(child: Expression) extends RDG with ExpressionWithRandomSeed {
  }

  override def freshCopy(): Rand = Rand(child)


What happens when we do freshCopy? Do we need to propagate useRandSeed field?

oh, yes. Nice catch!

dongjoon-hyun · 2020-04-28T17:41:44Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -3425,6 +3425,28 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
      assert(SQLConf.get.getConf(SQLConf.CODEGEN_FALLBACK) === true)
    }
  }
+
+  test("Do not display the seed of rand/randn with no argument in output schema") {


If you don't mind, SPARK-31594: prefix please?

Sure, of course not!

dongjoon-hyun · 2020-04-28T17:52:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

+
+  override def flatArguments: Iterator[Any] = Iterator(child)
+  override def sql: String = {
+    s"randn(${if (useRandSeed) "" else child.sql})"


The naming useRandSeed might be a little mismatched. This is used only here to hide random seed. Maybe, something like hideSeed is more direct?

Currently, Randn(child = expr, useRandSeed = true) seems to be possible in program. It might look weird because it will not use rand seed.

yea, the name looks reasonable.

SparkQA · 2020-04-28T21:42:19Z

Test build #122007 has finished for PR 28392 at commit 663fdc7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Rand(child: Expression, useRandSeed: Boolean = false)
case class Randn(child: Expression, useRandSeed: Boolean = false)

SparkQA · 2020-04-29T04:55:57Z

Test build #122021 has finished for PR 28392 at commit 4b1f3f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Rand(child: Expression, hideSeed: Boolean = false)
case class Randn(child: Expression, hideSeed: Boolean = false)

dongjoon-hyun · 2020-04-29T05:05:09Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+      Console.withOut(output) {
+        df.explain()
+      }
+      output.toString.matches("""randn?\(-?[0-9]+\)""")


Did you want assert(...)?

dongjoon-hyun · 2020-04-29T05:13:22Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+    }
+    val df1 = sql("SELECT rand()")
+    assert(df1.schema.head.name === "rand()")
+    checkIfSeedExistsInExplain(df1)


If we add assert at line 3435, this test will fail.

SparkQA · 2020-04-29T07:05:02Z

Test build #122042 has finished for PR 28392 at commit d90b2e7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-04-29T07:05:46Z

retest this please

dongjoon-hyun

+1, LGTM. Thank you, @maropu, @HyukjinKwon , @srowen .
The last commit is only test case update and I verified that locally.
Merged to master.

beliefer · 2020-04-29T07:25:06Z

@maropu I will update #28194

maropu · 2020-04-29T07:25:59Z

Thanks, @dongjoon-hyun !

SparkQA · 2020-04-29T12:17:00Z

Test build #122045 has finished for PR 28392 at commit d90b2e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ment in output schema This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic. Before this PR (a column name changes run-by-run): ``` scala> sql("select rand()").show() +-------------------------+ |rand(7986133828002692830)| +-------------------------+ | 0.9524061403696937| +-------------------------+ ``` After this PR (a column name fixed): ``` scala> sql("select rand()").show() +------------------+ | rand()| +------------------+ |0.7137935639522275| +------------------+ // If a seed given, it is still shown in a column name // (the same with the current behaviour) scala> sql("select rand(1)").show() +------------------+ | rand(1)| +------------------+ |0.6363787615254752| +------------------+ // We can still check a seed in explain output: scala> sql("select rand()").explain() == Physical Plan == *(1) Project [rand(-2282124938778456838) AS rand()#0] +- *(1) Scan OneRowRelation[] ``` Note: This fix comes from apache#28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests. To make output schema deterministic. No. Added unit tests. Closes apache#28392 from maropu/SPARK-31594. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Fix

7216511

probot-autolabeler bot added the SQL label Apr 28, 2020

maropu mentioned this pull request Apr 28, 2020

[SPARK-31372][SQL][TEST] Display expression schema for double check. #28194

Closed

HyukjinKwon reviewed Apr 28, 2020

View reviewed changes

Fix

663fdc7

maropu changed the title ~~[SPARK-31594][SQL] Do not display rand/randn seed numbers in schema~~ [SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema Apr 28, 2020

dongjoon-hyun reviewed Apr 28, 2020

View reviewed changes

Fix

4b1f3f2

maropu force-pushed the SPARK-31594 branch from c44c31d to 4b1f3f2 Compare April 29, 2020 00:04

dongjoon-hyun reviewed Apr 29, 2020

View reviewed changes

Fix

d90b2e7

dongjoon-hyun approved these changes Apr 29, 2020

View reviewed changes

dongjoon-hyun closed this in 97f2c03 Apr 29, 2020

HyukjinKwon mentioned this pull request May 1, 2020

[SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite #28427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema #28392

[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema #28392

maropu commented Apr 28, 2020 •

edited

Loading

maropu commented Apr 28, 2020

HyukjinKwon Apr 28, 2020

maropu Apr 28, 2020 •

edited

Loading

srowen Apr 28, 2020

maropu Apr 28, 2020

srowen Apr 28, 2020

SparkQA commented Apr 28, 2020

dongjoon-hyun Apr 28, 2020

maropu Apr 29, 2020

dongjoon-hyun Apr 28, 2020

maropu Apr 29, 2020 •

edited

Loading

dongjoon-hyun Apr 28, 2020

dongjoon-hyun Apr 28, 2020 •

edited

Loading

maropu Apr 29, 2020

SparkQA commented Apr 28, 2020

SparkQA commented Apr 29, 2020

dongjoon-hyun Apr 29, 2020

dongjoon-hyun Apr 29, 2020

SparkQA commented Apr 29, 2020

maropu commented Apr 29, 2020

dongjoon-hyun left a comment

beliefer commented Apr 29, 2020

maropu commented Apr 29, 2020

SparkQA commented Apr 29, 2020

		@@ -102,6 +105,11 @@ case class Rand(child: Expression) extends RDG with ExpressionWithRandomSeed {
		}

		override def freshCopy(): Rand = Rand(child)

[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema #28392

[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema #28392

Conversation

maropu commented Apr 28, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented Apr 28, 2020

Choose a reason for hiding this comment

maropu Apr 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Apr 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 28, 2020

SparkQA commented Apr 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 29, 2020

maropu commented Apr 29, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

beliefer commented Apr 29, 2020

maropu commented Apr 29, 2020

SparkQA commented Apr 29, 2020

maropu commented Apr 28, 2020 •

edited

Loading

maropu Apr 28, 2020 •

edited

Loading

maropu Apr 29, 2020 •

edited

Loading

dongjoon-hyun Apr 28, 2020 •

edited

Loading