Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema #28392

Closed
wants to merge 4 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Apr 28, 2020

What changes were proposed in this pull request?

This PR intends to update sql in Rand/Randn with no argument to make a column name deterministic.

Before this PR (a column name changes run-by-run):

scala> sql("select rand()").show()
+-------------------------+
|rand(7986133828002692830)|
+-------------------------+
|       0.9524061403696937|
+-------------------------+

After this PR (a column name fixed):

scala> sql("select rand()").show()
+------------------+                                                            
|            rand()|
+------------------+
|0.7137935639522275|
+------------------+

// If a seed given, it is still shown in a column name
// (the same with the current behaviour)
scala> sql("select rand(1)").show()
+------------------+                                                            
|           rand(1)|
+------------------+
|0.6363787615254752|
+------------------+

// We can still check a seed in explain output:
scala> sql("select rand()").explain()
== Physical Plan ==
*(1) Project [rand(-2282124938778456838) AS rand()#0]
+- *(1) Scan OneRowRelation[]

Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests.

Why are the changes needed?

To make output schema deterministic.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests.

@maropu
Copy link
Member Author

maropu commented Apr 28, 2020

Minor fix though, how about this? @dongjoon-hyun @HyukjinKwon

@@ -102,6 +102,8 @@ case class Rand(child: Expression) extends RDG with ExpressionWithRandomSeed {
}

override def freshCopy(): Rand = Rand(child)

override def sql: String = "rand()"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm .. shouldn't we print out the seed when it's explicitly given?

Copy link
Member Author

@maropu maropu Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, I see... I just want to reomve a non-deterministic number in a column name when no argument given. I need more time to think of it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe the current output is useful. Yes the seed was randomly chosen, but you might want to know what it was.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I also think it's important that we can check seeds, but df.explain is not enough for checking it? Actually, the other two expression with random seeds (shuffle and uuid) don't display it in column names;

scala> sql("select shuffle(array(1, 2))").show()
+--------------------+
|shuffle(array(1, 2))|
+--------------------+
|              [2, 1]|
+--------------------+

scala> sql("select shuffle(array(1, 2))").explain()
== Physical Plan ==
*(1) Project [shuffle([1,2], Some(894779230406706679)) AS shuffle(array(1, 2))#14]
+- *(1) Scan OneRowRelation[]

scala> sql("select uuid()").show()
+--------------------+
|              uuid()|
+--------------------+
|dde93891-8a95-4e9...|
+--------------------+

scala> sql("select uuid()").explain()
== Physical Plan ==
*(1) Project [uuid(Some(4613707233104825008)) AS uuid()#23]
+- *(1) Scan OneRowRelation[]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, consistency is good. I don't feel strongly either way.

@SparkQA
Copy link

SparkQA commented Apr 28, 2020

Test build #121990 has finished for PR 28392 at commit 7216511.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-31594][SQL] Do not display rand/randn seed numbers in schema [SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema Apr 28, 2020
@@ -102,6 +105,11 @@ case class Rand(child: Expression) extends RDG with ExpressionWithRandomSeed {
}

override def freshCopy(): Rand = Rand(child)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when we do freshCopy? Do we need to propagate useRandSeed field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes. Nice catch!

@@ -3425,6 +3425,28 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
assert(SQLConf.get.getConf(SQLConf.CODEGEN_FALLBACK) === true)
}
}

test("Do not display the seed of rand/randn with no argument in output schema") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, SPARK-31594: prefix please?

Copy link
Member Author

@maropu maropu Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, of course not!


override def flatArguments: Iterator[Any] = Iterator(child)
override def sql: String = {
s"randn(${if (useRandSeed) "" else child.sql})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming useRandSeed might be a little mismatched. This is used only here to hide random seed. Maybe, something like hideSeed is more direct?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Randn(child = expr, useRandSeed = true) seems to be possible in program. It might look weird because it will not use rand seed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, the name looks reasonable.

@SparkQA
Copy link

SparkQA commented Apr 28, 2020

Test build #122007 has finished for PR 28392 at commit 663fdc7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Rand(child: Expression, useRandSeed: Boolean = false)
  • case class Randn(child: Expression, useRandSeed: Boolean = false)

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122021 has finished for PR 28392 at commit 4b1f3f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Rand(child: Expression, hideSeed: Boolean = false)
  • case class Randn(child: Expression, hideSeed: Boolean = false)

Console.withOut(output) {
df.explain()
}
output.toString.matches("""randn?\(-?[0-9]+\)""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want assert(...)?

}
val df1 = sql("SELECT rand()")
assert(df1.schema.head.name === "rand()")
checkIfSeedExistsInExplain(df1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add assert at line 3435, this test will fail.

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122042 has finished for PR 28392 at commit d90b2e7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 29, 2020

retest this please

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @maropu, @HyukjinKwon , @srowen .
The last commit is only test case update and I verified that locally.
Merged to master.

@beliefer
Copy link
Contributor

@maropu I will update #28194

@maropu
Copy link
Member Author

maropu commented Apr 29, 2020

Thanks, @dongjoon-hyun !

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122045 has finished for PR 28392 at commit d90b2e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

jdcasale pushed a commit to palantir/spark that referenced this pull request Jun 22, 2021
…ment in output schema

This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic.

Before this PR (a column name changes run-by-run):
```
scala> sql("select rand()").show()
+-------------------------+
|rand(7986133828002692830)|
+-------------------------+
|       0.9524061403696937|
+-------------------------+
```
After this PR (a column name fixed):
```
scala> sql("select rand()").show()
+------------------+
|            rand()|
+------------------+
|0.7137935639522275|
+------------------+

// If a seed given, it is still shown in a column name
// (the same with the current behaviour)
scala> sql("select rand(1)").show()
+------------------+
|           rand(1)|
+------------------+
|0.6363787615254752|
+------------------+

// We can still check a seed in explain output:
scala> sql("select rand()").explain()
== Physical Plan ==
*(1) Project [rand(-2282124938778456838) AS rand()#0]
+- *(1) Scan OneRowRelation[]
```

Note: This fix comes from apache#28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests.

To make output schema deterministic.

No.

Added unit tests.

Closes apache#28392 from maropu/SPARK-31594.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants