[SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf #32243

maropu · 2021-04-20T01:35:44Z

What changes were proposed in this pull request?

This PR proposes to port minimal code to generate TPC-DS data from databricks/spark-sql-perf. The classes in a new class file tpcdsDatagen.scala are basically copied from the databricks/spark-sql-perf codebase.
Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them.

The code authors of these classes are:
@juliuszsompolski
@npoggi
@wangyum

Why are the changes needed?

We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g.,

I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition.

Does this PR introduce any user-facing change?

dev only.

How was this patch tested?

Manually checked and GA passed.

maropu · 2021-04-20T01:37:06Z

WDYT? If no one disagrees with this, I'll file jira for it. @juliuszsompolski @npoggi @wangyum @HyukjinKwon @yaooqinn

HyukjinKwon

Yup, I like this idea. cc @gatorsmile too FYI

maropu · 2021-04-20T01:40:06Z

.github/workflows/build_and_test.yml

+    - name: Checkout tpcds-kit repository
+      if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
+      uses: actions/checkout@v2
+      with:
+        repository: databricks/tpcds-kit
+        path: ./tpcds-kit
+    - name: Build tpcds-kit
+      if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
+      run: cd tpcds-kit/tools && make OS=LINUX


The part above was copied from https://github.com/apache/spark/pull/31303/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2 cc: @wangyum

maropu · 2021-04-20T01:42:31Z

sql/core/src/test/resources/tpcds-query-results/v1_4/q1.sql.out

-AAAAAAAAAADGAAAA
-AAAAAAAAAADGBAAA
-AAAAAAAAAADGBAAA
+AAAAAAAAAAABAAAA


This PR updated the golden files because datagen used to generate maropu/spark-tpcds-sf-1 was older than the databricks/tpcds-kit one.

In this case, it seems that we had better split this PR into two. WDYT, @maropu ?

yea, sgtm. We can split it by checking out my repo instead: https://github.com/maropu/spark-tpcds-datagen/tree/master/thirdparty/tpcds-kit/tools

I'll do it later, too.

okay, I've fixed the code so that it generate the same TPC-DS data with https://github.com/maropu/spark-tpcds-sf-1.

maropu · 2021-04-20T01:45:30Z

sql/core/src/test/scala/org/apache/spark/sql/tpcdsDagagen.scala

+object GenTPCDSData {
+
+  def main(args: Array[String]): Unit = {
+    val config = new GenTPCDSDataConfig(args)


I rewrote the logic to parse options because Spark does not have the dependency of scopt. cc: @wangyum
https://github.com/databricks/spark-sql-perf/blob/ca4ccea3dd824597601e001d3fefe205e2bf50b0/src/main/scala/com/databricks/spark/sql/perf/tpcds/GenTPCDSData.scala#L45

yaooqinn · 2021-04-20T01:54:06Z

sql/core/src/test/resources/tpcds-query-results/v1_4/q23b.sql.out

-Brown                         	Monika              	6031.52
-Collins                       	Gordon              	727.57
-Green                         	Jesse               	9672.96
+


Does the empty result set look right?

Yea, I need to check the answers query-by-query later...

yaooqinn · 2021-04-20T01:54:30Z

sql/core/src/test/resources/tpcds-query-results/v1_4/q23a.sql.out

@@ -3,4 +3,4 @@
 -- !query schema
 struct<sum(sales):decimal(28,2)>
 -- !query output
-17030.91
+NULL


Does the NULL look right?

yaooqinn · 2021-04-20T01:55:55Z

sql/core/src/test/resources/tpcds-query-results/v1_4/q24b.sql.out

@@ -3,4 +3,4 @@
 -- !query schema
 struct<c_last_name:string,c_first_name:string,s_store_name:string,paid:decimal(27,2)>
 -- !query output
-Griffith                      	Ray                 	able	161564.48


yaooqinn · 2021-04-20T01:56:04Z

sql/core/src/test/resources/tpcds-query-results/v1_4/q25.sql.out

@@ -3,4 +3,4 @@
 -- !query schema
 struct<i_item_id:string,i_item_desc:string,s_store_id:string,s_store_name:string,store_sales_profit:decimal(17,2),store_returns_loss:decimal(17,2),catalog_sales_profit:decimal(17,2)>
 -- !query output
-AAAAAAAADPMBAAAA	Things know alone letters. Flights should tend even jewish fees. Civil plans could not cry also social days; other losses might not pay walls; still able signs should not remove too human 	AAAAAAAAHAAAAAAA	ation	12.84	91.41	-1329.46


yaooqinn · 2021-04-20T01:56:16Z

sql/core/src/test/resources/tpcds-query-results/v1_4/q24a.sql.out

-Pettit                        	Richard             	able	3930.52
-Townsend                      	Franklin            	able	68983.20
-Winchester                    	Margaret            	bar	14269.20
+


yaooqinn · 2021-04-20T02:06:05Z

Great work, @maropu and thanks for making this happen

SparkQA · 2021-04-20T02:36:51Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42187/

SparkQA · 2021-04-20T04:36:59Z

Test build #137658 has finished for PR 32243 at commit 7b6949d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TPCDSBase extends SharedSparkSession with TPCDSSchema
trait TPCDSSchema
class Dsdgen(dsdgenDir: String) extends Serializable
class TPCDSTables(sqlContext: SQLContext, dsdgenDir: String, scaleFactor: Int)
class GenTPCDSDataConfig(args: Array[String])

dongjoon-hyun · 2021-04-20T04:38:18Z

.github/workflows/build_and_test.yml

+      uses: actions/cache@v2
+      with:
+        path: ./tpcds-sf-1
+        key: tpcds-${{ hashFiles('sql/core/src/test/scala/org/apache/spark/sql/TPCDSSchema.scala') }}


Is this sufficient?

yes, I think it is for now since this PR targets at schema definition changes.

npoggi · 2021-04-21T08:03:29Z

Glad to see this update happening. Ideally, if we can store the [compressed] sf1 scale factor in a public server, there shouldn't be much of a need for compiling dsdgen and the generation. I see in the PR that there is some caching already.
With @HyukjinKwon we need to update spark-sql-perf to the latest TPC-DS spec and query templates. @gatorsmile let's coordinate that effort.

HyukjinKwon · 2021-04-22T01:06:58Z

I think it's okay. As you pointed out, GitHub Actions already caches the generated SF1 data one, and it doesn't seem adding much codes for that.

we need to update spark-sql-perf to the latest TPC-DS spec and query templates

@maropu would you mind porting the TPC-DS spec and query update to spark-sql-perf when you find some time?

maropu · 2021-04-22T13:09:41Z

Ur..., I noticed that generated data are different between the GA env(Ubuntu) and my env(MacOS) by following the same workflow. Probably, the generator behaviour seems to depend on the implementation of random functions. I'm currently not sure that we can generate the same data between different linux distro, so I need more work (e.g., adding a script to generate data on docker env) for making it easy for developers to generate data/golden files....

My bad. I misunderstood it and I just used different seeds RNGSEED when generating data. I've checked that the current code can generate the same TPC-DS data between different env (macos, Ubuntu, Amazon Linux 2, ...).

@maropu would you mind porting the TPC-DS spec and query update to spark-sql-perf when you find some time?

Of course not. Probably, I have time to work on it tomorrow.

SparkQA · 2021-04-30T02:41:32Z

Test build #138095 has finished for PR 32243 at commit ef68b61.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-30T03:30:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42615/

SparkQA · 2021-04-30T03:30:23Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42615/

…0.04)" This reverts commit 70244e4.

maropu · 2021-04-30T13:21:15Z

sql/core/src/test/scala/org/apache/spark/sql/tpcdsDagagen.scala

+        val commands = Seq(
+          "bash", "-c",
+          s"cd $localToolsDir && ./dsdgen -table $tableName -filter Y -scale $scaleFactor " +
+          s"-RNGSEED 19620718 $parallel")


NOTE: In a follow-up PR, I'll revert back this value to 100 to use https://github.com/databricks/tpcds-kit instead. See: #32243 (comment)

SparkQA · 2021-04-30T14:22:16Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42631/

dongjoon-hyun · 2021-04-30T16:33:56Z

sql/core/src/test/scala/org/apache/spark/sql/tpcdsDagagen.scala

+ *   build/sbt "sql/test:runMain <this class> --dsdgenDir <path> --location <path> --scaleFactor 1"
+ * }}}
+ */
+object GenTPCDSData {


Shall we rename the file tpcdsDatagen.scala -> GenTPCDSData.scala?

dongjoon-hyun · 2021-04-30T16:35:03Z

sql/core/src/test/scala/org/apache/spark/sql/tpcdsDagagen.scala

+      |build/sbt "test:runMain <this class> [Options]"
+      |Options:
+      |  --master                        the Spark master to use, default to local[*]
+      |  --dsdgenDir                     location of dsdgen


Is this the only difference from your repo?

dongjoon-hyun · 2021-04-30T16:37:12Z

sql/core/src/test/scala/org/apache/spark/sql/tpcdsDagagen.scala

+      clusterByPartitionColumns = config.clusterByPartitionColumns,
+      filterOutNullPartitionValues = config.filterOutNullPartitionValues,
+      tableFilter = config.tableFilter,
+      numPartitions = config.numPartitions)


We need spark.stop() after this line.

Oh, I forgot it. Thank you.

dongjoon-hyun · 2021-04-30T16:38:45Z

Thank you for update, @maropu . The updated version looks neat!

SparkQA · 2021-04-30T17:41:03Z

Test build #138110 has finished for PR 32243 at commit 5ee9c1a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-30T22:23:04Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42638/

SparkQA · 2021-04-30T23:32:28Z

Test build #138117 has finished for PR 32243 at commit a88fe32.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit a88fe32.

SparkQA · 2021-05-01T00:29:33Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42640/

SparkQA · 2021-05-01T04:16:23Z

Test build #138119 has finished for PR 32243 at commit 5f4d022.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-05-01T05:06:13Z

okay, ready to review. cc: @juliuszsompolski @npoggi @wangyum

maropu · 2021-05-03T03:05:09Z

Thank you, @HyukjinKwon @dongjoon-hyun ~ Merged to master.

npoggi · 2021-05-03T19:32:43Z

Arriving a big late. Looks good. We should move the DDL for the tables as resource files at some point. Thanks for the update.

… Adds a new job in GitHub Actions to check the output of TPC-DS queries ### What changes were proposed in this pull request? This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries. NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in #32243 and #32460. This is the backport PR of #31886. ### Why are the changes needed? There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The new test added. Closes #32462 from maropu/TPCDSQUeryTestSuite-Branch3.1. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… Adds a new job in GitHub Actions to check the output of TPC-DS queries ### What changes were proposed in this pull request? This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries. NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in #32243 and #32460. This is the backport PR of #31886. ### Why are the changes needed? There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The new test added. Closes #32479 from maropu/TPCDSQueryTestSuite-Branch3.0. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… Adds a new job in GitHub Actions to check the output of TPC-DS queries ### What changes were proposed in this pull request? This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries. NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in apache#32243 and apache#32460. This is the backport PR of apache#31886. ### Why are the changes needed? There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The new test added. Closes apache#32462 from maropu/TPCDSQUeryTestSuite-Branch3.1. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added INFRA SQL labels Apr 20, 2021

HyukjinKwon reviewed Apr 20, 2021

View reviewed changes

maropu commented Apr 20, 2021

View reviewed changes

yaooqinn reviewed Apr 20, 2021

View reviewed changes

dongjoon-hyun reviewed Apr 20, 2021

View reviewed changes

maropu marked this pull request as draft April 22, 2021 13:10

maropu changed the title ~~[WIP][SPARK-XXXXX][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf~~ [WIP][SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf Apr 22, 2021

Fix

be9f62f

maropu force-pushed the tpcdsDatagen branch from 7b6949d to ef68b61 Compare April 30, 2021 02:24

github-actions bot added the BUILD label Apr 30, 2021

maropu added 3 commits April 30, 2021 22:09

Add a script to generate TPC-DS data on docker-image(Ubuntu 20.04)

70244e4

Revert "Add a script to generate TPC-DS data on docker-image(Ubuntu 2…

74002c3

…0.04)" This reverts commit 70244e4.

Address comments

5ee9c1a

maropu force-pushed the tpcdsDatagen branch from ef68b61 to 5ee9c1a Compare April 30, 2021 13:13

maropu commented Apr 30, 2021

View reviewed changes

maropu changed the title ~~[WIP][SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf~~ [SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf Apr 30, 2021

maropu marked this pull request as ready for review April 30, 2021 13:25

dongjoon-hyun reviewed Apr 30, 2021

View reviewed changes

maropu added 2 commits May 1, 2021 06:38

review

57558c6

Try to refresh the cache in GA

a88fe32

maropu added 2 commits May 1, 2021 08:47

Revert "Try to refresh the cache in GA"

b28e8dc

This reverts commit a88fe32.

Revert wrong changes

5f4d022

HyukjinKwon approved these changes May 3, 2021

View reviewed changes

maropu closed this in cd689c9 May 3, 2021

[SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf #32243

[SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf #32243

Conversation

maropu commented Apr 20, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

maropu commented Apr 20, 2021

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaooqinn commented Apr 20, 2021

SparkQA commented Apr 20, 2021

SparkQA commented Apr 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npoggi commented Apr 21, 2021

HyukjinKwon commented Apr 22, 2021 • edited Loading

maropu commented Apr 22, 2021 • edited Loading

SparkQA commented Apr 30, 2021

SparkQA commented Apr 30, 2021

SparkQA commented Apr 30, 2021

Choose a reason for hiding this comment

SparkQA commented Apr 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Apr 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 30, 2021

SparkQA commented Apr 30, 2021

SparkQA commented Apr 30, 2021

SparkQA commented Apr 30, 2021

SparkQA commented May 1, 2021

SparkQA commented May 1, 2021

maropu commented May 1, 2021

maropu commented May 3, 2021

npoggi commented May 3, 2021

HyukjinKwon commented Apr 22, 2021 •

edited

Loading

maropu commented Apr 22, 2021 •

edited

Loading

dongjoon-hyun Apr 30, 2021 •

edited

Loading