Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf #32243

Closed
wants to merge 8 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Apr 20, 2021

What changes were proposed in this pull request?

This PR proposes to port minimal code to generate TPC-DS data from databricks/spark-sql-perf. The classes in a new class file tpcdsDatagen.scala are basically copied from the databricks/spark-sql-perf codebase.
Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them.

The code authors of these classes are:
@juliuszsompolski
@npoggi
@wangyum

Why are the changes needed?

We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g.,

I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition.

Does this PR introduce any user-facing change?

dev only.

How was this patch tested?

Manually checked and GA passed.

@maropu
Copy link
Member Author

maropu commented Apr 20, 2021

WDYT? If no one disagrees with this, I'll file jira for it. @juliuszsompolski @npoggi @wangyum @HyukjinKwon @yaooqinn

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I like this idea. cc @gatorsmile too FYI

Comment on lines 524 to 532
- name: Checkout tpcds-kit repository
if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
uses: actions/checkout@v2
with:
repository: databricks/tpcds-kit
path: ./tpcds-kit
- name: Build tpcds-kit
if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
run: cd tpcds-kit/tools && make OS=LINUX
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AAAAAAAAAADGAAAA
AAAAAAAAAADGBAAA
AAAAAAAAAADGBAAA
AAAAAAAAAAABAAAA
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR updated the golden files because datagen used to generate maropu/spark-tpcds-sf-1 was older than the databricks/tpcds-kit one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it seems that we had better split this PR into two. WDYT, @maropu ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, sgtm. We can split it by checking out my repo instead: https://github.com/maropu/spark-tpcds-datagen/tree/master/thirdparty/tpcds-kit/tools

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it later, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I've fixed the code so that it generate the same TPC-DS data with https://github.com/maropu/spark-tpcds-sf-1.

object GenTPCDSData {

def main(args: Array[String]): Unit = {
val config = new GenTPCDSDataConfig(args)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brown Monika 6031.52
Collins Gordon 727.57
Green Jesse 9672.96

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the empty result set look right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I need to check the answers query-by-query later...

@@ -3,4 +3,4 @@
-- !query schema
struct<sum(sales):decimal(28,2)>
-- !query output
17030.91
NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the NULL look right?

@@ -3,4 +3,4 @@
-- !query schema
struct<c_last_name:string,c_first_name:string,s_store_name:string,paid:decimal(27,2)>
-- !query output
Griffith Ray able 161564.48
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -3,4 +3,4 @@
-- !query schema
struct<i_item_id:string,i_item_desc:string,s_store_id:string,s_store_name:string,store_sales_profit:decimal(17,2),store_returns_loss:decimal(17,2),catalog_sales_profit:decimal(17,2)>
-- !query output
AAAAAAAADPMBAAAA Things know alone letters. Flights should tend even jewish fees. Civil plans could not cry also social days; other losses might not pay walls; still able signs should not remove too human AAAAAAAAHAAAAAAA ation 12.84 91.41 -1329.46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Pettit Richard able 3930.52
Townsend Franklin able 68983.20
Winchester Margaret bar 14269.20

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@yaooqinn
Copy link
Member

Great work, @maropu and thanks for making this happen

@SparkQA
Copy link

SparkQA commented Apr 20, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42187/

@SparkQA
Copy link

SparkQA commented Apr 20, 2021

Test build #137658 has finished for PR 32243 at commit 7b6949d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TPCDSBase extends SharedSparkSession with TPCDSSchema
  • trait TPCDSSchema
  • class Dsdgen(dsdgenDir: String) extends Serializable
  • class TPCDSTables(sqlContext: SQLContext, dsdgenDir: String, scaleFactor: Int)
  • class GenTPCDSDataConfig(args: Array[String])

uses: actions/cache@v2
with:
path: ./tpcds-sf-1
key: tpcds-${{ hashFiles('sql/core/src/test/scala/org/apache/spark/sql/TPCDSSchema.scala') }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think it is for now since this PR targets at schema definition changes.

@npoggi
Copy link
Contributor

npoggi commented Apr 21, 2021

Glad to see this update happening. Ideally, if we can store the [compressed] sf1 scale factor in a public server, there shouldn't be much of a need for compiling dsdgen and the generation. I see in the PR that there is some caching already.
With @HyukjinKwon we need to update spark-sql-perf to the latest TPC-DS spec and query templates. @gatorsmile let's coordinate that effort.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 22, 2021

I think it's okay. As you pointed out, GitHub Actions already caches the generated SF1 data one, and it doesn't seem adding much codes for that.

we need to update spark-sql-perf to the latest TPC-DS spec and query templates

@maropu would you mind porting the TPC-DS spec and query update to spark-sql-perf when you find some time?

@maropu
Copy link
Member Author

maropu commented Apr 22, 2021

Ur..., I noticed that generated data are different between the GA env(Ubuntu) and my env(MacOS) by following the same workflow. Probably, the generator behaviour seems to depend on the implementation of random functions. I'm currently not sure that we can generate the same data between different linux distro, so I need more work (e.g., adding a script to generate data on docker env) for making it easy for developers to generate data/golden files....

My bad. I misunderstood it and I just used different seeds RNGSEED when generating data. I've checked that the current code can generate the same TPC-DS data between different env (macos, Ubuntu, Amazon Linux 2, ...).

@maropu would you mind porting the TPC-DS spec and query update to spark-sql-perf when you find some time?

Of course not. Probably, I have time to work on it tomorrow.

@maropu maropu marked this pull request as draft April 22, 2021 13:10
@maropu maropu changed the title [WIP][SPARK-XXXXX][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf [WIP][SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf Apr 22, 2021
@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Test build #138095 has finished for PR 32243 at commit ef68b61.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42615/

@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42615/

val commands = Seq(
"bash", "-c",
s"cd $localToolsDir && ./dsdgen -table $tableName -filter Y -scale $scaleFactor " +
s"-RNGSEED 19620718 $parallel")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: In a follow-up PR, I'll revert back this value to 100 to use https://github.com/databricks/tpcds-kit instead. See: #32243 (comment)

@maropu maropu changed the title [WIP][SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf [SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf Apr 30, 2021
@maropu maropu marked this pull request as ready for review April 30, 2021 13:25
@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42631/

* build/sbt "sql/test:runMain <this class> --dsdgenDir <path> --location <path> --scaleFactor 1"
* }}}
*/
object GenTPCDSData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we rename the file tpcdsDatagen.scala -> GenTPCDSData.scala?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

|build/sbt "test:runMain <this class> [Options]"
|Options:
| --master the Spark master to use, default to local[*]
| --dsdgenDir location of dsdgen
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only difference from your repo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, yes.

clusterByPartitionColumns = config.clusterByPartitionColumns,
filterOutNullPartitionValues = config.filterOutNullPartitionValues,
tableFilter = config.tableFilter,
numPartitions = config.numPartitions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need spark.stop() after this line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot it. Thank you.

@dongjoon-hyun
Copy link
Member

Thank you for update, @maropu . The updated version looks neat!

@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Test build #138110 has finished for PR 32243 at commit 5ee9c1a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42638/

@SparkQA
Copy link

SparkQA commented Apr 30, 2021

Test build #138117 has finished for PR 32243 at commit a88fe32.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42640/

@SparkQA
Copy link

SparkQA commented May 1, 2021

Test build #138119 has finished for PR 32243 at commit 5f4d022.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented May 1, 2021

okay, ready to review. cc: @juliuszsompolski @npoggi @wangyum

@maropu maropu closed this in cd689c9 May 3, 2021
@maropu
Copy link
Member Author

maropu commented May 3, 2021

Thank you, @HyukjinKwon @dongjoon-hyun ~ Merged to master.

@npoggi
Copy link
Contributor

npoggi commented May 3, 2021

Arriving a big late. Looks good. We should move the DDL for the tables as resource files at some point. Thanks for the update.

dongjoon-hyun pushed a commit that referenced this pull request May 8, 2021
… Adds a new job in GitHub Actions to check the output of TPC-DS queries

### What changes were proposed in this pull request?

This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries.

NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in #32243 and #32460.

This is the backport PR of #31886.

### Why are the changes needed?

There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The new test added.

Closes #32462 from maropu/TPCDSQUeryTestSuite-Branch3.1.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request May 9, 2021
… Adds a new job in GitHub Actions to check the output of TPC-DS queries

### What changes were proposed in this pull request?

This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries.

NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in #32243 and #32460.

This is the backport PR of #31886.

### Why are the changes needed?

There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The new test added.

Closes #32479 from maropu/TPCDSQueryTestSuite-Branch3.0.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
… Adds a new job in GitHub Actions to check the output of TPC-DS queries

### What changes were proposed in this pull request?

This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries.

NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in apache#32243 and apache#32460.

This is the backport PR of apache#31886.

### Why are the changes needed?

There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The new test added.

Closes apache#32462 from maropu/TPCDSQUeryTestSuite-Branch3.1.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants