Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34211][SQL][TESTS] Benchmark TPC-DS with 1GB scale factor #31303

Closed
wants to merge 3 commits into from
Closed

[SPARK-34211][SQL][TESTS] Benchmark TPC-DS with 1GB scale factor #31303

wants to merge 3 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Jan 23, 2021

What changes were proposed in this pull request?

This pr add a new Github action to benchmark TPC-DS with 1GB scale factor.

Why are the changes needed?

  1. To track performance changes:
    http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Spark-3-1-1-RC1-tt30579.html#a30603
  2. To track the compatibility of Parquet upgrades:
    https://issues.apache.org/jira/browse/SPARK-26346

Does this PR introduce any user-facing change?

No.

How was this patch tested?

N/A.

@github-actions github-actions bot added the INFRA label Jan 23, 2021
@SparkQA
Copy link

SparkQA commented Jan 23, 2021

Test build #134395 has finished for PR 31303 at commit e8faad3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38983/

@SparkQA
Copy link

SparkQA commented Jan 23, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38983/

@SparkQA
Copy link

SparkQA commented Jan 23, 2021

Test build #134397 has finished for PR 31303 at commit 9da5102.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

run: cd spark-sql-perf && build/sbt "test:runMain com.databricks.spark.sql.GenTPCDSData `pwd`/../tpcds-kit/tools 1 `pwd`/../tpcds1g parquet"
- name: Run TPCDSQueryBenchmark
run: |
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location `pwd`/tpcds1g --cbo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we enable CBO by default? I think we should run it with the default conf.

@HyukjinKwon
Copy link
Member

cc @Ngone51, @dongjoon-hyun, @cloud-fan, @peter-toth FYI

- name: Checkout spark-sql-perf repository
uses: actions/checkout@v2
with:
repository: wangyum/spark-sql-perf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what these repos do in the PR description? Also what does this fork do? Looks like only diff is wangyum/spark-sql-perf@abf08eb. Can you just use the original repo, and keep the data generate Scala file somewhere in Spark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Original repo need a class: databricks/spark-sql-perf#196

run: cd spark-sql-perf && build/sbt "test:runMain com.databricks.spark.sql.GenTPCDSData `pwd`/../tpcds-kit/tools 1 `pwd`/../tpcds1g parquet"
- name: Run TPCDSQueryBenchmark
run: |
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location `pwd`/tpcds1g --cbo"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we keep the cache of Coursier in SBT?

@@ -430,3 +430,38 @@ jobs:
- name: Build with SBT
run: |
./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Phadoop-2.7 compile test:compile

tpcds1g:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no big deal but I would name it tpcds-1g

@HyukjinKwon HyukjinKwon changed the title [SPARK-34211][SQL][TEST] Benchmark TPC-DS with 1GB scale factor [SPARK-34211][SQL][TESTS] Benchmark TPC-DS with 1GB scale factor Jan 24, 2021
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few issues, @wangyum and @HyukjinKwon .

  1. I'm not sure if you are aware of it, but GitHub Action job performance is inconsisent. I'm -1 for depending on the unknown runner's benchmark.
  2. Apache Infra team already asked us to manage the shared resource pool efficiently. Given that, we had better avoid performance benchmark in Apache Spark GitHub Action job.

Like @maropu did already, why don't we have this in the downstream?

@HyukjinKwon
Copy link
Member

Is the perf inconsistency big? It will give you a rough numbers to check. I was also thinking this as a test. I remember we faced a case that TPSDS was broken whereas the Spark tests pass.

The resource is indeed a problem. It has been a problem so far. I think we have added some jobs if they are worthwhile though.

I'm okay to wait until we have some more resources and don't add new jobs for the time being.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 24, 2021

Actually, I understand the requirement and the intention because I also reported SPARK-33822 (TPCDS Q5 fails if spark.sql.adaptive.enabled=true).

I remember we faced a case that TPSDS was broken whereas the Spark tests pass.

@wangyum Could you provide a result in your repository which is requested by @HyukjinKwon ? It depends on the characteristic of the job.

Is the perf inconsistency big? It will give you a rough numbers to check.

BTW, on top of the above issues, this is not safe in terms of security. Recently, Apache Infra team introduced strict security enforcements by banning the runnable source of GitHub Action. We have no available option when the 3rd party repositories are compromised . Even the docker image repo, we should migrate ApacheSparkGitHubActionImage to Spark's official branch. For the two repositories in this PR, the migration is possible?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 25, 2021

Yeah, it would be great if we can tell if the perf diff is considerably big, @wangyum. If it's too big, it will be less making sense to add.

And, yes it would be great if there are ways around to use them. If I remember correctly, the purpose of banning GitHub Actions and 3rd part repository was to avoid running unreviewed codes #31104 (comment).

@wangyum
Copy link
Member Author

wangyum commented Jan 25, 2021

Latest benchmark result:

Query Github Action 505391695 Github Action 505251367
q1 1847 2092
q2 1771 2012
q3 612 834
q4 26213 27773
q5 3903 5303
q6 1183 1459
q7 1250 1560
q8 1146 1390
q9 4657 6995
q10 3318 3819
q11 4058 5395
q12 625 873
q13 1697 2188
q14a 14602 15065
q14b 13321 13641
q15 838 770
q16 3276 3846
q17 2351 2923
q18 1937 1975
q19 2405 2671
q20 663 672
q21 931 963
q22 4572 4682
q23a 12826 13756
q23b 13425 14709
q24a 3001 4710
q24b 2511 4863
q25 2157 2693
q26 1015 1128
q27 1449 1877
q28 4545 6108
q29 2524 2980
q30 1366 1867
q31 3303 4737
q32 864 928
q33 1871 2231
q34 1016 1235
q35 2740 3110
q36 1222 1508
q37 1137 1244
q38 1917 2373
q39a 2249 2131
q39b 2241 2235
q40 1350 1379
q41 345 323
q42 542 732
q43 810 1037
q44 1381 1836
q45 958 1075
q46 1248 1575
q47 3842 4116
q48 1784 2103
q49 3028 3794
q50 1330 1748
q51 5166 5783
q52 531 743
q53 836 1118
q54 1491 2321
q55 508 722
q56 1953 2342
q57 2805 2701
q58 2360 2547
q59 1598 1772
q60 1949 2402
q61 1927 2364
q62 757 966
q63 806 1085
q64 11252 12010
q65 1971 2511
q66 2591 3041
q67 12088 12007
q68 1274 1570
q69 1199 1627
q70 1572 2096
q71 1543 2024
q72 11001 10465
q73 942 1157
q74 3064 4269
q75 4723 6536
q76 1305 1886
q77 2697 3898
q78 5755 6678
q79 1027 1441
q80 4421 5733
q81 1402 1689
q82 1189 1521
q83 1482 1992
q84 927 1191
q85 2338 2828
q86 727 905
q87 2061 2557
q88 3893 5405
q89 927 1256
q90 810 1227
q91 1504 1591
q92 800 1149
q93 1096 1429
q94 1132 1615
q95 10331 12241
q96 512 728
q97 2454 3069
q98 746 1026
q99 848 960
q5a-v2.7 4331 6205
q6-v2.7 1035 1316
q10a-v2.7 1285 1681
q11-v2.7 3726 5333
q12-v2.7 517 809
q14-v2.7 10499 12304
q14a-v2.7 23577 26429
q18a-v2.7 6817 7550
q20-v2.7 598 677
q22-v2.7 18451 20012
q22a-v2.7 3341 3576
q24-v2.7 2538 4958
q27a-v2.7 2884 3855
q34-v2.7 866 1165
q35-v2.7 2527 2950
q35a-v2.7 1534 2023
q36a-v2.7 1748 2022
q47-v2.7 3466 4143
q49-v2.7 2773 3681
q51a-v2.7 27485 30249
q57-v2.7 2530 2772
q64-v2.7 9467 11019
q67a-v2.7 15092 16564
q70a-v2.7 2070 2574
q72-v2.7 9824 10055
q74-v2.7 3085 3971
q75-v2.7 4976 6163
q77a-v2.7 4268 5392
q78-v2.7 5478 6608
q80a-v2.7 6664 7881
q86a-v2.7 1196 1372
q98-v2.7 756 1010

@cloud-fan
Copy link
Contributor

I agree that we can still track some big perf differences (like 10x slower) with Github actions, but running benchmarks in Github action takes a lot of resources and may not pay off if it can only catch big perf differences that are rare.

We can either fork spark and run benchmarks with Github actions under your own account, or set up AWS machines to run benchmarks like @maropu did.

@wangyum
Copy link
Member Author

wangyum commented Jan 25, 2021

Close it. Thank you all.

@wangyum wangyum closed this Jan 25, 2021
@HyukjinKwon
Copy link
Member

running benchmarks in Github action takes a lot of resources

Just be correct, doing 1GB is not a lot. It takes less than an hour which is same as other parts of our tests. As I said earlier, it's not only performance but also test, see SPARK-33822.

But it's fine to drop it given that the resource issue going on globally in ASF repos.

@HyukjinKwon
Copy link
Member

Once we resolve the resource limitation issue, it would be great if we can bring this back.

@wangyum wangyum deleted the SPARK-34211 branch January 25, 2021 09:17
@dongjoon-hyun
Copy link
Member

Thank you for the decision all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants