Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Stack classes to wrapper classes to fix non-deterministic build issue #9576

Merged
merged 4 commits into from
Oct 31, 2023

Conversation

NVnavkumar
Copy link
Collaborator

Fixes #9571.

This wraps the Scala 2.12/2.13 specific Stack classes in a class called RapidsStack, which is then used like the other classes. Previously, we extended the Spark specific classes with ScalaStack, but that created issues in the build due to the bytecode dependencies that linked back to SQLExecPlugin.class.

…2.13 Stack classes to handle build issues

Signed-off-by: Navin Kumar <[email protected]>
@NVnavkumar NVnavkumar requested review from pxLi and jlowe October 30, 2023 18:47
@NVnavkumar
Copy link
Collaborator Author

From Jason's simple reproduce with this branch:

$ mvn clean install -pl sql-plugin -am -Dcuda.version=cuda11 -DskipTests -Dskip -Dbuildver=331
$ javap -cp sql-plugin/target/spark331/rapids-4-spark-sql_2.13-23.12.0-SNAPSHOT-spark331.jar com.nvidia.spark.rapids.SQLExecPlugin | grep compose
  public <A> scala.Function1<A, scala.runtime.BoxedUnit> compose(scala.Function1<A, org.apache.spark.sql.SparkSessionExtensions>);

The A$ now goes away on mvn clean install

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a repro to .github/workflows/mvn-verify-check.yml or premerge

jlowe
jlowe previously approved these changes Oct 30, 2023
@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar
Copy link
Collaborator Author

We should add a repro to .github/workflows/mvn-verify-check.yml or premerge

Should we do this in a follow up issue for all the "unshimmed" classes?

@gerashegalov
Copy link
Collaborator

We should add a repro to .github/workflows/mvn-verify-check.yml or premerge

Should we do this in a follow up issue for all the "unshimmed" classes?

That is a good idea.

gerashegalov
gerashegalov previously approved these changes Oct 30, 2023
@NVnavkumar
Copy link
Collaborator Author

We should add a repro to .github/workflows/mvn-verify-check.yml or premerge

Should we do this in a follow up issue for all the "unshimmed" classes?

That is a good idea.

Filed #9578

@NVnavkumar
Copy link
Collaborator Author

build

@sameerz sameerz added the task Work required that improves the product but is not user facing label Oct 30, 2023
@pxLi
Copy link
Collaborator

pxLi commented Oct 31, 2023

failed core dump. could be related to cudf/jni changes, I will file another ticket #9582. Let me retrigger the CI here


[2023-10-30T21:16:58.548Z] GpuSortRetrySuite:

[2023-10-30T21:16:58.549Z] - GPU out-of-core sort without OOM failures *** FAILED ***

[2023-10-30T21:16:58.549Z]   java.lang.AssertionError: assertion failed

[2023-10-30T21:16:58.549Z]   at scala.Predef$.assert(Predef.scala:264)

[2023-10-30T21:16:58.549Z]   at com.nvidia.spark.rapids.GpuSorter.$anonfun$mergeSortAndCloseWithRetry$1(SortUtils.scala:248)

[2023-10-30T21:16:58.549Z]   at com.nvidia.spark.rapids.GpuSorter.$anonfun$mergeSortAndCloseWithRetry$1$adapted(SortUtils.scala:247)

[2023-10-30T21:16:58.549Z]   at com.nvidia.spark.rapids.ArmScalaSpecificImpl.closeOnExcept(ArmScalaSpecificImpl.scala:40)

[2023-10-30T21:16:58.549Z]   at com.nvidia.spark.rapids.ArmScalaSpecificImpl.closeOnExcept$(ArmScalaSpecificImpl.scala:37)

[2023-10-30T21:16:58.549Z]   at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:24)

[2023-10-30T21:16:58.549Z]   at com.nvidia.spark.rapids.GpuSorter.mergeSortAndCloseWithRetry(SortUtils.scala:247)

[2023-10-30T21:16:58.549Z]   at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.mergeSortEnoughToOutput(GpuSortExec.scala:463)

[2023-10-30T21:16:58.549Z]   at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$next$3(GpuSortExec.scala:573)

[2023-10-30T21:16:58.549Z]   at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)

[2023-10-30T21:16:58.549Z]   ...

[2023-10-30T21:16:59.477Z] #

[2023-10-30T21:16:59.478Z] # A fatal error has been detected by the Java Runtime Environment:

[2023-10-30T21:16:59.478Z] #

[2023-10-30T21:16:59.478Z] #  SIGSEGV (0xb) at pc=0x00007fc25d7e8ab0, pid=2625, tid=0x00007fc3be149700

[2023-10-30T21:16:59.478Z] #

[2023-10-30T21:16:59.478Z] # JRE version: OpenJDK Runtime Environment (8.0_382-b05) (build 1.8.0_382-8u382-ga-1~20.04.1-b05)

[2023-10-30T21:16:59.478Z] # Java VM: OpenJDK 64-Bit Server VM (25.382-b05 mixed mode linux-amd64 compressed oops)

[2023-10-30T21:16:59.479Z] # Problematic frame:

[2023-10-30T21:16:59.479Z] # C  0x00007fc25d7e8ab0

[2023-10-30T21:16:59.479Z] #

[2023-10-30T21:16:59.479Z] # Core dump written. Default location: /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-8240-scala-213/scala2.13/tests/core or core.2625

[2023-10-30T21:16:59.479Z] #

[2023-10-30T21:16:59.479Z] # An error report file with more information is saved as:

[2023-10-30T21:16:59.479Z] # /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-8240-scala-213/scala2.13/tests/hs_err_pid2625.log

[2023-10-30T21:16:59.479Z] #

[2023-10-30T21:16:59.479Z] # If you would like to submit a bug report, please visit:

[2023-10-30T21:16:59.479Z] #   http://bugreport.java.com/bugreport/crash.jsp

[2023-10-30T21:16:59.479Z] # The crash happened outside the Java Virtual Machine in native code.

[2023-10-30T21:16:59.479Z] # See problematic frame for where to report the bug.

@pxLi
Copy link
Collaborator

pxLi commented Oct 31, 2023

build

@pxLi
Copy link
Collaborator

pxLi commented Oct 31, 2023

hmm still failing many assertion errors on CI machine (A30 + 32 cores cpu). I am not seeing a repo locally, this may related to the core numbers

[2023-10-31T00:55:26.037Z] �[31m- IGNORE ORDER: test sort agg with first and last string deterministic case *** FAILED ***�[0m
[2023-10-31T00:55:26.037Z] �[31m  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (ci-scala213-jenkins-rapids-premerge-github-8242-wb2k3-368mv executor driver): java.lang.AssertionError: assertion failed�[0m
[2023-10-31T00:55:26.037Z] �[31m	at scala.Predef$.assert(Predef.scala:264)�[0m
[2023-10-31T00:55:26.037Z] �[31m	at com.nvidia.spark.rapids.GpuSorter.$anonfun$mergeSortAndCloseWithRetry$1(SortUtils.scala:248)�[0m
[2023-10-31T00:55:26.037Z] �[31m	at com.nvidia.spark.rapids.GpuSorter.$anonfun$mergeSortAndCloseWithRetry$1$adapted(SortUtils.scala:247)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.ArmScalaSpecificImpl.closeOnExcept(ArmScalaSpecificImpl.scala:40)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.ArmScalaSpecificImpl.closeOnExcept$(ArmScalaSpecificImpl.scala:37)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:24)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.GpuSorter.mergeSortAndCloseWithRetry(SortUtils.scala:247)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.mergeSortEnoughToOutput(GpuSortExec.scala:463)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$next$3(GpuSortExec.scala:573)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:572)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:242)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.AbstractProjectSplitIterator.next(basicPhysicalOperators.scala:247)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.AbstractProjectSplitIterator.next(basicPhysicalOperators.scala:227)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:587)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$10.next(Iterator.scala:608)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.Option.map(Option.scala:242)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.AbstractProjectSplitIterator.next(basicPhysicalOperators.scala:247)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.AbstractProjectSplitIterator.next(basicPhysicalOperators.scala:227)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:587)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:751)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.Option.getOrElse(Option.scala:201)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:749)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:711)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$9.next(Iterator.scala:577)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.Option.map(Option.scala:242)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:287)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:284)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:301)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.scheduler.Task.run(Task.scala:136)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m
[2023-10-31T00:55:26.038Z] �[31m	at java.lang.Thread.run(Thread.java:750)�[0m

@pxLi
Copy link
Collaborator

pxLi commented Oct 31, 2023

build

Signed-off-by: Navin Kumar <[email protected]>
@NVnavkumar NVnavkumar dismissed stale reviews from gerashegalov and jlowe via 9f12c27 October 31, 2023 01:31
@NVnavkumar
Copy link
Collaborator Author

build

@pxLi pxLi merged commit d157306 into NVIDIA:branch-23.12 Oct 31, 2023
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] non-deterministic compiled SQLExecPlugin.class with scala 2.13 deployment
5 participants