New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add support for self-contained profiling #10870

Merged

winningsix merged 5 commits into NVIDIA:branch-24.06 from jlowe:profiler

May 29, 2024

Member

jlowe commented May 22, 2024

Contributes to #10632. Depends on NVIDIA/spark-rapids-jni#2066.

Adds the ability for the RAPIDS Accelerator to collect CUDA and NVTX profiling data with no extra requirements on the execution environment (i.e.: do not need to install Nsight Systems or the CUDA toolkit). Profiling is enabled by setting spark.rapids.profile.pathPrefix to a URI prefix where profile data will be written (e.g.: an HDFS or other distributed filesystem or object store path). Profiling data is written as ZSTD-compressed data by default, and once downloaded and decompressed, can be converted into other formats via the spark_rapids_profile_converter tool provided by NVIDIA/spark-rapids-jni#2066.

By default only executor 0 is profiled from startup until shutdown, but there are a number of configs to control which executors are profiled and what portions of the executor lifespan are profiled. Time ranges (in seconds since startup) or job- and stage-specific ranges are supported, where the profiler will be started and stopped accordingly.


          Add support for self-contained profiling

6e1a3b2

Signed-off-by: Jason Lowe <[email protected]>

jlowe self-assigned this

revans2 reviewed

View reviewed changes

Collaborator

revans2 left a comment

My main concern is really about testing. I know that this feature is not done, and it is not ready for customers yet. But I also am a bit concerned about code rot. Are there any plans to have some kind of regular testing to be sure that this still works?

sql-plugin/src/main/scala/com/nvidia/spark/rapids/profiler.scala

+                  stageRanges = new RangeConfMatcher(conf, RapidsConf.PROFILE_STAGES)
+                  driverPollMillis = conf.profileDriverPollMillis
+                  if (timeRanges.isDefined && (stageRanges.nonEmpty || jobRanges.nonEmpty)) {
+                    throw new UnsupportedOperationException(

Collaborator

revans2 May 23, 2024

Does the job crash if this happens? Would it be better to just have a priority on which one to use, and then WARN the user if it happens/possibly add a warning to the file being written? Especially when other places will eat exceptions from the timer and just log them. But I understand why that might be happening, so I can see it either way.

Member Author

jlowe May 24, 2024

Yes, it crashes the job on startup. I wanted it to fail fast rather than silently ignore an explicit request by the user that may take a long time and ultimately not capture the type of profile they want.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/profiler.scala

+                private var isProfileActive = false
+                def init(pluginCtx: PluginContext, conf: RapidsConf): Unit = {
+                  require(writer.isEmpty, "Already initialized")

Collaborator

revans2 May 23, 2024

Do we want lots of warning if profiling is enabled? I am a bit concerned about someone leaving this on in production on accident.

Member Author

jlowe May 24, 2024

There is a warning logged on the driver when profiling is started for each executor that is profiling so the user can easily locate the profiles. I'll also add a warning log on the executor, both to indicate profiling is enabled if all we have is the executor log and also as another pointer to the profile output path.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/profiler.scala

+                private def openOutput(codec: Option[CompressionCodec]): WritableByteChannel = {
+                  val hadoopConf = pluginCtx.ask(ProfileInitMsg(executorId, outPath.toString))
+                    .asInstanceOf[SerializableConfiguration].value
+                  val fs = outPath.getFileSystem(hadoopConf)

Collaborator

revans2 May 23, 2024

Can we have some kind of a follow on issue to make this a more generic service/API? We have a number of other debug tools that want to write data out. It would be nice if we were consistent in how we got the configs to do so, and it would be even nicer if we didn't have to worry about it for each debug operation being done.

Member Author

jlowe May 24, 2024

Filed #10892.

revans2 previously approved these changes

View reviewed changes

Collaborator

revans2 left a comment

Sorry forgot to hit the approve button on the review. My comments are all nits and would not block this from going in.

gerashegalov reviewed

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/profiler.scala Outdated Show resolved Hide resolved

jlowe mentioned this pull request

Add automated self-profile tests #10893

Open


          Use Scala regex, add executor-side logging on profile startup/shutdown

2299ab6

jlowe dismissed revans2’s stale review via

2299ab6

May 24, 2024 20:26

Member Author

jlowe commented May 24, 2024

Filed #10893 to track adding automated testing.

winningsix reviewed

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/profiler.scala Outdated

+              import com.nvidia.spark.rapids.jni.Profiler
+              import org.apache.hadoop.fs.Path
+              import org.apache.hadoop.ipc.CallerContext

Collaborator

winningsix May 28, 2024

Could we use other alternative thanCallerContext? It was introduced since Hadoop 2.8 and Hadoop 3.0. It may fail some cases for older Hadoop version users.

Member Author

jlowe May 28, 2024

I don't know of an alternative to retrieve the job ID associated with a task because it's not in the TaskContext. Note that we're not using CallerContext unless job ranges are specified, so it shouldn't be an issue unless job-level profiling is requested. Spark bundles the Hadoop client jars, so this will only be an issue if Spark was built with the hadoop-2 profile which compiles against Hadoop 2.7.4 by default.

I'll change this to something that is more tolerant of CallerContext not being there and fail fast if job ranges are requested when it's not available. If you know of a better way to get the job ID of a task on the executor, I'm happy to explore it.

jlowe added 2 commits

May 28, 2024 09:38


          Merge branch 'branch-24.06' into profiler

0944f3a


          Use reflection to handle potentially missing Hadoop CallerContext

3e03f96

sameerz added the tools label


          scala 2.13 fix

aceb980

Member Author

jlowe commented May 28, 2024

build

revans2 approved these changes

View reviewed changes

winningsix approved these changes

View reviewed changes

winningsix merged commit def596c into NVIDIA:branch-24.06

44 checks passed

jlowe deleted the profiler branch

May 29, 2024 14:52

sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request


          Add support for self-contained profiling (NVIDIA#10870)

ed651cb

* Add support for self-contained profiling

Signed-off-by: Jason Lowe <[email protected]>

* Use Scala regex, add executor-side logging on profile startup/shutdown

* Use reflection to handle potentially missing Hadoop CallerContext

* scala 2.13 fix

---------

Signed-off-by: Jason Lowe <[email protected]>

NvTimLiu mentioned this pull request

Update latest changelog [skip ci] #10981

Merged

liurenjie1024 mentioned this pull request

[FEA] Implement lore framework to support all operators. #10987

Closed

binmahone mentioned this pull request

[FEA] Support GpuHashAggregateExec in LoRe #10942

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels