-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for self-contained profiling #10870
Conversation
Signed-off-by: Jason Lowe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is really about testing. I know that this feature is not done, and it is not ready for customers yet. But I also am a bit concerned about code rot. Are there any plans to have some kind of regular testing to be sure that this still works?
stageRanges = new RangeConfMatcher(conf, RapidsConf.PROFILE_STAGES) | ||
driverPollMillis = conf.profileDriverPollMillis | ||
if (timeRanges.isDefined && (stageRanges.nonEmpty || jobRanges.nonEmpty)) { | ||
throw new UnsupportedOperationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the job crash if this happens? Would it be better to just have a priority on which one to use, and then WARN the user if it happens/possibly add a warning to the file being written? Especially when other places will eat exceptions from the timer and just log them. But I understand why that might be happening, so I can see it either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it crashes the job on startup. I wanted it to fail fast rather than silently ignore an explicit request by the user that may take a long time and ultimately not capture the type of profile they want.
private var isProfileActive = false | ||
|
||
def init(pluginCtx: PluginContext, conf: RapidsConf): Unit = { | ||
require(writer.isEmpty, "Already initialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want lots of warning if profiling is enabled? I am a bit concerned about someone leaving this on in production on accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a warning logged on the driver when profiling is started for each executor that is profiling so the user can easily locate the profiles. I'll also add a warning log on the executor, both to indicate profiling is enabled if all we have is the executor log and also as another pointer to the profile output path.
private def openOutput(codec: Option[CompressionCodec]): WritableByteChannel = { | ||
val hadoopConf = pluginCtx.ask(ProfileInitMsg(executorId, outPath.toString)) | ||
.asInstanceOf[SerializableConfiguration].value | ||
val fs = outPath.getFileSystem(hadoopConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have some kind of a follow on issue to make this a more generic service/API? We have a number of other debug tools that want to write data out. It would be nice if we were consistent in how we got the configs to do so, and it would be even nicer if we didn't have to worry about it for each debug operation being done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10892.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry forgot to hit the approve button on the review. My comments are all nits and would not block this from going in.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/profiler.scala
Outdated
Show resolved
Hide resolved
Filed #10893 to track adding automated testing. |
|
||
import com.nvidia.spark.rapids.jni.Profiler | ||
import org.apache.hadoop.fs.Path | ||
import org.apache.hadoop.ipc.CallerContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use other alternative thanCallerContext
? It was introduced since Hadoop 2.8 and Hadoop 3.0. It may fail some cases for older Hadoop version users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of an alternative to retrieve the job ID associated with a task because it's not in the TaskContext. Note that we're not using CallerContext unless job ranges are specified, so it shouldn't be an issue unless job-level profiling is requested. Spark bundles the Hadoop client jars, so this will only be an issue if Spark was built with the hadoop-2 profile which compiles against Hadoop 2.7.4 by default.
I'll change this to something that is more tolerant of CallerContext not being there and fail fast if job ranges are requested when it's not available. If you know of a better way to get the job ID of a task on the executor, I'm happy to explore it.
build |
* Add support for self-contained profiling Signed-off-by: Jason Lowe <[email protected]> * Use Scala regex, add executor-side logging on profile startup/shutdown * Use reflection to handle potentially missing Hadoop CallerContext * scala 2.13 fix --------- Signed-off-by: Jason Lowe <[email protected]>
Contributes to #10632. Depends on NVIDIA/spark-rapids-jni#2066.
Adds the ability for the RAPIDS Accelerator to collect CUDA and NVTX profiling data with no extra requirements on the execution environment (i.e.: do not need to install Nsight Systems or the CUDA toolkit). Profiling is enabled by setting spark.rapids.profile.pathPrefix to a URI prefix where profile data will be written (e.g.: an HDFS or other distributed filesystem or object store path). Profiling data is written as ZSTD-compressed data by default, and once downloaded and decompressed, can be converted into other formats via the spark_rapids_profile_converter tool provided by NVIDIA/spark-rapids-jni#2066.
By default only executor 0 is profiled from startup until shutdown, but there are a number of configs to control which executors are profiled and what portions of the executor lifespan are profiled. Time ranges (in seconds since startup) or job- and stage-specific ranges are supported, where the profiler will be started and stopped accordingly.