[FEA] Simplified Nsight tracing #10632

jlowe · 2024-03-25T19:23:36Z

Is your feature request related to a problem? Please describe.
It's currently complicated to setup and collect an Nsight Systems trace of one or more executors, especially in non-standalone environments. There needs to be a simpler solution so users can collect these traces easily.

Describe the solution you'd like
A new config flag, e.g.: spark.rapids.nsight.tracePrefix, that specifies a URI prefix where Nsight traces will be stored. If this config is set, it indicates that the user wants tracing to be enabled on all executors. The Nsight tracing libraries would be included in the jar and leveraged by the executors, before the CUDA context is established, to enable tracing. On executor shutdown, the tracing would be stopped, collected, and uploaded to the URI prefix with some unique ID appended to the prefix (e.g.: application ID and executor ID). Ideally the trace data is already a qdrep file ready to be loaded into the Nsight Systems viewer. A message should be sent back to the driver once the data is written so the driver can log where each executor placed its trace file.

A separate config, e.g.: spark.rapids.nsight.executor, could be used to limit which executor(s) are traced. For example, this could be a comma-separated and/or range-dashed list of executor IDs where only those executors will capture a trace. For example, 0,2-5,10 would capture traces only on executors 0, 2, 3, 4, 5, and 10. "all" or leaving the config unset would trace all executors. Or maybe we should just trace executor 0 by default, and let the user set this to "all" if they really want all executors traced.

Describe alternatives you've considered
If the libraries for tracing are too large to be included in the RAPIDS Accelerator jar by default, we could have a separate jar that is used for tracing.

The text was updated successfully, but these errors were encountered:

jlowe · 2024-03-25T19:24:53Z

After chatting with the Nsight Systems team, we can probably accomplish most of the tracing needs we want by leveraging the cupti toolkit. This won't generate a qdrep file, but it might be easy to post-process the cupti trace data such that we could generate the qdrep file from it.

jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 25, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 26, 2024

sameerz assigned jlowe May 7, 2024

This was referenced May 22, 2024

Profiler class and native code to support self-profiling NVIDIA/spark-rapids-jni#2066

Merged

Add support for self-contained profiling #10870

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Simplified Nsight tracing #10632

[FEA] Simplified Nsight tracing #10632

jlowe commented Mar 25, 2024

jlowe commented Mar 25, 2024

[FEA] Simplified Nsight tracing #10632

[FEA] Simplified Nsight tracing #10632

Comments

jlowe commented Mar 25, 2024

jlowe commented Mar 25, 2024