Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Simplified Nsight tracing #10632

Open
jlowe opened this issue Mar 25, 2024 · 1 comment
Open

[FEA] Simplified Nsight tracing #10632

jlowe opened this issue Mar 25, 2024 · 1 comment
Assignees
Labels
feature request New feature or request

Comments

@jlowe
Copy link
Member

jlowe commented Mar 25, 2024

Is your feature request related to a problem? Please describe.
It's currently complicated to setup and collect an Nsight Systems trace of one or more executors, especially in non-standalone environments. There needs to be a simpler solution so users can collect these traces easily.

Describe the solution you'd like
A new config flag, e.g.: spark.rapids.nsight.tracePrefix, that specifies a URI prefix where Nsight traces will be stored. If this config is set, it indicates that the user wants tracing to be enabled on all executors. The Nsight tracing libraries would be included in the jar and leveraged by the executors, before the CUDA context is established, to enable tracing. On executor shutdown, the tracing would be stopped, collected, and uploaded to the URI prefix with some unique ID appended to the prefix (e.g.: application ID and executor ID). Ideally the trace data is already a qdrep file ready to be loaded into the Nsight Systems viewer. A message should be sent back to the driver once the data is written so the driver can log where each executor placed its trace file.

A separate config, e.g.: spark.rapids.nsight.executor, could be used to limit which executor(s) are traced. For example, this could be a comma-separated and/or range-dashed list of executor IDs where only those executors will capture a trace. For example, 0,2-5,10 would capture traces only on executors 0, 2, 3, 4, 5, and 10. "all" or leaving the config unset would trace all executors. Or maybe we should just trace executor 0 by default, and let the user set this to "all" if they really want all executors traced.

Describe alternatives you've considered
If the libraries for tracing are too large to be included in the RAPIDS Accelerator jar by default, we could have a separate jar that is used for tracing.

@jlowe jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 25, 2024
@jlowe
Copy link
Member Author

jlowe commented Mar 25, 2024

After chatting with the Nsight Systems team, we can probably accomplish most of the tracing needs we want by leveraging the cupti toolkit. This won't generate a qdrep file, but it might be easy to post-process the cupti trace data such that we could generate the qdrep file from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants