N/A
- Brainstorm OCCA tools design for profiling OCCA apps
-
Resources
- https://icl.utk.edu/papi/
- hip support https://github.com/intel/pti-gpu/blob/master/chapters/device_activity_tracing/OpenCL.md
- https://github.com/intel/pti-gpu/blob/master/chapters/runtime_api_tracing/OpenCL.md
- intel: https://github.com/intel/pti-gpu/tree/master
- https://github.com/argonne-lcf/THAPI
-
Participant comments/thoughts
- tracing start/stop are global triggers
- need more granularity? tracing specific device or kernel or memory object?
- tagging a specific kernel, "everything that calls this will be traced"
- occaprof, as well as code-level profiling
- "i want a tool like nvprof"
- brice and thomas have iprof (see THAPI below).
- have cuda/hip/ocl/omp support. reuse this
- and add hooks for occa?
- output:
- traces stashed in default location
- perfetto support probably easy to add if not already there
- mpirun -n 8 occa_prof ./a.out -m HIP ...
- nvprof time spent in kernels by default
- to the first order users want:
- time per kernel and memory traffic
- what about host-side stuff like mpi?
- rename cu and roc to occa? not clear when user sees reported cu kernels where they are in their code
- we could have a very invasive "track all the things" and keep our own
- records inside occa
- does this preclude iprof?
- occa api calls we can do
- opencl? profiling mode? can do event timings in sycl
- openmp? probably can do it there
- occa openmp kernels are blocking so we can do gettimeofday
- metal is a challenge
- cuda and hip are fine
- is occa responsible for generating consistent output format across
vendors?
- or fall back to vendor-specific output format?
- add events to every occa kernel call?
- some vendors implement event pools
- multi-stream?
- timer aggregration confuses the user in the presence of overlap
- how do we get device timings?
- wrap around different vendor profiling tools? consume their output and
- unify it. or dump out what the vendor outputs.
- do you want timers at the occa api level?
- awkwardness: vendors deprecated tools
- wrappers around command line tools vs calling vendor-provided APis
- fallback? what if thapi goes away?
- ok with having an unsupported mode in the short-term?
- tracing start/stop are global triggers
-
THAPI
- thapi context:
- something lower overhead than hpctoolkit/vtune/whatever
- trace stuff
- created because intel didn't have an 'nvprof'
- started with an l0 backend
- then did hip and cudart/cuda driver
- now omp
- uses lttng is the infrastructure for tracing
- can occa give thapi some occa callbacks?
- this flips around the semantics of how to use thapi
- are we adding occa as a backend to thapi?
- or are we wrapping thapi from occa?
- thapi interface with omp using the ompt interface
- thapi inserts events for device timings
- thapi gets us first usecase with traces
- hw counters? papi. vendors need to provide hooks into papi for device metrics?
- can iprof do timelines in csv format?
- no json
- generate protobuf (max 2GiB) and view in perfetto
- run with nekrs? yes
- still proof of concept? development? or production ready?
- kris: people are using it now.
- tom: still work to do but it's very usable now
- how easy is it to build and deploy?
- challenging. use spack.
- thapi context:
- Set up next week's meeting
- Discuss THAPI interfacing to OCCA