2023-12-04

Meeting Link

https://teams.microsoft.com/l/meetup-join/19%3ameeting_ODdkYTI4MDUtYTZkZS00MjQxLTg0YjEtYzJlYWQ2NTUxZmY3%40thread.v2/0?context=%7b%22Tid%22%3a%223dd8961f-e488-4e60-8e11-a82d994e183d%22%2c%22Oid%22%3a%22283778cd-dac5-4948-96b5-f82885ad5d24%22%7d

Follow-up

N/A

Agenda

Brainstorm OCCA tools design for profiling OCCA apps

Notes

Resources
Participant comments/thoughts
- tracing start/stop are global triggers
  - need more granularity? tracing specific device or kernel or memory object?
  - tagging a specific kernel, "everything that calls this will be traced"
- occaprof, as well as code-level profiling
- "i want a tool like nvprof"
- brice and thomas have iprof (see THAPI below).
  - have cuda/hip/ocl/omp support. reuse this
  - and add hooks for occa?
- output:
- traces stashed in default location
- perfetto support probably easy to add if not already there
- mpirun -n 8 occa_prof ./a.out -m HIP ...
- nvprof time spent in kernels by default
- to the first order users want:
  - time per kernel and memory traffic
  - what about host-side stuff like mpi?
- rename cu and roc to occa? not clear when user sees reported cu kernels where they are in their code
- we could have a very invasive "track all the things" and keep our own
- records inside occa
  - does this preclude iprof?
  - occa api calls we can do
  - opencl? profiling mode? can do event timings in sycl
  - openmp? probably can do it there
    - occa openmp kernels are blocking so we can do gettimeofday
  - metal is a challenge
  - cuda and hip are fine
  - is occa responsible for generating consistent output format across vendors?
    - or fall back to vendor-specific output format?
  - add events to every occa kernel call?
    - some vendors implement event pools
  - multi-stream?
  - timer aggregration confuses the user in the presence of overlap
- how do we get device timings?
- wrap around different vendor profiling tools? consume their output and
- unify it. or dump out what the vendor outputs.
  - do you want timers at the occa api level?
  - awkwardness: vendors deprecated tools
    - wrappers around command line tools vs calling vendor-provided APis
    - fallback? what if thapi goes away?
- ok with having an unsupported mode in the short-term?
THAPI
- thapi context:
  - something lower overhead than hpctoolkit/vtune/whatever
  - trace stuff
  - created because intel didn't have an 'nvprof'
  - started with an l0 backend
  - then did hip and cudart/cuda driver
  - now omp
  - uses lttng is the infrastructure for tracing
- can occa give thapi some occa callbacks?
  - this flips around the semantics of how to use thapi
  - are we adding occa as a backend to thapi?
  - or are we wrapping thapi from occa?
  - thapi interface with omp using the ompt interface
    - https://www.openmp.org/spec-html/5.2/openmpch19.html#x364-37700019
  - thapi inserts events for device timings
- thapi gets us first usecase with traces
- hw counters? papi. vendors need to provide hooks into papi for device metrics?
- can iprof do timelines in csv format?
  - no json
  - generate protobuf (max 2GiB) and view in perfetto
- run with nekrs? yes
- still proof of concept? development? or production ready?
  - kris: people are using it now.
  - tom: still work to do but it's very usable now
- how easy is it to build and deploy?
  - challenging. use spack.

Action Items

Set up next week's meeting

Next Meeting

Discuss THAPI interfacing to OCCA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2023-12-04.md

2023-12-04.md

2023-12-04

Meeting Link

Follow-up

Agenda

Notes

Action Items

Next Meeting

Files

2023-12-04.md

Latest commit

History

2023-12-04.md

File metadata and controls

2023-12-04

Meeting Link

Follow-up

Agenda

Notes

Action Items

Next Meeting