Skip to content

Latest commit

 

History

History
103 lines (89 loc) · 3.99 KB

2023-12-04.md

File metadata and controls

103 lines (89 loc) · 3.99 KB

2023-12-04

Meeting Link

https://teams.microsoft.com/l/meetup-join/19%3ameeting_ODdkYTI4MDUtYTZkZS00MjQxLTg0YjEtYzJlYWQ2NTUxZmY3%40thread.v2/0?context=%7b%22Tid%22%3a%223dd8961f-e488-4e60-8e11-a82d994e183d%22%2c%22Oid%22%3a%22283778cd-dac5-4948-96b5-f82885ad5d24%22%7d

Follow-up

N/A

Agenda

  1. Brainstorm OCCA tools design for profiling OCCA apps

Notes

  • Resources

  • Participant comments/thoughts

    • tracing start/stop are global triggers
      • need more granularity? tracing specific device or kernel or memory object?
      • tagging a specific kernel, "everything that calls this will be traced"
    • occaprof, as well as code-level profiling
    • "i want a tool like nvprof"
    • brice and thomas have iprof (see THAPI below).
      • have cuda/hip/ocl/omp support. reuse this
      • and add hooks for occa?
    • output:
    • traces stashed in default location
    • perfetto support probably easy to add if not already there
    • mpirun -n 8 occa_prof ./a.out -m HIP ...
    • nvprof time spent in kernels by default
    • to the first order users want:
      • time per kernel and memory traffic
      • what about host-side stuff like mpi?
    • rename cu and roc to occa? not clear when user sees reported cu kernels where they are in their code
    • we could have a very invasive "track all the things" and keep our own
    • records inside occa
      • does this preclude iprof?
      • occa api calls we can do
      • opencl? profiling mode? can do event timings in sycl
      • openmp? probably can do it there
        • occa openmp kernels are blocking so we can do gettimeofday
      • metal is a challenge
      • cuda and hip are fine
      • is occa responsible for generating consistent output format across vendors?
        • or fall back to vendor-specific output format?
      • add events to every occa kernel call?
        • some vendors implement event pools
      • multi-stream?
      • timer aggregration confuses the user in the presence of overlap
    • how do we get device timings?
    • wrap around different vendor profiling tools? consume their output and
    • unify it. or dump out what the vendor outputs.
      • do you want timers at the occa api level?
      • awkwardness: vendors deprecated tools
        • wrappers around command line tools vs calling vendor-provided APis
        • fallback? what if thapi goes away?
    • ok with having an unsupported mode in the short-term?
  • THAPI

    • thapi context:
      • something lower overhead than hpctoolkit/vtune/whatever
      • trace stuff
      • created because intel didn't have an 'nvprof'
      • started with an l0 backend
      • then did hip and cudart/cuda driver
      • now omp
      • uses lttng is the infrastructure for tracing
    • can occa give thapi some occa callbacks?
    • thapi gets us first usecase with traces
    • hw counters? papi. vendors need to provide hooks into papi for device metrics?
    • can iprof do timelines in csv format?
      • no json
      • generate protobuf (max 2GiB) and view in perfetto
    • run with nekrs? yes
    • still proof of concept? development? or production ready?
      • kris: people are using it now.
      • tom: still work to do but it's very usable now
    • how easy is it to build and deploy?
      • challenging. use spack.

Action Items

  • Set up next week's meeting

Next Meeting

  • Discuss THAPI interfacing to OCCA