Skip to content

Commit

Permalink
Omnitrace docs refactoring (ROCm#353)
Browse files Browse the repository at this point in the history
* Add Sphinx and Read the Docs configs

* Add documentation workflow configurations

* Changed macros verbprintf and verbprintf_bare so they write to stdout… (ROCm#346)

Flush stdout when listing keys + bump verbose level for GPU count

* Removing static version asserts. (ROCm#347)

It is causing failures on our internal builds

Signed-off-by: David Galiffi <[email protected]>

* Check for an empty vector before popping (ROCm#350)

Protect from possible seg. fault

Signed-off-by: David Galiffi <[email protected]>

* Add release links to installation.md (ROCm#351)

* Initial infrastructure rework for Omnitrace refactoring and a rewrite of the What is file

* Add files in conceptual section, along with images and infrastructure changes.

* Formatting and style fixes for files in conceptual directory

* Add quick start install guide and fix spelling errors in other files

* Add install document and fix code tags. Infrastructure changes

* Add two how-to guides along with infra changes and spelling fixes

* Add two new how to files and fix errors in the last commit

* Fix spelling mistakes

* Add new how to file on causal profiling and infra changes.

* Add how to file on interpreting Omnitrace output, fixes, and images

* Add remaining how-to guides and reference materials along with fixes and infrastructure

* Add YouTube file and fix spelling and formatting

* Fix a few loose ends and add link to license page

* Add Sphinx and Doxygen infrastructure and some additional corrections

* Update rocm-docs-core

* Fix Doxyfile

* Fix path to API header files

* Run doxysphinx in conf.py

* Add back custom css for doxygen

* Remove doxygenlayout

* Add api to toc

* Update Doxyfile

Generate from source .in

* Proofreading edits and other changes

* Add .gitignore for Doxygen and remove deprecated words and typos

* Fix one additional typo

* Turn off dot

* Update doxyfile strip from path

* Workflow, submodules, and thread info Updates (ROCm#352)

* Update CI workflows

- use node20 workflow packages

* Update tests/source/CMakeLists.txt

- Use OMNITRACE_TRACE and OMNTRACE_PROFILE instead of perfetto/timemory

* Update timemory submodule

- argparse: requires -> required
- parse callbacks

* Update thread_info.cpp

- fix causal::delay::get_local usage

* Update timemory submodule

* Update kokkos submodule

- release 3.7.02

* Revert opensuse.yml and ubuntu-bionic.yml to use node16 workflows

* Update docs.yml

* ROCm 6.1 Installers (ROCm#349)

* Add ROCm 6.1 to packages
* Bump version to 1.11.3
* Add 6.1 support to the docker build support.
   Simplified this by adding 6.* to case statements, now that repo links have been standardized.

* Update timemory submodule (ROCm#354)

- fix argparse::argument::required template deduction

* Build omnitrace-rt library (ROCm#355)

* Build omnitrace-rt library

- Explicitly build dyninstAPI_RT as omnitrace-rt so that the SONAME in the ELF is omnitrace-rt instead of dyninstAPI_RT
- Create symbolic link lib/omnitrace/libdyninstAPI_RT.so which points to lib/libomnitrace-rt.so
- Simplify build tree location of libomnitrace-rt.so since it is ../lib from the bin directory even in the build tree
- Update dyninst submodule with minor tweaks to dyninstAPI_RT/CMakeLists.txt

* Update source/lib/omnitrace-rt/cmake/platform.cmake

* Use ftpmirror.gnu.org instead of ftp.gnu.org

- in timemory and dyninst submodules
- minor .clang-tidy tweak

* Executables append omnitrace library directory to LD_LIBRARY_PATH (ROCm#356)

- omnitrace-run, omnitrace-sample, and omnitrace-causal now automatically append the LD_LIBRARY_PATH with the directory containing the omnitrace libraries
  - this helps ensure that binary rewritten exes can resolve omnitrace-rt library location

* Fix a few typos and formatting issues

* Additional fixes and minor formatting changes.

* More fixes and minor formatting changes.

* Complete second proofreading with fixes and minor formatting changes.

* Make changes to table of contents and disable linting

* Update links in the README doc to reflect the new structure.

* Align intro on the Omnitrace index page with the first paragraph of the what-is page

* Changes and edits based on review comments

* Additional changes and edits based on external review

* Additional updates and changes from the external review of Omnitrace

* Additional changes based on the external review

* New round of edits based on the external review

* Additional edits based on the external review

* Changes to address comments from the internal review

* Correct to the RHEL SELinux note in the troubleshooting guide

* One additional change to the development guide code example

* Move troubleshooting to post-install of install.rst and other minor edits.

* Remove troubleshooting page and modify new post-install troubleshooting section on install.rst

* Refactor the how Omnitrace works page into seperate topics and redo infrastructure

* API ToC changes

* Additional API and ToC changes

* Back out API and ToC changes and update requirements.txt

* Additional API and ToC changes

* Add commit for signing purposes

* Add ElfUtils and BinUtils Download URL Overrides (ROCm#358)

* Add CMake CACHE Variable ElfUtils_DOWNLOAD_URL

Used to override the default URL to download ElfUtils from.
Useful for internal builds

Also, include a mirror to fallback to if the override URL fails.

* Update timemory submodule

Updating to include the BINUTIL_DOWNLOAD_URL override cmake
variable.

---------

Signed-off-by: David Galiffi <[email protected]>

* Remove Ubuntu 18.04 and SUSE 15.2

* Update checkout action to v4

* Add `docs/**` to `paths-ignore`

Document location is being refactored.

* Modified submodules dyninst and timemory. (ROCm#361)

---------

Signed-off-by: David Galiffi <[email protected]>
Co-authored-by: Peter Jun Park <[email protected]>
Co-authored-by: ajanicijamd <[email protected]>
Co-authored-by: David Galiffi <[email protected]>
Co-authored-by: Jonathan R. Madsen <[email protected]>
Co-authored-by: Sam Wu <[email protected]>
  • Loading branch information
6 people committed Jul 29, 2024
1 parent f0bd912 commit cb6e6a6
Show file tree
Hide file tree
Showing 38 changed files with 7,122 additions and 9 deletions.
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
docs/* @ROCm/rocm-documentation
*.md @ROCm/rocm-documentation
*.rst @ROCm/rocm-documentation
.readthedocs.yaml @ROCm/rocm-documentation
11 changes: 11 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,14 @@ updates:
directory: "/" # Location of package manifests
schedule:
interval: "weekly"

- package-ecosystem: "pip" # See documentation for possible values
directory: "/docs/sphinx" # Location of package manifests
open-pull-requests-limit: 10
schedule:
interval: "daily"
labels:
- "documentation"
- "dependencies"
reviewers:
- "samjwu"
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@
# Python cache files
*.pyc

# Documentation artifacts
/_build
_toc.yml

/build*
/.vscode
/.cache
Expand Down
18 changes: 18 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

version: 2

build:
os: ubuntu-22.04
tools:
python: "3.10"

python:
install:
- requirements: docs/sphinx/requirements.txt

sphinx:
configuration: docs/conf.py

formats: []
16 changes: 7 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@
[![Installer Packaging (CPack)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml)
[![Documentation](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml)

> ***[Omnitrace](https://github.com/ROCm/omnitrace) is an AMD open source research project and is not supported as part of the ROCm software stack.***
## Overview

AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems.
Expand Down Expand Up @@ -87,8 +85,8 @@ such as the memory usage, page-faults, and context-switches, and thread-level me

## Documentation

The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [rocm.github.io/omnitrace](https://rocm.github.io/omnitrace/).
See the [Getting Started documentation](https://rocm.github.io/omnitrace/getting_started) for general tips and a detailed discussion about sampling vs. binary instrumentation.
The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [the ROCm Omnitrace documentation repository](https://rocm.docs.amd.com/projects/omnitrace/en/latest/index.html).
See the [Getting Started documentation](https://rocm.docs.amd.com/projects/omnitrace/en/conceptual/how-omnitrace-works.html) for general tips and a detailed discussion about sampling vs. binary instrumentation.

## Quick Start

Expand All @@ -109,7 +107,7 @@ wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-instal
python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4
```

See the [Installation Documentation](https://rocm.github.io/omnitrace/installation) for detailed information.
See the [Installation Documentation](https://rocm.docs.amd.com/projects/omnitrace/en/install/install.html) for detailed information.

### Setup

Expand Down Expand Up @@ -298,13 +296,13 @@ for `foo` via the direct call within `spam`. There will be no entries for `bar`
- Select "Open trace file" from panel on the left
- Locate the omnitrace perfetto output (extension: `.proto`)

![omnitrace-perfetto](source/docs/images/omnitrace-perfetto.png)
![omnitrace-perfetto](docs/data/omnitrace-perfetto.png)

![omnitrace-rocm](source/docs/images/omnitrace-rocm.png)
![omnitrace-rocm](docs/data/omnitrace-rocm.png)

![omnitrace-rocm-flow](source/docs/images/omnitrace-rocm-flow.png)
![omnitrace-rocm-flow](docs/data/omnitrace-rocm-flow.png)

![omnitrace-user-api](source/docs/images/omnitrace-user-api.png)
![omnitrace-user-api](docs/data/omnitrace-user-api.png)

## Using Perfetto tracing with System Backend

Expand Down
2 changes: 2 additions & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
_build/
_doxygen/
146 changes: 146 additions & 0 deletions docs/conceptual/data-collection-modes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD

**********************
Data collection modes
**********************

Omnitrace supports several modes of recording trace and profiling data for your application.

.. note::

For an explanation of the terms used in this topic, see
the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.

+-----------------------------+---------------------------------------------------------+
| Mode | Description |
+=============================+=========================================================+
| Binary Instrumentation | Locates functions (and loops, if desired) in the binary |
| | and inserts snippets at the entry and exit |
+-----------------------------+---------------------------------------------------------+
| Statistical Sampling | Periodically pauses application at specified intervals |
| | and records various metrics for the given call stack |
+-----------------------------+---------------------------------------------------------+
| Callback APIs | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
| | make callbacks into Omnitrace to provide information |
| | about the work the API is performing |
+-----------------------------+---------------------------------------------------------+
| Dynamic Symbol Interception | Wrap function symbols defined in a position independent |
| | dynamic library/executable, like ``pthread_mutex_lock`` |
| | in ``libpthread.so`` or ``MPI_Init`` in the MPI library |
+-----------------------------+---------------------------------------------------------+
| User API | User-defined regions and controls for Omnitrace |
+-----------------------------+---------------------------------------------------------+

The two most generic and important modes are binary instrumentation and statistical sampling.
It is important to understand their advantages and disadvantages.
Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument``
executable. For statistical sampling, it's highly recommended to use the
``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed.
Callback APIs and dynamic symbol interception can be utilized with either tool.

Binary instrumentation
-----------------------------------

Binary instrumentation lets you record deterministic measurements for
every single invocation of a given function.
Binary instrumentation effectively adds instructions to the target application to
collect the required information. It therefore has the potential to cause performance
changes which might, in some cases, lead to inaccurate results. The effect depends on
the information being collected and which features are activated in Omnitrace.
For example, collecting only the wall-clock timing data
has less of an effect than collecting the wall-clock timing, CPU-clock timing,
memory usage, cache-misses, and number of instructions that were run. Similarly,
collecting a flat profile has less overhead than a hierarchical profile
and collecting a trace OR a profile has less overhead than collecting a
trace AND a profile.

In Omnitrace, the primary heuristic for controlling the overhead with binary
instrumentation is the minimum number of instructions for selecting functions
for instrumentation.

Statistical sampling
-----------------------------------

Statistical call-stack sampling periodically interrupts the application at
regular intervals using operating system interrupts.
Sampling is typically less numerically accurate and specific, but the
target program runs at nearly full speed.
In contrast to the data derived from binary instrumentation, the resulting
data is not exact but is instead a statistical approximation.
However, sampling often provides a more accurate picture of the application
execution because it is less intrusive to the target application and has fewer
side effects on memory caches or instruction decoding pipelines. Furthermore,
because sampling does not affect the execution speed as much, is it
relatively immune to over-evaluating the cost of small, frequently called
functions or "tight" loops.

In Omnitrace, the overhead for statistical sampling depends on the
sampling rate and whether the samples are taken with respect to the CPU time
and/or real time.

Binary instrumentation vs. statistical sampling example
-------------------------------------------------------

Consider the following code:

.. code-block:: c++

long fib(long n)
{
if(n < 2) return n;
return fib(n - 1) + fib(n - 2);
}

void run(long n)
{
long result = fib(n);
printf("[%li] fibonacci(%li) = %li\n", i, n, result);
}

int main(int argc, char** argv)
{
long nfib = 30;
long nitr = 10;
if(argc > 1) nfib = atol(argv[1]);
if(argc > 2) nitr = atol(argv[2]);

for(long i = 0; i < nitr; ++i)
run(nfib);

return 0;
}

Binary instrumentation of the ``fib`` function will record **every single invocation**
of the function. For a very small function
such as ``fib``, this results in **significant** overhead since this simple function
takes about 20 instructions, whereas the entry and
exit snippets are ~1024 instructions. Therefore, you generally want to avoid
instrumenting functions where the instrumented function has significantly fewer
instructions than entry and exit instrumentation. (Note that many of the
instructions in entry and exit functions are either logging functions or
depend on the runtime settings and thus might never run). However,
due to the number of potential instructions in the entry and exit snippets,
the default behavior of ``omnitrace-instrument`` is to only instrument functions
which contain fewer than 1024 instructions.

However, recording every single invocation of the function can be extremely
useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger
than the average or a high standard deviation. In this case, the traces help you
identify exactly when and where those instances deviated from the norm.
Compare the level of detail in the following traces. In the top image,
every instance of the ``fib`` function is instrumented, while in the bottom image,
the ``fib`` call-stack is derived via sampling.

Binary instrumentation of the Fibonacci function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. image:: ../data/fibonacci-instrumented.png
:alt: Visualization of the output of a binary instrumentation of the Fibonacci function

Statistical sampling of the Fibonacci function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. image:: ../data/fibonacci-sampling.png
:alt: Visualization of the output of a statistical sample of the Fibonacci function
137 changes: 137 additions & 0 deletions docs/conceptual/omnitrace-feature-set.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
.. meta::
:description: Omnitrace documentation and reference
:keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD

***************************************
The Omnitrace feature set and use cases
***************************************

`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible.
Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_
to manage extensions, resources, data, and other items. It supports the following features,
modes, metrics, and APIs.

Data collection modes
========================================

* Dynamic instrumentation

* Runtime instrumentation: Instrument executables and shared libraries at runtime
* Binary rewriting: Generate a new executable and/or library with instrumentation built-in

* Statistical sampling: Periodic software interrupts per-thread
* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
* Causal profiling: Quantifies the potential impact of optimizations in parallel code

.. note::

Critical trace support was removed in Omnitrace v1.11.0.
It was replaced by the causal profiling feature.

Data analysis
========================================

* High-level summary profiles with mean, min, max, and standard deviation statistics

* Low overhead and memory efficient
* Ideal for running at scale

* Comprehensive traces for every individual event and measurement
* Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling

Parallelism API support
========================================

* HIP
* HSA
* Pthreads
* MPI
* Kokkos-Tools (KokkosP)
* OpenMP-Tools (OMPT)

GPU metrics
========================================

* GPU hardware counters
* HIP API tracing
* HIP kernel tracing
* HSA API tracing
* HSA operation tracing
* System-level sampling (via rocm-smi)

* Memory usage
* Power usage
* Temperature
* Utilization

CPU metrics
========================================

* CPU hardware counters sampling and profiles
* CPU frequency sampling
* Various timing metrics

* Wall time
* CPU time (process and thread)
* CPU utilization (process and thread)
* User CPU time
* Kernel CPU time

* Various memory metrics

* High-water mark (sampling and profiles)
* Memory page allocation
* Virtual memory usage

* Network statistics
* I/O metrics
* Many others

Third-party API support
========================================

* TAU
* LIKWID
* Caliper
* CrayPAT
* VTune
* NVTX
* ROCTX

Omnitrace use cases
========================================

When analyzing the performance of an application, do NOT
assume you know where the performance bottlenecks are
and why they are happening. Omnitrace is a tool for analyzing the entire
application and its performance. It is
ideal for characterizing where optimization would have the greatest impact
on an end-to-end run of the application and for
viewing what else is happening on the system during a performance bottleneck.

When GPUs are involved, there is a tendency to assume that
the quickest path to performance improvement is minimizing
the runtime of the GPU kernels. This is a highly flawed assumption.
If you optimize the runtime of a kernel from one millisecond
to 1 microsecond (1000x speed-up) but the original application never
spent time waiting for kernels to complete,
there would be no statistically significant reduction in the end-to-end
runtime of your application. In other words, it does not matter
how fast or slow the code on GPU is if the application has a
bottleneck on waiting on the GPU.

Use Omnitrace to obtain a high-level view of the entire application. Use it
to determine where the performance bottlenecks are and
obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
performance, start your investigation with Omnitrace, which characterizes the
broad picture.

.. note::

For insight into the execution of individual kernels on the GPU,
use `Omniperf <https://github.com/rocm/omniperf>`_.

In terms of CPU analysis, Omnitrace does not target any specific vendor.
It works just as well on AMD and non-AMD CPUs.
With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs
and kernels running on AMD GPUs.
Loading

0 comments on commit cb6e6a6

Please sign in to comment.