Documentation is split by domain. This file contains a general overview of these domains and how they interact.
-
Overview -- this file
-
Rendezvous -- creating a
gloo::Context
-
Algorithms -- index of collective algorithms and their semantics and complexity
-
Transport details -- the transport API and its implementations
-
CUDA integration -- integration of CUDA aware Gloo algorithms with existing CUDA code
-
Latency optimization -- number of tips and tricks to improve performance
Gloo algorithms are collective algorithms, meaning they can run in
parallel across two or more processes/machines. To be able to execute
across multiple machines, they first need to find each other. We call
this rendezvous and it is the first thing to address when
integrating Gloo into your code base.
See rendezvous.md
for more information.
Once rendezvous completes, participating machines have setup connections to one another, either in a full mesh (every machine has a bidirectional communication channel to every other machine), or some subset. The required connectivity between machines depends on the type of algorithm that is used. For example, a ring algorithm only needs communication channels to a machine's neighbors.
Every participating process knows about the number of participating
processes, and its rank (or 0-based index) within the list of
participating processes. This state, as well as the state needed to
store the persistent communication channels, is stored in a
gloo::Context
class. Gloo does not maintain global state or
thread-local state. This means that you can setup as many contexts as
needed, and introduce as much parallelism as needed by your
application.
If you find particular documentation is missing, please consider contributing.