Skip to content

Latest commit

 

History

History
41 lines (38 loc) · 3.18 KB

torch_nccl_environment_variables.rst

File metadata and controls

41 lines (38 loc) · 3.18 KB

PYTORCH ProcessGroupNCCL Environment Variables

For more information on the environment variables, see ProcessGroupNCCL Environment Variables.

Variable Description
TORCH_NCCL_ASYNC_ERROR_HANDLING Control how we perform Async Error Handling with NCCL when an exception is observed in watchdog. If set to 0, no handling of asynchronous NCCL errors. If set to 1, aborting NCCL communicator and tearing down process upon error. If set to 2, only abort NCCL communicator and if set to 3, tearing down process without aborting NCCL communicator. By default, it is set to 3.
TORCH_NCCL_HIGH_PRIORITY Control whether to use high priority stream for the NCCL communicator.
TORCH_NCCL_BLOCKING_WAIT Control whether or not wait() is blocking or non-blocking.
TORCH_NCCL_DUMP_ON_TIMEOUT Control whether dumping debug info on watchdog timeout or exception is detected. This variable must be set together with TORCH_NCCL_TRACE_BUFFER_SIZE larger than 0.
TORCH_NCCL_DESYNC_DEBUG Control whether Desync Debug is enabled. This is helpful in figuring out the culprit rank of collective desync.
TORCH_NCCL_ENABLE_TIMING If set to 1, enable recording start-events for all ProcessGroupNCCL collectives, and compute accurate collective timing per-collective.
TORCH_NCCL_ENABLE_MONITORING If set to 1,enable monitoring thread which aborts the process when the ProcessGroupNCCL Watchdog thread gets stuck and no heartbeat is detected after TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC. This can happen due to calling CUDA/NCCL APIs that may hang. It is Useful to prevent jobs being stuck for a prolonged time than necessary tying up cluster resources.
TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC Control the watchdog heartbeat timeout period after which the monitoring thread will abort the process.
TORCH_NCCL_TRACE_BUFFER_SIZE The maximum number of events we store in the flight recorder's ring buffer. One event could be the start or end of a collective, for example. Set to 0 to disable the tracebuffer and debugging info dump.
TORCH_NCCL_TRACE_CPP_STACK Whether to collect cpp stack traces for flight recorder. Default value is False.
TORCH_NCCL_COORD_CHECK_MILSEC Control the interval inside the monitoring thread to check the coordinated signal from other ranks, e.g. to dump the debugging information. Default value is 1000 ms.
TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC Control how much extra time we will wait for dumping the debugging info before we exit and throws timeout exception.
TORCH_NCCL_DEBUG_INFO_TEMP_FILE The file into which the debugging info would be dumped.
TORCH_NCCL_DEBUG_INFO_PIPE_FILE The pipe file to trigger debugging dump manually, write anything into the pipe would trigger the dump.
TORCH_NCCL_NAN_CHECK Control whether to enable NAN check for the input, Error would be thrown if NAN is detected.