Fixes a race in DeviceRunLengthEncode::NonTrivialRuns #399

elstehle · 2023-09-04T07:33:19Z

Description

This PR addresses #351.

WarpScanAllocations() and Scatter() alias/repurpose the same shared memory allocation. Previously it could happen that some threads may have reached the Scatter() stage, overwriting shared memory that was yet-to-be-read by other threads still within WarpScanAllocations() stage. A CTA_SYNC was added to WarpScanAllocations to make sure other threads wouldn't race ahead to the Scatter() stage.

Lines that were reading potentially corrupted data:

        warp_aggregate = temp_storage.aliasable.scan_storage.warp_aggregates.Alias()[warp_id];
        ...
        tile_aggregate = temp_storage.aliasable.scan_storage.warp_aggregates.Alias()[0];

This fix resolves the race condition being reported by compute-sanitizer and has had more than 90k successful test runs (previously avg. failure rate was one in ~1300 runs).

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

miscco

Great find 🥇

gevtushenko

Thank you for finding the race!

gevtushenko · 2023-09-05T23:35:26Z

cub/cub/agent/agent_rle.cuh

@@ -468,6 +468,10 @@ struct AgentRle

            tile_aggregate = scan_op(tile_aggregate, temp_storage.aliasable.scan_storage.warp_aggregates.Alias()[WARP]);
        }
+
+        // Ensure all threads have read warp aggregates before temp_storage is repurposed in the
+        // subsequent scatter stage


remark: the memory seems to be used by the prefix operation sooner than in the scatter.

I think the prefix and warp_aggregates use different memory within the ScanStorage compartment, whereas the ScatterAliasable and the ScanStorage alias one another?

// Aliasable storage layout union Aliasable { struct ScanStorage { typename BlockDiscontinuityT::TempStorage discontinuity; // Smem needed for discontinuity detection typename WarpScanPairs::TempStorage warp_scan[WARPS]; // Smem needed for warp-synchronous scans Uninitialized<LengthOffsetPair[WARPS]> warp_aggregates; // Smem needed for sharing warp-wide aggregates typename TilePrefixCallbackOpT::TempStorage prefix; // Smem needed for cooperative prefix callback } scan_storage; // Smem needed for input loading typename BlockLoadT::TempStorage load; // Aliasable layout needed for two-phase scatter union ScatterAliasable { unsigned long long align; WarpExchangePairsStorage exchange_pairs[ACTIVE_EXCHANGE_WARPS]; typename WarpExchangeOffsets::TempStorage exchange_offsets[ACTIVE_EXCHANGE_WARPS]; typename WarpExchangeLengths::TempStorage exchange_lengths[ACTIVE_EXCHANGE_WARPS]; } scatter_aliasable; } aliasable;

Btw., the culprit really is in the code branch of the first tile:

if (tile_idx == 0) { // First tile ... BlockLoadT().Load(); CTA_SYNC(); ... InitializeSelections(); // using scan_storage WarpScanAllocations(); // using scan_storage prefix_op(); // using scan_storage Scatter(); // using scatter_aliasable

fixes a race in rle

ee14bab

elstehle requested review from a team as code owners September 4, 2023 07:33

elstehle requested review from gevtushenko and griwes and removed request for a team September 4, 2023 07:33

miscco approved these changes Sep 4, 2023

View reviewed changes

gevtushenko approved these changes Sep 5, 2023

View reviewed changes

elstehle merged commit df2990d into NVIDIA:main Sep 6, 2023
466 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes a race in DeviceRunLengthEncode::NonTrivialRuns #399

Fixes a race in DeviceRunLengthEncode::NonTrivialRuns #399

elstehle commented Sep 4, 2023

miscco left a comment

gevtushenko left a comment

gevtushenko Sep 5, 2023

elstehle Sep 6, 2023 •

edited

Loading

Fixes a race in DeviceRunLengthEncode::NonTrivialRuns #399

Fixes a race in DeviceRunLengthEncode::NonTrivialRuns #399

Conversation

elstehle commented Sep 4, 2023

Description

Checklist

miscco left a comment

Choose a reason for hiding this comment

gevtushenko left a comment

Choose a reason for hiding this comment

gevtushenko Sep 5, 2023

Choose a reason for hiding this comment

elstehle Sep 6, 2023 • edited Loading

Choose a reason for hiding this comment

elstehle Sep 6, 2023 •

edited

Loading