Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes a race in DeviceRunLengthEncode::NonTrivialRuns #399

Merged
merged 1 commit into from
Sep 6, 2023

Conversation

elstehle
Copy link
Collaborator

@elstehle elstehle commented Sep 4, 2023

Description

This PR addresses #351.

WarpScanAllocations() and Scatter() alias/repurpose the same shared memory allocation. Previously it could happen that some threads may have reached the Scatter() stage, overwriting shared memory that was yet-to-be-read by other threads still within WarpScanAllocations() stage. A CTA_SYNC was added to WarpScanAllocations to make sure other threads wouldn't race ahead to the Scatter() stage.

Lines that were reading potentially corrupted data:

        warp_aggregate = temp_storage.aliasable.scan_storage.warp_aggregates.Alias()[warp_id];
        ...
        tile_aggregate = temp_storage.aliasable.scan_storage.warp_aggregates.Alias()[0];

This fix resolves the race condition being reported by compute-sanitizer and has had more than 90k successful test runs (previously avg. failure rate was one in ~1300 runs).

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested review from a team as code owners September 4, 2023 07:33
@elstehle elstehle requested review from gevtushenko and griwes and removed request for a team September 4, 2023 07:33
Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find 🥇

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding the race!

@@ -468,6 +468,10 @@ struct AgentRle

tile_aggregate = scan_op(tile_aggregate, temp_storage.aliasable.scan_storage.warp_aggregates.Alias()[WARP]);
}

// Ensure all threads have read warp aggregates before temp_storage is repurposed in the
// subsequent scatter stage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remark: the memory seems to be used by the prefix operation sooner than in the scatter.

Copy link
Collaborator Author

@elstehle elstehle Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the prefix and warp_aggregates use different memory within the ScanStorage compartment, whereas the ScatterAliasable and the ScanStorage alias one another?

        // Aliasable storage layout
        union Aliasable
        {
            struct ScanStorage
            {
                typename BlockDiscontinuityT::TempStorage       discontinuity;              // Smem needed for discontinuity detection
                typename WarpScanPairs::TempStorage             warp_scan[WARPS];           // Smem needed for warp-synchronous scans
                Uninitialized<LengthOffsetPair[WARPS]>          warp_aggregates;            // Smem needed for sharing warp-wide aggregates
                typename TilePrefixCallbackOpT::TempStorage     prefix;                     // Smem needed for cooperative prefix callback
            } scan_storage;

            // Smem needed for input loading
            typename BlockLoadT::TempStorage                    load;

            // Aliasable layout needed for two-phase scatter
            union ScatterAliasable
            {
                unsigned long long                              align;
                WarpExchangePairsStorage                        exchange_pairs[ACTIVE_EXCHANGE_WARPS];
                typename WarpExchangeOffsets::TempStorage       exchange_offsets[ACTIVE_EXCHANGE_WARPS];
                typename WarpExchangeLengths::TempStorage       exchange_lengths[ACTIVE_EXCHANGE_WARPS];
            } scatter_aliasable;

        } aliasable;

Btw., the culprit really is in the code branch of the first tile:

if (tile_idx == 0)
        {
            // First tile
            ...
            BlockLoadT().Load();
            CTA_SYNC();
            ...
            InitializeSelections(); // using scan_storage
            WarpScanAllocations(); // using scan_storage
            prefix_op(); // using scan_storage
            Scatter(); // using scatter_aliasable

@elstehle elstehle merged commit df2990d into NVIDIA:main Sep 6, 2023
466 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants