Add cuda::device::barrier_arrive tx #358

ahendriksen · 2023-08-18T10:47:30Z

Description

Closes #357

This PR adds the cuda::device::arrive_txfunction for shared memory barriers.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

rapids-bot · 2023-08-18T10:47:32Z

Pull requests from external contributors require approval from a NVIDIA organization member with write permissions or greater before CI can begin.

ahendriksen · 2023-08-18T10:47:51Z

Please advise how to fulfill the items on the checklist.

miscco

I am no expert on barrier, but some general libcu++ comments

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h

griwes · 2023-08-18T17:49:32Z

I am not convinced about only supporting this on sm_90+ and on shmem barriers. For a future memcpy_async_tx, we'll just do a fallback and never issue tx-based instructions for those cases, so if arrive_tx is just an arrive there (essentially discarding the tx count), that should be fine, no?

I'd like our users to be able to just write the same code everywhere, have it use hardware features where available, and do a fallback everywhere else where possible (so only trap for cases where we can't tell what the correct thing to do is, like in the barrier in cluster shmem case).

miscco · 2023-08-18T18:36:06Z

I am loving what @griwes said. Use the new hot stuff when available and fall back to the old behavior when not. Maybe we can add some warning so the user acknowledges it?

libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp

ahendriksen · 2023-08-21T12:56:52Z

For a future memcpy_async_tx, we'll just do a fallback and never issue tx-based instructions for those cases, so if arrive_tx is just an arrive there (essentially discarding the tx count), that should be fine, no?

Good point. The backwards-compatibility story was not on my mind (yet). I have changed the code to simply arrive on previous architectures.

I have updated the PR to take this into account and also updated the tests to (hopefully) compile with NVRTC..

ahendriksen · 2023-08-23T07:56:07Z

Can somebody give a push to start the CI? Is there something I can do to start the CI?

miscco · 2023-08-23T08:25:42Z

/ok to test

miscco · 2023-08-23T08:29:02Z

@jarmak-nv Could you have a look why the CI is currently stalled?

libcudacxx/.upstream-tests/test/cuda/barrier_arrive_tx.pass.cpp

jrhemstad · 2023-08-24T15:36:43Z

@miscco @griwes can you give this another review?

miscco

@ahendriksen

I have updated the tests a tiny bit. The main issue is that in our current test setup we usually only launch with a single thread. Thats why I have split it up into three different tests.

I have fixed one potential unused variable warning out of pure paranoia.

I have also added a feature test macro and a test for it.

@griwes please give it a final review
@jrhemstad @ahendriksen I have only enabled the feature test macro for SM_90 Shout if you have any objections and we should relax that to SM_70

ahendriksen

Thanks for the changes @miscco. I have left some comments on the feature flag.

libcudacxx/include/cuda/std/detail/libcxx/include/version

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h

libcudacxx/include/cuda/std/detail/libcxx/include/version

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h

But do not support any template parameter that is not equal to the default completion function.

Instead of using the PTX docs as a reference, use the C++ barrier docs as a reference.

ahendriksen

All feedback has been incorporated:

Docs changes have been implemented
one NVRTC test has been disabled
other small changes also done.

libcudacxx/.upstream-tests/test/cuda/barrier/arrive_tx.h

libcudacxx/.upstream-tests/test/cuda/barrier/arrive_tx_device.runfail.cpp

libcudacxx/docs/extended_api/synchronization_primitives/barrier.md

libcudacxx/docs/extended_api/synchronization_primitives/barrier/barrier_arrive_tx.md

libcudacxx/docs/extended_api/synchronization_primitives/barrier.md

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h

gonzalobg · 2023-09-12T12:26:36Z

libcudacxx/docs/extended_api/synchronization_primitives/barrier.md

+thus progress the `cuda::barrier` towards the completion of the current phase.
+This may complete the current phase.
+
+### Phase Completion of a `cuda::barrier` with tx-count support


Write "Modify [....linked section of standard....] of ISO/IEC IS 14882 (the C++ Standard) as follows:", then use quote (>) to quote the standard text, then use bold to highlight the modifications, like it is done everywhere else in the docs:

https://nvidia.github.io/libcudacxx/extended_api/memory_model.html#data-races

https://nvidia.github.io/libcudacxx/standard_api/time_library/chrono.html

EDIT: there is no need to invent some new way of documentation; just be consistent with how libcu++ already documents these things elsewhere.

If you want to propose a new way to document these changes, then open an issue, and then it needs to be changed throughout.

Btw, lines 19-20 of this file (https://github.com/NVIDIA/cccl/pull/358/files#diff-2f7a92f8ce801caa513c4c823f3982c625d0815166ee448df9788b6f937a46a1R19-R20) currently contradict this..

It has the same interface and semantics as [cuda::std::barrier], with the following additional operations

I think this section needs to be moved to the top, and it has to be clarified that cuda::barrier has the same interface but not the same semantics as cuda::std::barrier.

It needs to say that,

if !(scope == thread_block_scope && __isShared(this)), then the semantics are the same as cuda::std::barrier,

otherwise, the semantics are that of "modify [....] of the c++ standard as follows:".

gonzalobg · 2023-09-12T12:40:02Z

libcudacxx/docs/extended_api/synchronization_primitives/barrier.md

+> 1. The _expected count_ is decremented by each call to `arrive`,`arrive_and_drop`**, or `cuda::device::barrier_arrive_tx`**.
+> 2. **The _transaction count_ is incremented by each call to `cuda::device::barrier_arrive_tx` and decremented by the completion of transaction-based asynchronous operations such as `cuda::memcpy_async_tx`.**
+> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.
+> 4. When the completion step finishes, the _expected count_  is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, **the _transaction count_ is reset to zero,** and the next phase starts.


Suggested change

> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, **the _transaction count_ is reset to zero,** and the next phase starts.

> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, and the next phase starts.

Since the transaction count is zero, it does not need to be reset.

gonzalobg · 2023-09-12T12:40:29Z

libcudacxx/docs/extended_api/synchronization_primitives/barrier.md

+>
+> 1. The _expected count_ is decremented by each call to `arrive`,`arrive_and_drop`**, or `cuda::device::barrier_arrive_tx`**.
+> 2. **The _transaction count_ is incremented by each call to `cuda::device::barrier_arrive_tx` and decremented by the completion of transaction-based asynchronous operations such as `cuda::memcpy_async_tx`.**
+> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.


Suggested change

> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.

> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, **`cuda::device::barrier_arrive_tx`**, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.

miscco · 2023-09-12T13:29:42Z

@gonzalobg we decided to merge the PR as is and add a follow up PR for the documentation fixes

gonzalobg · 2023-09-12T13:30:58Z

Please follow up with @ahendriksen for the remaining documentation fixes before 12.4 cut-off deadline to avoid having to revert this if fundamental issues are discovered.

miscco · 2023-09-12T13:31:12Z

Thanks a lot for that great contribution 🎉

ahendriksen requested review from a team as code owners August 18, 2023 10:47

ahendriksen requested review from ericniebler and removed request for a team August 18, 2023 10:47

ahendriksen requested review from alliepiper and removed request for a team August 18, 2023 10:47

miscco requested changes Aug 18, 2023

View reviewed changes

miscco reviewed Aug 18, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h Outdated Show resolved Hide resolved

miscco reviewed Aug 21, 2023

View reviewed changes

ahendriksen force-pushed the add-barrier-arrive-tx-method branch from 904ab27 to 1863492 Compare August 21, 2023 12:13

miscco requested a review from jarmak-nv August 23, 2023 08:29

ahendriksen force-pushed the add-barrier-arrive-tx-method branch from 1863492 to 8c4f4bb Compare August 23, 2023 13:56

miscco reviewed Aug 23, 2023

View reviewed changes

jrhemstad requested a review from miscco August 24, 2023 15:36

miscco approved these changes Aug 25, 2023

View reviewed changes

ahendriksen commented Aug 28, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/version Outdated Show resolved Hide resolved

griwes reviewed Aug 28, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h Outdated Show resolved Hide resolved

jrhemstad reviewed Aug 28, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/version Outdated Show resolved Hide resolved

jrhemstad reviewed Aug 28, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h Outdated Show resolved Hide resolved

jrhemstad reviewed Aug 28, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h Outdated Show resolved Hide resolved

jrhemstad reviewed Aug 28, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h Outdated Show resolved Hide resolved

ahendriksen added 18 commits September 12, 2023 09:55

Update arch check test

ee9db4a

Add completion function template parameter for ABI

d56cd4e

But do not support any template parameter that is not equal to the default completion function.

Test that arrive_tx fails on cluster and device memory

c7c4484

Test that arrive_tx fails with completion function

db23c52

Remove expected unit assert again

c280a68

Hide arrive_tx on older architectures

1401d4a

Pull out barrier_native_handle into variable

3b8503f

Fix hiding of arrive_tx

c6535a2

Rename arrive_tx => barrier_arrive_tx

f2b38e7

Add docs for barrier with transaction count

490b166

Markup cluster test

1739496

Remove completion function template param

53403d2

Trap in complete_tx in arrive_tx tests

c17b3ed

Exclude failing nvrtc test

c41fd9e

Remove ref to UB in effects section

1443a90

Update docs

331ed7c

Replace phase completion section

8e2f2ef

Instead of using the PTX docs as a reference, use the C++ barrier docs as a reference.

Pull out __cvta_generic_to_shared

41303da

ahendriksen force-pushed the add-barrier-arrive-tx-method branch from 8e3a10d to 41303da Compare September 12, 2023 08:23

ahendriksen commented Sep 12, 2023

View reviewed changes

ahendriksen added 2 commits September 12, 2023 14:06

Add back debug assert in constexpr

ce919a2

Disable debug assert on C++11

2040fa6

gonzalobg reviewed Sep 12, 2023

View reviewed changes

[skip-tests] Use same format for standard lib text modification

bb989b1

gonzalobg reviewed Sep 12, 2023

View reviewed changes

miscco merged commit cf6c417 into NVIDIA:main Sep 12, 2023
398 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cuda::device::barrier_arrive tx #358

Add cuda::device::barrier_arrive tx #358

ahendriksen commented Aug 18, 2023 •

edited

Loading

rapids-bot bot commented Aug 18, 2023

ahendriksen commented Aug 18, 2023

miscco left a comment

griwes commented Aug 18, 2023

miscco commented Aug 18, 2023

ahendriksen commented Aug 21, 2023 •

edited

Loading

ahendriksen commented Aug 23, 2023

miscco commented Aug 23, 2023

miscco commented Aug 23, 2023

jrhemstad commented Aug 24, 2023

miscco left a comment

ahendriksen left a comment

ahendriksen left a comment

gonzalobg Sep 12, 2023 •

edited

Loading

gonzalobg Sep 12, 2023

gonzalobg Sep 12, 2023

gonzalobg Sep 12, 2023

miscco commented Sep 12, 2023

gonzalobg commented Sep 12, 2023 •

edited

Loading

miscco commented Sep 12, 2023

	> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, the _transaction count_ is reset to zero, and the next phase starts.
	> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, and the next phase starts.

	> 3. Exactly once after both the _expected count_ and the _transaction count_ reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.
	> 3. Exactly once after both the _expected count_ and the _transaction count_ reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, `cuda::device::barrier_arrive_tx`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.

Add cuda::device::barrier_arrive tx #358

Add cuda::device::barrier_arrive tx #358

Conversation

ahendriksen commented Aug 18, 2023 • edited Loading

Description

Checklist

rapids-bot bot commented Aug 18, 2023

ahendriksen commented Aug 18, 2023

miscco left a comment

Choose a reason for hiding this comment

griwes commented Aug 18, 2023

miscco commented Aug 18, 2023

ahendriksen commented Aug 21, 2023 • edited Loading

ahendriksen commented Aug 23, 2023

miscco commented Aug 23, 2023

miscco commented Aug 23, 2023

jrhemstad commented Aug 24, 2023

miscco left a comment

Choose a reason for hiding this comment

ahendriksen left a comment

Choose a reason for hiding this comment

ahendriksen left a comment

Choose a reason for hiding this comment

gonzalobg Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

gonzalobg Sep 12, 2023

Choose a reason for hiding this comment

gonzalobg Sep 12, 2023

Choose a reason for hiding this comment

gonzalobg Sep 12, 2023

Choose a reason for hiding this comment

miscco commented Sep 12, 2023

gonzalobg commented Sep 12, 2023 • edited Loading

miscco commented Sep 12, 2023

ahendriksen commented Aug 18, 2023 •

edited

Loading

ahendriksen commented Aug 21, 2023 •

edited

Loading

gonzalobg Sep 12, 2023 •

edited

Loading

gonzalobg commented Sep 12, 2023 •

edited

Loading