Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuda::device::barrier_arrive tx #358

Merged
merged 68 commits into from
Sep 12, 2023

Conversation

ahendriksen
Copy link
Contributor

@ahendriksen ahendriksen commented Aug 18, 2023

Description

Closes #357

This PR adds the cuda::device::arrive_txfunction for shared memory barriers.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@ahendriksen ahendriksen requested review from a team as code owners August 18, 2023 10:47
@ahendriksen ahendriksen requested review from ericniebler and removed request for a team August 18, 2023 10:47
@rapids-bot
Copy link

rapids-bot bot commented Aug 18, 2023

Pull requests from external contributors require approval from a NVIDIA organization member with write permissions or greater before CI can begin.

@ahendriksen ahendriksen requested review from alliepiper and removed request for a team August 18, 2023 10:47
@ahendriksen
Copy link
Contributor Author

Please advise how to fulfill the items on the checklist.

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am no expert on barrier, but some general libcu++ comments

@griwes
Copy link
Collaborator

griwes commented Aug 18, 2023

I am not convinced about only supporting this on sm_90+ and on shmem barriers. For a future memcpy_async_tx, we'll just do a fallback and never issue tx-based instructions for those cases, so if arrive_tx is just an arrive there (essentially discarding the tx count), that should be fine, no?

I'd like our users to be able to just write the same code everywhere, have it use hardware features where available, and do a fallback everywhere else where possible (so only trap for cases where we can't tell what the correct thing to do is, like in the barrier in cluster shmem case).

@miscco
Copy link
Collaborator

miscco commented Aug 18, 2023

I am loving what @griwes said. Use the new hot stuff when available and fall back to the old behavior when not. Maybe we can add some warning so the user acknowledges it?

@ahendriksen
Copy link
Contributor Author

ahendriksen commented Aug 21, 2023

For a future memcpy_async_tx, we'll just do a fallback and never issue tx-based instructions for those cases, so if arrive_tx is just an arrive there (essentially discarding the tx count), that should be fine, no?

Good point. The backwards-compatibility story was not on my mind (yet). I have changed the code to simply arrive on previous architectures.

I have updated the PR to take this into account and also updated the tests to (hopefully) compile with NVRTC..

@ahendriksen
Copy link
Contributor Author

Can somebody give a push to start the CI? Is there something I can do to start the CI?

@miscco
Copy link
Collaborator

miscco commented Aug 23, 2023

/ok to test

@miscco
Copy link
Collaborator

miscco commented Aug 23, 2023

@jarmak-nv Could you have a look why the CI is currently stalled?

@jrhemstad jrhemstad requested a review from miscco August 24, 2023 15:36
@jrhemstad
Copy link
Collaborator

@miscco @griwes can you give this another review?

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahendriksen

I have updated the tests a tiny bit. The main issue is that in our current test setup we usually only launch with a single thread. Thats why I have split it up into three different tests.

I have fixed one potential unused variable warning out of pure paranoia.

I have also added a feature test macro and a test for it.

@griwes please give it a final review
@jrhemstad @ahendriksen I have only enabled the feature test macro for SM_90 Shout if you have any objections and we should relax that to SM_70

Copy link
Contributor Author

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @miscco. I have left some comments on the feature flag.

libcudacxx/include/cuda/std/detail/libcxx/include/version Outdated Show resolved Hide resolved
Copy link
Contributor Author

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All feedback has been incorporated:

  • Docs changes have been implemented
  • one NVRTC test has been disabled
  • other small changes also done.

thus progress the `cuda::barrier` towards the completion of the current phase.
This may complete the current phase.

### Phase Completion of a `cuda::barrier` with tx-count support
Copy link
Collaborator

@gonzalobg gonzalobg Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write "Modify [....linked section of standard....] of ISO/IEC IS 14882 (the C++ Standard) as follows:", then use quote (>) to quote the standard text, then use bold to highlight the modifications, like it is done everywhere else in the docs:

EDIT: there is no need to invent some new way of documentation; just be consistent with how libcu++ already documents these things elsewhere.

If you want to propose a new way to document these changes, then open an issue, and then it needs to be changed throughout.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, lines 19-20 of this file (https://github.com/NVIDIA/cccl/pull/358/files#diff-2f7a92f8ce801caa513c4c823f3982c625d0815166ee448df9788b6f937a46a1R19-R20) currently contradict this..

It has the same interface and semantics as [cuda::std::barrier], with the following additional operations

I think this section needs to be moved to the top, and it has to be clarified that cuda::barrier has the same interface but not the same semantics as cuda::std::barrier.

It needs to say that,

  • if !(scope == thread_block_scope && __isShared(this)), then the semantics are the same as cuda::std::barrier,
  • otherwise, the semantics are that of "modify [....] of the c++ standard as follows:".

> 1. The _expected count_ is decremented by each call to `arrive`,`arrive_and_drop`**, or `cuda::device::barrier_arrive_tx`**.
> 2. **The _transaction count_ is incremented by each call to `cuda::device::barrier_arrive_tx` and decremented by the completion of transaction-based asynchronous operations such as `cuda::memcpy_async_tx`.**
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.
> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, **the _transaction count_ is reset to zero,** and the next phase starts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, **the _transaction count_ is reset to zero,** and the next phase starts.
> 4. When the completion step finishes, the _expected count_ is reset to what was specified by the `expected` argument to the constructor, possibly adjusted by calls to `arrive_and_drop`, and the next phase starts.

Since the transaction count is zero, it does not need to be reset.

>
> 1. The _expected count_ is decremented by each call to `arrive`,`arrive_and_drop`**, or `cuda::device::barrier_arrive_tx`**.
> 2. **The _transaction count_ is incremented by each call to `cuda::device::barrier_arrive_tx` and decremented by the completion of transaction-based asynchronous operations such as `cuda::memcpy_async_tx`.**
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.
> 3. Exactly once after **both** the _expected count_ **and the _transaction count_** reach zero, a thread executes the _completion step_ during its call to `arrive`, `arrive_and_drop`, **`cuda::device::barrier_arrive_tx`**, or `wait`, except that it is implementation-defined whether the step executes if no thread calls `wait`.

@miscco
Copy link
Collaborator

miscco commented Sep 12, 2023

@gonzalobg we decided to merge the PR as is and add a follow up PR for the documentation fixes

@miscco miscco merged commit cf6c417 into NVIDIA:main Sep 12, 2023
398 checks passed
@gonzalobg
Copy link
Collaborator

gonzalobg commented Sep 12, 2023

Please follow up with @ahendriksen for the remaining documentation fixes before 12.4 cut-off deadline to avoid having to revert this if fundamental issues are discovered.

@miscco
Copy link
Collaborator

miscco commented Sep 12, 2023

Thanks a lot for that great contribution 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA]: barrier<thread_scope_block> should have arrive_tx member function
5 participants