Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance improvements for reduce-then-scan algorithm #1792

Open
danhoeflinger opened this issue Aug 22, 2024 · 0 comments
Open
Assignees

Comments

@danhoeflinger
Copy link
Contributor

This issue represents potential performance improvements for the reduce-then-scan algorithm, which have come out of discussions about the PRs:
#1762, #1763, #1764, #1765

  1. Upgrade single workgroup scan (and copy_if) with ideas from reduce-then-scan. We lowered our threshold of when to select single workgroup scan because it does not perform as well as reduce-then-scan. However, there is no reason we should not be able to do better in a single kernel if we fit by use similar strategies to reduce-then-scan. Also, we should be able to relax our requirements for known identity, and open up single workgroup scan to more input types / operations.

  2. For Make Unique family of APIs use reduce_then_scan #1765,
    a) For each input element, we load the element and the previous element from the input sequence to compare for uniqueness. This amounts to two dereferences from the input sequence (global memory). It may benefit us to load first to SLM, then read from SLM during the input processing. It must be studied to see if this is handled implicitly by the caching system or if this would be a benefit. We would need special infrastructure in the kernel to handle kernels which would benefit from the SLM load, as the other families should not benefit from this change.
    b) We shift the scan by 1 and handle the 0th element specially. This may cause issues in storing to unaligned output. We should investigate if we can do better by masking the 0th overall element, but it may be difficult to avoid additional branches when doing this.

  3. For Add reduce then scan algorithm for transform scan family #1762, we currently work around the in-place exclusive scan by copying it to another buffer, then doing an out of place exclusive scan. This is only required for multi-pass scan, not single_wg scan and not reduce-then-scan algorithms. We should push this workaround lower into the code, and only apply it for algorithms which require it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant