Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cub::DeviceFind::FindIf #2405

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

gonidelis
Copy link
Member

@gonidelis gonidelis commented Sep 11, 2024

This is a draft to track the work progress on cub::DeviceFind::FindIf which should ultimately be used to improve thrust::all_of.

Algorithm Description

The kernel is based on the concept of early cancellation through a global atomic result -that works as a flag- so that when its default value (input.size()) is overwritten, it is atomically min compared and then broadcasted to the other CTAs and they avoid proceeding to any further iterations.
    if (threadIdx.x == 0)
    {
      sresult = atomicAdd(result, 0);
    }
    __syncthreads();

    // early exit
    if (sresult < tile_offset)
    {
      return;
    }

For that to happen every block collects a local minimum index, block_result (when predicate is found)

    if (index < num_items)
      {
        if (pred(input_items[i]))
        {
          found = true;
          atomicMin(&block_result, index);
          break; // every thread goes over multiple elements per thread 
                 // for every tile. If a thread finds a local minimum it doesn't
                 // need to proceed further (inner early exit).
        }
      }

and at the end of the loop-body the first thread of every block atomically minCompares and writes to the global minimum index variable:

   if (syncthreads_or(found))
    {
      if (threadIdx.x == 0)
      {
        if (block_result < num_items)
        {
          atomicMin(result, block_result);
        }
      }
    }

It's a an algorithm that heavily favors cases where the "to-be-found" elements are ubiquitous.

Notes:

  • To achieve proper synchronization, every block reads a local copy of the global result (that potentially has been found by other blocks in previous iterations) into the shared memory sresult and then checks whether it should exit.
    • If the other blocks didn't "get to update the global result" early enough then, because the check for the global early exit is at the beginning of the loop the worst case scenario is a block would do one extra redundant iteration before exiting.
  • block_result should not be confused with sresult. Although both are shared memory variables that hold a local minimum, the former is being used to read from the global atomic, while the latter is being used to write to the global atomic.
  • The algorithm is utilizing vectorized loads for small input types (split implementation for large input types to be added).

@gonidelis
Copy link
Member Author

gonidelis commented Sep 11, 2024

# Benchmark Results

## cub::FindIf

### [0] NVIDIA H200

|  T  |     Elements     | RelativeMismatchPosition | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-----|------------------|--------------------------|---------|------------|--------|------------|-------|
| I32 |     2^16 = 65536 |                        1 |  22928x |  26.790 us | 22.92% |  21.814 us | 2.06% |
| I32 |   2^20 = 1048576 |                        1 |  19632x |  31.313 us | 23.28% |  25.470 us | 3.40% |
| I32 |  2^24 = 16777216 |                        1 |  10800x |  51.443 us | 11.07% |  46.351 us | 1.28% |
| I32 | 2^28 = 268435456 |                        1 |   1074x | 471.441 us |  1.32% | 465.567 us | 0.36% |
| I32 |     2^16 = 65536 |                      0.5 |  23024x |  26.862 us | 23.81% |  21.717 us | 2.03% |
| I32 |   2^20 = 1048576 |                      0.5 |  19792x |  31.141 us | 23.58% |  25.273 us | 3.86% |
| I32 |  2^24 = 16777216 |                      0.5 |  17152x |  34.301 us | 17.73% |  29.160 us | 1.59% |
| I32 | 2^28 = 268435456 |                      0.5 |   2064x | 247.413 us |  2.15% | 242.283 us | 0.35% |
| I32 |     2^16 = 65536 |                        0 |  54416x |  14.286 us | 55.68% |   9.189 us | 3.01% |
| I32 |   2^20 = 1048576 |                        0 |  52288x |  14.649 us | 53.34% |   9.565 us | 3.21% |
| I32 |  2^24 = 16777216 |                        0 |  33568x |  19.990 us | 34.29% |  14.901 us | 1.96% |
| I32 | 2^28 = 268435456 |                        0 |  33632x |  20.006 us | 34.67% |  14.874 us | 2.54% |

## thrust::count_if

### [0] NVIDIA H200

|  T  |     Elements     | RelativeMismatchPosition | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-----|------------------|--------------------------|---------|------------|--------|------------|-------|
| I32 |     2^16 = 65536 |                        1 |  24528x |  24.829 us | 22.00% |  20.390 us | 2.94% |
| I32 |   2^20 = 1048576 |                        1 |  22480x |  26.721 us | 20.33% |  22.244 us | 2.62% |
| I32 |  2^24 = 16777216 |                        1 |  11856x |  46.720 us | 10.84% |  42.222 us | 1.89% |
| I32 | 2^28 = 268435456 |                        1 |   1855x | 274.137 us |  1.77% | 269.582 us | 0.49% |
| I32 |     2^16 = 65536 |                      0.5 |  24176x |  25.121 us | 21.73% |  20.685 us | 3.23% |
| I32 |   2^20 = 1048576 |                      0.5 |  22048x |  27.172 us | 19.97% |  22.689 us | 2.65% |
| I32 |  2^24 = 16777216 |                      0.5 |  11744x |  47.062 us | 10.71% |  42.587 us | 1.97% |
| I32 | 2^28 = 268435456 |                      0.5 |   1859x | 273.367 us |  1.69% | 268.966 us | 0.41% |
| I32 |     2^16 = 65536 |                        0 |  24160x |  25.209 us | 22.02% |  20.707 us | 3.19% |
| I32 |   2^20 = 1048576 |                        0 |  22144x |  27.092 us | 20.12% |  22.594 us | 2.75% |
| I32 |  2^24 = 16777216 |                        0 |  11744x |  47.111 us | 10.85% |  42.584 us | 2.02% |
| I32 | 2^28 = 268435456 |                        0 |   1862x | 273.046 us |  1.70% | 268.609 us | 0.38% |

@bernhardmgruber
Copy link
Contributor

bernhardmgruber commented Sep 11, 2024

until we make sure that it performs better than reduce.

I would love to see a benchmark comparison of thrust::all_of before and after your improvements at some point!

Initial bench results are encouraging.

Do I understand correctly, that cub::FindIf should outperform thrust::count_if? Because I am seeing the opposite on some of the numbers you posted.

@jrhemstad jrhemstad changed the title Imlement cub::DeviceFind::FindIf Implement cub::DeviceFind::FindIf Sep 11, 2024
@bernhardmgruber
Copy link
Contributor

The name CommonPrefixRatio comes from the old thrust::equals benchmark, where it denoted the amount of equal elements at the beginning of both ranges. A CommonPrefixRatio of 1 meant all elements were the same in both ranges. A value of 0.5 meant, the first 50% of the range were equal and then the mismatch occurs.

For your benchmark you could name this value RelativeMismatchPosition, or MismatchAt, or something like that. It's about where you expect the mismatch to happen.

@gonidelis
Copy link
Member Author

@bernhardmgruber cub::FindIf because of the early exit is expected to perform better than thrust::count_if at least in the middle cases RelativeMismatchPosition = 0.5 (thanx for the name suggestion). On the axis of input size, this is true for small and medium sized inputs, but not for large inputs.

Copy link
Contributor

🟨 CI finished in 1h 52m: Pass: 97%/259 | Total: 1d 08h | Avg: 7m 28s | Max: 35m 24s | Hits: 99%/20079
  • 🟨 cub: Pass: 95%/136 | Total: 19h 26m | Avg: 8m 34s | Max: 35m 24s

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  95%/128 | Total: 18h 52m | Avg:  8m 50s | Max: 35m 24s
      🟩 arm64              Pass: 100%/8   | Total: 33m 44s | Avg:  4m 13s | Max:  4m 33s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  7m 19s | Avg:  3m 39s | Max:  3m 49s
      🔍 nvcc               Pass:  95%/134 | Total: 19h 19m | Avg:  8m 39s | Max: 35m 24s
    🚨 cxx_family: MSVC 🚨
      🟩 Clang              Pass: 100%/63  | Total:  9h 04m | Avg:  8m 38s | Max: 30m 14s
      🟩 GCC                Pass: 100%/64  | Total:  9h 07m | Avg:  8m 33s | Max: 35m 24s
      🟩 Intel              Pass: 100%/3   | Total: 16m 34s | Avg:  5m 31s | Max:  5m 42s
      🔥 MSVC               Pass:   0%/6   | Total: 57m 26s | Avg:  9m 34s | Max: 10m 42s
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  94%/103 | Total:  8h 05m | Avg:  4m 42s | Max: 10m 42s
      🟩 DeviceLaunch       Pass: 100%/8   | Total:  2h 28m | Avg: 18m 35s | Max: 19m 52s
      🟩 GraphCapture       Pass: 100%/8   | Total:  2h 07m | Avg: 15m 54s | Max: 19m 10s
      🟩 HostLaunch         Pass: 100%/8   | Total:  2h 28m | Avg: 18m 32s | Max: 22m 38s
      🟩 SmallGMem          Pass: 100%/1   | Total: 35m 24s | Avg: 35m 24s | Max: 35m 24s
      🟩 TestGPU            Pass: 100%/8   | Total:  3h 41m | Avg: 27m 38s | Max: 30m 14s
    🟨 ctk
      🟨 11.1               Pass:  93%/15  | Total:  1h 01m | Avg:  4m 04s | Max: 10m 42s
      🟩 11.8               Pass: 100%/3   | Total: 13m 50s | Avg:  4m 36s | Max:  4m 57s
      🟨 12.6               Pass:  95%/118 | Total: 18h 11m | Avg:  9m 14s | Max: 35m 24s
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  7m 19s | Avg:  3m 39s | Max:  3m 49s
      🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 01m | Avg:  4m 04s | Max: 10m 42s
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 50s | Avg:  4m 36s | Max:  4m 57s
      🟨 nvcc12.6           Pass:  95%/116 | Total: 18h 04m | Avg:  9m 20s | Max: 35m 24s
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 26m 22s | Avg:  4m 23s | Max:  5m 25s
      🟩 Clang10            Pass: 100%/3   | Total: 15m 38s | Avg:  5m 12s | Max:  5m 21s
      🟩 Clang11            Pass: 100%/4   | Total: 17m 44s | Avg:  4m 26s | Max:  4m 45s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 19s | Avg:  4m 34s | Max:  4m 43s
      🟩 Clang13            Pass: 100%/4   | Total: 18m 26s | Avg:  4m 36s | Max:  4m 46s
      🟩 Clang14            Pass: 100%/4   | Total: 17m 47s | Avg:  4m 26s | Max:  4m 39s
      🟩 Clang15            Pass: 100%/4   | Total: 18m 36s | Avg:  4m 39s | Max:  4m 53s
      🟩 Clang16            Pass: 100%/4   | Total: 19m 03s | Avg:  4m 45s | Max:  5m 13s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 14s | Avg:  4m 33s | Max:  4m 43s
      🟩 Clang18            Pass: 100%/26  | Total:  6h 14m | Avg: 14m 24s | Max: 30m 14s
      🟩 GCC6               Pass: 100%/2   | Total:  6m 57s | Avg:  3m 28s | Max:  3m 31s
      🟩 GCC7               Pass: 100%/6   | Total: 23m 23s | Avg:  3m 53s | Max:  4m 31s
      🟩 GCC8               Pass: 100%/6   | Total: 23m 30s | Avg:  3m 55s | Max:  4m 27s
      🟩 GCC9               Pass: 100%/6   | Total: 24m 35s | Avg:  4m 05s | Max:  4m 32s
      🟩 GCC10              Pass: 100%/4   | Total: 19m 01s | Avg:  4m 45s | Max:  5m 07s
      🟩 GCC11              Pass: 100%/7   | Total: 32m 33s | Avg:  4m 39s | Max:  4m 57s
      🟩 GCC12              Pass: 100%/4   | Total: 18m 46s | Avg:  4m 41s | Max:  4m 54s
      🟩 GCC13              Pass: 100%/29  | Total:  6h 38m | Avg: 13m 45s | Max: 35m 24s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 16m 34s | Avg:  5m 31s | Max:  5m 42s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 10m 42s | Avg: 10m 42s | Max: 10m 42s
      🟥 MSVC14.29          Pass:   0%/2   | Total: 18m 33s | Avg:  9m 16s | Max:  9m 24s
      🟥 MSVC14.39          Pass:   0%/3   | Total: 28m 11s | Avg:  9m 23s | Max:  9m 39s
    🟨 std
      🟩 11                 Pass: 100%/35  | Total:  4h 34m | Avg:  7m 50s | Max: 30m 14s
      🟨 14                 Pass:  92%/38  | Total:  5h 08m | Avg:  8m 07s | Max: 26m 43s
      🟨 17                 Pass:  94%/38  | Total:  5h 38m | Avg:  8m 54s | Max: 35m 24s
      🟨 20                 Pass:  96%/25  | Total:  4h 05m | Avg:  9m 48s | Max: 28m 21s
    🟨 gpu
      🟨 v100               Pass:  95%/136 | Total: 19h 26m | Avg:  8m 34s | Max: 35m 24s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 50s | Avg:  4m 36s | Max:  4m 57s
      🟩 90a                Pass: 100%/4   | Total: 14m 45s | Avg:  3m 41s | Max:  3m 49s
    
  • 🟩 thrust: Pass: 100%/122 | Total: 12h 35m | Avg: 6m 11s | Max: 21m 04s | Hits: 99%/20079

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total: 12h 00m | Avg:  6m 19s | Max: 21m 04s | Hits:  99%/20079 
      🟩 arm64              Pass: 100%/8   | Total: 35m 04s | Avg:  4m 23s | Max:  5m 14s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 08m | Avg:  4m 35s | Max: 16m 04s | Hits:  99%/2231  
      🟩 11.8               Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟩 12.6               Pass: 100%/104 | Total: 11h 13m | Avg:  6m 28s | Max: 21m 04s | Hits:  99%/17848 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 38s | Avg:  4m 19s | Max:  4m 23s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 08m | Avg:  4m 35s | Max: 16m 04s | Hits:  99%/2231  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 04m | Avg:  6m 30s | Max: 21m 04s | Hits:  99%/17848 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 38s | Avg:  4m 19s | Max:  4m 23s
      🟩 nvcc               Pass: 100%/120 | Total: 12h 26m | Avg:  6m 13s | Max: 21m 04s | Hits:  99%/20079 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 29m 37s | Avg:  4m 56s | Max:  6m 26s
      🟩 Clang10            Pass: 100%/3   | Total: 17m 33s | Avg:  5m 51s | Max:  6m 18s
      🟩 Clang11            Pass: 100%/4   | Total: 18m 32s | Avg:  4m 38s | Max:  5m 00s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 43s | Avg:  4m 40s | Max:  5m 00s
      🟩 Clang13            Pass: 100%/4   | Total: 19m 25s | Avg:  4m 51s | Max:  5m 11s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 48s | Avg:  4m 42s | Max:  4m 49s
      🟩 Clang15            Pass: 100%/4   | Total: 19m 15s | Avg:  4m 48s | Max:  5m 06s
      🟩 Clang16            Pass: 100%/4   | Total: 18m 36s | Avg:  4m 39s | Max:  4m 49s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 12s | Avg:  4m 33s | Max:  4m 56s
      🟩 Clang18            Pass: 100%/18  | Total:  2h 00m | Avg:  6m 41s | Max: 14m 53s
      🟩 GCC6               Pass: 100%/2   | Total:  6m 57s | Avg:  3m 28s | Max:  3m 31s
      🟩 GCC7               Pass: 100%/6   | Total: 23m 55s | Avg:  3m 59s | Max:  4m 49s
      🟩 GCC8               Pass: 100%/6   | Total: 24m 19s | Avg:  4m 03s | Max:  4m 33s
      🟩 GCC9               Pass: 100%/6   | Total: 24m 58s | Avg:  4m 09s | Max:  4m 54s
      🟩 GCC10              Pass: 100%/4   | Total: 18m 08s | Avg:  4m 32s | Max:  4m 41s
      🟩 GCC11              Pass: 100%/7   | Total: 31m 38s | Avg:  4m 31s | Max:  4m 40s
      🟩 GCC12              Pass: 100%/4   | Total: 19m 06s | Avg:  4m 46s | Max:  5m 03s
      🟩 GCC13              Pass: 100%/20  | Total:  2h 19m | Avg:  6m 59s | Max: 15m 30s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max:  5m 54s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 16m 04s | Avg: 16m 04s | Max: 16m 04s | Hits:  99%/2231  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 27m 01s | Avg: 13m 30s | Max: 13m 39s | Hits:  99%/4462  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  1h 47m | Avg: 17m 53s | Max: 21m 04s | Hits:  99%/13386 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  4h 59m | Avg:  5m 26s | Max: 14m 53s
      🟩 GCC                Pass: 100%/55  | Total:  4h 48m | Avg:  5m 15s | Max: 15m 30s
      🟩 Intel              Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max:  5m 54s
      🟩 MSVC               Pass: 100%/9   | Total:  2h 30m | Avg: 16m 42s | Max: 21m 04s | Hits:  99%/20079 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total: 12h 35m | Avg:  6m 11s | Max: 21m 04s | Hits:  99%/20079 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  8h 50m | Avg:  5m 09s | Max: 16m 11s | Hits:  99%/13386 
      🟩 TestCPU            Pass: 100%/11  | Total:  1h 58m | Avg: 10m 46s | Max: 21m 04s | Hits:  99%/6693  
      🟩 TestGPU            Pass: 100%/8   | Total:  1h 46m | Avg: 13m 15s | Max: 15m 30s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟩 90a                Pass: 100%/4   | Total: 15m 53s | Avg:  3m 58s | Max:  4m 20s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total:  2h 33m | Avg:  4m 57s | Max: 15m 30s
      🟩 14                 Pass: 100%/35  | Total:  3h 50m | Avg:  6m 34s | Max: 19m 59s | Hits:  99%/8924  
      🟩 17                 Pass: 100%/34  | Total:  3h 37m | Avg:  6m 23s | Max: 21m 01s | Hits:  99%/6693  
      🟩 20                 Pass: 100%/22  | Total:  2h 34m | Avg:  7m 00s | Max: 21m 04s | Hits:  99%/4462  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

# Runner
186 linux-amd64-cpu16
42 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved
cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved
@gonidelis gonidelis force-pushed the cub_find_if branch 3 times, most recently from ce3b44c to 6f9fba8 Compare September 17, 2024 03:51
Copy link
Contributor

🟨 CI finished in 4h 42m: Pass: 94%/259 | Total: 5d 01h | Avg: 28m 13s | Max: 1h 39m | Hits: 99%/20079
  • 🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 41s | Max: 1h 39m

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  88%/128 | Total:  4d 05h | Avg: 47m 22s | Max:  1h 39m
      🟩 arm64              Pass: 100%/8   | Total:  7h 01m | Avg: 52m 41s | Max: 54m 14s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 50s | Max: 59m 38s
      🔍 nvcc               Pass:  88%/134 | Total:  4d 10h | Avg: 47m 31s | Max:  1h 39m
    🟨 ctk
      🟨 11.1               Pass:  93%/15  | Total:  1h 07m | Avg:  4m 30s | Max: 14m 31s
      🟩 11.8               Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟨 12.6               Pass:  88%/118 | Total:  4d 10h | Avg: 54m 16s | Max:  1h 39m
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 57m | Avg: 58m 50s | Max: 59m 38s
      🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 07m | Avg:  4m 30s | Max: 14m 31s
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟨 nvcc12.6           Pass:  87%/116 | Total:  4d 08h | Avg: 54m 11s | Max:  1h 39m
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  2h 30m | Avg: 25m 07s | Max: 48m 51s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 15m | Avg: 45m 00s | Max: 47m 01s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 55m | Avg: 43m 51s | Max: 45m 13s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 01m | Avg: 45m 24s | Max: 47m 22s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 00m | Avg: 45m 14s | Max: 47m 45s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 53m | Avg: 43m 20s | Max: 43m 54s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 43m 46s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 28s | Max: 47m 08s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 55m | Avg: 43m 51s | Max: 45m 31s
      🟨 Clang18            Pass:  84%/26  | Total:  1d 07h | Avg:  1h 11m | Max:  1h 34m
      🟩 GCC6               Pass: 100%/2   | Total:  7m 00s | Avg:  3m 30s | Max:  3m 41s
      🟩 GCC7               Pass: 100%/6   | Total:  2h 23m | Avg: 23m 52s | Max: 45m 11s
      🟩 GCC8               Pass: 100%/6   | Total:  2h 54m | Avg: 29m 00s | Max:  1h 12m
      🟩 GCC9               Pass: 100%/6   | Total:  2h 28m | Avg: 24m 42s | Max: 46m 30s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 58m | Avg: 44m 35s | Max: 47m 43s
      🟩 GCC11              Pass: 100%/7   | Total:  3h 14m | Avg: 27m 49s | Max: 46m 54s
      🟩 GCC12              Pass: 100%/4   | Total:  3h 03m | Avg: 45m 51s | Max: 48m 13s
      🟨 GCC13              Pass:  82%/29  | Total:  1d 06h | Avg:  1h 04m | Max:  1h 39m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 17m | Avg: 45m 54s | Max: 48m 09s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 14m 31s | Avg: 14m 31s | Max: 14m 31s
      🟥 MSVC14.29          Pass:   0%/2   | Total: 24m 02s | Avg: 12m 01s | Max: 12m 25s
      🟥 MSVC14.39          Pass:   0%/3   | Total: 38m 46s | Avg: 12m 55s | Max: 13m 28s
    🟨 cxx_family
      🟨 Clang              Pass:  93%/63  | Total:  2d 08h | Avg: 53m 42s | Max:  1h 34m
      🟨 GCC                Pass:  92%/64  | Total:  2d 00h | Avg: 45m 06s | Max:  1h 39m
      🟩 Intel              Pass: 100%/3   | Total:  2h 17m | Avg: 45m 54s | Max: 48m 09s
      🟥 MSVC               Pass:   0%/6   | Total:  1h 17m | Avg: 12m 53s | Max: 14m 31s
    🟨 jobs
      🟨 Build              Pass:  94%/103 | Total:  2d 13h | Avg: 35m 43s | Max:  1h 12m
      🟩 DeviceLaunch       Pass: 100%/8   | Total: 11h 01m | Avg:  1h 22m | Max:  1h 25m
      🟩 GraphCapture       Pass: 100%/8   | Total: 10h 59m | Avg:  1h 22m | Max:  1h 29m
      🟩 HostLaunch         Pass: 100%/8   | Total: 11h 01m | Avg:  1h 22m | Max:  1h 27m
      🟥 SmallGMem          Pass:   0%/1   | Total:  1h 39m | Avg:  1h 39m | Max:  1h 39m
      🟥 TestGPU            Pass:   0%/8   | Total: 12h 02m | Avg:  1h 30m | Max:  1h 36m
    🟨 gpu
      🟨 v100               Pass:  88%/136 | Total:  4d 12h | Avg: 47m 41s | Max:  1h 39m
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
      🟩 90a                Pass: 100%/4   | Total: 15m 11s | Avg:  3m 47s | Max:  3m 55s
    🟨 std
      🟨 11                 Pass:  94%/35  | Total:  1d 02h | Avg: 45m 56s | Max:  1h 30m
      🟨 14                 Pass:  86%/38  | Total:  1d 03h | Avg: 43m 33s | Max:  1h 33m
      🟨 17                 Pass:  86%/38  | Total:  1d 06h | Avg: 47m 27s | Max:  1h 39m
      🟨 20                 Pass:  88%/25  | Total: 23h 38m | Avg: 56m 44s | Max:  1h 36m
    
  • 🟩 thrust: Pass: 100%/122 | Total: 13h 27m | Avg: 6m 37s | Max: 22m 34s | Hits: 99%/20079

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total: 12h 49m | Avg:  6m 45s | Max: 22m 34s | Hits:  99%/20079 
      🟩 arm64              Pass: 100%/8   | Total: 37m 43s | Avg:  4m 42s | Max:  5m 21s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 12m | Avg:  4m 50s | Max: 18m 34s | Hits:  99%/2231  
      🟩 11.8               Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  4m 59s
      🟩 12.6               Pass: 100%/104 | Total: 12h 00m | Avg:  6m 55s | Max: 22m 34s | Hits:  99%/17848 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  4m 26s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 12m | Avg:  4m 50s | Max: 18m 34s | Hits:  99%/2231  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  4m 59s
      🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 51m | Avg:  6m 58s | Max: 22m 34s | Hits:  99%/17848 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  4m 26s
      🟩 nvcc               Pass: 100%/120 | Total: 13h 18m | Avg:  6m 39s | Max: 22m 34s | Hits:  99%/20079 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 30m 48s | Avg:  5m 08s | Max:  5m 55s
      🟩 Clang10            Pass: 100%/3   | Total: 16m 46s | Avg:  5m 35s | Max:  5m 57s
      🟩 Clang11            Pass: 100%/4   | Total: 18m 12s | Avg:  4m 33s | Max:  4m 37s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 15s | Avg:  4m 33s | Max:  4m 46s
      🟩 Clang13            Pass: 100%/4   | Total: 18m 05s | Avg:  4m 31s | Max:  4m 39s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 18s | Avg:  4m 34s | Max:  4m 43s
      🟩 Clang15            Pass: 100%/4   | Total: 19m 15s | Avg:  4m 48s | Max:  5m 14s
      🟩 Clang16            Pass: 100%/4   | Total: 18m 38s | Avg:  4m 39s | Max:  5m 00s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 45s | Avg:  4m 41s | Max:  4m 52s
      🟩 Clang18            Pass: 100%/18  | Total:  2h 15m | Avg:  7m 30s | Max: 18m 36s
      🟩 GCC6               Pass: 100%/2   | Total:  7m 06s | Avg:  3m 33s | Max:  3m 50s
      🟩 GCC7               Pass: 100%/6   | Total: 24m 38s | Avg:  4m 06s | Max:  4m 44s
      🟩 GCC8               Pass: 100%/6   | Total: 24m 11s | Avg:  4m 01s | Max:  4m 44s
      🟩 GCC9               Pass: 100%/6   | Total: 25m 26s | Avg:  4m 14s | Max:  4m 52s
      🟩 GCC10              Pass: 100%/4   | Total: 18m 45s | Avg:  4m 41s | Max:  5m 00s
      🟩 GCC11              Pass: 100%/7   | Total: 33m 43s | Avg:  4m 49s | Max:  5m 02s
      🟩 GCC12              Pass: 100%/4   | Total: 20m 07s | Avg:  5m 01s | Max:  5m 10s
      🟩 GCC13              Pass: 100%/20  | Total:  2h 32m | Avg:  7m 36s | Max: 19m 31s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 17m 50s | Avg:  5m 56s | Max:  6m 06s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 34s | Avg: 18m 34s | Max: 18m 34s | Hits:  99%/2231  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 32m 57s | Avg: 16m 28s | Max: 16m 46s | Hits:  99%/4462  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  1h 59m | Avg: 19m 57s | Max: 22m 34s | Hits:  99%/13386 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  5h 12m | Avg:  5m 40s | Max: 18m 36s
      🟩 GCC                Pass: 100%/55  | Total:  5h 06m | Avg:  5m 34s | Max: 19m 31s
      🟩 Intel              Pass: 100%/3   | Total: 17m 50s | Avg:  5m 56s | Max:  6m 06s
      🟩 MSVC               Pass: 100%/9   | Total:  2h 51m | Avg: 19m 01s | Max: 22m 34s | Hits:  99%/20079 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total: 13h 27m | Avg:  6m 37s | Max: 22m 34s | Hits:  99%/20079 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  9h 14m | Avg:  5m 22s | Max: 18m 50s | Hits:  99%/13386 
      🟩 TestCPU            Pass: 100%/11  | Total:  2h 05m | Avg: 11m 23s | Max: 22m 34s | Hits:  99%/6693  
      🟩 TestGPU            Pass: 100%/8   | Total:  2h 07m | Avg: 15m 59s | Max: 19m 31s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  4m 59s
      🟩 90a                Pass: 100%/4   | Total: 15m 40s | Avg:  3m 55s | Max:  4m 02s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total:  2h 46m | Avg:  5m 22s | Max: 19m 26s
      🟩 14                 Pass: 100%/35  | Total:  4h 09m | Avg:  7m 08s | Max: 22m 03s | Hits:  99%/8924  
      🟩 17                 Pass: 100%/34  | Total:  3h 45m | Avg:  6m 38s | Max: 22m 34s | Hits:  99%/6693  
      🟩 20                 Pass: 100%/22  | Total:  2h 45m | Avg:  7m 30s | Max: 21m 58s | Hits:  99%/4462  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

# Runner
186 linux-amd64-cpu16
42 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

cub/cub/device/device_find_if.cuh Show resolved Hide resolved
cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved
cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved
cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

🟨 CI finished in 8h 06m: Pass: 94%/259 | Total: 5d 02h | Avg: 28m 19s | Max: 1h 34m | Hits: 99%/20079
  • 🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 44s | Max: 1h 34m

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  88%/128 | Total:  4d 05h | Avg: 47m 21s | Max:  1h 34m
      🟩 arm64              Pass: 100%/8   | Total:  7h 11m | Avg: 53m 53s | Max: 57m 36s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 05m
      🔍 nvcc               Pass:  88%/134 | Total:  4d 10h | Avg: 47m 29s | Max:  1h 34m
    🟨 ctk
      🟨 11.1               Pass:  93%/15  | Total:  1h 07m | Avg:  4m 31s | Max: 14m 48s
      🟩 11.8               Pass: 100%/3   | Total: 14m 06s | Avg:  4m 42s | Max:  4m 54s
      🟨 12.6               Pass:  88%/118 | Total:  4d 10h | Avg: 54m 19s | Max:  1h 34m
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 05m
      🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 07m | Avg:  4m 31s | Max: 14m 48s
      🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 06s | Avg:  4m 42s | Max:  4m 54s
      🟨 nvcc12.6           Pass:  87%/116 | Total:  4d 08h | Avg: 54m 09s | Max:  1h 34m
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  2h 27m | Avg: 24m 34s | Max: 46m 01s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 16m | Avg: 45m 28s | Max: 47m 07s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 59m | Avg: 44m 49s | Max: 46m 31s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 58m | Avg: 44m 42s | Max: 48m 46s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 01m | Avg: 45m 15s | Max: 48m 00s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 03m | Avg: 45m 46s | Max: 47m 53s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 17s | Max: 46m 06s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 54m | Avg: 43m 38s | Max: 45m 19s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 01m | Avg: 45m 25s | Max: 47m 22s
      🟨 Clang18            Pass:  84%/26  | Total:  1d 07h | Avg:  1h 12m | Max:  1h 34m
      🟩 GCC6               Pass: 100%/2   | Total:  6m 59s | Avg:  3m 29s | Max:  3m 34s
      🟩 GCC7               Pass: 100%/6   | Total:  2h 22m | Avg: 23m 44s | Max: 44m 38s
      🟩 GCC8               Pass: 100%/6   | Total:  2h 20m | Avg: 23m 27s | Max: 43m 37s
      🟩 GCC9               Pass: 100%/6   | Total:  2h 27m | Avg: 24m 31s | Max: 47m 41s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 59m | Avg: 44m 52s | Max: 45m 38s
      🟩 GCC11              Pass: 100%/7   | Total:  3h 13m | Avg: 27m 36s | Max: 46m 19s
      🟩 GCC12              Pass: 100%/4   | Total:  3h 34m | Avg: 53m 39s | Max:  1h 17m
      🟨 GCC13              Pass:  82%/29  | Total:  1d 06h | Avg:  1h 03m | Max:  1h 34m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 18m | Avg: 46m 08s | Max: 48m 33s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 14m 48s | Avg: 14m 48s | Max: 14m 48s
      🟥 MSVC14.29          Pass:   0%/2   | Total: 25m 11s | Avg: 12m 35s | Max: 12m 36s
      🟥 MSVC14.39          Pass:   0%/3   | Total: 41m 00s | Avg: 13m 40s | Max: 14m 26s
    🟨 cxx_family
      🟨 Clang              Pass:  93%/63  | Total:  2d 08h | Avg: 54m 09s | Max:  1h 34m
      🟨 GCC                Pass:  92%/64  | Total:  1d 23h | Avg: 44m 42s | Max:  1h 34m
      🟩 Intel              Pass: 100%/3   | Total:  2h 18m | Avg: 46m 08s | Max: 48m 33s
      🟥 MSVC               Pass:   0%/6   | Total:  1h 20m | Avg: 13m 29s | Max: 14m 48s
    🟨 jobs
      🟨 Build              Pass:  94%/103 | Total:  2d 13h | Avg: 36m 04s | Max:  1h 17m
      🟩 DeviceLaunch       Pass: 100%/8   | Total: 11h 02m | Avg:  1h 22m | Max:  1h 27m
      🟩 GraphCapture       Pass: 100%/8   | Total: 10h 44m | Avg:  1h 20m | Max:  1h 25m
      🟩 HostLaunch         Pass: 100%/8   | Total: 11h 08m | Avg:  1h 23m | Max:  1h 29m
      🟥 SmallGMem          Pass:   0%/1   | Total:  1h 34m | Avg:  1h 34m | Max:  1h 34m
      🟥 TestGPU            Pass:   0%/8   | Total: 11h 48m | Avg:  1h 28m | Max:  1h 34m
    🟨 gpu
      🟨 v100               Pass:  88%/136 | Total:  4d 12h | Avg: 47m 44s | Max:  1h 34m
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 06s | Avg:  4m 42s | Max:  4m 54s
      🟩 90a                Pass: 100%/4   | Total: 15m 27s | Avg:  3m 51s | Max:  4m 05s
    🟨 std
      🟨 11                 Pass:  94%/35  | Total:  1d 03h | Avg: 46m 34s | Max:  1h 28m
      🟨 14                 Pass:  86%/38  | Total:  1d 03h | Avg: 42m 41s | Max:  1h 26m
      🟨 17                 Pass:  86%/38  | Total:  1d 06h | Avg: 47m 56s | Max:  1h 34m
      🟨 20                 Pass:  88%/25  | Total: 23h 38m | Avg: 56m 44s | Max:  1h 33m
    
  • 🟩 thrust: Pass: 100%/122 | Total: 13h 45m | Avg: 6m 46s | Max: 25m 49s | Hits: 99%/20079

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total: 13h 10m | Avg:  6m 56s | Max: 25m 49s | Hits:  99%/20079 
      🟩 arm64              Pass: 100%/8   | Total: 34m 56s | Avg:  4m 22s | Max:  4m 48s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 13m | Avg:  4m 52s | Max: 20m 32s | Hits:  99%/2231  
      🟩 11.8               Pass: 100%/3   | Total: 13m 40s | Avg:  4m 33s | Max:  4m 53s
      🟩 12.6               Pass: 100%/104 | Total: 12h 18m | Avg:  7m 06s | Max: 25m 49s | Hits:  99%/17848 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 18s | Avg:  4m 39s | Max:  4m 44s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 13m | Avg:  4m 52s | Max: 20m 32s | Hits:  99%/2231  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 40s | Avg:  4m 33s | Max:  4m 53s
      🟩 nvcc12.6           Pass: 100%/102 | Total: 12h 09m | Avg:  7m 09s | Max: 25m 49s | Hits:  99%/17848 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 18s | Avg:  4m 39s | Max:  4m 44s
      🟩 nvcc               Pass: 100%/120 | Total: 13h 36m | Avg:  6m 48s | Max: 25m 49s | Hits:  99%/20079 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 28m 11s | Avg:  4m 41s | Max:  5m 43s
      🟩 Clang10            Pass: 100%/3   | Total: 17m 05s | Avg:  5m 41s | Max:  6m 17s
      🟩 Clang11            Pass: 100%/4   | Total: 19m 09s | Avg:  4m 47s | Max:  4m 58s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 41s | Avg:  4m 40s | Max:  4m 48s
      🟩 Clang13            Pass: 100%/4   | Total: 19m 37s | Avg:  4m 54s | Max:  5m 48s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 01s
      🟩 Clang15            Pass: 100%/4   | Total: 19m 43s | Avg:  4m 55s | Max:  5m 25s
      🟩 Clang16            Pass: 100%/4   | Total: 19m 09s | Avg:  4m 47s | Max:  5m 03s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 47s | Avg:  4m 41s | Max:  4m 56s
      🟩 Clang18            Pass: 100%/18  | Total:  2h 28m | Avg:  8m 15s | Max: 23m 35s
      🟩 GCC6               Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  3m 49s
      🟩 GCC7               Pass: 100%/6   | Total: 24m 04s | Avg:  4m 00s | Max:  4m 41s
      🟩 GCC8               Pass: 100%/6   | Total: 25m 11s | Avg:  4m 11s | Max:  4m 38s
      🟩 GCC9               Pass: 100%/6   | Total: 25m 03s | Avg:  4m 10s | Max:  5m 09s
      🟩 GCC10              Pass: 100%/4   | Total: 18m 43s | Avg:  4m 40s | Max:  5m 02s
      🟩 GCC11              Pass: 100%/7   | Total: 33m 20s | Avg:  4m 45s | Max:  5m 28s
      🟩 GCC12              Pass: 100%/4   | Total: 19m 33s | Avg:  4m 53s | Max:  5m 07s
      🟩 GCC13              Pass: 100%/20  | Total:  2h 23m | Avg:  7m 09s | Max: 16m 05s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 18m 12s | Avg:  6m 04s | Max:  6m 29s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 20m 32s | Avg: 20m 32s | Max: 20m 32s | Hits:  99%/2231  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 34m 31s | Avg: 17m 15s | Max: 17m 34s | Hits:  99%/4462  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  2h 08m | Avg: 21m 23s | Max: 25m 49s | Hits:  99%/13386 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  5h 27m | Avg:  5m 57s | Max: 23m 35s
      🟩 GCC                Pass: 100%/55  | Total:  4h 56m | Avg:  5m 23s | Max: 16m 05s
      🟩 Intel              Pass: 100%/3   | Total: 18m 12s | Avg:  6m 04s | Max:  6m 29s
      🟩 MSVC               Pass: 100%/9   | Total:  3h 03m | Avg: 20m 22s | Max: 25m 49s | Hits:  99%/20079 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total: 13h 45m | Avg:  6m 46s | Max: 25m 49s | Hits:  99%/20079 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  9h 19m | Avg:  5m 26s | Max: 20m 32s | Hits:  99%/13386 
      🟩 TestCPU            Pass: 100%/11  | Total:  2h 11m | Avg: 11m 56s | Max: 25m 49s | Hits:  99%/6693  
      🟩 TestGPU            Pass: 100%/8   | Total:  2h 14m | Avg: 16m 49s | Max: 23m 35s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 40s | Avg:  4m 33s | Max:  4m 53s
      🟩 90a                Pass: 100%/4   | Total: 16m 09s | Avg:  4m 02s | Max:  4m 19s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total:  2h 46m | Avg:  5m 22s | Max: 23m 35s
      🟩 14                 Pass: 100%/35  | Total:  4h 03m | Avg:  6m 58s | Max: 22m 44s | Hits:  99%/8924  
      🟩 17                 Pass: 100%/34  | Total:  4h 05m | Avg:  7m 13s | Max: 25m 20s | Hits:  99%/6693  
      🟩 20                 Pass: 100%/22  | Total:  2h 49m | Avg:  7m 41s | Max: 25m 49s | Hits:  99%/4462  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

# Runner
186 linux-amd64-cpu16
42 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

Copy link
Contributor

🟨 CI finished in 2h 35m: Pass: 97%/259 | Total: 5d 01h | Avg: 28m 11s | Max: 1h 41m | Hits: 99%/20079
  • 🟨 cub: Pass: 95%/136 | Total: 4d 12h | Avg: 47m 51s | Max: 1h 41m

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  95%/128 | Total:  4d 05h | Avg: 47m 22s | Max:  1h 41m
      🟩 arm64              Pass: 100%/8   | Total:  7h 24m | Avg: 55m 32s | Max: 59m 13s
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 04m
      🔍 nvcc               Pass:  95%/134 | Total:  4d 10h | Avg: 47m 38s | Max:  1h 41m
    🚨 cxx_family: MSVC 🚨
      🟩 Clang              Pass: 100%/63  | Total:  2d 08h | Avg: 53m 43s | Max:  1h 31m
      🟩 GCC                Pass: 100%/64  | Total:  2d 00h | Avg: 45m 26s | Max:  1h 41m
      🟩 Intel              Pass: 100%/3   | Total:  2h 17m | Avg: 45m 46s | Max: 46m 50s
      🔥 MSVC               Pass:   0%/6   | Total:  1h 18m | Avg: 13m 04s | Max: 14m 25s
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  94%/103 | Total:  2d 13h | Avg: 35m 50s | Max:  1h 04m
      🟩 DeviceLaunch       Pass: 100%/8   | Total: 11h 17m | Avg:  1h 24m | Max:  1h 28m
      🟩 GraphCapture       Pass: 100%/8   | Total: 10h 49m | Avg:  1h 21m | Max:  1h 25m
      🟩 HostLaunch         Pass: 100%/8   | Total: 11h 03m | Avg:  1h 22m | Max:  1h 26m
      🟩 SmallGMem          Pass: 100%/1   | Total:  1h 41m | Avg:  1h 41m | Max:  1h 41m
      🟩 TestGPU            Pass: 100%/8   | Total: 12h 04m | Avg:  1h 30m | Max:  1h 32m
    🟨 ctk
      🟨 11.1               Pass:  93%/15  | Total:  1h 04m | Avg:  4m 17s | Max: 14m 25s
      🟩 11.8               Pass: 100%/3   | Total: 13m 53s | Avg:  4m 37s | Max:  4m 53s
      🟨 12.6               Pass:  95%/118 | Total:  4d 11h | Avg: 54m 29s | Max:  1h 41m
    🟨 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 04m
      🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 04m | Avg:  4m 17s | Max: 14m 25s
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 53s | Avg:  4m 37s | Max:  4m 53s
      🟨 nvcc12.6           Pass:  95%/116 | Total:  4d 09h | Avg: 54m 22s | Max:  1h 41m
    🟨 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  2h 26m | Avg: 24m 27s | Max: 47m 10s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 15m | Avg: 45m 06s | Max: 47m 07s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 00m | Avg: 45m 14s | Max: 47m 46s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 53m | Avg: 43m 18s | Max: 43m 58s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 29s | Max: 48m 16s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 54m | Avg: 43m 32s | Max: 44m 01s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 00m | Avg: 45m 13s | Max: 48m 18s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 58m | Avg: 44m 33s | Max: 49m 00s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 43m 49s
      🟩 Clang18            Pass: 100%/26  | Total:  1d 07h | Avg:  1h 11m | Max:  1h 31m
      🟩 GCC6               Pass: 100%/2   | Total:  7m 05s | Avg:  3m 32s | Max:  3m 36s
      🟩 GCC7               Pass: 100%/6   | Total:  2h 20m | Avg: 23m 28s | Max: 44m 26s
      🟩 GCC8               Pass: 100%/6   | Total:  2h 19m | Avg: 23m 13s | Max: 43m 13s
      🟩 GCC9               Pass: 100%/6   | Total:  2h 21m | Avg: 23m 34s | Max: 44m 16s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 01m | Avg: 45m 27s | Max: 47m 37s
      🟩 GCC11              Pass: 100%/7   | Total:  3h 09m | Avg: 27m 05s | Max: 45m 18s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 54m | Avg: 43m 40s | Max: 44m 17s
      🟩 GCC13              Pass: 100%/29  | Total:  1d 08h | Avg:  1h 06m | Max:  1h 41m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 17m | Avg: 45m 46s | Max: 46m 50s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 14m 25s | Avg: 14m 25s | Max: 14m 25s
      🟥 MSVC14.29          Pass:   0%/2   | Total: 25m 57s | Avg: 12m 58s | Max: 13m 57s
      🟥 MSVC14.39          Pass:   0%/3   | Total: 38m 05s | Avg: 12m 41s | Max: 13m 01s
    🟨 std
      🟩 11                 Pass: 100%/35  | Total:  1d 02h | Avg: 45m 54s | Max:  1h 31m
      🟨 14                 Pass:  92%/38  | Total:  1d 03h | Avg: 43m 46s | Max:  1h 31m
      🟨 17                 Pass:  94%/38  | Total:  1d 06h | Avg: 47m 29s | Max:  1h 41m
      🟨 20                 Pass:  96%/25  | Total: 23h 53m | Avg: 57m 20s | Max:  1h 32m
    🟨 gpu
      🟨 v100               Pass:  95%/136 | Total:  4d 12h | Avg: 47m 51s | Max:  1h 41m
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 53s | Avg:  4m 37s | Max:  4m 53s
      🟩 90a                Pass: 100%/4   | Total: 40m 53s | Avg: 10m 13s | Max: 14m 20s
    
  • 🟩 thrust: Pass: 100%/122 | Total: 12h 58m | Avg: 6m 22s | Max: 25m 02s | Hits: 99%/20079

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total: 12h 23m | Avg:  6m 31s | Max: 25m 02s | Hits:  99%/20079 
      🟩 arm64              Pass: 100%/8   | Total: 34m 44s | Avg:  4m 20s | Max:  4m 53s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 10m | Avg:  4m 40s | Max: 18m 21s | Hits:  99%/2231  
      🟩 11.8               Pass: 100%/3   | Total: 14m 34s | Avg:  4m 51s | Max:  5m 31s
      🟩 12.6               Pass: 100%/104 | Total: 11h 33m | Avg:  6m 40s | Max: 25m 02s | Hits:  99%/17848 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  4m 47s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 10m | Avg:  4m 40s | Max: 18m 21s | Hits:  99%/2231  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 34s | Avg:  4m 51s | Max:  5m 31s
      🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 24m | Avg:  6m 42s | Max: 25m 02s | Hits:  99%/17848 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  4m 47s
      🟩 nvcc               Pass: 100%/120 | Total: 12h 49m | Avg:  6m 24s | Max: 25m 02s | Hits:  99%/20079 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 27m 33s | Avg:  4m 35s | Max:  5m 30s
      🟩 Clang10            Pass: 100%/3   | Total: 17m 02s | Avg:  5m 40s | Max:  6m 05s
      🟩 Clang11            Pass: 100%/4   | Total: 18m 49s | Avg:  4m 42s | Max:  5m 01s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 24s | Avg:  4m 36s | Max:  4m 50s
      🟩 Clang13            Pass: 100%/4   | Total: 18m 13s | Avg:  4m 33s | Max:  5m 09s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 14s | Avg:  4m 33s | Max:  5m 03s
      🟩 Clang15            Pass: 100%/4   | Total: 19m 02s | Avg:  4m 45s | Max:  5m 01s
      🟩 Clang16            Pass: 100%/4   | Total: 18m 31s | Avg:  4m 37s | Max:  4m 44s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 15s | Avg:  4m 33s | Max:  4m 51s
      🟩 Clang18            Pass: 100%/18  | Total:  2h 06m | Avg:  7m 01s | Max: 15m 32s
      🟩 GCC6               Pass: 100%/2   | Total:  6m 51s | Avg:  3m 25s | Max:  3m 33s
      🟩 GCC7               Pass: 100%/6   | Total: 23m 06s | Avg:  3m 51s | Max:  4m 23s
      🟩 GCC8               Pass: 100%/6   | Total: 24m 29s | Avg:  4m 04s | Max:  4m 46s
      🟩 GCC9               Pass: 100%/6   | Total: 24m 27s | Avg:  4m 04s | Max:  5m 02s
      🟩 GCC10              Pass: 100%/4   | Total: 17m 24s | Avg:  4m 21s | Max:  4m 39s
      🟩 GCC11              Pass: 100%/7   | Total: 33m 24s | Avg:  4m 46s | Max:  5m 31s
      🟩 GCC12              Pass: 100%/4   | Total: 19m 11s | Avg:  4m 47s | Max:  5m 08s
      🟩 GCC13              Pass: 100%/20  | Total:  2h 18m | Avg:  6m 54s | Max: 15m 11s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 17m 10s | Avg:  5m 43s | Max:  5m 57s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 21s | Avg: 18m 21s | Max: 18m 21s | Hits:  99%/2231  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 33m 52s | Avg: 16m 56s | Max: 18m 05s | Hits:  99%/4462  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  2h 01m | Avg: 20m 17s | Max: 25m 02s | Hits:  99%/13386 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  5h 00m | Avg:  5m 27s | Max: 15m 32s
      🟩 GCC                Pass: 100%/55  | Total:  4h 46m | Avg:  5m 13s | Max: 15m 11s
      🟩 Intel              Pass: 100%/3   | Total: 17m 10s | Avg:  5m 43s | Max:  5m 57s
      🟩 MSVC               Pass: 100%/9   | Total:  2h 53m | Avg: 19m 19s | Max: 25m 02s | Hits:  99%/20079 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total: 12h 58m | Avg:  6m 22s | Max: 25m 02s | Hits:  99%/20079 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  9h 02m | Avg:  5m 15s | Max: 18m 21s | Hits:  99%/13386 
      🟩 TestCPU            Pass: 100%/11  | Total:  2h 04m | Avg: 11m 18s | Max: 25m 02s | Hits:  99%/6693  
      🟩 TestGPU            Pass: 100%/8   | Total:  1h 51m | Avg: 13m 58s | Max: 15m 32s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 34s | Avg:  4m 51s | Max:  5m 31s
      🟩 90a                Pass: 100%/4   | Total: 15m 40s | Avg:  3m 55s | Max:  4m 15s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total:  2h 26m | Avg:  4m 43s | Max: 10m 45s
      🟩 14                 Pass: 100%/35  | Total:  3h 58m | Avg:  6m 49s | Max: 21m 38s | Hits:  99%/8924  
      🟩 17                 Pass: 100%/34  | Total:  3h 52m | Avg:  6m 50s | Max: 22m 44s | Hits:  99%/6693  
      🟩 20                 Pass: 100%/22  | Total:  2h 40m | Avg:  7m 18s | Max: 25m 02s | Hits:  99%/4462  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

# Runner
186 linux-amd64-cpu16
42 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

@gonidelis
Copy link
Member Author

gonidelis commented Sep 25, 2024

Performance Results of thrust::count_if vs cub::DeviceFind::FindIf
(runs with I8 input type to test the performance of the newly introduced vectorized loads)

## [0] NVIDIA H200

| T   | Elements | MismatchAt | Ref Time  | Ref Noise | Cmp Time  | Cmp Noise | Diff       | %Diff    |
| --- | -------- | ---------- | --------- | --------- | --------- | --------- | ---------  | -------  |
| I8  | 2^16     | 1          | 19.937 us | 9.99%     | 10.860 us | 3.43%     | -9.077 us  | -45.53%  |
| I8  | 2^20     | 1          | 20.740 us | 2.16%     | 13.999 us | 3.34%     | -6.741 us  | -32.50%  |
| I8  | 2^24     | 1          | 31.782 us | 2.21%     | 22.034 us | 1.84%     | -9.718 us  | -30.60%  |
| I8  | 2^28     | 1          | 137.999 us| 6.49%     | 177.938 us| 2.55%     | 39.938 us  | 28.94%   |
| I8  | 2^16     | 0.5        | 19.962 us | 2.26%     | 10.860 us | 2.51%     | -9.002 us  | -45.10%  |
| I8  | 2^20     | 0.5        | 21.056 us | 2.79%     | 13.979 us | 5.12%     | -7.078 us  | -33.61%  |
| I8  | 2^24     | 0.5        | 31.876 us | 1.51%     | 16.793 us | 2.42%     | -15.083 us | -47.32%  |
| I8  | 2^28     | 0.5        | 137.932 us| 0.54%     | 96.730 us | 0.64%     | -41.202 us | -29.87%  |
| I8  | 2^16     | 0          | 20.016 us | 2.77%     | 9.661 us  | 2.99%     | -10.375 us | -51.83%  |
| I8  | 2^20     | 0          | 20.981 us | 2.70%     | 10.298 us | 3.69%     | -10.654 us | -50.85%  |
| I8  | 2^24     | 0          | 31.743 us | 1.86%     | 14.190 us | 2.16%     | -17.553 us | -55.30%  |
| I8  | 2^28     | 0          | 137.975 us| 0.49%     | 14.186 us | 2.96%     | -123.789 us| -89.72%  |

When MismatchAt == 1, the predicated is satisfied only for the last element of the range. When MismatchA == 0 every element in the range satisfies the predicate. We see how the latter case is benefited from the newly introduced algorithm.

We also see how thrust::count_if is only better in the case where the whole range needs to be iterated (MismatchAt == 0) and the input size is big.

@gonidelis
Copy link
Member Author

gonidelis commented Oct 11, 2024

Some long awaiting performance results on A6000 and H200 (extending @gevtushenko's work in #1870):

Search Operation
cub::Device::FindIf, thrust::find_if and thrust::count_if are used as a backend to search for an int32 with key placed at a certain percentage of an input from 0 to 100%.
image
image

Equal Operation
cub::Device::FindIf, thrust::find_if and thrust::count_if are used as a backend to compare two vectors of int32 for equality with a mismatch placed at a certain percentage of input from 0 to 100%.
image
image

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

  • You can make some more variables const
  • Your hand written vectorized loading looks a lot like the BLOCK_LOAD_VECTORIZE algorithm of cub::BlockLoad. Why can't you use that one?

cub/benchmarks/bench/find_if/base.cu Outdated Show resolved Hide resolved
using VectorT = typename CubVector<InputT, _VECTOR_LOAD_LENGTH>::Type;
///
bool full_tile = (tile_offset + tile_size) <= num_items;
bool is_aligned = (size_t(d_in) & (sizeof(VectorT) - 1)) == 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find a modulo operation easier to read, and it should compile to the same bitwise operation. Also, reinterpreting a pointer as an integer should use uintptr_t. However, that may be a different type as size_t, so:

Suggested change
bool is_aligned = (size_t(d_in) & (sizeof(VectorT) - 1)) == 0;
const bool is_aligned = reinterpret_cast<::cuda::std::uintptr_t>(d_in) % uintptr_t{sizeof(VectorT)} == 0;

You may also need to handle the case when Iterator is not a pointer, but I guess that is planned for later, when the runtime numbers look good :) Given you have a second overload of IsAlignedAndFullTile below, you could just change Iterator to T* in the function signature, so any non-pointer iterator would just not match your overload here.

Comment on lines 108 to 111
InputT* d_in_unqualified = const_cast<InputT*>(begin) + tile_offset + (threadIdx.x * _VECTOR_LOAD_LENGTH);

cub::CacheModifiedInputIterator<cub::CacheLoadModifier::LOAD_LDG, VectorT> d_vec_in(
reinterpret_cast<VectorT*>(d_in_unqualified));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripping const from the input pointer should not be necessary, since we want to only read from it. Is this a limitation of cub::CacheModifiedInputIterator?

Comment on lines 317 to 323
// use d_temp_storage as the intermediate device result
// to read and write from. Then store the final result in the output iterator.
cuda_mem_set_async_dtemp_storage<<<1, 1>>>(int_temp_storage, num_items);

find_if<<<findif_grid_size, block_threads, 0, stream>>>(d_in, d_in + num_items, op, int_temp_storage, num_items);

write_final_result_in_output_iterator_already<int><<<1, 1>>>(int_temp_storage, d_out);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: if OutputIteratorT is a pointer to some integer type (which I imagine is common), you could just use that memory location to fire the atomic minimum operations at. Then you would not need a second kernel and also no temporary storage.

Comment on lines 222 to 226
if (threadIdx.x == 0)
{
sresult = atomicAdd(result, 0);
}
__syncthreads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use an ordinary load: sresult = result;. In the first loop iteartion, only thread 0 reads, so there is no concurrency. In all subsequent iterations, ConsumeRange contains a synchronize_or() after writes to result, so no thread can be writing when reading here in thread 0 again.

cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved
Comment on lines 113 to 114
InputT input_items[elements_per_thread];
VectorT* vec_items = reinterpret_cast<VectorT*>(input_items);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, input_items has a smaller alignment than when you access the data through vec_items, so you could run into misaligned stores.

I suggest to declare an array of vectors instead, and reinterpret it as an array of scalars.

Comment on lines 126 to 130
int index = i % WORDS + (i / WORDS) * block_threads * WORDS + threadIdx.x * WORDS + tile_offset;
// i % WORDS = + 0 1 2 3, 0 1 2 3, 0 1 2 3, ... (static)
// (i / WORDS) * block_threads * WORDS = + 0 , 64 , 128, ... (static)
// threadIdx.x * WORDS = + 0, 4, 8, ... offset of the thread within working tile
// tile_offset = + just start at the beginning of the block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorization only concerns the loads, so the processing loop should no longer need to consider WORDS. It should just be auto index = tile_offset + threadIdx.x + i * block_threads;, like in the non-vectorized version. But please verify. Maybe I am missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my suggestion was wrong above. I was somehow assuming both versions used the same access pattern, loading blocks of elements_per_thread. However, each version loads block-strided. That changes how the index is computed.

cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved
Comment on lines 116 to 125
#pragma unroll
for (int i = 0; i < WORDS; ++i)
{
vec_items[i] = d_vec_in[block_threads * i];
}
//// vectorized loads end

bool found = false;
for (int i = 0; i < elements_per_thread; ++i)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it would make any sense to merge the two loops in order to load 1 vector, then process the predicate on all scalars of the vector, and then continue to load the next vector. This way, you could could avoid loading the entire tile in which the predicate turns true, and you can overlap data loading with compute more tighly. But if you want to try this, maybe try it last and stabilize your existing approach first!

@gonidelis gonidelis force-pushed the cub_find_if branch 5 times, most recently from 7ff1a1a to 0364cf3 Compare October 31, 2024 07:58
@gonidelis
Copy link
Member Author

Many thanks to @elstehle for helping figure out this index!!!!

OffsetT nth_vector_of_thread = i / VECTOR_LOAD_LENGTH;
OffsetT element_in_word = i % VECTOR_LOAD_LENGTH;
OffsetT vector_of_tile = nth_vector_of_thread * BLOCK_THREADS + threadIdx.x;
OffsetT index = tile_offset + vector_of_tile * VECTOR_LOAD_LENGTH + element_in_word;

cub/cub/agent/agent_find.cuh Outdated Show resolved Hide resolved
cub/cub/agent/agent_find.cuh Outdated Show resolved Hide resolved
Comment on lines 126 to 130
int index = i % WORDS + (i / WORDS) * block_threads * WORDS + threadIdx.x * WORDS + tile_offset;
// i % WORDS = + 0 1 2 3, 0 1 2 3, 0 1 2 3, ... (static)
// (i / WORDS) * block_threads * WORDS = + 0 , 64 , 128, ... (static)
// threadIdx.x * WORDS = + 0, 4, 8, ... offset of the thread within working tile
// tile_offset = + just start at the beginning of the block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my suggestion was wrong above. I was somehow assuming both versions used the same access pattern, loading blocks of elements_per_thread. However, each version loads block-strided. That changes how the index is computed.

cub/test/catch2_test_device_find_if.cu Outdated Show resolved Hide resolved
thrust/benchmarks/bench/count_if/basic.cu Outdated Show resolved Hide resolved
thrust/benchmarks/bench/find_if/basic.cu Outdated Show resolved Hide resolved
@gonidelis
Copy link
Member Author

gonidelis commented Nov 2, 2024

Update: After refactoring the code by introducing Dispatch and Agent layers the benchmark results look the same on my A6000 local machine.

docs to be added over the weekend

@gonidelis gonidelis marked this pull request as ready for review November 2, 2024 00:30
@gonidelis gonidelis requested review from a team as code owners November 2, 2024 00:30
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

Build and benchmark a prototype CUB cub::DeviceFind::FindIf kernel
4 participants