Implement `cub::DeviceFind::FindIf` #2405

gonidelis · 2024-09-11T15:59:39Z

This is a draft to track the work progress on cub::DeviceFind::FindIf which should ultimately be used to improve thrust::all_of.

Algorithm Description

The kernel is based on the concept of early cancellation through a global atomic result -that works as a flag- so that when its default value (input.size()) is overwritten, it is atomically min compared and then broadcasted to the other CTAs and they avoid proceeding to any further iterations.

    if (threadIdx.x == 0)
    {
      sresult = atomicAdd(result, 0);
    }
    __syncthreads();

    // early exit
    if (sresult < tile_offset)
    {
      return;
    }

For that to happen every block collects a local minimum index, block_result (when predicate is found)

    if (index < num_items)
      {
        if (pred(input_items[i]))
        {
          found = true;
          atomicMin(&block_result, index);
          break; // every thread goes over multiple elements per thread 
                 // for every tile. If a thread finds a local minimum it doesn't
                 // need to proceed further (inner early exit).
        }
      }

and at the end of the loop-body the first thread of every block atomically minCompares and writes to the global minimum index variable:

   if (syncthreads_or(found))
    {
      if (threadIdx.x == 0)
      {
        if (block_result < num_items)
        {
          atomicMin(result, block_result);
        }
      }
    }

It's a an algorithm that heavily favors cases where the "to-be-found" elements are ubiquitous.

Notes:

To achieve proper synchronization, every block reads a local copy of the global result (that potentially has been found by other blocks in previous iterations) into the shared memory sresult and then checks whether it should exit.
- If the other blocks didn't "get to update the global result" early enough then, because the check for the global early exit is at the beginning of the loop the worst case scenario is a block would do one extra redundant iteration before exiting.
block_result should not be confused with sresult. Although both are shared memory variables that hold a local minimum, the former is being used to read from the global atomic, while the latter is being used to write to the global atomic.
The algorithm is utilizing vectorized loads for small input types (split implementation for large input types to be added).

gonidelis · 2024-09-11T16:00:41Z

# Benchmark Results

## cub::FindIf

### [0] NVIDIA H200

|  T  |     Elements     | RelativeMismatchPosition | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-----|------------------|--------------------------|---------|------------|--------|------------|-------|
| I32 |     2^16 = 65536 |                        1 |  22928x |  26.790 us | 22.92% |  21.814 us | 2.06% |
| I32 |   2^20 = 1048576 |                        1 |  19632x |  31.313 us | 23.28% |  25.470 us | 3.40% |
| I32 |  2^24 = 16777216 |                        1 |  10800x |  51.443 us | 11.07% |  46.351 us | 1.28% |
| I32 | 2^28 = 268435456 |                        1 |   1074x | 471.441 us |  1.32% | 465.567 us | 0.36% |
| I32 |     2^16 = 65536 |                      0.5 |  23024x |  26.862 us | 23.81% |  21.717 us | 2.03% |
| I32 |   2^20 = 1048576 |                      0.5 |  19792x |  31.141 us | 23.58% |  25.273 us | 3.86% |
| I32 |  2^24 = 16777216 |                      0.5 |  17152x |  34.301 us | 17.73% |  29.160 us | 1.59% |
| I32 | 2^28 = 268435456 |                      0.5 |   2064x | 247.413 us |  2.15% | 242.283 us | 0.35% |
| I32 |     2^16 = 65536 |                        0 |  54416x |  14.286 us | 55.68% |   9.189 us | 3.01% |
| I32 |   2^20 = 1048576 |                        0 |  52288x |  14.649 us | 53.34% |   9.565 us | 3.21% |
| I32 |  2^24 = 16777216 |                        0 |  33568x |  19.990 us | 34.29% |  14.901 us | 1.96% |
| I32 | 2^28 = 268435456 |                        0 |  33632x |  20.006 us | 34.67% |  14.874 us | 2.54% |

## thrust::count_if

### [0] NVIDIA H200

|  T  |     Elements     | RelativeMismatchPosition | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |
|-----|------------------|--------------------------|---------|------------|--------|------------|-------|
| I32 |     2^16 = 65536 |                        1 |  24528x |  24.829 us | 22.00% |  20.390 us | 2.94% |
| I32 |   2^20 = 1048576 |                        1 |  22480x |  26.721 us | 20.33% |  22.244 us | 2.62% |
| I32 |  2^24 = 16777216 |                        1 |  11856x |  46.720 us | 10.84% |  42.222 us | 1.89% |
| I32 | 2^28 = 268435456 |                        1 |   1855x | 274.137 us |  1.77% | 269.582 us | 0.49% |
| I32 |     2^16 = 65536 |                      0.5 |  24176x |  25.121 us | 21.73% |  20.685 us | 3.23% |
| I32 |   2^20 = 1048576 |                      0.5 |  22048x |  27.172 us | 19.97% |  22.689 us | 2.65% |
| I32 |  2^24 = 16777216 |                      0.5 |  11744x |  47.062 us | 10.71% |  42.587 us | 1.97% |
| I32 | 2^28 = 268435456 |                      0.5 |   1859x | 273.367 us |  1.69% | 268.966 us | 0.41% |
| I32 |     2^16 = 65536 |                        0 |  24160x |  25.209 us | 22.02% |  20.707 us | 3.19% |
| I32 |   2^20 = 1048576 |                        0 |  22144x |  27.092 us | 20.12% |  22.594 us | 2.75% |
| I32 |  2^24 = 16777216 |                        0 |  11744x |  47.111 us | 10.85% |  42.584 us | 2.02% |
| I32 | 2^28 = 268435456 |                        0 |   1862x | 273.046 us |  1.70% | 268.609 us | 0.38% |

bernhardmgruber · 2024-09-11T16:30:30Z

until we make sure that it performs better than reduce.

I would love to see a benchmark comparison of thrust::all_of before and after your improvements at some point!

Initial bench results are encouraging.

Do I understand correctly, that cub::FindIf should outperform thrust::count_if? Because I am seeing the opposite on some of the numbers you posted.

bernhardmgruber · 2024-09-11T16:48:31Z

The name CommonPrefixRatio comes from the old thrust::equals benchmark, where it denoted the amount of equal elements at the beginning of both ranges. A CommonPrefixRatio of 1 meant all elements were the same in both ranges. A value of 0.5 meant, the first 50% of the range were equal and then the mismatch occurs.

For your benchmark you could name this value RelativeMismatchPosition, or MismatchAt, or something like that. It's about where you expect the mismatch to happen.

gonidelis · 2024-09-11T16:55:48Z

@bernhardmgruber cub::FindIf because of the early exit is expected to perform better than thrust::count_if at least in the middle cases RelativeMismatchPosition = 0.5 (thanx for the name suggestion). On the axis of input size, this is true for small and medium sized inputs, but not for large inputs.

github-actions · 2024-09-11T17:54:40Z

🟨 CI finished in 1h 52m: Pass: 97%/259 | Total: 1d 08h | Avg: 7m 28s | Max: 35m 24s | Hits: 99%/20079

🟨 cub: Pass: 95%/136 | Total: 19h 26m | Avg: 8m 34s | Max: 35m 24s

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  95%/128 | Total: 18h 52m | Avg:  8m 50s | Max: 35m 24s
  🟩 arm64              Pass: 100%/8   | Total: 33m 44s | Avg:  4m 13s | Max:  4m 33s
🔍 cudacxx_family: nvcc 🔍
  🟩 ClangCUDA          Pass: 100%/2   | Total:  7m 19s | Avg:  3m 39s | Max:  3m 49s
  🔍 nvcc               Pass:  95%/134 | Total: 19h 19m | Avg:  8m 39s | Max: 35m 24s
🚨 cxx_family: MSVC 🚨
  🟩 Clang              Pass: 100%/63  | Total:  9h 04m | Avg:  8m 38s | Max: 30m 14s
  🟩 GCC                Pass: 100%/64  | Total:  9h 07m | Avg:  8m 33s | Max: 35m 24s
  🟩 Intel              Pass: 100%/3   | Total: 16m 34s | Avg:  5m 31s | Max:  5m 42s
  🔥 MSVC               Pass:   0%/6   | Total: 57m 26s | Avg:  9m 34s | Max: 10m 42s
🔍 jobs: Build 🔍
  🔍 Build              Pass:  94%/103 | Total:  8h 05m | Avg:  4m 42s | Max: 10m 42s
  🟩 DeviceLaunch       Pass: 100%/8   | Total:  2h 28m | Avg: 18m 35s | Max: 19m 52s
  🟩 GraphCapture       Pass: 100%/8   | Total:  2h 07m | Avg: 15m 54s | Max: 19m 10s
  🟩 HostLaunch         Pass: 100%/8   | Total:  2h 28m | Avg: 18m 32s | Max: 22m 38s
  🟩 SmallGMem          Pass: 100%/1   | Total: 35m 24s | Avg: 35m 24s | Max: 35m 24s
  🟩 TestGPU            Pass: 100%/8   | Total:  3h 41m | Avg: 27m 38s | Max: 30m 14s
🟨 ctk
  🟨 11.1               Pass:  93%/15  | Total:  1h 01m | Avg:  4m 04s | Max: 10m 42s
  🟩 11.8               Pass: 100%/3   | Total: 13m 50s | Avg:  4m 36s | Max:  4m 57s
  🟨 12.6               Pass:  95%/118 | Total: 18h 11m | Avg:  9m 14s | Max: 35m 24s
🟨 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  7m 19s | Avg:  3m 39s | Max:  3m 49s
  🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 01m | Avg:  4m 04s | Max: 10m 42s
  🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 50s | Avg:  4m 36s | Max:  4m 57s
  🟨 nvcc12.6           Pass:  95%/116 | Total: 18h 04m | Avg:  9m 20s | Max: 35m 24s
🟨 cxx
  🟩 Clang9             Pass: 100%/6   | Total: 26m 22s | Avg:  4m 23s | Max:  5m 25s
  🟩 Clang10            Pass: 100%/3   | Total: 15m 38s | Avg:  5m 12s | Max:  5m 21s
  🟩 Clang11            Pass: 100%/4   | Total: 17m 44s | Avg:  4m 26s | Max:  4m 45s
  🟩 Clang12            Pass: 100%/4   | Total: 18m 19s | Avg:  4m 34s | Max:  4m 43s
  🟩 Clang13            Pass: 100%/4   | Total: 18m 26s | Avg:  4m 36s | Max:  4m 46s
  🟩 Clang14            Pass: 100%/4   | Total: 17m 47s | Avg:  4m 26s | Max:  4m 39s
  🟩 Clang15            Pass: 100%/4   | Total: 18m 36s | Avg:  4m 39s | Max:  4m 53s
  🟩 Clang16            Pass: 100%/4   | Total: 19m 03s | Avg:  4m 45s | Max:  5m 13s
  🟩 Clang17            Pass: 100%/4   | Total: 18m 14s | Avg:  4m 33s | Max:  4m 43s
  🟩 Clang18            Pass: 100%/26  | Total:  6h 14m | Avg: 14m 24s | Max: 30m 14s
  🟩 GCC6               Pass: 100%/2   | Total:  6m 57s | Avg:  3m 28s | Max:  3m 31s
  🟩 GCC7               Pass: 100%/6   | Total: 23m 23s | Avg:  3m 53s | Max:  4m 31s
  🟩 GCC8               Pass: 100%/6   | Total: 23m 30s | Avg:  3m 55s | Max:  4m 27s
  🟩 GCC9               Pass: 100%/6   | Total: 24m 35s | Avg:  4m 05s | Max:  4m 32s
  🟩 GCC10              Pass: 100%/4   | Total: 19m 01s | Avg:  4m 45s | Max:  5m 07s
  🟩 GCC11              Pass: 100%/7   | Total: 32m 33s | Avg:  4m 39s | Max:  4m 57s
  🟩 GCC12              Pass: 100%/4   | Total: 18m 46s | Avg:  4m 41s | Max:  4m 54s
  🟩 GCC13              Pass: 100%/29  | Total:  6h 38m | Avg: 13m 45s | Max: 35m 24s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total: 16m 34s | Avg:  5m 31s | Max:  5m 42s
  🟥 MSVC14.16          Pass:   0%/1   | Total: 10m 42s | Avg: 10m 42s | Max: 10m 42s
  🟥 MSVC14.29          Pass:   0%/2   | Total: 18m 33s | Avg:  9m 16s | Max:  9m 24s
  🟥 MSVC14.39          Pass:   0%/3   | Total: 28m 11s | Avg:  9m 23s | Max:  9m 39s
🟨 std
  🟩 11                 Pass: 100%/35  | Total:  4h 34m | Avg:  7m 50s | Max: 30m 14s
  🟨 14                 Pass:  92%/38  | Total:  5h 08m | Avg:  8m 07s | Max: 26m 43s
  🟨 17                 Pass:  94%/38  | Total:  5h 38m | Avg:  8m 54s | Max: 35m 24s
  🟨 20                 Pass:  96%/25  | Total:  4h 05m | Avg:  9m 48s | Max: 28m 21s
🟨 gpu
  🟨 v100               Pass:  95%/136 | Total: 19h 26m | Avg:  8m 34s | Max: 35m 24s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 50s | Avg:  4m 36s | Max:  4m 57s
  🟩 90a                Pass: 100%/4   | Total: 14m 45s | Avg:  3m 41s | Max:  3m 49s

🟩 thrust: Pass: 100%/122 | Total: 12h 35m | Avg: 6m 11s | Max: 21m 04s | Hits: 99%/20079

🟩 cpu
  🟩 amd64              Pass: 100%/114 | Total: 12h 00m | Avg:  6m 19s | Max: 21m 04s | Hits:  99%/20079 
  🟩 arm64              Pass: 100%/8   | Total: 35m 04s | Avg:  4m 23s | Max:  5m 14s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  1h 08m | Avg:  4m 35s | Max: 16m 04s | Hits:  99%/2231  
  🟩 11.8               Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
  🟩 12.6               Pass: 100%/104 | Total: 11h 13m | Avg:  6m 28s | Max: 21m 04s | Hits:  99%/17848 
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 38s | Avg:  4m 19s | Max:  4m 23s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 08m | Avg:  4m 35s | Max: 16m 04s | Hits:  99%/2231  
  🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
  🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 04m | Avg:  6m 30s | Max: 21m 04s | Hits:  99%/17848 
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 38s | Avg:  4m 19s | Max:  4m 23s
  🟩 nvcc               Pass: 100%/120 | Total: 12h 26m | Avg:  6m 13s | Max: 21m 04s | Hits:  99%/20079 
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total: 29m 37s | Avg:  4m 56s | Max:  6m 26s
  🟩 Clang10            Pass: 100%/3   | Total: 17m 33s | Avg:  5m 51s | Max:  6m 18s
  🟩 Clang11            Pass: 100%/4   | Total: 18m 32s | Avg:  4m 38s | Max:  5m 00s
  🟩 Clang12            Pass: 100%/4   | Total: 18m 43s | Avg:  4m 40s | Max:  5m 00s
  🟩 Clang13            Pass: 100%/4   | Total: 19m 25s | Avg:  4m 51s | Max:  5m 11s
  🟩 Clang14            Pass: 100%/4   | Total: 18m 48s | Avg:  4m 42s | Max:  4m 49s
  🟩 Clang15            Pass: 100%/4   | Total: 19m 15s | Avg:  4m 48s | Max:  5m 06s
  🟩 Clang16            Pass: 100%/4   | Total: 18m 36s | Avg:  4m 39s | Max:  4m 49s
  🟩 Clang17            Pass: 100%/4   | Total: 18m 12s | Avg:  4m 33s | Max:  4m 56s
  🟩 Clang18            Pass: 100%/18  | Total:  2h 00m | Avg:  6m 41s | Max: 14m 53s
  🟩 GCC6               Pass: 100%/2   | Total:  6m 57s | Avg:  3m 28s | Max:  3m 31s
  🟩 GCC7               Pass: 100%/6   | Total: 23m 55s | Avg:  3m 59s | Max:  4m 49s
  🟩 GCC8               Pass: 100%/6   | Total: 24m 19s | Avg:  4m 03s | Max:  4m 33s
  🟩 GCC9               Pass: 100%/6   | Total: 24m 58s | Avg:  4m 09s | Max:  4m 54s
  🟩 GCC10              Pass: 100%/4   | Total: 18m 08s | Avg:  4m 32s | Max:  4m 41s
  🟩 GCC11              Pass: 100%/7   | Total: 31m 38s | Avg:  4m 31s | Max:  4m 40s
  🟩 GCC12              Pass: 100%/4   | Total: 19m 06s | Avg:  4m 46s | Max:  5m 03s
  🟩 GCC13              Pass: 100%/20  | Total:  2h 19m | Avg:  6m 59s | Max: 15m 30s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max:  5m 54s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 16m 04s | Avg: 16m 04s | Max: 16m 04s | Hits:  99%/2231  
  🟩 MSVC14.29          Pass: 100%/2   | Total: 27m 01s | Avg: 13m 30s | Max: 13m 39s | Hits:  99%/4462  
  🟩 MSVC14.39          Pass: 100%/6   | Total:  1h 47m | Avg: 17m 53s | Max: 21m 04s | Hits:  99%/13386 
🟩 cxx_family
  🟩 Clang              Pass: 100%/55  | Total:  4h 59m | Avg:  5m 26s | Max: 14m 53s
  🟩 GCC                Pass: 100%/55  | Total:  4h 48m | Avg:  5m 15s | Max: 15m 30s
  🟩 Intel              Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max:  5m 54s
  🟩 MSVC               Pass: 100%/9   | Total:  2h 30m | Avg: 16m 42s | Max: 21m 04s | Hits:  99%/20079 
🟩 gpu
  🟩 v100               Pass: 100%/122 | Total: 12h 35m | Avg:  6m 11s | Max: 21m 04s | Hits:  99%/20079 
🟩 jobs
  🟩 Build              Pass: 100%/103 | Total:  8h 50m | Avg:  5m 09s | Max: 16m 11s | Hits:  99%/13386 
  🟩 TestCPU            Pass: 100%/11  | Total:  1h 58m | Avg: 10m 46s | Max: 21m 04s | Hits:  99%/6693  
  🟩 TestGPU            Pass: 100%/8   | Total:  1h 46m | Avg: 13m 15s | Max: 15m 30s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
  🟩 90a                Pass: 100%/4   | Total: 15m 53s | Avg:  3m 58s | Max:  4m 20s
🟩 std
  🟩 11                 Pass: 100%/31  | Total:  2h 33m | Avg:  4m 57s | Max: 15m 30s
  🟩 14                 Pass: 100%/35  | Total:  3h 50m | Avg:  6m 34s | Max: 19m 59s | Hits:  99%/8924  
  🟩 17                 Pass: 100%/34  | Total:  3h 37m | Avg:  6m 23s | Max: 21m 01s | Hits:  99%/6693  
  🟩 20                 Pass: 100%/22  | Total:  2h 34m | Avg:  7m 00s | Max: 21m 04s | Hits:  99%/4462

🟩 pycuda: Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
🟩 ctk
  🟩 12.5               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
🟩 cudacxx
  🟩 nvcc12.5           Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	pycuda
	CUDA C Core Library

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	pycuda
+/-	CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

#	Runner
186	`linux-amd64-cpu16`
42	`linux-amd64-gpu-v100-latest-1`
16	`linux-arm64-cpu16`
15	`windows-amd64-cpu16`

cub/cub/device/device_find_if.cuh

github-actions · 2024-09-17T08:35:49Z

🟨 CI finished in 4h 42m: Pass: 94%/259 | Total: 5d 01h | Avg: 28m 13s | Max: 1h 39m | Hits: 99%/20079

🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 41s | Max: 1h 39m

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  88%/128 | Total:  4d 05h | Avg: 47m 22s | Max:  1h 39m
  🟩 arm64              Pass: 100%/8   | Total:  7h 01m | Avg: 52m 41s | Max: 54m 14s
🔍 cudacxx_family: nvcc 🔍
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 50s | Max: 59m 38s
  🔍 nvcc               Pass:  88%/134 | Total:  4d 10h | Avg: 47m 31s | Max:  1h 39m
🟨 ctk
  🟨 11.1               Pass:  93%/15  | Total:  1h 07m | Avg:  4m 30s | Max: 14m 31s
  🟩 11.8               Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
  🟨 12.6               Pass:  88%/118 | Total:  4d 10h | Avg: 54m 16s | Max:  1h 39m
🟨 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 57m | Avg: 58m 50s | Max: 59m 38s
  🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 07m | Avg:  4m 30s | Max: 14m 31s
  🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
  🟨 nvcc12.6           Pass:  87%/116 | Total:  4d 08h | Avg: 54m 11s | Max:  1h 39m
🟨 cxx
  🟩 Clang9             Pass: 100%/6   | Total:  2h 30m | Avg: 25m 07s | Max: 48m 51s
  🟩 Clang10            Pass: 100%/3   | Total:  2h 15m | Avg: 45m 00s | Max: 47m 01s
  🟩 Clang11            Pass: 100%/4   | Total:  2h 55m | Avg: 43m 51s | Max: 45m 13s
  🟩 Clang12            Pass: 100%/4   | Total:  3h 01m | Avg: 45m 24s | Max: 47m 22s
  🟩 Clang13            Pass: 100%/4   | Total:  3h 00m | Avg: 45m 14s | Max: 47m 45s
  🟩 Clang14            Pass: 100%/4   | Total:  2h 53m | Avg: 43m 20s | Max: 43m 54s
  🟩 Clang15            Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 43m 46s
  🟩 Clang16            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 28s | Max: 47m 08s
  🟩 Clang17            Pass: 100%/4   | Total:  2h 55m | Avg: 43m 51s | Max: 45m 31s
  🟨 Clang18            Pass:  84%/26  | Total:  1d 07h | Avg:  1h 11m | Max:  1h 34m
  🟩 GCC6               Pass: 100%/2   | Total:  7m 00s | Avg:  3m 30s | Max:  3m 41s
  🟩 GCC7               Pass: 100%/6   | Total:  2h 23m | Avg: 23m 52s | Max: 45m 11s
  🟩 GCC8               Pass: 100%/6   | Total:  2h 54m | Avg: 29m 00s | Max:  1h 12m
  🟩 GCC9               Pass: 100%/6   | Total:  2h 28m | Avg: 24m 42s | Max: 46m 30s
  🟩 GCC10              Pass: 100%/4   | Total:  2h 58m | Avg: 44m 35s | Max: 47m 43s
  🟩 GCC11              Pass: 100%/7   | Total:  3h 14m | Avg: 27m 49s | Max: 46m 54s
  🟩 GCC12              Pass: 100%/4   | Total:  3h 03m | Avg: 45m 51s | Max: 48m 13s
  🟨 GCC13              Pass:  82%/29  | Total:  1d 06h | Avg:  1h 04m | Max:  1h 39m
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 17m | Avg: 45m 54s | Max: 48m 09s
  🟥 MSVC14.16          Pass:   0%/1   | Total: 14m 31s | Avg: 14m 31s | Max: 14m 31s
  🟥 MSVC14.29          Pass:   0%/2   | Total: 24m 02s | Avg: 12m 01s | Max: 12m 25s
  🟥 MSVC14.39          Pass:   0%/3   | Total: 38m 46s | Avg: 12m 55s | Max: 13m 28s
🟨 cxx_family
  🟨 Clang              Pass:  93%/63  | Total:  2d 08h | Avg: 53m 42s | Max:  1h 34m
  🟨 GCC                Pass:  92%/64  | Total:  2d 00h | Avg: 45m 06s | Max:  1h 39m
  🟩 Intel              Pass: 100%/3   | Total:  2h 17m | Avg: 45m 54s | Max: 48m 09s
  🟥 MSVC               Pass:   0%/6   | Total:  1h 17m | Avg: 12m 53s | Max: 14m 31s
🟨 jobs
  🟨 Build              Pass:  94%/103 | Total:  2d 13h | Avg: 35m 43s | Max:  1h 12m
  🟩 DeviceLaunch       Pass: 100%/8   | Total: 11h 01m | Avg:  1h 22m | Max:  1h 25m
  🟩 GraphCapture       Pass: 100%/8   | Total: 10h 59m | Avg:  1h 22m | Max:  1h 29m
  🟩 HostLaunch         Pass: 100%/8   | Total: 11h 01m | Avg:  1h 22m | Max:  1h 27m
  🟥 SmallGMem          Pass:   0%/1   | Total:  1h 39m | Avg:  1h 39m | Max:  1h 39m
  🟥 TestGPU            Pass:   0%/8   | Total: 12h 02m | Avg:  1h 30m | Max:  1h 36m
🟨 gpu
  🟨 v100               Pass:  88%/136 | Total:  4d 12h | Avg: 47m 41s | Max:  1h 39m
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 21s | Avg:  4m 27s | Max:  4m 36s
  🟩 90a                Pass: 100%/4   | Total: 15m 11s | Avg:  3m 47s | Max:  3m 55s
🟨 std
  🟨 11                 Pass:  94%/35  | Total:  1d 02h | Avg: 45m 56s | Max:  1h 30m
  🟨 14                 Pass:  86%/38  | Total:  1d 03h | Avg: 43m 33s | Max:  1h 33m
  🟨 17                 Pass:  86%/38  | Total:  1d 06h | Avg: 47m 27s | Max:  1h 39m
  🟨 20                 Pass:  88%/25  | Total: 23h 38m | Avg: 56m 44s | Max:  1h 36m

🟩 thrust: Pass: 100%/122 | Total: 13h 27m | Avg: 6m 37s | Max: 22m 34s | Hits: 99%/20079

🟩 cpu
  🟩 amd64              Pass: 100%/114 | Total: 12h 49m | Avg:  6m 45s | Max: 22m 34s | Hits:  99%/20079 
  🟩 arm64              Pass: 100%/8   | Total: 37m 43s | Avg:  4m 42s | Max:  5m 21s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  1h 12m | Avg:  4m 50s | Max: 18m 34s | Hits:  99%/2231  
  🟩 11.8               Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  4m 59s
  🟩 12.6               Pass: 100%/104 | Total: 12h 00m | Avg:  6m 55s | Max: 22m 34s | Hits:  99%/17848 
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  4m 26s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 12m | Avg:  4m 50s | Max: 18m 34s | Hits:  99%/2231  
  🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  4m 59s
  🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 51m | Avg:  6m 58s | Max: 22m 34s | Hits:  99%/17848 
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 51s | Avg:  4m 25s | Max:  4m 26s
  🟩 nvcc               Pass: 100%/120 | Total: 13h 18m | Avg:  6m 39s | Max: 22m 34s | Hits:  99%/20079 
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total: 30m 48s | Avg:  5m 08s | Max:  5m 55s
  🟩 Clang10            Pass: 100%/3   | Total: 16m 46s | Avg:  5m 35s | Max:  5m 57s
  🟩 Clang11            Pass: 100%/4   | Total: 18m 12s | Avg:  4m 33s | Max:  4m 37s
  🟩 Clang12            Pass: 100%/4   | Total: 18m 15s | Avg:  4m 33s | Max:  4m 46s
  🟩 Clang13            Pass: 100%/4   | Total: 18m 05s | Avg:  4m 31s | Max:  4m 39s
  🟩 Clang14            Pass: 100%/4   | Total: 18m 18s | Avg:  4m 34s | Max:  4m 43s
  🟩 Clang15            Pass: 100%/4   | Total: 19m 15s | Avg:  4m 48s | Max:  5m 14s
  🟩 Clang16            Pass: 100%/4   | Total: 18m 38s | Avg:  4m 39s | Max:  5m 00s
  🟩 Clang17            Pass: 100%/4   | Total: 18m 45s | Avg:  4m 41s | Max:  4m 52s
  🟩 Clang18            Pass: 100%/18  | Total:  2h 15m | Avg:  7m 30s | Max: 18m 36s
  🟩 GCC6               Pass: 100%/2   | Total:  7m 06s | Avg:  3m 33s | Max:  3m 50s
  🟩 GCC7               Pass: 100%/6   | Total: 24m 38s | Avg:  4m 06s | Max:  4m 44s
  🟩 GCC8               Pass: 100%/6   | Total: 24m 11s | Avg:  4m 01s | Max:  4m 44s
  🟩 GCC9               Pass: 100%/6   | Total: 25m 26s | Avg:  4m 14s | Max:  4m 52s
  🟩 GCC10              Pass: 100%/4   | Total: 18m 45s | Avg:  4m 41s | Max:  5m 00s
  🟩 GCC11              Pass: 100%/7   | Total: 33m 43s | Avg:  4m 49s | Max:  5m 02s
  🟩 GCC12              Pass: 100%/4   | Total: 20m 07s | Avg:  5m 01s | Max:  5m 10s
  🟩 GCC13              Pass: 100%/20  | Total:  2h 32m | Avg:  7m 36s | Max: 19m 31s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total: 17m 50s | Avg:  5m 56s | Max:  6m 06s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 34s | Avg: 18m 34s | Max: 18m 34s | Hits:  99%/2231  
  🟩 MSVC14.29          Pass: 100%/2   | Total: 32m 57s | Avg: 16m 28s | Max: 16m 46s | Hits:  99%/4462  
  🟩 MSVC14.39          Pass: 100%/6   | Total:  1h 59m | Avg: 19m 57s | Max: 22m 34s | Hits:  99%/13386 
🟩 cxx_family
  🟩 Clang              Pass: 100%/55  | Total:  5h 12m | Avg:  5m 40s | Max: 18m 36s
  🟩 GCC                Pass: 100%/55  | Total:  5h 06m | Avg:  5m 34s | Max: 19m 31s
  🟩 Intel              Pass: 100%/3   | Total: 17m 50s | Avg:  5m 56s | Max:  6m 06s
  🟩 MSVC               Pass: 100%/9   | Total:  2h 51m | Avg: 19m 01s | Max: 22m 34s | Hits:  99%/20079 
🟩 gpu
  🟩 v100               Pass: 100%/122 | Total: 13h 27m | Avg:  6m 37s | Max: 22m 34s | Hits:  99%/20079 
🟩 jobs
  🟩 Build              Pass: 100%/103 | Total:  9h 14m | Avg:  5m 22s | Max: 18m 50s | Hits:  99%/13386 
  🟩 TestCPU            Pass: 100%/11  | Total:  2h 05m | Avg: 11m 23s | Max: 22m 34s | Hits:  99%/6693  
  🟩 TestGPU            Pass: 100%/8   | Total:  2h 07m | Avg: 15m 59s | Max: 19m 31s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  4m 59s
  🟩 90a                Pass: 100%/4   | Total: 15m 40s | Avg:  3m 55s | Max:  4m 02s
🟩 std
  🟩 11                 Pass: 100%/31  | Total:  2h 46m | Avg:  5m 22s | Max: 19m 26s
  🟩 14                 Pass: 100%/35  | Total:  4h 09m | Avg:  7m 08s | Max: 22m 03s | Hits:  99%/8924  
  🟩 17                 Pass: 100%/34  | Total:  3h 45m | Avg:  6m 38s | Max: 22m 34s | Hits:  99%/6693  
  🟩 20                 Pass: 100%/22  | Total:  2h 45m | Avg:  7m 30s | Max: 21m 58s | Hits:  99%/4462

🟩 pycuda: Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
🟩 ctk
  🟩 12.5               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
🟩 cudacxx
  🟩 nvcc12.5           Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	pycuda
	CUDA C Core Library

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	pycuda
+/-	CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

#	Runner
186	`linux-amd64-cpu16`
42	`linux-amd64-gpu-v100-latest-1`
16	`linux-arm64-cpu16`
15	`windows-amd64-cpu16`

cub/cub/device/device_find_if.cuh

github-actions · 2024-09-18T01:51:30Z

🟨 CI finished in 8h 06m: Pass: 94%/259 | Total: 5d 02h | Avg: 28m 19s | Max: 1h 34m | Hits: 99%/20079

🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 44s | Max: 1h 34m

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  88%/128 | Total:  4d 05h | Avg: 47m 21s | Max:  1h 34m
  🟩 arm64              Pass: 100%/8   | Total:  7h 11m | Avg: 53m 53s | Max: 57m 36s
🔍 cudacxx_family: nvcc 🔍
  🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 05m
  🔍 nvcc               Pass:  88%/134 | Total:  4d 10h | Avg: 47m 29s | Max:  1h 34m
🟨 ctk
  🟨 11.1               Pass:  93%/15  | Total:  1h 07m | Avg:  4m 31s | Max: 14m 48s
  🟩 11.8               Pass: 100%/3   | Total: 14m 06s | Avg:  4m 42s | Max:  4m 54s
  🟨 12.6               Pass:  88%/118 | Total:  4d 10h | Avg: 54m 19s | Max:  1h 34m
🟨 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 05m
  🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 07m | Avg:  4m 31s | Max: 14m 48s
  🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 06s | Avg:  4m 42s | Max:  4m 54s
  🟨 nvcc12.6           Pass:  87%/116 | Total:  4d 08h | Avg: 54m 09s | Max:  1h 34m
🟨 cxx
  🟩 Clang9             Pass: 100%/6   | Total:  2h 27m | Avg: 24m 34s | Max: 46m 01s
  🟩 Clang10            Pass: 100%/3   | Total:  2h 16m | Avg: 45m 28s | Max: 47m 07s
  🟩 Clang11            Pass: 100%/4   | Total:  2h 59m | Avg: 44m 49s | Max: 46m 31s
  🟩 Clang12            Pass: 100%/4   | Total:  2h 58m | Avg: 44m 42s | Max: 48m 46s
  🟩 Clang13            Pass: 100%/4   | Total:  3h 01m | Avg: 45m 15s | Max: 48m 00s
  🟩 Clang14            Pass: 100%/4   | Total:  3h 03m | Avg: 45m 46s | Max: 47m 53s
  🟩 Clang15            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 17s | Max: 46m 06s
  🟩 Clang16            Pass: 100%/4   | Total:  2h 54m | Avg: 43m 38s | Max: 45m 19s
  🟩 Clang17            Pass: 100%/4   | Total:  3h 01m | Avg: 45m 25s | Max: 47m 22s
  🟨 Clang18            Pass:  84%/26  | Total:  1d 07h | Avg:  1h 12m | Max:  1h 34m
  🟩 GCC6               Pass: 100%/2   | Total:  6m 59s | Avg:  3m 29s | Max:  3m 34s
  🟩 GCC7               Pass: 100%/6   | Total:  2h 22m | Avg: 23m 44s | Max: 44m 38s
  🟩 GCC8               Pass: 100%/6   | Total:  2h 20m | Avg: 23m 27s | Max: 43m 37s
  🟩 GCC9               Pass: 100%/6   | Total:  2h 27m | Avg: 24m 31s | Max: 47m 41s
  🟩 GCC10              Pass: 100%/4   | Total:  2h 59m | Avg: 44m 52s | Max: 45m 38s
  🟩 GCC11              Pass: 100%/7   | Total:  3h 13m | Avg: 27m 36s | Max: 46m 19s
  🟩 GCC12              Pass: 100%/4   | Total:  3h 34m | Avg: 53m 39s | Max:  1h 17m
  🟨 GCC13              Pass:  82%/29  | Total:  1d 06h | Avg:  1h 03m | Max:  1h 34m
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 18m | Avg: 46m 08s | Max: 48m 33s
  🟥 MSVC14.16          Pass:   0%/1   | Total: 14m 48s | Avg: 14m 48s | Max: 14m 48s
  🟥 MSVC14.29          Pass:   0%/2   | Total: 25m 11s | Avg: 12m 35s | Max: 12m 36s
  🟥 MSVC14.39          Pass:   0%/3   | Total: 41m 00s | Avg: 13m 40s | Max: 14m 26s
🟨 cxx_family
  🟨 Clang              Pass:  93%/63  | Total:  2d 08h | Avg: 54m 09s | Max:  1h 34m
  🟨 GCC                Pass:  92%/64  | Total:  1d 23h | Avg: 44m 42s | Max:  1h 34m
  🟩 Intel              Pass: 100%/3   | Total:  2h 18m | Avg: 46m 08s | Max: 48m 33s
  🟥 MSVC               Pass:   0%/6   | Total:  1h 20m | Avg: 13m 29s | Max: 14m 48s
🟨 jobs
  🟨 Build              Pass:  94%/103 | Total:  2d 13h | Avg: 36m 04s | Max:  1h 17m
  🟩 DeviceLaunch       Pass: 100%/8   | Total: 11h 02m | Avg:  1h 22m | Max:  1h 27m
  🟩 GraphCapture       Pass: 100%/8   | Total: 10h 44m | Avg:  1h 20m | Max:  1h 25m
  🟩 HostLaunch         Pass: 100%/8   | Total: 11h 08m | Avg:  1h 23m | Max:  1h 29m
  🟥 SmallGMem          Pass:   0%/1   | Total:  1h 34m | Avg:  1h 34m | Max:  1h 34m
  🟥 TestGPU            Pass:   0%/8   | Total: 11h 48m | Avg:  1h 28m | Max:  1h 34m
🟨 gpu
  🟨 v100               Pass:  88%/136 | Total:  4d 12h | Avg: 47m 44s | Max:  1h 34m
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 06s | Avg:  4m 42s | Max:  4m 54s
  🟩 90a                Pass: 100%/4   | Total: 15m 27s | Avg:  3m 51s | Max:  4m 05s
🟨 std
  🟨 11                 Pass:  94%/35  | Total:  1d 03h | Avg: 46m 34s | Max:  1h 28m
  🟨 14                 Pass:  86%/38  | Total:  1d 03h | Avg: 42m 41s | Max:  1h 26m
  🟨 17                 Pass:  86%/38  | Total:  1d 06h | Avg: 47m 56s | Max:  1h 34m
  🟨 20                 Pass:  88%/25  | Total: 23h 38m | Avg: 56m 44s | Max:  1h 33m

🟩 thrust: Pass: 100%/122 | Total: 13h 45m | Avg: 6m 46s | Max: 25m 49s | Hits: 99%/20079

🟩 cpu
  🟩 amd64              Pass: 100%/114 | Total: 13h 10m | Avg:  6m 56s | Max: 25m 49s | Hits:  99%/20079 
  🟩 arm64              Pass: 100%/8   | Total: 34m 56s | Avg:  4m 22s | Max:  4m 48s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  1h 13m | Avg:  4m 52s | Max: 20m 32s | Hits:  99%/2231  
  🟩 11.8               Pass: 100%/3   | Total: 13m 40s | Avg:  4m 33s | Max:  4m 53s
  🟩 12.6               Pass: 100%/104 | Total: 12h 18m | Avg:  7m 06s | Max: 25m 49s | Hits:  99%/17848 
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 18s | Avg:  4m 39s | Max:  4m 44s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 13m | Avg:  4m 52s | Max: 20m 32s | Hits:  99%/2231  
  🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 40s | Avg:  4m 33s | Max:  4m 53s
  🟩 nvcc12.6           Pass: 100%/102 | Total: 12h 09m | Avg:  7m 09s | Max: 25m 49s | Hits:  99%/17848 
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 18s | Avg:  4m 39s | Max:  4m 44s
  🟩 nvcc               Pass: 100%/120 | Total: 13h 36m | Avg:  6m 48s | Max: 25m 49s | Hits:  99%/20079 
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total: 28m 11s | Avg:  4m 41s | Max:  5m 43s
  🟩 Clang10            Pass: 100%/3   | Total: 17m 05s | Avg:  5m 41s | Max:  6m 17s
  🟩 Clang11            Pass: 100%/4   | Total: 19m 09s | Avg:  4m 47s | Max:  4m 58s
  🟩 Clang12            Pass: 100%/4   | Total: 18m 41s | Avg:  4m 40s | Max:  4m 48s
  🟩 Clang13            Pass: 100%/4   | Total: 19m 37s | Avg:  4m 54s | Max:  5m 48s
  🟩 Clang14            Pass: 100%/4   | Total: 18m 53s | Avg:  4m 43s | Max:  5m 01s
  🟩 Clang15            Pass: 100%/4   | Total: 19m 43s | Avg:  4m 55s | Max:  5m 25s
  🟩 Clang16            Pass: 100%/4   | Total: 19m 09s | Avg:  4m 47s | Max:  5m 03s
  🟩 Clang17            Pass: 100%/4   | Total: 18m 47s | Avg:  4m 41s | Max:  4m 56s
  🟩 Clang18            Pass: 100%/18  | Total:  2h 28m | Avg:  8m 15s | Max: 23m 35s
  🟩 GCC6               Pass: 100%/2   | Total:  7m 07s | Avg:  3m 33s | Max:  3m 49s
  🟩 GCC7               Pass: 100%/6   | Total: 24m 04s | Avg:  4m 00s | Max:  4m 41s
  🟩 GCC8               Pass: 100%/6   | Total: 25m 11s | Avg:  4m 11s | Max:  4m 38s
  🟩 GCC9               Pass: 100%/6   | Total: 25m 03s | Avg:  4m 10s | Max:  5m 09s
  🟩 GCC10              Pass: 100%/4   | Total: 18m 43s | Avg:  4m 40s | Max:  5m 02s
  🟩 GCC11              Pass: 100%/7   | Total: 33m 20s | Avg:  4m 45s | Max:  5m 28s
  🟩 GCC12              Pass: 100%/4   | Total: 19m 33s | Avg:  4m 53s | Max:  5m 07s
  🟩 GCC13              Pass: 100%/20  | Total:  2h 23m | Avg:  7m 09s | Max: 16m 05s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total: 18m 12s | Avg:  6m 04s | Max:  6m 29s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 20m 32s | Avg: 20m 32s | Max: 20m 32s | Hits:  99%/2231  
  🟩 MSVC14.29          Pass: 100%/2   | Total: 34m 31s | Avg: 17m 15s | Max: 17m 34s | Hits:  99%/4462  
  🟩 MSVC14.39          Pass: 100%/6   | Total:  2h 08m | Avg: 21m 23s | Max: 25m 49s | Hits:  99%/13386 
🟩 cxx_family
  🟩 Clang              Pass: 100%/55  | Total:  5h 27m | Avg:  5m 57s | Max: 23m 35s
  🟩 GCC                Pass: 100%/55  | Total:  4h 56m | Avg:  5m 23s | Max: 16m 05s
  🟩 Intel              Pass: 100%/3   | Total: 18m 12s | Avg:  6m 04s | Max:  6m 29s
  🟩 MSVC               Pass: 100%/9   | Total:  3h 03m | Avg: 20m 22s | Max: 25m 49s | Hits:  99%/20079 
🟩 gpu
  🟩 v100               Pass: 100%/122 | Total: 13h 45m | Avg:  6m 46s | Max: 25m 49s | Hits:  99%/20079 
🟩 jobs
  🟩 Build              Pass: 100%/103 | Total:  9h 19m | Avg:  5m 26s | Max: 20m 32s | Hits:  99%/13386 
  🟩 TestCPU            Pass: 100%/11  | Total:  2h 11m | Avg: 11m 56s | Max: 25m 49s | Hits:  99%/6693  
  🟩 TestGPU            Pass: 100%/8   | Total:  2h 14m | Avg: 16m 49s | Max: 23m 35s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 40s | Avg:  4m 33s | Max:  4m 53s
  🟩 90a                Pass: 100%/4   | Total: 16m 09s | Avg:  4m 02s | Max:  4m 19s
🟩 std
  🟩 11                 Pass: 100%/31  | Total:  2h 46m | Avg:  5m 22s | Max: 23m 35s
  🟩 14                 Pass: 100%/35  | Total:  4h 03m | Avg:  6m 58s | Max: 22m 44s | Hits:  99%/8924  
  🟩 17                 Pass: 100%/34  | Total:  4h 05m | Avg:  7m 13s | Max: 25m 20s | Hits:  99%/6693  
  🟩 20                 Pass: 100%/22  | Total:  2h 49m | Avg:  7m 41s | Max: 25m 49s | Hits:  99%/4462

🟩 pycuda: Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
🟩 ctk
  🟩 12.5               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
🟩 cudacxx
  🟩 nvcc12.5           Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	pycuda
	CUDA C Core Library

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	pycuda
+/-	CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

#	Runner
186	`linux-amd64-cpu16`
42	`linux-amd64-gpu-v100-latest-1`
16	`linux-arm64-cpu16`
15	`windows-amd64-cpu16`

cub/cub/device/device_find_if.cuh

github-actions · 2024-09-22T02:01:14Z

🟨 CI finished in 2h 35m: Pass: 97%/259 | Total: 5d 01h | Avg: 28m 11s | Max: 1h 41m | Hits: 99%/20079

🟨 cub: Pass: 95%/136 | Total: 4d 12h | Avg: 47m 51s | Max: 1h 41m

🔍 cpu: amd64 🔍
  🔍 amd64              Pass:  95%/128 | Total:  4d 05h | Avg: 47m 22s | Max:  1h 41m
  🟩 arm64              Pass: 100%/8   | Total:  7h 24m | Avg: 55m 32s | Max: 59m 13s
🔍 cudacxx_family: nvcc 🔍
  🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 04m
  🔍 nvcc               Pass:  95%/134 | Total:  4d 10h | Avg: 47m 38s | Max:  1h 41m
🚨 cxx_family: MSVC 🚨
  🟩 Clang              Pass: 100%/63  | Total:  2d 08h | Avg: 53m 43s | Max:  1h 31m
  🟩 GCC                Pass: 100%/64  | Total:  2d 00h | Avg: 45m 26s | Max:  1h 41m
  🟩 Intel              Pass: 100%/3   | Total:  2h 17m | Avg: 45m 46s | Max: 46m 50s
  🔥 MSVC               Pass:   0%/6   | Total:  1h 18m | Avg: 13m 04s | Max: 14m 25s
🔍 jobs: Build 🔍
  🔍 Build              Pass:  94%/103 | Total:  2d 13h | Avg: 35m 50s | Max:  1h 04m
  🟩 DeviceLaunch       Pass: 100%/8   | Total: 11h 17m | Avg:  1h 24m | Max:  1h 28m
  🟩 GraphCapture       Pass: 100%/8   | Total: 10h 49m | Avg:  1h 21m | Max:  1h 25m
  🟩 HostLaunch         Pass: 100%/8   | Total: 11h 03m | Avg:  1h 22m | Max:  1h 26m
  🟩 SmallGMem          Pass: 100%/1   | Total:  1h 41m | Avg:  1h 41m | Max:  1h 41m
  🟩 TestGPU            Pass: 100%/8   | Total: 12h 04m | Avg:  1h 30m | Max:  1h 32m
🟨 ctk
  🟨 11.1               Pass:  93%/15  | Total:  1h 04m | Avg:  4m 17s | Max: 14m 25s
  🟩 11.8               Pass: 100%/3   | Total: 13m 53s | Avg:  4m 37s | Max:  4m 53s
  🟨 12.6               Pass:  95%/118 | Total:  4d 11h | Avg: 54m 29s | Max:  1h 41m
🟨 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 04m
  🟨 nvcc11.1           Pass:  93%/15  | Total:  1h 04m | Avg:  4m 17s | Max: 14m 25s
  🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 53s | Avg:  4m 37s | Max:  4m 53s
  🟨 nvcc12.6           Pass:  95%/116 | Total:  4d 09h | Avg: 54m 22s | Max:  1h 41m
🟨 cxx
  🟩 Clang9             Pass: 100%/6   | Total:  2h 26m | Avg: 24m 27s | Max: 47m 10s
  🟩 Clang10            Pass: 100%/3   | Total:  2h 15m | Avg: 45m 06s | Max: 47m 07s
  🟩 Clang11            Pass: 100%/4   | Total:  3h 00m | Avg: 45m 14s | Max: 47m 46s
  🟩 Clang12            Pass: 100%/4   | Total:  2h 53m | Avg: 43m 18s | Max: 43m 58s
  🟩 Clang13            Pass: 100%/4   | Total:  2h 57m | Avg: 44m 29s | Max: 48m 16s
  🟩 Clang14            Pass: 100%/4   | Total:  2h 54m | Avg: 43m 32s | Max: 44m 01s
  🟩 Clang15            Pass: 100%/4   | Total:  3h 00m | Avg: 45m 13s | Max: 48m 18s
  🟩 Clang16            Pass: 100%/4   | Total:  2h 58m | Avg: 44m 33s | Max: 49m 00s
  🟩 Clang17            Pass: 100%/4   | Total:  2h 52m | Avg: 43m 14s | Max: 43m 49s
  🟩 Clang18            Pass: 100%/26  | Total:  1d 07h | Avg:  1h 11m | Max:  1h 31m
  🟩 GCC6               Pass: 100%/2   | Total:  7m 05s | Avg:  3m 32s | Max:  3m 36s
  🟩 GCC7               Pass: 100%/6   | Total:  2h 20m | Avg: 23m 28s | Max: 44m 26s
  🟩 GCC8               Pass: 100%/6   | Total:  2h 19m | Avg: 23m 13s | Max: 43m 13s
  🟩 GCC9               Pass: 100%/6   | Total:  2h 21m | Avg: 23m 34s | Max: 44m 16s
  🟩 GCC10              Pass: 100%/4   | Total:  3h 01m | Avg: 45m 27s | Max: 47m 37s
  🟩 GCC11              Pass: 100%/7   | Total:  3h 09m | Avg: 27m 05s | Max: 45m 18s
  🟩 GCC12              Pass: 100%/4   | Total:  2h 54m | Avg: 43m 40s | Max: 44m 17s
  🟩 GCC13              Pass: 100%/29  | Total:  1d 08h | Avg:  1h 06m | Max:  1h 41m
  🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 17m | Avg: 45m 46s | Max: 46m 50s
  🟥 MSVC14.16          Pass:   0%/1   | Total: 14m 25s | Avg: 14m 25s | Max: 14m 25s
  🟥 MSVC14.29          Pass:   0%/2   | Total: 25m 57s | Avg: 12m 58s | Max: 13m 57s
  🟥 MSVC14.39          Pass:   0%/3   | Total: 38m 05s | Avg: 12m 41s | Max: 13m 01s
🟨 std
  🟩 11                 Pass: 100%/35  | Total:  1d 02h | Avg: 45m 54s | Max:  1h 31m
  🟨 14                 Pass:  92%/38  | Total:  1d 03h | Avg: 43m 46s | Max:  1h 31m
  🟨 17                 Pass:  94%/38  | Total:  1d 06h | Avg: 47m 29s | Max:  1h 41m
  🟨 20                 Pass:  96%/25  | Total: 23h 53m | Avg: 57m 20s | Max:  1h 32m
🟨 gpu
  🟨 v100               Pass:  95%/136 | Total:  4d 12h | Avg: 47m 51s | Max:  1h 41m
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 53s | Avg:  4m 37s | Max:  4m 53s
  🟩 90a                Pass: 100%/4   | Total: 40m 53s | Avg: 10m 13s | Max: 14m 20s

🟩 thrust: Pass: 100%/122 | Total: 12h 58m | Avg: 6m 22s | Max: 25m 02s | Hits: 99%/20079

🟩 cpu
  🟩 amd64              Pass: 100%/114 | Total: 12h 23m | Avg:  6m 31s | Max: 25m 02s | Hits:  99%/20079 
  🟩 arm64              Pass: 100%/8   | Total: 34m 44s | Avg:  4m 20s | Max:  4m 53s
🟩 ctk
  🟩 11.1               Pass: 100%/15  | Total:  1h 10m | Avg:  4m 40s | Max: 18m 21s | Hits:  99%/2231  
  🟩 11.8               Pass: 100%/3   | Total: 14m 34s | Avg:  4m 51s | Max:  5m 31s
  🟩 12.6               Pass: 100%/104 | Total: 11h 33m | Avg:  6m 40s | Max: 25m 02s | Hits:  99%/17848 
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  4m 47s
  🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 10m | Avg:  4m 40s | Max: 18m 21s | Hits:  99%/2231  
  🟩 nvcc11.8           Pass: 100%/3   | Total: 14m 34s | Avg:  4m 51s | Max:  5m 31s
  🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 24m | Avg:  6m 42s | Max: 25m 02s | Hits:  99%/17848 
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 15s | Avg:  4m 37s | Max:  4m 47s
  🟩 nvcc               Pass: 100%/120 | Total: 12h 49m | Avg:  6m 24s | Max: 25m 02s | Hits:  99%/20079 
🟩 cxx
  🟩 Clang9             Pass: 100%/6   | Total: 27m 33s | Avg:  4m 35s | Max:  5m 30s
  🟩 Clang10            Pass: 100%/3   | Total: 17m 02s | Avg:  5m 40s | Max:  6m 05s
  🟩 Clang11            Pass: 100%/4   | Total: 18m 49s | Avg:  4m 42s | Max:  5m 01s
  🟩 Clang12            Pass: 100%/4   | Total: 18m 24s | Avg:  4m 36s | Max:  4m 50s
  🟩 Clang13            Pass: 100%/4   | Total: 18m 13s | Avg:  4m 33s | Max:  5m 09s
  🟩 Clang14            Pass: 100%/4   | Total: 18m 14s | Avg:  4m 33s | Max:  5m 03s
  🟩 Clang15            Pass: 100%/4   | Total: 19m 02s | Avg:  4m 45s | Max:  5m 01s
  🟩 Clang16            Pass: 100%/4   | Total: 18m 31s | Avg:  4m 37s | Max:  4m 44s
  🟩 Clang17            Pass: 100%/4   | Total: 18m 15s | Avg:  4m 33s | Max:  4m 51s
  🟩 Clang18            Pass: 100%/18  | Total:  2h 06m | Avg:  7m 01s | Max: 15m 32s
  🟩 GCC6               Pass: 100%/2   | Total:  6m 51s | Avg:  3m 25s | Max:  3m 33s
  🟩 GCC7               Pass: 100%/6   | Total: 23m 06s | Avg:  3m 51s | Max:  4m 23s
  🟩 GCC8               Pass: 100%/6   | Total: 24m 29s | Avg:  4m 04s | Max:  4m 46s
  🟩 GCC9               Pass: 100%/6   | Total: 24m 27s | Avg:  4m 04s | Max:  5m 02s
  🟩 GCC10              Pass: 100%/4   | Total: 17m 24s | Avg:  4m 21s | Max:  4m 39s
  🟩 GCC11              Pass: 100%/7   | Total: 33m 24s | Avg:  4m 46s | Max:  5m 31s
  🟩 GCC12              Pass: 100%/4   | Total: 19m 11s | Avg:  4m 47s | Max:  5m 08s
  🟩 GCC13              Pass: 100%/20  | Total:  2h 18m | Avg:  6m 54s | Max: 15m 11s
  🟩 Intel2023.2.0      Pass: 100%/3   | Total: 17m 10s | Avg:  5m 43s | Max:  5m 57s
  🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 21s | Avg: 18m 21s | Max: 18m 21s | Hits:  99%/2231  
  🟩 MSVC14.29          Pass: 100%/2   | Total: 33m 52s | Avg: 16m 56s | Max: 18m 05s | Hits:  99%/4462  
  🟩 MSVC14.39          Pass: 100%/6   | Total:  2h 01m | Avg: 20m 17s | Max: 25m 02s | Hits:  99%/13386 
🟩 cxx_family
  🟩 Clang              Pass: 100%/55  | Total:  5h 00m | Avg:  5m 27s | Max: 15m 32s
  🟩 GCC                Pass: 100%/55  | Total:  4h 46m | Avg:  5m 13s | Max: 15m 11s
  🟩 Intel              Pass: 100%/3   | Total: 17m 10s | Avg:  5m 43s | Max:  5m 57s
  🟩 MSVC               Pass: 100%/9   | Total:  2h 53m | Avg: 19m 19s | Max: 25m 02s | Hits:  99%/20079 
🟩 gpu
  🟩 v100               Pass: 100%/122 | Total: 12h 58m | Avg:  6m 22s | Max: 25m 02s | Hits:  99%/20079 
🟩 jobs
  🟩 Build              Pass: 100%/103 | Total:  9h 02m | Avg:  5m 15s | Max: 18m 21s | Hits:  99%/13386 
  🟩 TestCPU            Pass: 100%/11  | Total:  2h 04m | Avg: 11m 18s | Max: 25m 02s | Hits:  99%/6693  
  🟩 TestGPU            Pass: 100%/8   | Total:  1h 51m | Avg: 13m 58s | Max: 15m 32s
🟩 sm
  🟩 60;70;80;90        Pass: 100%/3   | Total: 14m 34s | Avg:  4m 51s | Max:  5m 31s
  🟩 90a                Pass: 100%/4   | Total: 15m 40s | Avg:  3m 55s | Max:  4m 15s
🟩 std
  🟩 11                 Pass: 100%/31  | Total:  2h 26m | Avg:  4m 43s | Max: 10m 45s
  🟩 14                 Pass: 100%/35  | Total:  3h 58m | Avg:  6m 49s | Max: 21m 38s | Hits:  99%/8924  
  🟩 17                 Pass: 100%/34  | Total:  3h 52m | Avg:  6m 50s | Max: 22m 44s | Hits:  99%/6693  
  🟩 20                 Pass: 100%/22  | Total:  2h 40m | Avg:  7m 18s | Max: 25m 02s | Hits:  99%/4462

🟩 pycuda: Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
🟩 ctk
  🟩 12.5               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
🟩 cudacxx
  🟩 nvcc12.5           Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
	pycuda
	CUDA C Core Library

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	pycuda
+/-	CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

#	Runner
186	`linux-amd64-cpu16`
42	`linux-amd64-gpu-v100-latest-1`
16	`linux-arm64-cpu16`
15	`windows-amd64-cpu16`

gonidelis · 2024-09-25T16:28:58Z

Performance Results of thrust::count_if vs cub::DeviceFind::FindIf
(runs with I8 input type to test the performance of the newly introduced vectorized loads)

## [0] NVIDIA H200

| T   | Elements | MismatchAt | Ref Time  | Ref Noise | Cmp Time  | Cmp Noise | Diff       | %Diff    |
| --- | -------- | ---------- | --------- | --------- | --------- | --------- | ---------  | -------  |
| I8  | 2^16     | 1          | 19.937 us | 9.99%     | 10.860 us | 3.43%     | -9.077 us  | -45.53%  |
| I8  | 2^20     | 1          | 20.740 us | 2.16%     | 13.999 us | 3.34%     | -6.741 us  | -32.50%  |
| I8  | 2^24     | 1          | 31.782 us | 2.21%     | 22.034 us | 1.84%     | -9.718 us  | -30.60%  |
| I8  | 2^28     | 1          | 137.999 us| 6.49%     | 177.938 us| 2.55%     | 39.938 us  | 28.94%   |
| I8  | 2^16     | 0.5        | 19.962 us | 2.26%     | 10.860 us | 2.51%     | -9.002 us  | -45.10%  |
| I8  | 2^20     | 0.5        | 21.056 us | 2.79%     | 13.979 us | 5.12%     | -7.078 us  | -33.61%  |
| I8  | 2^24     | 0.5        | 31.876 us | 1.51%     | 16.793 us | 2.42%     | -15.083 us | -47.32%  |
| I8  | 2^28     | 0.5        | 137.932 us| 0.54%     | 96.730 us | 0.64%     | -41.202 us | -29.87%  |
| I8  | 2^16     | 0          | 20.016 us | 2.77%     | 9.661 us  | 2.99%     | -10.375 us | -51.83%  |
| I8  | 2^20     | 0          | 20.981 us | 2.70%     | 10.298 us | 3.69%     | -10.654 us | -50.85%  |
| I8  | 2^24     | 0          | 31.743 us | 1.86%     | 14.190 us | 2.16%     | -17.553 us | -55.30%  |
| I8  | 2^28     | 0          | 137.975 us| 0.49%     | 14.186 us | 2.96%     | -123.789 us| -89.72%  |

When MismatchAt == 1, the predicated is satisfied only for the last element of the range. When MismatchA == 0 every element in the range satisfies the predicate. We see how the latter case is benefited from the newly introduced algorithm.

We also see how thrust::count_if is only better in the case where the whole range needs to be iterated (MismatchAt == 0) and the input size is big.

gonidelis · 2024-10-11T18:18:11Z

Some long awaiting performance results on A6000 and H200 (extending @gevtushenko's work in #1870):

Search Operation
cub::Device::FindIf, thrust::find_if and thrust::count_if are used as a backend to search for an int32 with key placed at a certain percentage of an input from 0 to 100%.

Equal Operation
cub::Device::FindIf, thrust::find_if and thrust::count_if are used as a backend to compare two vectors of int32 for equality with a mismatch placed at a certain percentage of input from 0 to 100%.

bernhardmgruber

General comments:

You can make some more variables const
Your hand written vectorized loading looks a lot like the BLOCK_LOAD_VECTORIZE algorithm of cub::BlockLoad. Why can't you use that one?

cub/benchmarks/bench/find_if/base.cu

bernhardmgruber · 2024-10-28T07:39:18Z

cub/cub/device/device_find_if.cuh

+  using VectorT = typename CubVector<InputT, _VECTOR_LOAD_LENGTH>::Type;
+  ///
+  bool full_tile  = (tile_offset + tile_size) <= num_items;
+  bool is_aligned = (size_t(d_in) & (sizeof(VectorT) - 1)) == 0;


I find a modulo operation easier to read, and it should compile to the same bitwise operation. Also, reinterpreting a pointer as an integer should use uintptr_t. However, that may be a different type as size_t, so:

Suggested change

bool is_aligned = (size_t(d_in) & (sizeof(VectorT) - 1)) == 0;

const bool is_aligned = reinterpret_cast<::cuda::std::uintptr_t>(d_in) % uintptr_t{sizeof(VectorT)} == 0;

You may also need to handle the case when Iterator is not a pointer, but I guess that is planned for later, when the runtime numbers look good :) Given you have a second overload of IsAlignedAndFullTile below, you could just change Iterator to T* in the function signature, so any non-pointer iterator would just not match your overload here.

bernhardmgruber · 2024-10-28T11:40:22Z

cub/cub/device/device_find_if.cuh

+  InputT* d_in_unqualified = const_cast<InputT*>(begin) + tile_offset + (threadIdx.x * _VECTOR_LOAD_LENGTH);
+
+  cub::CacheModifiedInputIterator<cub::CacheLoadModifier::LOAD_LDG, VectorT> d_vec_in(
+    reinterpret_cast<VectorT*>(d_in_unqualified));


Stripping const from the input pointer should not be necessary, since we want to only read from it. Is this a limitation of cub::CacheModifiedInputIterator?

bernhardmgruber · 2024-10-28T11:51:22Z

cub/cub/device/device_find_if.cuh

+    // use d_temp_storage as the intermediate device result
+    // to read and write from. Then store the final result in the output iterator.
+    cuda_mem_set_async_dtemp_storage<<<1, 1>>>(int_temp_storage, num_items);
+
+    find_if<<<findif_grid_size, block_threads, 0, stream>>>(d_in, d_in + num_items, op, int_temp_storage, num_items);
+
+    write_final_result_in_output_iterator_already<int><<<1, 1>>>(int_temp_storage, d_out);


Idea: if OutputIteratorT is a pointer to some integer type (which I imagine is common), you could just use that memory location to fire the atomic minimum operations at. Then you would not need a second kernel and also no temporary storage.

bernhardmgruber · 2024-10-28T12:03:49Z

cub/cub/device/device_find_if.cuh

+    if (threadIdx.x == 0)
+    {
+      sresult = atomicAdd(result, 0);
+    }
+    __syncthreads();


I think you can use an ordinary load: sresult = result;. In the first loop iteartion, only thread 0 reads, so there is no concurrency. In all subsequent iterations, ConsumeRange contains a synchronize_or() after writes to result, so no thread can be writing when reading here in thread 0 again.

cub/cub/device/device_find_if.cuh

bernhardmgruber · 2024-10-28T12:15:51Z

cub/cub/device/device_find_if.cuh

+  InputT input_items[elements_per_thread];
+  VectorT* vec_items = reinterpret_cast<VectorT*>(input_items);


Strictly speaking, input_items has a smaller alignment than when you access the data through vec_items, so you could run into misaligned stores.

I suggest to declare an array of vectors instead, and reinterpret it as an array of scalars.

bernhardmgruber · 2024-10-28T12:27:17Z

cub/cub/device/device_find_if.cuh

+    int index = i % WORDS + (i / WORDS) * block_threads * WORDS + threadIdx.x * WORDS + tile_offset;
+    // i % WORDS                            = + 0 1 2 3, 0 1 2 3, 0 1 2 3, ... (static)
+    // (i / WORDS) * block_threads * WORDS  = + 0      , 64     , 128,     ... (static)
+    // threadIdx.x * WORDS                  = + 0, 4, 8, ... offset of the thread within working tile
+    // tile_offset                          = + just start at the beginning of the block


The vectorization only concerns the loads, so the processing loop should no longer need to consider WORDS. It should just be auto index = tile_offset + threadIdx.x + i * block_threads;, like in the non-vectorized version. But please verify. Maybe I am missing something.

So, my suggestion was wrong above. I was somehow assuming both versions used the same access pattern, loading blocks of elements_per_thread. However, each version loads block-strided. That changes how the index is computed.

cub/cub/device/device_find_if.cuh

bernhardmgruber · 2024-10-28T12:42:57Z

cub/cub/device/device_find_if.cuh

+#pragma unroll
+  for (int i = 0; i < WORDS; ++i)
+  {
+    vec_items[i] = d_vec_in[block_threads * i];
+  }
+  //// vectorized loads end
+
+  bool found = false;
+  for (int i = 0; i < elements_per_thread; ++i)
+  {


I wonder whether it would make any sense to merge the two loops in order to load 1 vector, then process the predicate on all scalars of the vector, and then continue to load the next vector. This way, you could could avoid loading the entire tile in which the predicate turns true, and you can overlap data loading with compute more tighly. But if you want to try this, maybe try it last and stabilize your existing approach first!

gonidelis · 2024-10-31T08:02:08Z

Many thanks to @elstehle for helping figure out this index!!!!

cccl/cub/cub/agent/agent_find.cuh

Lines 213 to 217 in 0364cf3

    
           OffsetT nth_vector_of_thread = i / VECTOR_LOAD_LENGTH; 
        
           OffsetT element_in_word      = i % VECTOR_LOAD_LENGTH; 
        
           OffsetT vector_of_tile       = nth_vector_of_thread * BLOCK_THREADS + threadIdx.x; 
        
           OffsetT index = tile_offset + vector_of_tile * VECTOR_LOAD_LENGTH + element_in_word;

cub/cub/agent/agent_find.cuh

bernhardmgruber · 2024-10-31T09:11:12Z

cub/cub/device/device_find_if.cuh

+    int index = i % WORDS + (i / WORDS) * block_threads * WORDS + threadIdx.x * WORDS + tile_offset;
+    // i % WORDS                            = + 0 1 2 3, 0 1 2 3, 0 1 2 3, ... (static)
+    // (i / WORDS) * block_threads * WORDS  = + 0      , 64     , 128,     ... (static)
+    // threadIdx.x * WORDS                  = + 0, 4, 8, ... offset of the thread within working tile
+    // tile_offset                          = + just start at the beginning of the block


So, my suggestion was wrong above. I was somehow assuming both versions used the same access pattern, loading blocks of elements_per_thread. However, each version loads block-strided. That changes how the index is computed.

cub/test/catch2_test_device_find_if.cu

thrust/benchmarks/bench/count_if/basic.cu

thrust/benchmarks/bench/find_if/basic.cu

…ording to tile size

…orized loads

gonidelis · 2024-11-02T00:30:08Z

Update: After refactoring the code by introducing Dispatch and Agent layers the benchmark results look the same on my A6000 local machine.

docs to be added over the weekend

cub/cub/agent/agent_find.cuh

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

jrhemstad changed the title ~~Imlement cub::DeviceFind::FindIf~~ Implement cub::DeviceFind::FindIf Sep 11, 2024

gevtushenko reviewed Sep 15, 2024

View reviewed changes

cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved

cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved

gonidelis force-pushed the cub_find_if branch 3 times, most recently from ce3b44c to 6f9fba8 Compare September 17, 2024 03:51

miscco reviewed Sep 17, 2024

View reviewed changes

cub/cub/device/device_find_if.cuh Show resolved Hide resolved

cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved

cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved

cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved

gonidelis force-pushed the cub_find_if branch from f2dc70c to 612a9c2 Compare September 17, 2024 17:41

jrhemstad linked an issue Sep 18, 2024 that may be closed by this pull request

Build and benchmark a prototype CUB cub::DeviceFind::FindIf kernel #2114

Open

gonidelis commented Sep 20, 2024

View reviewed changes

cub/cub/device/device_find_if.cuh Outdated Show resolved Hide resolved

gonidelis force-pushed the cub_find_if branch from b3e99bc to 1f6060b Compare October 1, 2024 20:10

bernhardmgruber reviewed Oct 28, 2024

View reviewed changes

gonidelis force-pushed the cub_find_if branch 5 times, most recently from 7ff1a1a to 0364cf3 Compare October 31, 2024 07:58

bernhardmgruber reviewed Oct 31, 2024

View reviewed changes

cub/test/catch2_test_device_find_if.cu Outdated Show resolved Hide resolved

thrust/benchmarks/bench/count_if/basic.cu Outdated Show resolved Hide resolved

thrust/benchmarks/bench/find_if/basic.cu Outdated Show resolved Hide resolved

cub::DeviceFind::FindIf dummy implementation and test

fd6e836

gonidelis added 10 commits November 1, 2024 17:28

Calculate d_temp_storage bytes properly aligned and use grid_size acc…

afd42cb

…ording to tile size

Add cub::Device::FindIf nvbench benchmark

7610da6

Introduced tile based loop

e683a3d

Use multiple elements per thread

b7a6db5

Optimized solution

f69e688

Add vectorized loads

81c676e

Separate benchmarks into different files

b8b86dd

Add split implementation that works with both vectorized and non vect…

f93ff9b

…orized loads

Add elaborate unit testing

50dc8f5

Add Dispatch and Agent layers

c508b99

gonidelis force-pushed the cub_find_if branch from 0364cf3 to 6bb6985 Compare November 2, 2024 00:29

gonidelis marked this pull request as ready for review November 2, 2024 00:30

gonidelis requested review from a team as code owners November 2, 2024 00:30

gonidelis requested review from gevtushenko and wmaxey November 2, 2024 00:30

Reviews, cleanup

eaa5b75

gonidelis force-pushed the cub_find_if branch from 6bb6985 to eaa5b75 Compare November 2, 2024 00:33

bernhardmgruber reviewed Nov 11, 2024

View reviewed changes

cub/cub/agent/agent_find.cuh Outdated Show resolved Hide resolved

CacheModifiedInputIterator accepts const

7c2f2a8

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

	bool is_aligned = (size_t(d_in) & (sizeof(VectorT) - 1)) == 0;
	const bool is_aligned = reinterpret_cast<::cuda::std::uintptr_t>(d_in) % uintptr_t{sizeof(VectorT)} == 0;

		InputT input_items[elements_per_thread];
		VectorT* vec_items = reinterpret_cast<VectorT*>(input_items);

Implement cub::DeviceFind::FindIf #2405

Are you sure you want to change the base?

Implement cub::DeviceFind::FindIf #2405

Conversation

gonidelis commented Sep 11, 2024 • edited Loading

Algorithm Description

gonidelis commented Sep 11, 2024 • edited Loading

bernhardmgruber commented Sep 11, 2024 • edited Loading

bernhardmgruber commented Sep 11, 2024

gonidelis commented Sep 11, 2024

github-actions bot commented Sep 11, 2024

🟨 cub: Pass: 95%/136 | Total: 19h 26m | Avg: 8m 34s | Max: 35m 24s

🟩 thrust: Pass: 100%/122 | Total: 12h 35m | Avg: 6m 11s | Max: 21m 04s | Hits: 99%/20079

🟩 pycuda: Pass: 100%/1 | Total: 15m 28s | Avg: 15m 28s | Max: 15m 28s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 259)

github-actions bot commented Sep 17, 2024

🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 41s | Max: 1h 39m

🟩 thrust: Pass: 100%/122 | Total: 13h 27m | Avg: 6m 37s | Max: 22m 34s | Hits: 99%/20079

🟩 pycuda: Pass: 100%/1 | Total: 17m 16s | Avg: 17m 16s | Max: 17m 16s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 259)

github-actions bot commented Sep 18, 2024

🟨 cub: Pass: 88%/136 | Total: 4d 12h | Avg: 47m 44s | Max: 1h 34m

🟩 thrust: Pass: 100%/122 | Total: 13h 45m | Avg: 6m 46s | Max: 25m 49s | Hits: 99%/20079

🟩 pycuda: Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 259)

github-actions bot commented Sep 22, 2024

🟨 cub: Pass: 95%/136 | Total: 4d 12h | Avg: 47m 51s | Max: 1h 41m

🟩 thrust: Pass: 100%/122 | Total: 12h 58m | Avg: 6m 22s | Max: 25m 02s | Hits: 99%/20079

🟩 pycuda: Pass: 100%/1 | Total: 15m 42s | Avg: 15m 42s | Max: 15m 42s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 259)

gonidelis commented Sep 25, 2024 • edited Loading

gonidelis commented Oct 11, 2024 • edited Loading

bernhardmgruber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gonidelis commented Oct 31, 2024

Choose a reason for hiding this comment

gonidelis commented Nov 2, 2024 • edited Loading

Implement `cub::DeviceFind::FindIf` #2405

Implement `cub::DeviceFind::FindIf` #2405

gonidelis commented Sep 11, 2024 •

edited

Loading

gonidelis commented Sep 11, 2024 •

edited

Loading

bernhardmgruber commented Sep 11, 2024 •

edited

Loading

gonidelis commented Sep 25, 2024 •

edited

Loading

gonidelis commented Oct 11, 2024 •

edited

Loading

gonidelis commented Nov 2, 2024 •

edited

Loading