Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prefetching kernel as new fallback for cub::DeviceTransform #2396

Merged
merged 6 commits into from
Oct 30, 2024

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Sep 10, 2024

The new prefetching kernel tunes it's number of elements per thread at runtime to reach an architecture-specific number of bytes in flight. It is based on work by @ahendriksen and adapted in the following ways:

  • Prefetch cachelines instead of elements
  • Avoid maximizing bytes in flight for small problem sizes. Rather try to evenly spread the workload on all available SMs. This is needed to beat cub::DeviceFor for problem sizes around 2^16 elements.

The new prefetching kernel replaces the current fallback to cub::DeviceFor in the tuning policies (the fallback implementation is not yet removed. Will be done in a separate PR).

Babelstream on H100 fallback_for vs prefetch
['/home/bgruber/babelstream_fallbackfor_H100/', '/home/bgruber/babelstream_prefetch_H100/']
# mul

## [0] NVIDIA H100 PCIe

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |   5.019 us |       1.89% |   4.552 us |       2.29% |   -0.467 us |  -9.30% |   FAIL   |
|   I8    |      I32      |      2^20      |   7.974 us |       2.06% |  11.798 us |       1.68% |    3.824 us |  47.96% |   FAIL   |
|   I8    |      I32      |      2^24      |  52.438 us |       0.23% |  31.339 us |       0.82% |  -21.099 us | -40.24% |   FAIL   |
|   I8    |      I32      |      2^28      | 750.740 us |       0.03% | 332.620 us |       0.27% | -418.120 us | -55.69% |   FAIL   |
|   I8    |      I64      |      2^16      |   5.520 us |       2.64% |   5.062 us |       2.69% |   -0.458 us |  -8.29% |   FAIL   |
|   I8    |      I64      |      2^20      |   8.223 us |       1.73% |  11.915 us |       1.48% |    3.693 us |  44.91% |   FAIL   |
|   I8    |      I64      |      2^24      |  52.640 us |       0.27% |  31.395 us |       0.84% |  -21.246 us | -40.36% |   FAIL   |
|   I8    |      I64      |      2^28      | 752.238 us |       0.03% | 333.346 us |       0.27% | -418.892 us | -55.69% |   FAIL   |
|   I16   |      I32      |      2^16      |   5.514 us |       3.02% |   5.150 us |       2.76% |   -0.363 us |  -6.59% |   FAIL   |
|   I16   |      I32      |      2^20      |   8.966 us |       1.66% |   9.820 us |       2.06% |    0.855 us |   9.53% |   FAIL   |
|   I16   |      I32      |      2^24      |  57.333 us |       0.19% |  43.414 us |       0.71% |  -13.919 us | -24.28% |   FAIL   |
|   I16   |      I32      |      2^28      | 812.365 us |       0.02% | 595.944 us |       0.44% | -216.420 us | -26.64% |   FAIL   |
|   I16   |      I64      |      2^16      |   5.580 us |       3.13% |   5.267 us |       3.08% |   -0.313 us |  -5.62% |   FAIL   |
|   I16   |      I64      |      2^20      |   8.757 us |       1.79% |   9.809 us |       1.96% |    1.052 us |  12.01% |   FAIL   |
|   I16   |      I64      |      2^24      |  57.092 us |       0.22% |  43.752 us |       0.75% |  -13.340 us | -23.37% |   FAIL   |
|   I16   |      I64      |      2^28      | 809.144 us |       0.02% | 595.682 us |       0.43% | -213.462 us | -26.38% |   FAIL   |
|   F32   |      I32      |      2^16      |   5.654 us |       2.76% |   5.217 us |       2.76% |   -0.437 us |  -7.73% |   FAIL   |
|   F32   |      I32      |      2^20      |  10.595 us |       1.50% |  10.767 us |       2.22% |    0.172 us |   1.62% |   FAIL   |
|   F32   |      I32      |      2^24      |  87.013 us |       0.24% |  78.501 us |       0.91% |   -8.511 us |  -9.78% |   FAIL   |
|   F32   |      I32      |      2^28      |   1.273 ms |       0.08% |   1.177 ms |       0.26% |  -96.044 us |  -7.54% |   FAIL   |
|   F32   |      I64      |      2^16      |   5.720 us |       3.05% |   5.340 us |       3.28% |   -0.380 us |  -6.64% |   FAIL   |
|   F32   |      I64      |      2^20      |  10.589 us |       1.45% |  10.783 us |       2.10% |    0.195 us |   1.84% |   FAIL   |
|   F32   |      I64      |      2^24      |  86.935 us |       0.29% |  78.594 us |       1.18% |   -8.341 us |  -9.59% |   FAIL   |
|   F32   |      I64      |      2^28      |   1.273 ms |       0.08% |   1.177 ms |       0.26% |  -95.188 us |  -7.48% |   FAIL   |
|   F64   |      I32      |      2^16      |   5.930 us |       3.84% |   5.610 us |       4.69% |   -0.320 us |  -5.40% |   FAIL   |
|   F64   |      I32      |      2^20      |  15.202 us |       1.43% |  14.480 us |       2.05% |   -0.722 us |  -4.75% |   FAIL   |
|   F64   |      I32      |      2^24      | 152.854 us |       0.39% | 151.988 us |       0.41% |   -0.866 us |  -0.57% |   FAIL   |
|   F64   |      I32      |      2^28      |   2.333 ms |       0.03% |   2.329 ms |       0.14% |   -4.029 us |  -0.17% |   FAIL   |
|   F64   |      I64      |      2^16      |   6.003 us |       4.20% |   5.696 us |       4.58% |   -0.307 us |  -5.12% |   FAIL   |
|   F64   |      I64      |      2^20      |  15.176 us |       1.47% |  14.532 us |       1.93% |   -0.644 us |  -4.24% |   FAIL   |
|   F64   |      I64      |      2^24      | 152.804 us |       0.42% | 152.024 us |       0.39% |   -0.780 us |  -0.51% |   FAIL   |
|   F64   |      I64      |      2^28      |   2.333 ms |       0.04% |   2.329 ms |       0.16% |   -4.142 us |  -0.18% |   FAIL   |
|  I128   |      I32      |      2^16      |   6.703 us |       5.52% |   6.458 us |       5.17% |   -0.245 us |  -3.65% |   PASS   |
|  I128   |      I32      |      2^20      |  24.490 us |       1.04% |  23.172 us |       1.55% |   -1.318 us |  -5.38% |   FAIL   |
|  I128   |      I32      |      2^24      | 298.453 us |       0.25% | 295.902 us |       0.28% |   -2.550 us |  -0.85% |   FAIL   |
|  I128   |      I32      |      2^28      |   4.649 ms |       0.04% |   4.678 ms |       0.13% |   29.054 us |   0.63% |   FAIL   |
|  I128   |      I64      |      2^16      |   6.397 us |       3.07% |   6.271 us |       3.67% |   -0.126 us |  -1.96% |   PASS   |
|  I128   |      I64      |      2^20      |  24.597 us |       1.10% |  23.624 us |       1.57% |   -0.973 us |  -3.96% |   FAIL   |
|  I128   |      I64      |      2^24      | 302.222 us |       1.40% | 299.443 us |       1.44% |   -2.778 us |  -0.92% |   PASS   |
|  I128   |      I64      |      2^28      |   4.649 ms |       0.04% |   4.677 ms |       0.13% |   28.678 us |   0.62% |   FAIL   |

# add

## [0] NVIDIA H100 PCIe

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |   4.943 us |       2.33% |   4.708 us |       2.38% |   -0.235 us |  -4.75% |   FAIL   |
|   I8    |      I32      |      2^20      |   8.560 us |       1.71% |   9.627 us |       1.80% |    1.067 us |  12.47% |   FAIL   |
|   I8    |      I32      |      2^24      |  58.620 us |       0.20% |  40.265 us |       0.51% |  -18.355 us | -31.31% |   FAIL   |
|   I8    |      I32      |      2^28      | 817.313 us |       0.03% | 483.184 us |       0.23% | -334.128 us | -40.88% |   FAIL   |
|   I8    |      I64      |      2^16      |   5.499 us |       2.82% |   5.271 us |       3.03% |   -0.228 us |  -4.15% |   FAIL   |
|   I8    |      I64      |      2^20      |   8.689 us |       1.65% |   9.749 us |       1.77% |    1.061 us |  12.21% |   FAIL   |
|   I8    |      I64      |      2^24      |  58.960 us |       0.24% |  40.282 us |       0.57% |  -18.677 us | -31.68% |   FAIL   |
|   I8    |      I64      |      2^28      | 822.444 us |       0.03% | 483.517 us |       0.24% | -338.927 us | -41.21% |   FAIL   |
|   I16   |      I32      |      2^16      |   5.587 us |       2.45% |   5.357 us |       2.85% |   -0.231 us |  -4.13% |   FAIL   |
|   I16   |      I32      |      2^20      |  10.294 us |       1.49% |  10.354 us |       2.03% |    0.060 us |   0.58% |   PASS   |
|   I16   |      I32      |      2^24      |  76.462 us |       0.16% |  60.624 us |       0.44% |  -15.838 us | -20.71% |   FAIL   |
|   I16   |      I32      |      2^28      |   1.019 ms |       0.02% | 865.096 us |       0.21% | -154.161 us | -15.12% |   FAIL   |
|   I16   |      I64      |      2^16      |   5.637 us |       2.79% |   5.441 us |       3.01% |   -0.196 us |  -3.48% |   FAIL   |
|   I16   |      I64      |      2^20      |  10.149 us |       1.46% |  10.405 us |       2.02% |    0.255 us |   2.52% |   FAIL   |
|   I16   |      I64      |      2^24      |  75.928 us |       0.16% |  60.759 us |       0.44% |  -15.169 us | -19.98% |   FAIL   |
|   I16   |      I64      |      2^28      |   1.014 ms |       0.02% | 865.537 us |       0.21% | -148.121 us | -14.61% |   FAIL   |
|   F32   |      I32      |      2^16      |   5.955 us |       3.38% |   5.597 us |       3.61% |   -0.358 us |  -6.01% |   FAIL   |
|   F32   |      I32      |      2^20      |  12.977 us |       1.25% |  12.877 us |       1.57% |   -0.100 us |  -0.77% |   PASS   |
|   F32   |      I32      |      2^24      | 121.529 us |       0.71% | 115.634 us |       0.57% |   -5.895 us |  -4.85% |   FAIL   |
|   F32   |      I32      |      2^28      |   1.798 ms |       0.07% |   1.703 ms |       0.17% |  -94.747 us |  -5.27% |   FAIL   |
|   F32   |      I64      |      2^16      |   5.995 us |       3.72% |   5.755 us |       4.56% |   -0.240 us |  -4.00% |   FAIL   |
|   F32   |      I64      |      2^20      |  12.936 us |       1.37% |  12.953 us |       1.61% |    0.017 us |   0.13% |   PASS   |
|   F32   |      I64      |      2^24      | 121.399 us |       0.65% | 115.690 us |       0.55% |   -5.710 us |  -4.70% |   FAIL   |
|   F32   |      I64      |      2^28      |   1.795 ms |       0.07% |   1.703 ms |       0.16% |  -92.254 us |  -5.14% |   FAIL   |
|   F64   |      I32      |      2^16      |   6.527 us |       4.58% |   6.371 us |       4.74% |   -0.157 us |  -2.40% |   PASS   |
|   F64   |      I32      |      2^20      |  20.354 us |       1.08% |  19.271 us |       1.38% |   -1.083 us |  -5.32% |   FAIL   |
|   F64   |      I32      |      2^24      | 220.761 us |       0.25% | 220.170 us |       0.25% |   -0.592 us |  -0.27% |   FAIL   |
|   F64   |      I32      |      2^28      |   3.389 ms |       0.05% |   3.431 ms |       0.11% |   41.806 us |   1.23% |   FAIL   |
|   F64   |      I64      |      2^16      |   6.627 us |       5.00% |   6.496 us |       4.82% |   -0.131 us |  -1.98% |   PASS   |
|   F64   |      I64      |      2^20      |  20.324 us |       1.13% |  19.376 us |       1.40% |   -0.948 us |  -4.66% |   FAIL   |
|   F64   |      I64      |      2^24      | 220.783 us |       0.25% | 220.195 us |       0.26% |   -0.588 us |  -0.27% |   FAIL   |
|   F64   |      I64      |      2^28      |   3.389 ms |       0.05% |   3.431 ms |       0.10% |   42.063 us |   1.24% |   FAIL   |
|  I128   |      I32      |      2^16      |   8.093 us |       5.43% |   7.685 us |       5.15% |   -0.408 us |  -5.04% |   PASS   |
|  I128   |      I32      |      2^20      |  34.111 us |       0.81% |  33.224 us |       0.81% |   -0.888 us |  -2.60% |   FAIL   |
|  I128   |      I32      |      2^24      | 432.087 us |       0.15% | 430.026 us |       0.25% |   -2.061 us |  -0.48% |   FAIL   |
|  I128   |      I32      |      2^28      |   6.790 ms |       0.04% |   6.759 ms |       0.03% |  -30.595 us |  -0.45% |   FAIL   |
|  I128   |      I64      |      2^16      |   7.499 us |       2.82% |   7.181 us |       2.74% |   -0.318 us |  -4.24% |   FAIL   |
|  I128   |      I64      |      2^20      |  34.568 us |       0.76% |  33.791 us |       0.92% |   -0.776 us |  -2.25% |   FAIL   |
|  I128   |      I64      |      2^24      | 438.079 us |       1.41% | 435.787 us |       1.43% |   -2.292 us |  -0.52% |   PASS   |
|  I128   |      I64      |      2^28      |   6.790 ms |       0.04% |   6.759 ms |       0.03% |  -31.077 us |  -0.46% |   FAIL   |

# triad

## [0] NVIDIA H100 PCIe

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |   5.649 us |       3.66% |   5.340 us |       2.98% |   -0.309 us |  -5.47% |   FAIL   |
|   I8    |      I32      |      2^20      |   8.746 us |       1.90% |   9.625 us |       2.46% |    0.879 us |  10.05% |   FAIL   |
|   I8    |      I32      |      2^24      |  59.716 us |       0.24% |  40.237 us |       0.65% |  -19.478 us | -32.62% |   FAIL   |
|   I8    |      I32      |      2^28      | 820.635 us |       0.43% | 485.822 us |       0.95% | -334.814 us | -40.80% |   FAIL   |
|   I8    |      I64      |      2^16      |   5.612 us |       2.78% |   5.301 us |       2.64% |   -0.311 us |  -5.55% |   FAIL   |
|   I8    |      I64      |      2^20      |   8.686 us |       1.67% |   9.709 us |       1.84% |    1.023 us |  11.78% |   FAIL   |
|   I8    |      I64      |      2^24      |  59.393 us |       0.24% |  39.772 us |       0.55% |  -19.621 us | -33.04% |   FAIL   |
|   I8    |      I64      |      2^28      | 819.940 us |       0.03% | 481.645 us |       0.23% | -338.295 us | -41.26% |   FAIL   |
|   I16   |      I32      |      2^16      |   5.682 us |       2.81% |   5.370 us |       2.79% |   -0.312 us |  -5.49% |   FAIL   |
|   I16   |      I32      |      2^20      |  10.257 us |       1.54% |  10.303 us |       2.05% |    0.046 us |   0.45% |   PASS   |
|   I16   |      I32      |      2^24      |  76.557 us |       0.16% |  60.598 us |       0.48% |  -15.959 us | -20.85% |   FAIL   |
|   I16   |      I32      |      2^28      |   1.019 ms |       0.03% | 860.964 us |       0.17% | -157.600 us | -15.47% |   FAIL   |
|   I16   |      I64      |      2^16      |   5.680 us |       3.52% |   5.457 us |       3.15% |   -0.223 us |  -3.92% |   FAIL   |
|   I16   |      I64      |      2^20      |  10.242 us |       1.53% |  10.225 us |       1.88% |   -0.017 us |  -0.16% |   PASS   |
|   I16   |      I64      |      2^24      |  76.018 us |       0.19% |  60.683 us |       0.48% |  -15.336 us | -20.17% |   FAIL   |
|   I16   |      I64      |      2^28      |   1.013 ms |       0.03% | 861.393 us |       0.19% | -151.457 us | -14.95% |   FAIL   |
|   F32   |      I32      |      2^16      |   5.927 us |       3.08% |   5.608 us |       3.59% |   -0.318 us |  -5.37% |   FAIL   |
|   F32   |      I32      |      2^20      |  12.888 us |       1.35% |  12.970 us |       1.49% |    0.082 us |   0.64% |   PASS   |
|   F32   |      I32      |      2^24      | 121.555 us |       0.72% | 115.033 us |       0.55% |   -6.522 us |  -5.37% |   FAIL   |
|   F32   |      I32      |      2^28      |   1.796 ms |       0.06% |   1.703 ms |       0.17% |  -93.157 us |  -5.19% |   FAIL   |
|   F32   |      I64      |      2^16      |   5.948 us |       3.91% |   5.655 us |       4.10% |   -0.294 us |  -4.94% |   FAIL   |
|   F32   |      I64      |      2^20      |  12.892 us |       1.36% |  13.011 us |       1.54% |    0.119 us |   0.92% |   PASS   |
|   F32   |      I64      |      2^24      | 121.347 us |       0.69% | 115.109 us |       0.62% |   -6.238 us |  -5.14% |   FAIL   |
|   F32   |      I64      |      2^28      |   1.794 ms |       0.07% |   1.703 ms |       0.17% |  -91.036 us |  -5.08% |   FAIL   |
|   F64   |      I32      |      2^16      |   6.473 us |       4.75% |   6.267 us |       3.90% |   -0.206 us |  -3.18% |   PASS   |
|   F64   |      I32      |      2^20      |  20.763 us |       1.04% |  19.802 us |       1.25% |   -0.962 us |  -4.63% |   FAIL   |
|   F64   |      I32      |      2^24      | 221.277 us |       0.26% | 220.973 us |       0.31% |   -0.304 us |  -0.14% |   PASS   |
|   F64   |      I32      |      2^28      |   3.390 ms |       0.05% |   3.431 ms |       0.12% |   41.010 us |   1.21% |   FAIL   |
|   F64   |      I64      |      2^16      |   6.695 us |       5.73% |   6.438 us |       4.61% |   -0.257 us |  -3.84% |   PASS   |
|   F64   |      I64      |      2^20      |  20.777 us |       1.17% |  19.908 us |       1.29% |   -0.869 us |  -4.18% |   FAIL   |
|   F64   |      I64      |      2^24      | 221.180 us |       0.24% | 221.003 us |       0.30% |   -0.177 us |  -0.08% |   PASS   |
|   F64   |      I64      |      2^28      |   3.390 ms |       0.05% |   3.431 ms |       0.11% |   40.861 us |   1.21% |   FAIL   |
|  I128   |      I32      |      2^16      |   7.754 us |       5.27% |   7.236 us |       4.72% |   -0.518 us |  -6.68% |   FAIL   |
|  I128   |      I32      |      2^20      |  34.314 us |       0.69% |  33.583 us |       0.76% |   -0.731 us |  -2.13% |   FAIL   |
|  I128   |      I32      |      2^24      | 432.004 us |       0.15% | 429.729 us |       0.26% |   -2.275 us |  -0.53% |   FAIL   |
|  I128   |      I32      |      2^28      |   6.790 ms |       0.04% |   6.759 ms |       0.03% |  -30.655 us |  -0.45% |   FAIL   |
|  I128   |      I64      |      2^16      |   7.362 us |       3.47% |   6.915 us |       3.12% |   -0.447 us |  -6.07% |   FAIL   |
|  I128   |      I64      |      2^20      |  34.626 us |       0.81% |  33.817 us |       0.94% |   -0.809 us |  -2.34% |   FAIL   |
|  I128   |      I64      |      2^24      | 437.931 us |       1.39% | 435.455 us |       1.43% |   -2.476 us |  -0.57% |   PASS   |
|  I128   |      I64      |      2^28      |   6.790 ms |       0.04% |   6.759 ms |       0.03% |  -30.958 us |  -0.46% |   FAIL   |

# nstream

## [0] NVIDIA H100 PCIe

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  OverwriteInput  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |        1         |   5.204 us |       2.40% |   4.827 us |       2.23% |   -0.377 us |  -7.24% |   FAIL   |
|   I8    |      I32      |      2^20      |        1         |   9.560 us |       1.78% |   9.901 us |       1.81% |    0.340 us |   3.56% |   FAIL   |
|   I8    |      I32      |      2^24      |        1         |  65.895 us |       0.22% |  51.806 us |       0.56% |  -14.089 us | -21.38% |   FAIL   |
|   I8    |      I32      |      2^28      |        1         | 891.910 us |       0.03% | 654.394 us |       0.29% | -237.516 us | -26.63% |   FAIL   |
|   I8    |      I64      |      2^16      |        1         |   5.740 us |       2.42% |   5.315 us |       2.58% |   -0.424 us |  -7.39% |   FAIL   |
|   I8    |      I64      |      2^20      |        1         |   9.679 us |       1.79% |   9.979 us |       1.77% |    0.300 us |   3.10% |   FAIL   |
|   I8    |      I64      |      2^24      |        1         |  66.383 us |       0.25% |  51.817 us |       0.50% |  -14.567 us | -21.94% |   FAIL   |
|   I8    |      I64      |      2^28      |        1         | 899.463 us |       0.03% | 654.562 us |       0.30% | -244.901 us | -27.23% |   FAIL   |
|   I16   |      I32      |      2^16      |        1         |   5.806 us |       2.54% |   5.468 us |       2.53% |   -0.339 us |  -5.83% |   FAIL   |
|   I16   |      I32      |      2^20      |        1         |  11.720 us |       1.42% |  11.701 us |       2.15% |   -0.019 us |  -0.16% |   PASS   |
|   I16   |      I32      |      2^24      |        1         |  94.163 us |       0.14% |  79.849 us |       1.14% |  -14.314 us | -15.20% |   FAIL   |
|   I16   |      I32      |      2^28      |        1         |   1.313 ms |       0.11% |   1.216 ms |       0.18% |  -97.411 us |  -7.42% |   FAIL   |
|   I16   |      I64      |      2^16      |        1         |   5.875 us |       3.04% |   5.566 us |       3.22% |   -0.309 us |  -5.25% |   FAIL   |
|   I16   |      I64      |      2^20      |        1         |  11.729 us |       1.50% |  11.755 us |       2.02% |    0.026 us |   0.22% |   PASS   |
|   I16   |      I64      |      2^24      |        1         |  93.993 us |       0.15% |  80.006 us |       1.19% |  -13.987 us | -14.88% |   FAIL   |
|   I16   |      I64      |      2^28      |        1         |   1.311 ms |       0.11% |   1.216 ms |       0.18% |  -95.007 us |  -7.24% |   FAIL   |
|   F32   |      I32      |      2^16      |        1         |   6.131 us |       3.50% |   5.923 us |       3.30% |   -0.208 us |  -3.40% |   FAIL   |
|   F32   |      I32      |      2^20      |        1         |  15.139 us |       1.11% |  14.953 us |       1.30% |   -0.187 us |  -1.23% |   FAIL   |
|   F32   |      I32      |      2^24      |        1         | 153.994 us |       0.56% | 151.961 us |       0.39% |   -2.033 us |  -1.32% |   FAIL   |
|   F32   |      I32      |      2^28      |        1         |   2.342 ms |       0.06% |   2.288 ms |       0.04% |  -53.510 us |  -2.28% |   FAIL   |
|   F32   |      I64      |      2^16      |        1         |   6.254 us |       4.31% |   6.037 us |       3.75% |   -0.217 us |  -3.46% |   PASS   |
|   F32   |      I64      |      2^20      |        1         |  15.222 us |       1.13% |  15.006 us |       1.14% |   -0.216 us |  -1.42% |   FAIL   |
|   F32   |      I64      |      2^24      |        1         | 154.041 us |       0.53% | 152.052 us |       0.41% |   -1.988 us |  -1.29% |   FAIL   |
|   F32   |      I64      |      2^28      |        1         |   2.341 ms |       0.06% |   2.289 ms |       0.04% |  -51.752 us |  -2.21% |   FAIL   |
|   F64   |      I32      |      2^16      |        1         |   6.973 us |       5.01% |   6.500 us |       4.36% |   -0.473 us |  -6.79% |   FAIL   |
|   F64   |      I32      |      2^20      |        1         |  24.873 us |       0.86% |  24.380 us |       0.83% |   -0.493 us |  -1.98% |   FAIL   |
|   F64   |      I32      |      2^24      |        1         | 292.095 us |       0.17% | 292.004 us |       0.14% |   -0.091 us |  -0.03% |   PASS   |
|   F64   |      I32      |      2^28      |        1         |   4.491 ms |       0.02% |   4.498 ms |       0.02% |    6.589 us |   0.15% |   FAIL   |
|   F64   |      I64      |      2^16      |        1         |   7.093 us |       5.55% |   6.669 us |       4.86% |   -0.423 us |  -5.97% |   FAIL   |
|   F64   |      I64      |      2^20      |        1         |  24.897 us |       0.85% |  24.479 us |       0.81% |   -0.418 us |  -1.68% |   FAIL   |
|   F64   |      I64      |      2^24      |        1         | 292.136 us |       0.18% | 292.146 us |       0.15% |    0.010 us |   0.00% |   PASS   |
|   F64   |      I64      |      2^28      |        1         |   4.491 ms |       0.02% |   4.498 ms |       0.02% |    7.276 us |   0.16% |   FAIL   |
|  I128   |      I32      |      2^16      |        1         |   8.751 us |       5.69% |   8.296 us |       5.34% |   -0.455 us |  -5.20% |   PASS   |
|  I128   |      I32      |      2^20      |        1         |  42.849 us |       0.59% |  42.029 us |       0.56% |   -0.820 us |  -1.91% |   FAIL   |
|  I128   |      I32      |      2^24      |        1         | 570.422 us |       0.10% | 569.592 us |       0.07% |   -0.830 us |  -0.15% |   FAIL   |
|  I128   |      I32      |      2^28      |        1         |   8.945 ms |       0.01% |   8.937 ms |       0.01% |   -7.566 us |  -0.08% |   FAIL   |
|  I128   |      I64      |      2^16      |        1         |   8.189 us |       2.26% |   7.682 us |       2.36% |   -0.507 us |  -6.19% |   FAIL   |
|  I128   |      I64      |      2^20      |        1         |  42.851 us |       0.57% |  41.972 us |       0.56% |   -0.879 us |  -2.05% |   FAIL   |
|  I128   |      I64      |      2^24      |        1         | 570.468 us |       0.09% | 569.285 us |       0.07% |   -1.183 us |  -0.21% |   FAIL   |
|  I128   |      I64      |      2^28      |        1         |   8.945 ms |       0.01% |   8.930 ms |       0.01% |  -15.065 us |  -0.17% |   FAIL   |

Performance requirements:

  • no regressions of more than 15% compared to cub::DeviceFor
  • no regressions of more than 2% compared to cub::DeviceFor on 2^24+ problem sizes
  • peeling version outperforms cub::DeviceFor on 2^24+ problem sizes with average improvement of more than 5%

Average perf diff across all Babelstream kernels on 2^24+ problem sizes on H100

kernel diff
mul -16.45%
add -11.77%
triad -11.99%
nstream -7.47%

Fixes: #2363

@bernhardmgruber bernhardmgruber added the cub For all items related to CUB label Sep 10, 2024
@bernhardmgruber bernhardmgruber force-pushed the transform_prefetch branch 2 times, most recently from 0cfe4ba to 074ad49 Compare September 10, 2024 23:23
@bernhardmgruber bernhardmgruber marked this pull request as ready for review September 10, 2024 23:26
Copy link
Contributor

🟩 CI finished in 9h 50m: Pass: 100%/259 | Total: 1d 10h | Avg: 8m 02s | Max: 51m 35s | Hits: 99%/24441
  • 🟩 cub: Pass: 100%/136 | Total: 21h 20m | Avg: 9m 24s | Max: 51m 35s | Hits: 97%/4362

    🟩 cpu
      🟩 amd64              Pass: 100%/128 | Total: 20h 40m | Avg:  9m 41s | Max: 51m 35s | Hits:  97%/4362  
      🟩 arm64              Pass: 100%/8   | Total: 40m 02s | Avg:  5m 00s | Max:  6m 07s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 11m | Avg:  4m 45s | Max: 13m 59s | Hits:  97%/727   
      🟩 11.8               Pass: 100%/3   | Total: 15m 32s | Avg:  5m 10s | Max:  5m 36s
      🟩 12.6               Pass: 100%/118 | Total: 19h 53m | Avg: 10m 06s | Max: 51m 35s | Hits:  97%/3635  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  8m 21s | Avg:  4m 10s | Max:  4m 11s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 11m | Avg:  4m 45s | Max: 13m 59s | Hits:  97%/727   
      🟩 nvcc11.8           Pass: 100%/3   | Total: 15m 32s | Avg:  5m 10s | Max:  5m 36s
      🟩 nvcc12.6           Pass: 100%/116 | Total: 19h 45m | Avg: 10m 13s | Max: 51m 35s | Hits:  97%/3635  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  8m 21s | Avg:  4m 10s | Max:  4m 11s
      🟩 nvcc               Pass: 100%/134 | Total: 21h 12m | Avg:  9m 29s | Max: 51m 35s | Hits:  97%/4362  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 29m 45s | Avg:  4m 57s | Max:  5m 44s
      🟩 Clang10            Pass: 100%/3   | Total: 17m 34s | Avg:  5m 51s | Max:  6m 10s
      🟩 Clang11            Pass: 100%/4   | Total: 21m 31s | Avg:  5m 22s | Max:  5m 31s
      🟩 Clang12            Pass: 100%/4   | Total: 21m 05s | Avg:  5m 16s | Max:  5m 44s
      🟩 Clang13            Pass: 100%/4   | Total: 20m 49s | Avg:  5m 12s | Max:  5m 19s
      🟩 Clang14            Pass: 100%/4   | Total: 21m 19s | Avg:  5m 19s | Max:  5m 48s
      🟩 Clang15            Pass: 100%/4   | Total: 22m 21s | Avg:  5m 35s | Max:  6m 05s
      🟩 Clang16            Pass: 100%/4   | Total: 22m 01s | Avg:  5m 30s | Max:  6m 05s
      🟩 Clang17            Pass: 100%/4   | Total: 21m 26s | Avg:  5m 21s | Max:  5m 42s
      🟩 Clang18            Pass: 100%/26  | Total:  5h 58m | Avg: 13m 46s | Max: 24m 13s
      🟩 GCC6               Pass: 100%/2   | Total:  7m 41s | Avg:  3m 50s | Max:  3m 52s
      🟩 GCC7               Pass: 100%/6   | Total: 27m 18s | Avg:  4m 33s | Max:  5m 43s
      🟩 GCC8               Pass: 100%/6   | Total: 28m 31s | Avg:  4m 45s | Max:  5m 30s
      🟩 GCC9               Pass: 100%/6   | Total: 28m 03s | Avg:  4m 40s | Max:  5m 22s
      🟩 GCC10              Pass: 100%/4   | Total: 21m 44s | Avg:  5m 26s | Max:  5m 38s
      🟩 GCC11              Pass: 100%/7   | Total:  1h 23m | Avg: 11m 59s | Max: 51m 35s
      🟩 GCC12              Pass: 100%/4   | Total: 22m 04s | Avg:  5m 31s | Max:  5m 47s
      🟩 GCC13              Pass: 100%/29  | Total:  6h 55m | Avg: 14m 19s | Max: 40m 29s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 18m 03s | Avg:  6m 01s | Max:  6m 14s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 13m 59s | Avg: 13m 59s | Max: 13m 59s | Hits:  97%/727   
      🟩 MSVC14.29          Pass: 100%/2   | Total: 21m 23s | Avg: 10m 41s | Max: 10m 43s | Hits:  97%/1454  
      🟩 MSVC14.39          Pass: 100%/3   | Total: 36m 10s | Avg: 12m 03s | Max: 12m 27s | Hits:  97%/2181  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/63  | Total:  9h 15m | Avg:  8m 49s | Max: 24m 13s
      🟩 GCC                Pass: 100%/64  | Total: 10h 34m | Avg:  9m 55s | Max: 51m 35s
      🟩 Intel              Pass: 100%/3   | Total: 18m 03s | Avg:  6m 01s | Max:  6m 14s
      🟩 MSVC               Pass: 100%/6   | Total:  1h 11m | Avg: 11m 55s | Max: 13m 59s | Hits:  97%/4362  
    🟩 gpu
      🟩 v100               Pass: 100%/136 | Total: 21h 20m | Avg:  9m 24s | Max: 51m 35s | Hits:  97%/4362  
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total: 10h 14m | Avg:  5m 57s | Max: 51m 35s | Hits:  97%/4362  
      🟩 DeviceLaunch       Pass: 100%/8   | Total:  2h 38m | Avg: 19m 52s | Max: 33m 48s
      🟩 GraphCapture       Pass: 100%/8   | Total:  2h 12m | Avg: 16m 33s | Max: 18m 29s
      🟩 HostLaunch         Pass: 100%/8   | Total:  2h 18m | Avg: 17m 21s | Max: 20m 18s
      🟩 SmallGMem          Pass: 100%/1   | Total: 40m 29s | Avg: 40m 29s | Max: 40m 29s
      🟩 TestGPU            Pass: 100%/8   | Total:  3h 15m | Avg: 24m 26s | Max: 30m 22s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 15m 32s | Avg:  5m 10s | Max:  5m 36s
      🟩 90a                Pass: 100%/4   | Total: 15m 47s | Avg:  3m 56s | Max:  4m 05s
    🟩 std
      🟩 11                 Pass: 100%/35  | Total:  5h 25m | Avg:  9m 17s | Max: 51m 35s
      🟩 14                 Pass: 100%/38  | Total:  5h 21m | Avg:  8m 27s | Max: 24m 13s | Hits:  97%/2181  
      🟩 17                 Pass: 100%/38  | Total:  6h 08m | Avg:  9m 41s | Max: 40m 29s | Hits:  97%/1454  
      🟩 20                 Pass: 100%/25  | Total:  4h 25m | Avg: 10m 36s | Max: 33m 48s | Hits:  97%/727   
    
  • 🟩 thrust: Pass: 100%/122 | Total: 13h 07m | Avg: 6m 27s | Max: 26m 18s | Hits: 99%/20079

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total: 12h 28m | Avg:  6m 33s | Max: 26m 18s | Hits:  99%/20079 
      🟩 arm64              Pass: 100%/8   | Total: 38m 45s | Avg:  4m 50s | Max:  8m 58s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 35m | Avg:  6m 20s | Max: 26m 18s | Hits:  99%/2231  
      🟩 11.8               Pass: 100%/3   | Total: 13m 22s | Avg:  4m 27s | Max:  4m 42s
      🟩 12.6               Pass: 100%/104 | Total: 11h 18m | Avg:  6m 31s | Max: 20m 48s | Hits:  99%/17848 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  4m 58s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 35m | Avg:  6m 20s | Max: 26m 18s | Hits:  99%/2231  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 13m 22s | Avg:  4m 27s | Max:  4m 42s
      🟩 nvcc12.6           Pass: 100%/102 | Total: 11h 09m | Avg:  6m 33s | Max: 20m 48s | Hits:  99%/17848 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  4m 58s
      🟩 nvcc               Pass: 100%/120 | Total: 12h 57m | Avg:  6m 28s | Max: 26m 18s | Hits:  99%/20079 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 29m 04s | Avg:  4m 50s | Max:  6m 01s
      🟩 Clang10            Pass: 100%/3   | Total: 18m 02s | Avg:  6m 00s | Max:  6m 22s
      🟩 Clang11            Pass: 100%/4   | Total: 17m 50s | Avg:  4m 27s | Max:  4m 55s
      🟩 Clang12            Pass: 100%/4   | Total: 18m 24s | Avg:  4m 36s | Max:  5m 15s
      🟩 Clang13            Pass: 100%/4   | Total: 18m 49s | Avg:  4m 42s | Max:  4m 55s
      🟩 Clang14            Pass: 100%/4   | Total: 19m 29s | Avg:  4m 52s | Max:  5m 31s
      🟩 Clang15            Pass: 100%/4   | Total: 19m 32s | Avg:  4m 53s | Max:  5m 15s
      🟩 Clang16            Pass: 100%/4   | Total: 18m 40s | Avg:  4m 40s | Max:  4m 51s
      🟩 Clang17            Pass: 100%/4   | Total: 18m 28s | Avg:  4m 37s | Max:  5m 04s
      🟩 Clang18            Pass: 100%/18  | Total:  2h 02m | Avg:  6m 49s | Max: 13m 41s
      🟩 GCC6               Pass: 100%/2   | Total:  7m 49s | Avg:  3m 54s | Max:  4m 29s
      🟩 GCC7               Pass: 100%/6   | Total: 47m 27s | Avg:  7m 54s | Max: 26m 18s
      🟩 GCC8               Pass: 100%/6   | Total: 24m 52s | Avg:  4m 08s | Max:  4m 49s
      🟩 GCC9               Pass: 100%/6   | Total: 26m 19s | Avg:  4m 23s | Max:  5m 34s
      🟩 GCC10              Pass: 100%/4   | Total: 19m 12s | Avg:  4m 48s | Max:  5m 08s
      🟩 GCC11              Pass: 100%/7   | Total: 33m 30s | Avg:  4m 47s | Max:  5m 20s
      🟩 GCC12              Pass: 100%/4   | Total: 19m 28s | Avg:  4m 52s | Max:  5m 33s
      🟩 GCC13              Pass: 100%/20  | Total:  2h 18m | Avg:  6m 56s | Max: 14m 15s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 17m 41s | Avg:  5m 53s | Max:  6m 15s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 16m 16s | Avg: 16m 16s | Max: 16m 16s | Hits:  99%/2231  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 27m 29s | Avg: 13m 44s | Max: 14m 24s | Hits:  99%/4462  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  1h 47m | Avg: 17m 51s | Max: 20m 48s | Hits:  99%/13386 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  5h 01m | Avg:  5m 28s | Max: 13m 41s
      🟩 GCC                Pass: 100%/55  | Total:  5h 17m | Avg:  5m 46s | Max: 26m 18s
      🟩 Intel              Pass: 100%/3   | Total: 17m 41s | Avg:  5m 53s | Max:  6m 15s
      🟩 MSVC               Pass: 100%/9   | Total:  2h 30m | Avg: 16m 45s | Max: 20m 48s | Hits:  99%/20079 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total: 13h 07m | Avg:  6m 27s | Max: 26m 18s | Hits:  99%/20079 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  9h 27m | Avg:  5m 30s | Max: 26m 18s | Hits:  99%/13386 
      🟩 TestCPU            Pass: 100%/11  | Total:  1h 56m | Avg: 10m 37s | Max: 20m 48s | Hits:  99%/6693  
      🟩 TestGPU            Pass: 100%/8   | Total:  1h 42m | Avg: 12m 52s | Max: 14m 15s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 13m 22s | Avg:  4m 27s | Max:  4m 42s
      🟩 90a                Pass: 100%/4   | Total: 15m 55s | Avg:  3m 58s | Max:  4m 17s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total:  2h 58m | Avg:  5m 44s | Max: 26m 18s
      🟩 14                 Pass: 100%/35  | Total:  3h 49m | Avg:  6m 33s | Max: 19m 51s | Hits:  99%/8924  
      🟩 17                 Pass: 100%/34  | Total:  3h 44m | Avg:  6m 36s | Max: 20m 48s | Hits:  99%/6693  
      🟩 20                 Pass: 100%/22  | Total:  2h 34m | Avg:  7m 01s | Max: 20m 18s | Hits:  99%/4462  
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 14m 58s | Avg: 14m 58s | Max: 14m 58s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 259)

# Runner
186 linux-amd64-cpu16
42 linux-amd64-gpu-v100-latest-1
16 linux-arm64-cpu16
15 windows-amd64-cpu16

Copy link
Contributor

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor comments on the device code. I have not looked at the host code.

cub/cub/device/dispatch/dispatch_transform.cuh Outdated Show resolved Hide resolved
} \
}

if (tile_stride == tile_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: does the split between full_tile and partial tiles make a big difference in run time?

I expect that the difference shouldn't be too big, as the partial tile version only adds one comparison and predicates the rest of the computation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question! I followed existing practice here, but I see your point. I will evaluate!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a benchmark on H200:

Babelstream on H200 two code paths (baseline) vs. single code path
# mul

## [0] NVIDIA H200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |      I32      |      2^16      |   3.964 us |       3.53% |   3.948 us |       2.92% | -0.016 us |  -0.41% |   PASS   |
|   I8    |      I32      |      2^20      |  10.199 us |       1.88% |  10.266 us |       1.87% |  0.067 us |   0.65% |   PASS   |
|   I8    |      I32      |      2^24      |  22.588 us |       1.08% |  22.788 us |       1.11% |  0.200 us |   0.89% |   PASS   |
|   I8    |      I32      |      2^28      | 236.103 us |       0.15% | 238.034 us |       0.14% |  1.931 us |   0.82% |   FAIL   |
|   I8    |      I64      |      2^16      |   4.094 us |       3.64% |   4.114 us |       3.47% |  0.021 us |   0.51% |   PASS   |
|   I8    |      I64      |      2^20      |  10.245 us |       1.56% |  10.260 us |       1.47% |  0.015 us |   0.15% |   PASS   |
|   I8    |      I64      |      2^24      |  22.733 us |       1.09% |  22.792 us |       1.14% |  0.059 us |   0.26% |   PASS   |
|   I8    |      I64      |      2^28      | 237.687 us |       0.13% | 237.323 us |       0.14% | -0.363 us |  -0.15% |   FAIL   |
|   I16   |      I32      |      2^16      |   4.133 us |       2.97% |   4.168 us |       2.64% |  0.035 us |   0.84% |   PASS   |
|   I16   |      I32      |      2^20      |   7.748 us |       2.16% |   7.838 us |       2.33% |  0.089 us |   1.15% |   PASS   |
|   I16   |      I32      |      2^24      |  25.688 us |       1.48% |  25.959 us |       1.46% |  0.271 us |   1.05% |   PASS   |
|   I16   |      I32      |      2^28      | 299.739 us |       0.12% | 303.418 us |       0.10% |  3.679 us |   1.23% |   FAIL   |
|   I16   |      I64      |      2^16      |   4.191 us |       2.98% |   4.198 us |       3.03% |  0.007 us |   0.16% |   PASS   |
|   I16   |      I64      |      2^20      |   7.925 us |       2.11% |   7.921 us |       2.21% | -0.004 us |  -0.05% |   PASS   |
|   I16   |      I64      |      2^24      |  25.890 us |       1.39% |  25.933 us |       1.48% |  0.043 us |   0.17% |   PASS   |
|   I16   |      I64      |      2^28      | 300.617 us |       0.12% | 301.334 us |       0.10% |  0.717 us |   0.24% |   FAIL   |
|   F32   |      I32      |      2^16      |   4.108 us |       3.51% |   4.207 us |       3.29% |  0.100 us |   2.43% |   PASS   |
|   F32   |      I32      |      2^20      |   7.212 us |       2.63% |   7.398 us |       2.94% |  0.186 us |   2.58% |   PASS   |
|   F32   |      I32      |      2^24      |  38.193 us |       1.75% |  38.389 us |       1.59% |  0.196 us |   0.51% |   PASS   |
|   F32   |      I32      |      2^28      | 520.746 us |       0.15% | 522.979 us |       0.16% |  2.233 us |   0.43% |   FAIL   |
|   F32   |      I64      |      2^16      |   4.228 us |       4.57% |   4.325 us |       3.97% |  0.097 us |   2.29% |   PASS   |
|   F32   |      I64      |      2^20      |   7.264 us |       2.86% |   7.429 us |       2.98% |  0.165 us |   2.28% |   PASS   |
|   F32   |      I64      |      2^24      |  38.171 us |       1.62% |  38.409 us |       1.61% |  0.239 us |   0.63% |   PASS   |
|   F32   |      I64      |      2^28      | 521.283 us |       0.15% | 522.771 us |       0.16% |  1.488 us |   0.29% |   FAIL   |
|   F64   |      I32      |      2^16      |   4.387 us |       4.15% |   4.480 us |       3.78% |  0.093 us |   2.12% |   PASS   |
|   F64   |      I32      |      2^20      |   9.059 us |       2.89% |   9.198 us |       2.65% |  0.139 us |   1.54% |   PASS   |
|   F64   |      I32      |      2^24      |  67.603 us |       1.08% |  67.765 us |       1.06% |  0.161 us |   0.24% |   PASS   |
|   F64   |      I32      |      2^28      |   1.008 ms |       0.05% |   1.009 ms |       0.06% |  0.501 us |   0.05% |   PASS   |
|   F64   |      I64      |      2^16      |   4.516 us |       4.02% |   4.647 us |       5.02% |  0.130 us |   2.89% |   PASS   |
|   F64   |      I64      |      2^20      |   9.079 us |       2.79% |   9.217 us |       2.35% |  0.138 us |   1.52% |   PASS   |
|   F64   |      I64      |      2^24      |  67.634 us |       1.10% |  67.780 us |       1.07% |  0.147 us |   0.22% |   PASS   |
|   F64   |      I64      |      2^28      |   1.009 ms |       0.06% |   1.009 ms |       0.06% |  0.578 us |   0.06% |   FAIL   |
|  I128   |      I32      |      2^16      |   4.790 us |       4.22% |   4.934 us |       5.32% |  0.144 us |   3.00% |   PASS   |
|  I128   |      I32      |      2^20      |  12.903 us |       2.26% |  13.029 us |       2.36% |  0.126 us |   0.98% |   PASS   |
|  I128   |      I32      |      2^24      | 130.827 us |       0.64% | 130.909 us |       0.66% |  0.082 us |   0.06% |   PASS   |
|  I128   |      I32      |      2^28      |   2.030 ms |       0.14% |   2.031 ms |       0.14% |  0.725 us |   0.04% |   PASS   |
|  I128   |      I64      |      2^16      |   4.789 us |       3.89% |   4.788 us |       3.42% | -0.001 us |  -0.01% |   PASS   |
|  I128   |      I64      |      2^20      |  12.949 us |       2.35% |  13.150 us |       3.15% |  0.200 us |   1.55% |   PASS   |
|  I128   |      I64      |      2^24      | 131.666 us |       0.73% | 131.626 us |       0.74% | -0.040 us |  -0.03% |   PASS   |
|  I128   |      I64      |      2^28      |   2.032 ms |       0.41% |   2.033 ms |       0.41% |  0.144 us |   0.01% |   PASS   |

# add

## [0] NVIDIA H200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |      I32      |      2^16      |   3.972 us |       2.69% |   4.055 us |       2.47% |  0.083 us |   2.09% |   PASS   |
|   I8    |      I32      |      2^20      |   7.763 us |       2.59% |   7.943 us |       2.18% |  0.179 us |   2.31% |   FAIL   |
|   I8    |      I32      |      2^24      |  26.937 us |       1.10% |  27.457 us |       1.05% |  0.521 us |   1.93% |   FAIL   |
|   I8    |      I32      |      2^28      | 300.779 us |       0.17% | 308.288 us |       0.16% |  7.509 us |   2.50% |   FAIL   |
|   I8    |      I64      |      2^16      |   4.136 us |       2.88% |   4.202 us |       2.87% |  0.066 us |   1.60% |   PASS   |
|   I8    |      I64      |      2^20      |   7.790 us |       2.26% |   7.934 us |       2.29% |  0.144 us |   1.84% |   PASS   |
|   I8    |      I64      |      2^24      |  27.023 us |       1.10% |  27.256 us |       1.10% |  0.234 us |   0.86% |   PASS   |
|   I8    |      I64      |      2^28      | 301.382 us |       0.11% | 308.000 us |       0.13% |  6.618 us |   2.20% |   FAIL   |
|   I16   |      I32      |      2^16      |   4.219 us |       2.79% |   4.305 us |       3.10% |  0.086 us |   2.03% |   PASS   |
|   I16   |      I32      |      2^20      |   7.193 us |       2.82% |   7.305 us |       2.80% |  0.112 us |   1.55% |   PASS   |
|   I16   |      I32      |      2^24      |  34.167 us |       1.71% |  34.630 us |       1.67% |  0.463 us |   1.36% |   PASS   |
|   I16   |      I32      |      2^28      | 444.715 us |       0.16% | 451.323 us |       0.15% |  6.607 us |   1.49% |   FAIL   |
|   I16   |      I64      |      2^16      |   4.242 us |       3.31% |   4.331 us |       3.11% |  0.090 us |   2.12% |   PASS   |
|   I16   |      I64      |      2^20      |   7.192 us |       2.76% |   7.344 us |       2.73% |  0.152 us |   2.12% |   PASS   |
|   I16   |      I64      |      2^24      |  34.275 us |       1.78% |  34.495 us |       1.73% |  0.220 us |   0.64% |   PASS   |
|   I16   |      I64      |      2^28      | 445.632 us |       0.15% | 449.233 us |       0.16% |  3.601 us |   0.81% |   FAIL   |
|   F32   |      I32      |      2^16      |   4.365 us |       2.98% |   4.601 us |       3.67% |  0.236 us |   5.40% |   FAIL   |
|   F32   |      I32      |      2^20      |   8.399 us |       2.74% |   8.559 us |       2.39% |  0.160 us |   1.90% |   PASS   |
|   F32   |      I32      |      2^24      |  54.969 us |       1.37% |  54.921 us |       1.37% | -0.048 us |  -0.09% |   PASS   |
|   F32   |      I32      |      2^28      | 785.761 us |       0.09% | 789.161 us |       0.09% |  3.400 us |   0.43% |   FAIL   |
|   F32   |      I64      |      2^16      |   4.429 us |       4.19% |   4.704 us |       4.52% |  0.275 us |   6.21% |   FAIL   |
|   F32   |      I64      |      2^20      |   8.373 us |       2.40% |   8.546 us |       2.48% |  0.173 us |   2.07% |   PASS   |
|   F32   |      I64      |      2^24      |  55.000 us |       1.37% |  55.306 us |       1.37% |  0.307 us |   0.56% |   PASS   |
|   F32   |      I64      |      2^28      | 786.775 us |       0.08% | 790.047 us |       0.08% |  3.272 us |   0.42% |   FAIL   |
|   F64   |      I32      |      2^16      |   4.706 us |       3.67% |   4.801 us |       4.52% |  0.095 us |   2.01% |   PASS   |
|   F64   |      I32      |      2^20      |  11.428 us |       3.41% |  11.495 us |       2.49% |  0.066 us |   0.58% |   PASS   |
|   F64   |      I32      |      2^24      | 100.121 us |       0.39% | 100.326 us |       0.39% |  0.205 us |   0.20% |   PASS   |
|   F64   |      I32      |      2^28      |   1.508 ms |       0.05% |   1.509 ms |       0.05% |  0.855 us |   0.06% |   FAIL   |
|   F64   |      I64      |      2^16      |   4.804 us |       4.27% |   4.931 us |       4.97% |  0.126 us |   2.63% |   PASS   |
|   F64   |      I64      |      2^20      |  11.316 us |       2.47% |  11.444 us |       2.55% |  0.128 us |   1.13% |   PASS   |
|   F64   |      I64      |      2^24      | 100.004 us |       0.41% | 100.244 us |       0.39% |  0.240 us |   0.24% |   PASS   |
|   F64   |      I64      |      2^28      |   1.509 ms |       0.05% |   1.509 ms |       0.06% |  0.658 us |   0.04% |   PASS   |
|  I128   |      I32      |      2^16      |   5.227 us |       4.38% |   5.329 us |       4.78% |  0.102 us |   1.95% |   PASS   |
|  I128   |      I32      |      2^20      |  17.633 us |       2.22% |  17.680 us |       2.28% |  0.047 us |   0.27% |   PASS   |
|  I128   |      I32      |      2^24      | 194.202 us |       0.29% | 194.335 us |       0.29% |  0.133 us |   0.07% |   PASS   |
|  I128   |      I32      |      2^28      |   3.008 ms |       0.09% |   3.009 ms |       0.09% |  1.169 us |   0.04% |   PASS   |
|  I128   |      I64      |      2^16      |   5.164 us |       3.50% |   5.179 us |       3.29% |  0.014 us |   0.28% |   PASS   |
|  I128   |      I64      |      2^20      |  17.804 us |       2.52% |  17.788 us |       2.36% | -0.016 us |  -0.09% |   PASS   |
|  I128   |      I64      |      2^24      | 195.061 us |       0.36% | 195.180 us |       0.38% |  0.119 us |   0.06% |   PASS   |
|  I128   |      I64      |      2^28      |   3.010 ms |       0.23% |   3.009 ms |       0.21% | -0.449 us |  -0.01% |   PASS   |

# triad

## [0] NVIDIA H200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |      I32      |      2^16      |   4.246 us |       3.71% |   4.226 us |       3.63% | -0.019 us |  -0.45% |   PASS   |
|   I8    |      I32      |      2^20      |   7.839 us |       2.64% |   7.915 us |       2.36% |  0.075 us |   0.96% |   PASS   |
|   I8    |      I32      |      2^24      |  26.018 us |       1.33% |  26.417 us |       1.25% |  0.399 us |   1.53% |   FAIL   |
|   I8    |      I32      |      2^28      | 286.034 us |       0.15% | 294.496 us |       0.13% |  8.461 us |   2.96% |   FAIL   |
|   I8    |      I64      |      2^16      |   4.264 us |       3.84% |   4.301 us |       3.56% |  0.038 us |   0.88% |   PASS   |
|   I8    |      I64      |      2^20      |   7.958 us |       3.06% |   8.046 us |       3.02% |  0.088 us |   1.10% |   PASS   |
|   I8    |      I64      |      2^24      |  25.732 us |       1.25% |  25.935 us |       1.30% |  0.202 us |   0.79% |   PASS   |
|   I8    |      I64      |      2^28      | 314.272 us |       0.12% | 316.872 us |       0.21% |  2.600 us |   0.83% |   FAIL   |
|   I16   |      I32      |      2^16      |   4.327 us |       2.76% |   4.331 us |       3.10% |  0.004 us |   0.09% |   PASS   |
|   I16   |      I32      |      2^20      |   7.305 us |       2.79% |   7.410 us |       2.68% |  0.105 us |   1.44% |   PASS   |
|   I16   |      I32      |      2^24      |  34.765 us |       1.67% |  34.117 us |       1.75% | -0.647 us |  -1.86% |   FAIL   |
|   I16   |      I32      |      2^28      | 425.006 us |       0.17% | 430.051 us |       0.17% |  5.045 us |   1.19% |   FAIL   |
|   I16   |      I64      |      2^16      |   4.373 us |       3.45% |   4.388 us |       3.47% |  0.015 us |   0.34% |   PASS   |
|   I16   |      I64      |      2^20      |   7.401 us |       3.11% |   7.397 us |       2.74% | -0.003 us |  -0.04% |   PASS   |
|   I16   |      I64      |      2^24      |  33.826 us |       1.79% |  33.999 us |       1.75% |  0.174 us |   0.51% |   PASS   |
|   I16   |      I64      |      2^28      | 425.722 us |       0.17% | 428.301 us |       0.16% |  2.579 us |   0.61% |   FAIL   |
|   F32   |      I32      |      2^16      |   4.458 us |       3.71% |   4.407 us |       3.56% | -0.051 us |  -1.14% |   PASS   |
|   F32   |      I32      |      2^20      |   8.391 us |       2.70% |   8.450 us |       2.82% |  0.059 us |   0.71% |   PASS   |
|   F32   |      I32      |      2^24      |  55.046 us |       1.33% |  55.285 us |       1.34% |  0.239 us |   0.43% |   PASS   |
|   F32   |      I32      |      2^28      | 773.132 us |       0.12% | 775.959 us |       0.11% |  2.828 us |   0.37% |   FAIL   |
|   F32   |      I64      |      2^16      |   4.484 us |       4.79% |   4.493 us |       4.85% |  0.009 us |   0.20% |   PASS   |
|   F32   |      I64      |      2^20      |   8.401 us |       2.66% |   8.494 us |       2.66% |  0.093 us |   1.11% |   PASS   |
|   F32   |      I64      |      2^24      |  55.101 us |       1.32% |  55.349 us |       1.32% |  0.248 us |   0.45% |   PASS   |
|   F32   |      I64      |      2^28      | 774.147 us |       0.12% | 776.581 us |       0.11% |  2.434 us |   0.31% |   FAIL   |
|   F64   |      I32      |      2^16      |   4.827 us |       3.95% |   4.810 us |       3.70% | -0.017 us |  -0.35% |   PASS   |
|   F64   |      I32      |      2^20      |  11.340 us |       1.35% |  11.445 us |       2.54% |  0.105 us |   0.92% |   PASS   |
|   F64   |      I32      |      2^24      | 100.355 us |       0.43% | 100.262 us |       0.44% | -0.094 us |  -0.09% |   PASS   |
|   F64   |      I32      |      2^28      |   1.509 ms |       0.05% |   1.509 ms |       0.06% |  0.636 us |   0.04% |   PASS   |
|   F64   |      I64      |      2^16      |   4.903 us |       5.42% |   4.947 us |       5.01% |  0.044 us |   0.90% |   PASS   |
|   F64   |      I64      |      2^20      |  11.371 us |       1.66% |  11.553 us |       3.20% |  0.182 us |   1.60% |   PASS   |
|   F64   |      I64      |      2^24      | 100.220 us |       0.43% | 100.266 us |       0.43% |  0.046 us |   0.05% |   PASS   |
|   F64   |      I64      |      2^28      |   1.509 ms |       0.06% |   1.509 ms |       0.06% |  0.494 us |   0.03% |   PASS   |
|  I128   |      I32      |      2^16      |   5.422 us |       4.90% |   5.422 us |       5.07% |  0.000 us |   0.00% |   PASS   |
|  I128   |      I32      |      2^20      |  17.905 us |       2.25% |  17.918 us |       2.14% |  0.013 us |   0.07% |   PASS   |
|  I128   |      I32      |      2^24      | 194.422 us |       0.30% | 194.603 us |       0.27% |  0.181 us |   0.09% |   PASS   |
|  I128   |      I32      |      2^28      |   3.007 ms |       0.09% |   3.010 ms |       0.09% |  2.715 us |   0.09% |   FAIL   |
|  I128   |      I64      |      2^16      |   5.202 us |       3.37% |   5.235 us |       4.08% |  0.032 us |   0.62% |   PASS   |
|  I128   |      I64      |      2^20      |  17.851 us |       2.39% |  17.837 us |       2.32% | -0.014 us |  -0.08% |   PASS   |
|  I128   |      I64      |      2^24      | 194.996 us |       0.36% | 195.120 us |       0.34% |  0.124 us |   0.06% |   PASS   |
|  I128   |      I64      |      2^28      |   3.007 ms |       0.19% |   3.009 ms |       0.21% |  2.588 us |   0.09% |   PASS   |

# nstream

## [0] NVIDIA H200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  OverwriteInput  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |      I32      |      2^16      |        1         |   4.153 us |       3.08% |   4.072 us |       3.18% |  -0.081 us |  -1.95% |   PASS   |
|   I8    |      I32      |      2^20      |        1         |   7.199 us |       2.68% |   7.263 us |       2.30% |   0.065 us |   0.90% |   PASS   |
|   I8    |      I32      |      2^24      |        1         |  30.769 us |       1.27% |  31.748 us |       1.16% |   0.979 us |   3.18% |   FAIL   |
|   I8    |      I32      |      2^28      |        1         | 367.426 us |       0.15% | 387.490 us |       0.11% |  20.065 us |   5.46% |   FAIL   |
|   I8    |      I64      |      2^16      |        1         |   4.308 us |       2.83% |   4.194 us |       2.96% |  -0.114 us |  -2.64% |   PASS   |
|   I8    |      I64      |      2^20      |        1         |   7.247 us |       2.46% |   7.133 us |       2.63% |  -0.114 us |  -1.58% |   PASS   |
|   I8    |      I64      |      2^24      |        1         |  30.870 us |       1.22% |  30.982 us |       1.22% |   0.112 us |   0.36% |   PASS   |
|   I8    |      I64      |      2^28      |        1         | 369.782 us |       0.23% | 373.844 us |       0.13% |   4.062 us |   1.10% |   FAIL   |
|   I16   |      I32      |      2^16      |        1         |   4.440 us |       3.49% |   4.359 us |       2.92% |  -0.081 us |  -1.84% |   PASS   |
|   I16   |      I32      |      2^20      |        1         |   7.498 us |       2.97% |   7.485 us |       3.08% |  -0.013 us |  -0.18% |   PASS   |
|   I16   |      I32      |      2^24      |        1         |  42.648 us |       1.66% |  43.014 us |       1.60% |   0.367 us |   0.86% |   PASS   |
|   I16   |      I32      |      2^28      |        1         | 562.846 us |       0.13% | 572.306 us |       0.13% |   9.460 us |   1.68% |   FAIL   |
|   I16   |      I64      |      2^16      |        1         |   4.472 us |       3.73% |   4.376 us |       3.08% |  -0.096 us |  -2.14% |   PASS   |
|   I16   |      I64      |      2^20      |        1         |   7.528 us |       2.69% |   7.452 us |       3.06% |  -0.077 us |  -1.02% |   PASS   |
|   I16   |      I64      |      2^24      |        1         |  42.431 us |       1.67% |  42.550 us |       1.63% |   0.119 us |   0.28% |   PASS   |
|   I16   |      I64      |      2^28      |        1         | 560.706 us |       0.14% | 563.724 us |       0.09% |   3.018 us |   0.54% |   FAIL   |
|   F32   |      I32      |      2^16      |        1         |   4.452 us |       3.22% |   4.576 us |       3.21% |   0.124 us |   2.79% |   PASS   |
|   F32   |      I32      |      2^20      |        1         |   9.191 us |       2.85% |   9.327 us |       2.67% |   0.136 us |   1.48% |   PASS   |
|   F32   |      I32      |      2^24      |        1         |  70.198 us |       1.07% |  71.007 us |       1.02% |   0.809 us |   1.15% |   FAIL   |
|   F32   |      I32      |      2^28      |        1         |   1.021 ms |       0.21% |   1.036 ms |       0.19% |  15.479 us |   1.52% |   FAIL   |
|   F32   |      I64      |      2^16      |        1         |   4.519 us |       3.78% |   4.652 us |       4.58% |   0.133 us |   2.95% |   PASS   |
|   F32   |      I64      |      2^20      |        1         |   9.242 us |       2.80% |   9.355 us |       2.63% |   0.113 us |   1.22% |   PASS   |
|   F32   |      I64      |      2^24      |        1         |  70.674 us |       1.04% |  71.199 us |       1.02% |   0.524 us |   0.74% |   PASS   |
|   F32   |      I64      |      2^28      |        1         |   1.031 ms |       0.21% |   1.039 ms |       0.20% |   8.410 us |   0.82% |   FAIL   |
|   F64   |      I32      |      2^16      |        1         |   4.864 us |       4.07% |   5.022 us |       4.96% |   0.158 us |   3.25% |   PASS   |
|   F64   |      I32      |      2^20      |        1         |  13.639 us |       2.16% |  13.737 us |       1.61% |   0.098 us |   0.72% |   PASS   |
|   F64   |      I32      |      2^24      |        1         | 129.316 us |       0.65% | 133.768 us |       0.74% |   4.452 us |   3.44% |   FAIL   |
|   F64   |      I32      |      2^28      |        1         |   1.952 ms |       0.21% |   2.043 ms |       0.16% |  91.277 us |   4.68% |   FAIL   |
|   F64   |      I64      |      2^16      |        1         |   4.984 us |       4.54% |   5.129 us |       4.96% |   0.144 us |   2.90% |   PASS   |
|   F64   |      I64      |      2^20      |        1         |  13.639 us |       1.81% |  13.771 us |       1.87% |   0.132 us |   0.97% |   PASS   |
|   F64   |      I64      |      2^24      |        1         | 130.536 us |       0.64% | 133.817 us |       0.74% |   3.280 us |   2.51% |   FAIL   |
|   F64   |      I64      |      2^28      |        1         |   1.982 ms |       0.10% |   2.040 ms |       0.22% |  57.276 us |   2.89% |   FAIL   |
|  I128   |      I32      |      2^16      |        1         |   5.508 us |       4.58% |   5.697 us |       5.21% |   0.189 us |   3.43% |   PASS   |
|  I128   |      I32      |      2^20      |        1         |  21.720 us |       2.16% |  21.817 us |       2.20% |   0.097 us |   0.45% |   PASS   |
|  I128   |      I32      |      2^24      |        1         | 251.228 us |       0.40% | 258.446 us |       0.41% |   7.218 us |   2.87% |   FAIL   |
|  I128   |      I32      |      2^28      |        1         |   3.907 ms |       0.09% |   4.042 ms |       0.08% | 135.477 us |   3.47% |   FAIL   |
|  I128   |      I64      |      2^16      |        1         |   5.471 us |       3.86% |   5.409 us |       3.47% |  -0.063 us |  -1.15% |   PASS   |
|  I128   |      I64      |      2^20      |        1         |  21.859 us |       2.20% |  21.719 us |       2.19% |  -0.140 us |  -0.64% |   PASS   |
|  I128   |      I64      |      2^24      |        1         | 260.256 us |       0.38% | 258.323 us |       0.43% |  -1.933 us |  -0.74% |   FAIL   |
|  I128   |      I64      |      2^28      |        1         |   4.074 ms |       0.08% |   4.043 ms |       0.09% | -30.665 us |  -0.75% |   FAIL   |

# Summary

- Total Matches: 160
  - Pass    (diff <= min_noise): 117
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  43

While the change seems to not matter in most cases, I see a few regressions, especially on the large problem sizes and with nstream.

cub/cub/device/dispatch/dispatch_transform.cuh Outdated Show resolved Hide resolved
@gevtushenko gevtushenko self-requested a review September 16, 2024 15:49
@bernhardmgruber bernhardmgruber force-pushed the transform_prefetch branch 2 times, most recently from a4a823b to 1707100 Compare October 29, 2024 14:54
Copy link
Contributor

🟩 CI finished in 1h 18m: Pass: 100%/222 | Total: 1d 03h | Avg: 7m 26s | Max: 56m 52s | Hits: 99%/16093
  • 🟩 cub: Pass: 100%/110 | Total: 14h 40m | Avg: 8m 00s | Max: 56m 52s | Hits: 97%/2928

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total: 13h 54m | Avg:  8m 10s | Max: 56m 52s | Hits:  97%/2928  
      🟩 arm64              Pass: 100%/8   | Total: 45m 38s | Avg:  5m 42s | Max:  6m 27s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 20m | Avg:  5m 23s | Max: 13m 30s | Hits:  97%/732   
      🟩 11.8               Pass: 100%/3   | Total: 18m 17s | Avg:  6m 05s | Max:  6m 18s
      🟩 12.5               Pass: 100%/4   | Total: 42m 28s | Avg: 10m 37s | Max: 10m 52s
      🟩 12.6               Pass: 100%/88  | Total: 12h 18m | Avg:  8m 23s | Max: 56m 52s | Hits:  97%/2196  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total: 18m 03s | Avg:  4m 30s | Max:  4m 50s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 20m | Avg:  5m 23s | Max: 13m 30s | Hits:  97%/732   
      🟩 nvcc11.8           Pass: 100%/3   | Total: 18m 17s | Avg:  6m 05s | Max:  6m 18s
      🟩 nvcc12.5           Pass: 100%/4   | Total: 42m 28s | Avg: 10m 37s | Max: 10m 52s
      🟩 nvcc12.6           Pass: 100%/84  | Total: 12h 00m | Avg:  8m 34s | Max: 56m 52s | Hits:  97%/2196  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total: 18m 03s | Avg:  4m 30s | Max:  4m 50s
      🟩 nvcc               Pass: 100%/106 | Total: 14h 22m | Avg:  8m 07s | Max: 56m 52s | Hits:  97%/2928  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 34m 27s | Avg:  5m 44s | Max:  6m 42s
      🟩 Clang10            Pass: 100%/3   | Total: 19m 41s | Avg:  6m 33s | Max:  6m 53s
      🟩 Clang11            Pass: 100%/4   | Total: 22m 42s | Avg:  5m 40s | Max:  6m 22s
      🟩 Clang12            Pass: 100%/4   | Total: 22m 00s | Avg:  5m 30s | Max:  5m 38s
      🟩 Clang13            Pass: 100%/4   | Total: 22m 42s | Avg:  5m 40s | Max:  6m 08s
      🟩 Clang14            Pass: 100%/4   | Total: 22m 01s | Avg:  5m 30s | Max:  5m 57s
      🟩 Clang15            Pass: 100%/4   | Total: 21m 33s | Avg:  5m 23s | Max:  5m 35s
      🟩 Clang16            Pass: 100%/4   | Total: 21m 55s | Avg:  5m 28s | Max:  5m 42s
      🟩 Clang17            Pass: 100%/4   | Total: 23m 01s | Avg:  5m 45s | Max:  6m 01s
      🟩 Clang18            Pass: 100%/11  | Total:  1h 27m | Avg:  7m 56s | Max: 23m 12s
      🟩 GCC6               Pass: 100%/2   | Total:  9m 18s | Avg:  4m 39s | Max:  4m 44s
      🟩 GCC7               Pass: 100%/6   | Total:  1h 22m | Avg: 13m 41s | Max: 56m 52s
      🟩 GCC8               Pass: 100%/6   | Total: 30m 28s | Avg:  5m 04s | Max:  5m 55s
      🟩 GCC9               Pass: 100%/6   | Total: 31m 23s | Avg:  5m 13s | Max:  5m 58s
      🟩 GCC10              Pass: 100%/4   | Total: 22m 38s | Avg:  5m 39s | Max:  6m 10s
      🟩 GCC11              Pass: 100%/7   | Total:  1h 27m | Avg: 12m 31s | Max: 51m 07s
      🟩 GCC12              Pass: 100%/4   | Total: 24m 22s | Avg:  6m 05s | Max:  6m 24s
      🟩 GCC13              Pass: 100%/16  | Total:  3h 02m | Avg: 11m 24s | Max: 29m 49s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 19m 16s | Avg:  6m 25s | Max:  6m 44s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 13m 30s | Avg: 13m 30s | Max: 13m 30s | Hits:  97%/732   
      🟩 MSVC14.29          Pass: 100%/2   | Total: 24m 07s | Avg: 12m 03s | Max: 12m 34s | Hits:  97%/1464  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 12m 59s | Avg: 12m 59s | Max: 12m 59s | Hits:  97%/732   
      🟩 NVHPC24.7          Pass: 100%/4   | Total: 42m 28s | Avg: 10m 37s | Max: 10m 52s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  4h 57m | Avg:  6m 11s | Max: 23m 12s
      🟩 GCC                Pass: 100%/51  | Total:  7h 50m | Avg:  9m 13s | Max: 56m 52s
      🟩 Intel              Pass: 100%/3   | Total: 19m 16s | Avg:  6m 25s | Max:  6m 44s
      🟩 MSVC               Pass: 100%/4   | Total: 50m 36s | Avg: 12m 39s | Max: 13m 30s | Hits:  97%/2928  
      🟩 NVHPC              Pass: 100%/4   | Total: 42m 28s | Avg: 10m 37s | Max: 10m 52s
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total: 14h 40m | Avg:  8m 00s | Max: 56m 52s | Hits:  97%/2928  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total: 11h 50m | Avg:  6m 57s | Max: 56m 52s | Hits:  97%/2928  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 23m 34s | Avg: 23m 34s | Max: 23m 34s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 21s | Avg: 15m 21s | Max: 15m 21s
      🟩 HostLaunch         Pass: 100%/3   | Total: 55m 49s | Avg: 18m 36s | Max: 20m 17s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 14m | Avg: 24m 57s | Max: 29m 49s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 18m 17s | Avg:  6m 05s | Max:  6m 18s
      🟩 90a                Pass: 100%/4   | Total: 18m 42s | Avg:  4m 40s | Max:  4m 47s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  4h 07m | Avg:  8m 14s | Max: 56m 52s
      🟩 14                 Pass: 100%/29  | Total:  3h 43m | Avg:  7m 41s | Max: 51m 07s | Hits:  97%/1464  
      🟩 17                 Pass: 100%/27  | Total:  2h 44m | Avg:  6m 05s | Max: 12m 34s | Hits:  97%/732   
      🟩 20                 Pass: 100%/24  | Total:  4h 05m | Avg: 10m 13s | Max: 29m 49s | Hits:  97%/732   
    
  • 🟩 thrust: Pass: 100%/109 | Total: 12h 19m | Avg: 6m 47s | Max: 24m 55s | Hits: 99%/13165

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total: 11h 39m | Avg:  6m 55s | Max: 24m 55s | Hits:  99%/13165 
      🟩 arm64              Pass: 100%/8   | Total: 40m 33s | Avg:  5m 04s | Max:  5m 30s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 22m | Avg:  5m 30s | Max: 18m 26s | Hits:  99%/2633  
      🟩 11.8               Pass: 100%/3   | Total: 17m 12s | Avg:  5m 44s | Max:  6m 21s
      🟩 12.5               Pass: 100%/4   | Total:  1h 09m | Avg: 17m 19s | Max: 18m 47s
      🟩 12.6               Pass: 100%/87  | Total:  9h 30m | Avg:  6m 33s | Max: 24m 55s | Hits:  99%/10532 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total: 21m 01s | Avg:  5m 15s | Max:  5m 28s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 22m | Avg:  5m 30s | Max: 18m 26s | Hits:  99%/2633  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 17m 12s | Avg:  5m 44s | Max:  6m 21s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  1h 09m | Avg: 17m 19s | Max: 18m 47s
      🟩 nvcc12.6           Pass: 100%/83  | Total:  9h 09m | Avg:  6m 37s | Max: 24m 55s | Hits:  99%/10532 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total: 21m 01s | Avg:  5m 15s | Max:  5m 28s
      🟩 nvcc               Pass: 100%/105 | Total: 11h 58m | Avg:  6m 50s | Max: 24m 55s | Hits:  99%/13165 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 34m 01s | Avg:  5m 40s | Max:  6m 48s
      🟩 Clang10            Pass: 100%/3   | Total: 20m 30s | Avg:  6m 50s | Max:  7m 13s
      🟩 Clang11            Pass: 100%/4   | Total: 20m 27s | Avg:  5m 06s | Max:  5m 13s
      🟩 Clang12            Pass: 100%/4   | Total: 21m 12s | Avg:  5m 18s | Max:  5m 39s
      🟩 Clang13            Pass: 100%/4   | Total: 21m 48s | Avg:  5m 27s | Max:  5m 49s
      🟩 Clang14            Pass: 100%/4   | Total: 22m 52s | Avg:  5m 43s | Max:  5m 52s
      🟩 Clang15            Pass: 100%/4   | Total: 23m 36s | Avg:  5m 54s | Max:  6m 17s
      🟩 Clang16            Pass: 100%/4   | Total: 22m 23s | Avg:  5m 35s | Max:  5m 54s
      🟩 Clang17            Pass: 100%/4   | Total: 22m 07s | Avg:  5m 31s | Max:  5m 56s
      🟩 Clang18            Pass: 100%/11  | Total:  1h 06m | Avg:  6m 03s | Max: 12m 55s
      🟩 GCC6               Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  5m 08s
      🟩 GCC7               Pass: 100%/6   | Total: 29m 47s | Avg:  4m 57s | Max:  6m 06s
      🟩 GCC8               Pass: 100%/6   | Total: 29m 32s | Avg:  4m 55s | Max:  5m 36s
      🟩 GCC9               Pass: 100%/6   | Total: 30m 43s | Avg:  5m 07s | Max:  5m 53s
      🟩 GCC10              Pass: 100%/4   | Total: 22m 19s | Avg:  5m 34s | Max:  5m 51s
      🟩 GCC11              Pass: 100%/7   | Total: 39m 46s | Avg:  5m 40s | Max:  6m 21s
      🟩 GCC12              Pass: 100%/4   | Total: 23m 32s | Avg:  5m 53s | Max:  6m 25s
      🟩 GCC13              Pass: 100%/14  | Total:  1h 35m | Avg:  6m 48s | Max: 14m 52s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 19m 57s | Avg:  6m 39s | Max:  7m 17s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 26s | Avg: 18m 26s | Max: 18m 26s | Hits:  99%/2633  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 32m 24s | Avg: 16m 12s | Max: 16m 28s | Hits:  99%/5266  
      🟩 MSVC14.39          Pass: 100%/2   | Total: 43m 42s | Avg: 21m 51s | Max: 24m 55s | Hits:  99%/5266  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  1h 09m | Avg: 17m 19s | Max: 18m 47s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  4h 35m | Avg:  5m 44s | Max: 12m 55s
      🟩 GCC                Pass: 100%/49  | Total:  4h 40m | Avg:  5m 43s | Max: 14m 52s
      🟩 Intel              Pass: 100%/3   | Total: 19m 57s | Avg:  6m 39s | Max:  7m 17s
      🟩 MSVC               Pass: 100%/5   | Total:  1h 34m | Avg: 18m 54s | Max: 24m 55s | Hits:  99%/13165 
      🟩 NVHPC              Pass: 100%/4   | Total:  1h 09m | Avg: 17m 19s | Max: 18m 47s
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total: 12h 19m | Avg:  6m 47s | Max: 24m 55s | Hits:  99%/13165 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total: 10h 49m | Avg:  6m 22s | Max: 18m 47s | Hits:  99%/10532 
      🟩 TestCPU            Pass: 100%/4   | Total: 48m 49s | Avg: 12m 12s | Max: 24m 55s | Hits:  99%/2633  
      🟩 TestGPU            Pass: 100%/3   | Total: 40m 53s | Avg: 13m 37s | Max: 14m 52s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 17m 12s | Avg:  5m 44s | Max:  6m 21s
      🟩 90a                Pass: 100%/4   | Total: 17m 51s | Avg:  4m 27s | Max:  4m 56s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  2h 53m | Avg:  5m 46s | Max: 14m 52s
      🟩 14                 Pass: 100%/29  | Total:  3h 14m | Avg:  6m 42s | Max: 18m 26s | Hits:  99%/5266  
      🟩 17                 Pass: 100%/27  | Total:  2h 56m | Avg:  6m 32s | Max: 18m 47s | Hits:  99%/2633  
      🟩 20                 Pass: 100%/23  | Total:  3h 15m | Avg:  8m 29s | Max: 24m 55s | Hits:  99%/5266  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 38s | Avg: 5m 19s | Max: 8m 23s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 23s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 23s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 23s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 23s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 23s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 23s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 23s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 15s | Avg:  2m 15s | Max:  2m 15s
      🟩 Test               Pass: 100%/1   | Total:  8m 23s | Avg:  8m 23s | Max:  8m 23s
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 21m 45s | Avg: 21m 45s | Max: 21m 45s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 21m 45s | Avg: 21m 45s | Max: 21m 45s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 21m 45s | Avg: 21m 45s | Max: 21m 45s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 21m 45s | Avg: 21m 45s | Max: 21m 45s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 21m 45s | Avg: 21m 45s | Max: 21m 45s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 21m 45s | Avg: 21m 45s | Max: 21m 45s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 21m 45s | Avg: 21m 45s | Max: 21m 45s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 21m 45s | Avg: 21m 45s | Max: 21m 45s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 21m 45s | Avg: 21m 45s | Max: 21m 45s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
pycuda
CCCL C Parallel Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- pycuda
+/- CCCL C Parallel Library

🏃‍ Runner counts (total jobs: 222)

# Runner
184 linux-amd64-cpu16
16 linux-arm64-cpu16
13 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16

@bernhardmgruber
Copy link
Contributor Author

This PR is ready from my point of view :)

Copy link
Contributor

🟩 CI finished in 4h 15m: Pass: 100%/222 | Total: 4d 22h | Avg: 31m 55s | Max: 1h 30m | Hits: 51%/16089
  • 🟩 cub: Pass: 100%/110 | Total: 4d 02h | Avg: 53m 38s | Max: 1h 30m | Hits: 64%/2924

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total:  3d 19h | Avg: 53m 41s | Max:  1h 30m | Hits:  64%/2924  
      🟩 arm64              Pass: 100%/8   | Total:  7h 04m | Avg: 53m 02s | Max: 53m 52s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 26m | Avg: 45m 45s | Max: 52m 56s | Hits:  64%/731   
      🟩 11.8               Pass: 100%/3   | Total:  3h 17m | Avg:  1h 05m | Max:  1h 07m
      🟩 12.5               Pass: 100%/4   | Total:  4h 16m | Avg:  1h 04m | Max:  1h 09m
      🟩 12.6               Pass: 100%/88  | Total:  3d 07h | Avg: 54m 05s | Max:  1h 30m | Hits:  64%/2193  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 43m | Avg: 55m 53s | Max: 58m 24s
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 26m | Avg: 45m 45s | Max: 52m 56s | Hits:  64%/731   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 17m | Avg:  1h 05m | Max:  1h 07m
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 16m | Avg:  1h 04m | Max:  1h 09m
      🟩 nvcc12.6           Pass: 100%/84  | Total:  3d 03h | Avg: 54m 00s | Max:  1h 30m | Hits:  64%/2193  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 43m | Avg: 55m 53s | Max: 58m 24s
      🟩 nvcc               Pass: 100%/106 | Total:  3d 22h | Avg: 53m 33s | Max:  1h 30m | Hits:  64%/2924  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  5h 10m | Avg: 51m 46s | Max: 58m 22s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 47m | Avg: 55m 49s | Max: 56m 55s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 32m | Avg: 53m 13s | Max: 56m 05s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 28m | Avg: 52m 14s | Max: 55m 43s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 34m | Avg: 53m 30s | Max: 55m 11s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 43m | Avg: 55m 49s | Max: 56m 36s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 38m | Avg: 54m 37s | Max: 58m 56s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 30m | Avg: 52m 30s | Max: 55m 26s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 36m | Avg: 54m 01s | Max: 57m 46s
      🟩 Clang18            Pass: 100%/11  | Total:  8h 45m | Avg: 47m 44s | Max: 58m 24s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 31m | Avg: 45m 42s | Max: 48m 59s
      🟩 GCC7               Pass: 100%/6   | Total:  4h 59m | Avg: 49m 54s | Max: 58m 03s
      🟩 GCC8               Pass: 100%/6   | Total:  4h 44m | Avg: 47m 23s | Max: 51m 30s
      🟩 GCC9               Pass: 100%/6   | Total:  4h 54m | Avg: 49m 06s | Max: 56m 03s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 34m | Avg: 53m 40s | Max: 57m 46s
      🟩 GCC11              Pass: 100%/7   | Total:  6h 47m | Avg: 58m 08s | Max:  1h 07m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 28m | Avg: 52m 12s | Max: 55m 00s
      🟩 GCC13              Pass: 100%/16  | Total: 15h 15m | Avg: 57m 12s | Max:  1h 30m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 56m | Avg: 58m 52s | Max:  1h 00m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 52m 56s | Avg: 52m 56s | Max: 52m 56s | Hits:  64%/731   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m | Hits:  64%/1462  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 04m | Avg:  1h 04m | Max:  1h 04m | Hits:  64%/731   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 16m | Avg:  1h 04m | Max:  1h 09m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 17h | Avg: 52m 13s | Max: 58m 56s
      🟩 GCC                Pass: 100%/51  | Total:  1d 21h | Avg: 53m 15s | Max:  1h 30m
      🟩 Intel              Pass: 100%/3   | Total:  2h 56m | Avg: 58m 52s | Max:  1h 00m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 04m | Avg:  1h 01m | Max:  1h 04m | Hits:  64%/2924  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 16m | Avg:  1h 04m | Max:  1h 09m
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total:  4d 02h | Avg: 53m 38s | Max:  1h 30m | Hits:  64%/2924  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  3d 17h | Avg: 52m 34s | Max:  1h 09m | Hits:  64%/2924  
      🟩 DeviceLaunch       Pass: 100%/1   | Total:  1h 28m | Avg:  1h 28m | Max:  1h 28m
      🟩 GraphCapture       Pass: 100%/1   | Total:  1h 17m | Avg:  1h 17m | Max:  1h 17m
      🟩 HostLaunch         Pass: 100%/3   | Total:  3h 03m | Avg:  1h 01m | Max:  1h 30m
      🟩 TestGPU            Pass: 100%/3   | Total:  3h 09m | Avg:  1h 03m | Max:  1h 23m
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 17m | Avg:  1h 05m | Max:  1h 07m
      🟩 90a                Pass: 100%/4   | Total:  1h 32m | Avg: 23m 01s | Max: 24m 06s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  1d 02h | Avg: 53m 27s | Max:  1h 30m
      🟩 14                 Pass: 100%/29  | Total:  1d 01h | Avg: 52m 58s | Max:  1h 09m | Hits:  64%/1462  
      🟩 17                 Pass: 100%/27  | Total: 23h 55m | Avg: 53m 09s | Max:  1h 04m | Hits:  64%/731   
      🟩 20                 Pass: 100%/24  | Total: 22h 04m | Avg: 55m 12s | Max:  1h 28m | Hits:  64%/731   
    
  • 🟩 thrust: Pass: 100%/109 | Total: 19h 18m | Avg: 10m 37s | Max: 1h 11m | Hits: 49%/13165

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total: 18h 38m | Avg: 11m 04s | Max:  1h 11m | Hits:  49%/13165 
      🟩 arm64              Pass: 100%/8   | Total: 40m 16s | Avg:  5m 02s | Max:  5m 29s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  2h 05m | Avg:  8m 21s | Max:  1h 00m | Hits:  58%/2633  
      🟩 11.8               Pass: 100%/3   | Total: 16m 08s | Avg:  5m 22s | Max:  5m 57s
      🟩 12.5               Pass: 100%/4   | Total:  3h 27m | Avg: 51m 52s | Max: 57m 39s
      🟩 12.6               Pass: 100%/87  | Total: 13h 30m | Avg:  9m 18s | Max:  1h 11m | Hits:  46%/10532 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total: 21m 10s | Avg:  5m 17s | Max:  5m 28s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  2h 05m | Avg:  8m 21s | Max:  1h 00m | Hits:  58%/2633  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 16m 08s | Avg:  5m 22s | Max:  5m 57s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  3h 27m | Avg: 51m 52s | Max: 57m 39s
      🟩 nvcc12.6           Pass: 100%/83  | Total: 13h 08m | Avg:  9m 30s | Max:  1h 11m | Hits:  46%/10532 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total: 21m 10s | Avg:  5m 17s | Max:  5m 28s
      🟩 nvcc               Pass: 100%/105 | Total: 18h 57m | Avg: 10m 50s | Max:  1h 11m | Hits:  49%/13165 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 33m 39s | Avg:  5m 36s | Max:  6m 30s
      🟩 Clang10            Pass: 100%/3   | Total: 20m 08s | Avg:  6m 42s | Max:  7m 02s
      🟩 Clang11            Pass: 100%/4   | Total: 21m 43s | Avg:  5m 25s | Max:  5m 47s
      🟩 Clang12            Pass: 100%/4   | Total: 22m 13s | Avg:  5m 33s | Max:  5m 59s
      🟩 Clang13            Pass: 100%/4   | Total: 22m 13s | Avg:  5m 33s | Max:  6m 01s
      🟩 Clang14            Pass: 100%/4   | Total: 21m 13s | Avg:  5m 18s | Max:  5m 39s
      🟩 Clang15            Pass: 100%/4   | Total: 22m 39s | Avg:  5m 39s | Max:  6m 04s
      🟩 Clang16            Pass: 100%/4   | Total: 22m 20s | Avg:  5m 35s | Max:  6m 12s
      🟩 Clang17            Pass: 100%/4   | Total: 22m 30s | Avg:  5m 37s | Max:  6m 02s
      🟩 Clang18            Pass: 100%/11  | Total:  1h 08m | Avg:  6m 14s | Max: 14m 44s
      🟩 GCC6               Pass: 100%/2   | Total:  8m 47s | Avg:  4m 23s | Max:  4m 39s
      🟩 GCC7               Pass: 100%/6   | Total: 29m 12s | Avg:  4m 52s | Max:  5m 52s
      🟩 GCC8               Pass: 100%/6   | Total: 29m 32s | Avg:  4m 55s | Max:  5m 37s
      🟩 GCC9               Pass: 100%/6   | Total: 31m 52s | Avg:  5m 18s | Max:  6m 31s
      🟩 GCC10              Pass: 100%/4   | Total: 21m 37s | Avg:  5m 24s | Max:  5m 56s
      🟩 GCC11              Pass: 100%/7   | Total: 39m 41s | Avg:  5m 40s | Max:  6m 34s
      🟩 GCC12              Pass: 100%/4   | Total: 24m 25s | Avg:  6m 06s | Max:  6m 55s
      🟩 GCC13              Pass: 100%/14  | Total:  1h 36m | Avg:  6m 52s | Max: 18m 09s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 54m | Avg: 38m 18s | Max: 43m 25s
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:  58%/2633  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 03m | Hits:  27%/5266  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 37m | Avg: 48m 31s | Max:  1h 11m | Hits:  65%/5266  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  3h 27m | Avg: 51m 52s | Max: 57m 39s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  4h 37m | Avg:  5m 46s | Max: 14m 44s
      🟩 GCC                Pass: 100%/49  | Total:  4h 41m | Avg:  5m 44s | Max: 18m 09s
      🟩 Intel              Pass: 100%/3   | Total:  1h 54m | Avg: 38m 18s | Max: 43m 25s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 37m | Avg: 55m 33s | Max:  1h 11m | Hits:  49%/13165 
      🟩 NVHPC              Pass: 100%/4   | Total:  3h 27m | Avg: 51m 52s | Max: 57m 39s
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total: 19h 18m | Avg: 10m 37s | Max:  1h 11m | Hits:  49%/13165 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total: 17h 45m | Avg: 10m 26s | Max:  1h 11m | Hits:  36%/10532 
      🟩 TestCPU            Pass: 100%/4   | Total: 48m 24s | Avg: 12m 06s | Max: 25m 24s | Hits:  99%/2633  
      🟩 TestGPU            Pass: 100%/3   | Total: 45m 14s | Avg: 15m 04s | Max: 18m 09s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 16m 08s | Avg:  5m 22s | Max:  5m 57s
      🟩 90a                Pass: 100%/4   | Total: 18m 23s | Avg:  4m 35s | Max:  5m 09s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  3h 40m | Avg:  7m 20s | Max: 39m 53s
      🟩 14                 Pass: 100%/29  | Total:  5h 46m | Avg: 11m 56s | Max:  1h 00m | Hits:  41%/5266  
      🟩 17                 Pass: 100%/27  | Total:  4h 58m | Avg: 11m 03s | Max:  1h 03m | Hits:  30%/2633  
      🟩 20                 Pass: 100%/23  | Total:  4h 54m | Avg: 12m 47s | Max:  1h 11m | Hits:  65%/5266  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 49s | Avg: 5m 24s | Max: 8m 32s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 32s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 32s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 32s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 32s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 32s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 32s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 49s | Avg:  5m 24s | Max:  8m 32s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 17s | Avg:  2m 17s | Max:  2m 17s
      🟩 Test               Pass: 100%/1   | Total:  8m 32s | Avg:  8m 32s | Max:  8m 32s
    
  • 🟩 python: Pass: 100%/1 | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 17m 27s | Avg: 17m 27s | Max: 17m 27s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 222)

# Runner
184 linux-amd64-cpu16
16 linux-arm64-cpu16
13 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16

}

// overload for any iterator that is not a pointer, do nothing
template <int, typename It, ::cuda::std::__enable_if_t<!::cuda::std::is_pointer<It>::value, int> = 0>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We could use cuda::std::contiguous_iterator in C++17 onwards.

Definitely something for a followup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note.

Comment on lines +958 to +973
auto determine_config = [&]() -> PoorExpected<prefetch_config> {
int max_occupancy = 0;
const auto error = CubDebug(MaxSmOccupancy(max_occupancy, CUB_DETAIL_TRANSFORM_KERNEL_PTR, block_dim, 0));
if (error != cudaSuccess)
{
return error;
}
const auto sm_count = get_sm_count();
if (!sm_count)
{
return sm_count.error;
}
return prefetch_config{max_occupancy, *sm_count};
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this rather be a struct than a lambda, afaik the only thing that is needed is block_dim which is a static property

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discusses offline. We could move the lambda outside the surrounding function, but determined it's more a style question than correctness. We then dediced to leave it as is to stay consistent with the way the ublkcp kernel configuration is written.

Copy link
Contributor

🟩 CI finished in 1h 00m: Pass: 100%/222 | Total: 1d 11h | Avg: 9m 35s | Max: 59m 35s | Hits: 91%/16089
  • 🟩 cub: Pass: 100%/110 | Total: 20h 04m | Avg: 10m 57s | Max: 59m 35s | Hits: 97%/2924

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total: 15h 53m | Avg:  9m 20s | Max: 57m 57s | Hits:  97%/2924  
      🟩 arm64              Pass: 100%/8   | Total:  4h 11m | Avg: 31m 23s | Max: 59m 35s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 14m | Avg:  4m 59s | Max: 12m 38s | Hits:  97%/731   
      🟩 11.8               Pass: 100%/3   | Total: 17m 11s | Avg:  5m 43s | Max:  5m 57s
      🟩 12.5               Pass: 100%/4   | Total: 36m 55s | Avg:  9m 13s | Max: 10m 02s
      🟩 12.6               Pass: 100%/88  | Total: 17h 55m | Avg: 12m 13s | Max: 59m 35s | Hits:  97%/2193  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total: 17m 41s | Avg:  4m 25s | Max:  4m 28s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 14m | Avg:  4m 59s | Max: 12m 38s | Hits:  97%/731   
      🟩 nvcc11.8           Pass: 100%/3   | Total: 17m 11s | Avg:  5m 43s | Max:  5m 57s
      🟩 nvcc12.5           Pass: 100%/4   | Total: 36m 55s | Avg:  9m 13s | Max: 10m 02s
      🟩 nvcc12.6           Pass: 100%/84  | Total: 17h 37m | Avg: 12m 35s | Max: 59m 35s | Hits:  97%/2193  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total: 17m 41s | Avg:  4m 25s | Max:  4m 28s
      🟩 nvcc               Pass: 100%/106 | Total: 19h 46m | Avg: 11m 11s | Max: 59m 35s | Hits:  97%/2924  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 32m 47s | Avg:  5m 27s | Max:  6m 19s
      🟩 Clang10            Pass: 100%/3   | Total: 18m 56s | Avg:  6m 18s | Max:  6m 27s
      🟩 Clang11            Pass: 100%/4   | Total: 21m 26s | Avg:  5m 21s | Max:  5m 30s
      🟩 Clang12            Pass: 100%/4   | Total: 22m 14s | Avg:  5m 33s | Max:  5m 39s
      🟩 Clang13            Pass: 100%/4   | Total: 21m 09s | Avg:  5m 17s | Max:  5m 35s
      🟩 Clang14            Pass: 100%/4   | Total: 21m 54s | Avg:  5m 28s | Max:  5m 55s
      🟩 Clang15            Pass: 100%/4   | Total: 22m 29s | Avg:  5m 37s | Max:  5m 56s
      🟩 Clang16            Pass: 100%/4   | Total: 21m 49s | Avg:  5m 27s | Max:  5m 42s
      🟩 Clang17            Pass: 100%/4   | Total: 22m 46s | Avg:  5m 41s | Max:  5m 58s
      🟩 Clang18            Pass: 100%/11  | Total:  1h 31m | Avg:  8m 18s | Max: 26m 52s
      🟩 GCC6               Pass: 100%/2   | Total:  8m 46s | Avg:  4m 23s | Max:  4m 33s
      🟩 GCC7               Pass: 100%/6   | Total: 29m 19s | Avg:  4m 53s | Max:  5m 29s
      🟩 GCC8               Pass: 100%/6   | Total: 29m 39s | Avg:  4m 56s | Max:  5m 47s
      🟩 GCC9               Pass: 100%/6   | Total: 29m 19s | Avg:  4m 53s | Max:  5m 40s
      🟩 GCC10              Pass: 100%/4   | Total: 22m 00s | Avg:  5m 30s | Max:  6m 03s
      🟩 GCC11              Pass: 100%/7   | Total: 38m 44s | Avg:  5m 32s | Max:  5m 57s
      🟩 GCC12              Pass: 100%/4   | Total: 22m 49s | Avg:  5m 42s | Max:  6m 01s
      🟩 GCC13              Pass: 100%/16  | Total:  7h 50m | Avg: 29m 25s | Max: 59m 35s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 52m | Avg: 57m 25s | Max: 57m 57s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 12m 38s | Avg: 12m 38s | Max: 12m 38s | Hits:  97%/731   
      🟩 MSVC14.29          Pass: 100%/2   | Total: 23m 12s | Avg: 11m 36s | Max: 11m 39s | Hits:  97%/1462  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 11s | Avg: 11m 11s | Max: 11m 11s | Hits:  97%/731   
      🟩 NVHPC24.7          Pass: 100%/4   | Total: 36m 55s | Avg:  9m 13s | Max: 10m 02s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  4h 56m | Avg:  6m 11s | Max: 26m 52s
      🟩 GCC                Pass: 100%/51  | Total: 10h 51m | Avg: 12m 46s | Max: 59m 35s
      🟩 Intel              Pass: 100%/3   | Total:  2h 52m | Avg: 57m 25s | Max: 57m 57s
      🟩 MSVC               Pass: 100%/4   | Total: 47m 01s | Avg: 11m 45s | Max: 12m 38s | Hits:  97%/2924  
      🟩 NVHPC              Pass: 100%/4   | Total: 36m 55s | Avg:  9m 13s | Max: 10m 02s
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total: 20h 04m | Avg: 10m 57s | Max: 59m 35s | Hits:  97%/2924  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total: 16h 58m | Avg:  9m 59s | Max: 59m 35s | Hits:  97%/2924  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 21m 02s | Avg: 21m 02s | Max: 21m 02s
      🟩 GraphCapture       Pass: 100%/1   | Total: 22m 22s | Avg: 22m 22s | Max: 22m 22s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 01m | Avg: 20m 30s | Max: 21m 15s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 20m | Avg: 26m 58s | Max: 30m 25s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 17m 11s | Avg:  5m 43s | Max:  5m 57s
      🟩 90a                Pass: 100%/4   | Total:  1h 30m | Avg: 22m 31s | Max: 23m 50s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  5h 16m | Avg: 10m 33s | Max: 57m 14s
      🟩 14                 Pass: 100%/29  | Total:  4h 52m | Avg: 10m 04s | Max: 57m 57s | Hits:  97%/1462  
      🟩 17                 Pass: 100%/27  | Total:  4h 40m | Avg: 10m 24s | Max: 57m 57s | Hits:  97%/731   
      🟩 20                 Pass: 100%/24  | Total:  5h 14m | Avg: 13m 06s | Max: 59m 35s | Hits:  97%/731   
    
  • 🟩 thrust: Pass: 100%/109 | Total: 14h 54m | Avg: 8m 12s | Max: 42m 02s | Hits: 89%/13165

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total: 12h 03m | Avg:  7m 09s | Max: 35m 58s | Hits:  89%/13165 
      🟩 arm64              Pass: 100%/8   | Total:  2h 51m | Avg: 21m 25s | Max: 42m 02s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 18m | Avg:  5m 12s | Max: 16m 38s | Hits:  99%/2633  
      🟩 11.8               Pass: 100%/3   | Total: 15m 47s | Avg:  5m 15s | Max:  6m 01s
      🟩 12.5               Pass: 100%/4   | Total:  1h 05m | Avg: 16m 23s | Max: 17m 38s
      🟩 12.6               Pass: 100%/87  | Total: 12h 14m | Avg:  8m 26s | Max: 42m 02s | Hits:  87%/10532 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total: 20m 04s | Avg:  5m 01s | Max:  5m 16s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 18m | Avg:  5m 12s | Max: 16m 38s | Hits:  99%/2633  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 15m 47s | Avg:  5m 15s | Max:  6m 01s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  1h 05m | Avg: 16m 23s | Max: 17m 38s
      🟩 nvcc12.6           Pass: 100%/83  | Total: 11h 54m | Avg:  8m 36s | Max: 42m 02s | Hits:  87%/10532 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total: 20m 04s | Avg:  5m 01s | Max:  5m 16s
      🟩 nvcc               Pass: 100%/105 | Total: 14h 34m | Avg:  8m 19s | Max: 42m 02s | Hits:  89%/13165 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 33m 50s | Avg:  5m 38s | Max:  7m 01s
      🟩 Clang10            Pass: 100%/3   | Total: 19m 20s | Avg:  6m 26s | Max:  6m 32s
      🟩 Clang11            Pass: 100%/4   | Total: 20m 55s | Avg:  5m 13s | Max:  5m 32s
      🟩 Clang12            Pass: 100%/4   | Total: 20m 52s | Avg:  5m 13s | Max:  5m 53s
      🟩 Clang13            Pass: 100%/4   | Total: 21m 33s | Avg:  5m 23s | Max:  5m 46s
      🟩 Clang14            Pass: 100%/4   | Total: 21m 23s | Avg:  5m 20s | Max:  5m 39s
      🟩 Clang15            Pass: 100%/4   | Total: 21m 14s | Avg:  5m 18s | Max:  5m 38s
      🟩 Clang16            Pass: 100%/4   | Total: 22m 25s | Avg:  5m 36s | Max:  5m 59s
      🟩 Clang17            Pass: 100%/4   | Total: 21m 58s | Avg:  5m 29s | Max:  5m 37s
      🟩 Clang18            Pass: 100%/11  | Total:  1h 05m | Avg:  5m 54s | Max: 13m 25s
      🟩 GCC6               Pass: 100%/2   | Total:  8m 14s | Avg:  4m 07s | Max:  4m 23s
      🟩 GCC7               Pass: 100%/6   | Total: 27m 40s | Avg:  4m 36s | Max:  5m 09s
      🟩 GCC8               Pass: 100%/6   | Total: 28m 26s | Avg:  4m 44s | Max:  5m 12s
      🟩 GCC9               Pass: 100%/6   | Total: 29m 11s | Avg:  4m 51s | Max:  5m 28s
      🟩 GCC10              Pass: 100%/4   | Total: 21m 48s | Avg:  5m 27s | Max:  5m 43s
      🟩 GCC11              Pass: 100%/7   | Total:  1h 08m | Avg:  9m 50s | Max: 35m 58s
      🟩 GCC12              Pass: 100%/4   | Total: 22m 59s | Avg:  5m 44s | Max:  6m 18s
      🟩 GCC13              Pass: 100%/14  | Total:  3h 48m | Avg: 16m 21s | Max: 42m 02s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 20m 54s | Avg:  6m 58s | Max:  7m 35s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s | Hits:  99%/2633  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 49m 08s | Avg: 24m 34s | Max: 24m 40s | Hits:  75%/5266  
      🟩 MSVC14.39          Pass: 100%/2   | Total: 37m 33s | Avg: 18m 46s | Max: 20m 07s | Hits:  99%/5266  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  1h 05m | Avg: 16m 23s | Max: 17m 38s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  4h 28m | Avg:  5m 35s | Max: 13m 25s
      🟩 GCC                Pass: 100%/49  | Total:  7h 16m | Avg:  8m 54s | Max: 42m 02s
      🟩 Intel              Pass: 100%/3   | Total: 20m 54s | Avg:  6m 58s | Max:  7m 35s
      🟩 MSVC               Pass: 100%/5   | Total:  1h 43m | Avg: 20m 39s | Max: 24m 40s | Hits:  89%/13165 
      🟩 NVHPC              Pass: 100%/4   | Total:  1h 05m | Avg: 16m 23s | Max: 17m 38s
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total: 14h 54m | Avg:  8m 12s | Max: 42m 02s | Hits:  89%/13165 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total: 13h 27m | Avg:  7m 55s | Max: 42m 02s | Hits:  87%/10532 
      🟩 TestCPU            Pass: 100%/4   | Total: 42m 52s | Avg: 10m 43s | Max: 20m 07s | Hits:  99%/2633  
      🟩 TestGPU            Pass: 100%/3   | Total: 44m 03s | Avg: 14m 41s | Max: 15m 34s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 15m 47s | Avg:  5m 15s | Max:  6m 01s
      🟩 90a                Pass: 100%/4   | Total: 18m 40s | Avg:  4m 40s | Max:  4m 47s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  3h 17m | Avg:  6m 35s | Max: 31m 27s
      🟩 14                 Pass: 100%/29  | Total:  4h 17m | Avg:  8m 53s | Max: 38m 26s | Hits:  87%/5266  
      🟩 17                 Pass: 100%/27  | Total:  3h 35m | Avg:  7m 59s | Max: 40m 54s | Hits:  75%/2633  
      🟩 20                 Pass: 100%/23  | Total:  3h 42m | Avg:  9m 41s | Max: 42m 02s | Hits:  99%/5266  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 58s | Avg: 5m 59s | Max: 9m 47s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 11m 58s | Avg:  5m 59s | Max:  9m 47s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 11m 58s | Avg:  5m 59s | Max:  9m 47s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 11m 58s | Avg:  5m 59s | Max:  9m 47s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 11m 58s | Avg:  5m 59s | Max:  9m 47s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 11m 58s | Avg:  5m 59s | Max:  9m 47s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 11m 58s | Avg:  5m 59s | Max:  9m 47s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 11m 58s | Avg:  5m 59s | Max:  9m 47s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 11s | Avg:  2m 11s | Max:  2m 11s
      🟩 Test               Pass: 100%/1   | Total:  9m 47s | Avg:  9m 47s | Max:  9m 47s
    
  • 🟩 python: Pass: 100%/1 | Total: 16m 44s | Avg: 16m 44s | Max: 16m 44s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 16m 44s | Avg: 16m 44s | Max: 16m 44s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 16m 44s | Avg: 16m 44s | Max: 16m 44s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 16m 44s | Avg: 16m 44s | Max: 16m 44s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 16m 44s | Avg: 16m 44s | Max: 16m 44s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 16m 44s | Avg: 16m 44s | Max: 16m 44s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 16m 44s | Avg: 16m 44s | Max: 16m 44s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 16m 44s | Avg: 16m 44s | Max: 16m 44s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 16m 44s | Avg: 16m 44s | Max: 16m 44s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 222)

# Runner
184 linux-amd64-cpu16
16 linux-arm64-cpu16
13 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16

@bernhardmgruber bernhardmgruber enabled auto-merge (squash) October 30, 2024 12:15
Copy link
Contributor

🟩 CI finished in 1h 37m: Pass: 100%/222 | Total: 1d 20h | Avg: 11m 57s | Max: 1h 18m | Hits: 22%/16089
  • 🟩 cub: Pass: 100%/110 | Total: 22h 06m | Avg: 12m 03s | Max: 1h 09m | Hits: 0%/2924

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total: 21h 24m | Avg: 12m 35s | Max:  1h 09m | Hits:   0%/2924  
      🟩 arm64              Pass: 100%/8   | Total: 41m 20s | Avg:  5m 10s | Max:  5m 31s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  2h 03m | Avg:  8m 13s | Max:  1h 00m | Hits:   1%/731   
      🟩 11.8               Pass: 100%/3   | Total: 16m 42s | Avg:  5m 34s | Max:  6m 11s
      🟩 12.5               Pass: 100%/4   | Total:  4h 17m | Avg:  1h 04m | Max:  1h 09m
      🟩 12.6               Pass: 100%/88  | Total: 15h 28m | Avg: 10m 33s | Max:  1h 06m | Hits:   0%/2193  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total: 17m 59s | Avg:  4m 29s | Max:  4m 37s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  2h 03m | Avg:  8m 13s | Max:  1h 00m | Hits:   1%/731   
      🟩 nvcc11.8           Pass: 100%/3   | Total: 16m 42s | Avg:  5m 34s | Max:  6m 11s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 17m | Avg:  1h 04m | Max:  1h 09m
      🟩 nvcc12.6           Pass: 100%/84  | Total: 15h 10m | Avg: 10m 50s | Max:  1h 06m | Hits:   0%/2193  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total: 17m 59s | Avg:  4m 29s | Max:  4m 37s
      🟩 nvcc               Pass: 100%/106 | Total: 21h 48m | Avg: 12m 20s | Max:  1h 09m | Hits:   0%/2924  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 31m 04s | Avg:  5m 10s | Max:  6m 06s
      🟩 Clang10            Pass: 100%/3   | Total: 19m 19s | Avg:  6m 26s | Max:  6m 48s
      🟩 Clang11            Pass: 100%/4   | Total: 20m 55s | Avg:  5m 13s | Max:  5m 36s
      🟩 Clang12            Pass: 100%/4   | Total: 21m 32s | Avg:  5m 23s | Max:  5m 36s
      🟩 Clang13            Pass: 100%/4   | Total: 21m 17s | Avg:  5m 19s | Max:  5m 33s
      🟩 Clang14            Pass: 100%/4   | Total: 21m 31s | Avg:  5m 22s | Max:  5m 28s
      🟩 Clang15            Pass: 100%/4   | Total: 21m 09s | Avg:  5m 17s | Max:  5m 40s
      🟩 Clang16            Pass: 100%/4   | Total: 20m 58s | Avg:  5m 14s | Max:  5m 39s
      🟩 Clang17            Pass: 100%/4   | Total: 21m 11s | Avg:  5m 17s | Max:  5m 37s
      🟩 Clang18            Pass: 100%/11  | Total:  1h 31m | Avg:  8m 21s | Max: 24m 55s
      🟩 GCC6               Pass: 100%/2   | Total:  8m 35s | Avg:  4m 17s | Max:  4m 20s
      🟩 GCC7               Pass: 100%/6   | Total: 28m 08s | Avg:  4m 41s | Max:  4m 59s
      🟩 GCC8               Pass: 100%/6   | Total: 29m 05s | Avg:  4m 50s | Max:  5m 09s
      🟩 GCC9               Pass: 100%/6   | Total: 29m 08s | Avg:  4m 51s | Max:  5m 15s
      🟩 GCC10              Pass: 100%/4   | Total: 21m 25s | Avg:  5m 21s | Max:  5m 31s
      🟩 GCC11              Pass: 100%/7   | Total: 38m 19s | Avg:  5m 28s | Max:  6m 11s
      🟩 GCC12              Pass: 100%/4   | Total: 21m 42s | Avg:  5m 25s | Max:  5m 50s
      🟩 GCC13              Pass: 100%/16  | Total:  2h 54m | Avg: 10m 55s | Max: 31m 36s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 54m | Avg: 58m 15s | Max:  1h 00m
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:   1%/731   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 06m | Hits:   0%/1462  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits:   0%/731   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 17m | Avg:  1h 04m | Max:  1h 09m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  4h 50m | Avg:  6m 03s | Max: 24m 55s
      🟩 GCC                Pass: 100%/51  | Total:  5h 51m | Avg:  6m 53s | Max: 31m 36s
      🟩 Intel              Pass: 100%/3   | Total:  2h 54m | Avg: 58m 15s | Max:  1h 00m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 11m | Avg:  1h 02m | Max:  1h 06m | Hits:   0%/2924  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 17m | Avg:  1h 04m | Max:  1h 09m
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total: 22h 06m | Avg: 12m 03s | Max:  1h 09m | Hits:   0%/2924  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total: 19h 12m | Avg: 11m 17s | Max:  1h 09m | Hits:   0%/2924  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 18m 58s | Avg: 18m 58s | Max: 18m 58s
      🟩 GraphCapture       Pass: 100%/1   | Total: 19m 20s | Avg: 19m 20s | Max: 19m 20s
      🟩 HostLaunch         Pass: 100%/3   | Total: 57m 38s | Avg: 19m 12s | Max: 23m 12s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 17m | Avg: 25m 57s | Max: 31m 36s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 16m 42s | Avg:  5m 34s | Max:  6m 11s
      🟩 90a                Pass: 100%/4   | Total: 17m 03s | Avg:  4m 15s | Max:  4m 18s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  4h 44m | Avg:  9m 29s | Max:  1h 00m
      🟩 14                 Pass: 100%/29  | Total:  6h 24m | Avg: 13m 16s | Max:  1h 09m | Hits:   0%/1462  
      🟩 17                 Pass: 100%/27  | Total:  5h 08m | Avg: 11m 25s | Max:  1h 03m | Hits:   0%/731   
      🟩 20                 Pass: 100%/24  | Total:  5h 47m | Avg: 14m 28s | Max:  1h 04m | Hits:   0%/731   
    
  • 🟩 thrust: Pass: 100%/109 | Total: 21h 41m | Avg: 11m 56s | Max: 1h 18m | Hits: 26%/13165

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total: 21h 02m | Avg: 12m 30s | Max:  1h 18m | Hits:  26%/13165 
      🟩 arm64              Pass: 100%/8   | Total: 39m 17s | Avg:  4m 54s | Max:  5m 16s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  2h 19m | Avg:  9m 18s | Max:  1h 17m | Hits:   8%/2633  
      🟩 11.8               Pass: 100%/3   | Total: 15m 27s | Avg:  5m 09s | Max:  5m 33s
      🟩 12.5               Pass: 100%/4   | Total:  4h 47m | Avg:  1h 11m | Max:  1h 17m
      🟩 12.6               Pass: 100%/87  | Total: 14h 19m | Avg:  9m 52s | Max:  1h 18m | Hits:  31%/10532 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total: 19m 50s | Avg:  4m 57s | Max:  5m 27s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  2h 19m | Avg:  9m 18s | Max:  1h 17m | Hits:   8%/2633  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 15m 27s | Avg:  5m 09s | Max:  5m 33s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 47m | Avg:  1h 11m | Max:  1h 17m
      🟩 nvcc12.6           Pass: 100%/83  | Total: 13h 59m | Avg: 10m 06s | Max:  1h 18m | Hits:  31%/10532 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total: 19m 50s | Avg:  4m 57s | Max:  5m 27s
      🟩 nvcc               Pass: 100%/105 | Total: 21h 21m | Avg: 12m 12s | Max:  1h 18m | Hits:  26%/13165 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 33m 53s | Avg:  5m 38s | Max:  7m 16s
      🟩 Clang10            Pass: 100%/3   | Total: 20m 11s | Avg:  6m 43s | Max:  7m 01s
      🟩 Clang11            Pass: 100%/4   | Total: 21m 06s | Avg:  5m 16s | Max:  5m 34s
      🟩 Clang12            Pass: 100%/4   | Total: 20m 55s | Avg:  5m 13s | Max:  5m 35s
      🟩 Clang13            Pass: 100%/4   | Total: 21m 21s | Avg:  5m 20s | Max:  5m 42s
      🟩 Clang14            Pass: 100%/4   | Total: 21m 20s | Avg:  5m 20s | Max:  5m 51s
      🟩 Clang15            Pass: 100%/4   | Total: 21m 45s | Avg:  5m 26s | Max:  5m 39s
      🟩 Clang16            Pass: 100%/4   | Total: 21m 38s | Avg:  5m 24s | Max:  5m 39s
      🟩 Clang17            Pass: 100%/4   | Total: 22m 40s | Avg:  5m 40s | Max:  6m 10s
      🟩 Clang18            Pass: 100%/11  | Total:  1h 06m | Avg:  6m 00s | Max: 14m 26s
      🟩 GCC6               Pass: 100%/2   | Total:  8m 28s | Avg:  4m 14s | Max:  4m 42s
      🟩 GCC7               Pass: 100%/6   | Total: 27m 29s | Avg:  4m 34s | Max:  4m 57s
      🟩 GCC8               Pass: 100%/6   | Total: 28m 55s | Avg:  4m 49s | Max:  5m 31s
      🟩 GCC9               Pass: 100%/6   | Total: 29m 28s | Avg:  4m 54s | Max:  5m 42s
      🟩 GCC10              Pass: 100%/4   | Total: 20m 57s | Avg:  5m 14s | Max:  6m 01s
      🟩 GCC11              Pass: 100%/7   | Total: 37m 37s | Avg:  5m 22s | Max:  6m 32s
      🟩 GCC12              Pass: 100%/4   | Total: 27m 36s | Avg:  6m 54s | Max: 10m 36s
      🟩 GCC13              Pass: 100%/14  | Total:  1h 30m | Avg:  6m 29s | Max: 15m 27s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 37m | Avg: 52m 23s | Max: 59m 49s
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 17m | Avg:  1h 17m | Max:  1h 17m | Hits:   8%/2633  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 23m | Avg:  1h 11m | Max:  1h 18m | Hits:  12%/5266  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 33m | Avg: 46m 57s | Max:  1h 11m | Hits:  50%/5266  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 47m | Avg:  1h 11m | Max:  1h 17m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  4h 30m | Avg:  5m 38s | Max: 14m 26s
      🟩 GCC                Pass: 100%/49  | Total:  4h 31m | Avg:  5m 32s | Max: 15m 27s
      🟩 Intel              Pass: 100%/3   | Total:  2h 37m | Avg: 52m 23s | Max: 59m 49s
      🟩 MSVC               Pass: 100%/5   | Total:  5h 14m | Avg:  1h 02m | Max:  1h 18m | Hits:  26%/13165 
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 47m | Avg:  1h 11m | Max:  1h 17m
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total: 21h 41m | Avg: 11m 56s | Max:  1h 18m | Hits:  26%/13165 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total: 20h 16m | Avg: 11m 55s | Max:  1h 18m | Hits:   8%/10532 
      🟩 TestCPU            Pass: 100%/4   | Total: 44m 08s | Avg: 11m 02s | Max: 22m 16s | Hits:  99%/2633  
      🟩 TestGPU            Pass: 100%/3   | Total: 41m 16s | Avg: 13m 45s | Max: 15m 27s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 15m 27s | Avg:  5m 09s | Max:  5m 33s
      🟩 90a                Pass: 100%/4   | Total: 17m 58s | Avg:  4m 29s | Max:  4m 51s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  4h 11m | Avg:  8m 23s | Max:  1h 03m
      🟩 14                 Pass: 100%/29  | Total:  6h 35m | Avg: 13m 37s | Max:  1h 17m | Hits:   4%/5266  
      🟩 17                 Pass: 100%/27  | Total:  5h 46m | Avg: 12m 49s | Max:  1h 18m | Hits:  24%/2633  
      🟩 20                 Pass: 100%/23  | Total:  5h 08m | Avg: 13m 24s | Max:  1h 17m | Hits:  50%/5266  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 50s | Avg: 4m 25s | Max: 6m 45s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 45s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 45s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 45s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 45s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 45s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 45s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 50s | Avg:  4m 25s | Max:  6m 45s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 05s | Avg:  2m 05s | Max:  2m 05s
      🟩 Test               Pass: 100%/1   | Total:  6m 45s | Avg:  6m 45s | Max:  6m 45s
    
  • 🟩 python: Pass: 100%/1 | Total: 19m 44s | Avg: 19m 44s | Max: 19m 44s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 19m 44s | Avg: 19m 44s | Max: 19m 44s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 19m 44s | Avg: 19m 44s | Max: 19m 44s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 19m 44s | Avg: 19m 44s | Max: 19m 44s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 19m 44s | Avg: 19m 44s | Max: 19m 44s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 19m 44s | Avg: 19m 44s | Max: 19m 44s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 19m 44s | Avg: 19m 44s | Max: 19m 44s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 19m 44s | Avg: 19m 44s | Max: 19m 44s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 19m 44s | Avg: 19m 44s | Max: 19m 44s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 222)

# Runner
184 linux-amd64-cpu16
16 linux-arm64-cpu16
13 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16

@bernhardmgruber bernhardmgruber merged commit 2f05ef3 into NVIDIA:main Oct 30, 2024
236 checks passed
@bernhardmgruber bernhardmgruber deleted the transform_prefetch branch October 30, 2024 13:31
fbusato pushed a commit to fbusato/cccl that referenced this pull request Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Re-evaluate the prefetching fallback kernel and compare it with DeviceFor as fallback
4 participants