feat: add reduce kernels #3136

ManasviGoyal · 2024-05-30T13:38:06Z

Kernels tested for different block sizes

lgray · 2024-06-06T13:23:44Z

@ManasviGoyal in trying to implement query 4 from the analysis benchmarks I found some nasty memory scaling:

If you scroll to the bottom of the trace below you'll see that when attempting to execute:

This is processing ~53M rows all at once in the input file, the data fit with no problem in the GPU itself. Same for the histogram that is being filled.

For this test I have merged #3123, #3142, and this PR on top of awkward main. This PR was merged in last.

However, for the ak.sum, where the calculation fails, it is attempting to allocation 71 terabytes of ram on the device. This seems excessive and is indicative of some poor memory scaling properties in the implementation. You'll see that this fails in the ak.sum step and nowhere else.

Here's the full stack trace:

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[3], line 16
     11 MET_pt = ak.to_backend(jetmet.MET_pt, "cuda")
     12 q4_hist = hist.Hist(
     13     "Counts",
     14     hist.Bin("met", "$E_{T}^{miss}$ [GeV]", 100, 0, 200),
     15 )
---> 16 has2jets = ak.sum(Jet_pt > 40, axis=1) >= 2
     17 q4_hist.fill(met=MET_pt[has2jets])
     19 q4_hist.to_hist().plot1d(flow="none");

File [~/coffea-gpu/awkward/src/awkward/_dispatch.py:64](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_dispatch.py#line=63), in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:210](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=209), in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    207 yield (array,)
    209 # Implementation
--> 210 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:277](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=276), in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    274     layout = ctx.unwrap(array, allow_record=False, primitive_policy="error")
    275 reducer = ak._reducers.Sum()
--> 277 out = ak._do.reduce(
    278     layout,
    279     reducer,
    280     axis=axis,
    281     mask=mask_identity,
    282     keepdims=keepdims,
    283     behavior=ctx.behavior,
    284 )
    285 return ctx.wrap(out, highlevel=highlevel, allow_other=True)

File [~/coffea-gpu/awkward/src/awkward/_do.py:333](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_do.py#line=332), in reduce(layout, reducer, axis, mask, keepdims, behavior)
    331 parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
    332 shifts = None
--> 333 next = layout._reduce_next(
    334     reducer,
    335     negaxis,
    336     starts,
    337     shifts,
    338     parents,
    339     1,
    340     mask,
    341     keepdims,
    342     behavior,
    343 )
    345 return next[0]

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:1612](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=1611), in ListOffsetArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1609 trimmed = self._content[self.offsets[0] : self.offsets[-1]]
   1610 nextstarts = self.offsets[:-1]
-> 1612 outcontent = trimmed._reduce_next(
   1613     reducer,
   1614     negaxis,
   1615     nextstarts,
   1616     shifts,
   1617     nextparents,
   1618     globalstarts_length,
   1619     mask,
   1620     keepdims,
   1621     behavior,
   1622 )
   1624 outoffsets = Index64.empty(outlength + 1, index_nplike)
   1625 assert outoffsets.nplike is index_nplike and parents.nplike is index_nplike

File [~/coffea-gpu/awkward/src/awkward/contents/numpyarray.py:1122](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/numpyarray.py#line=1121), in NumpyArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1119 assert self.is_contiguous
   1120 assert self._data.ndim == 1
-> 1122 out = reducer.apply(self, parents, starts, shifts, outlength)
   1124 if mask:
   1125     outmask = ak.index.Index8.empty(outlength, self._backend.index_nplike)

File [~/coffea-gpu/awkward/src/awkward/_reducers.py:358](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_reducers.py#line=357), in Sum.apply(self, array, parents, starts, shifts, outlength)
    355 if result.dtype in (np.int64, np.uint64):
    356     assert parents.nplike is array.backend.index_nplike
    357     array.backend.maybe_kernel_error(
--> 358         array.backend[
    359             "awkward_reduce_sum_int64_bool_64",
    360             np.int64,
    361             array.dtype.type,
    362             parents.dtype.type,
    363         ](
    364             result,
    365             array.data,
    366             parents.data,
    367             parents.length,
    368             outlength,
    369         )
    370     )
    371 elif result.dtype in (np.int32, np.uint32):
    372     assert parents.nplike is array.backend.index_nplike

File [~/coffea-gpu/awkward/src/awkward/_kernels.py:169](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_kernels.py#line=168), in CupyKernel.__call__(self, *args)
    157 args = (
    158     *args,
    159     len(ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1]),
    160     ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][0],
    161 )
    162 ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1].append(
    163     ak_cuda.Invocation(
    164         name=self.key[0],
    165         error_context=ak._errors.ErrorContext.primary(),
    166     )
    167 )
--> 169 self._impl(grid, blocks, args)

File [~/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py:4337](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py#line=4336), in by_signature.<locals>.f(grid, block, args)
   4335     segment = 0
   4336     grid_size = 1
-> 4337 partial = cupy.zeros(outlength * grid_size, dtype=toptr.dtype)
   4338 temp = cupy.zeros(lenparents, dtype=toptr.dtype)
   4339 cuda_kernel_templates.get_function(fetch_specialization(["awkward_reduce_sum_int64_bool_64_a", int64, bool_, parents.dtype]))((grid_size,), block, (toptr, fromptr, parents, lenparents, outlength, partial, temp, invocation_index, err_code))

File ~/.conda/envs/coffea-gpu/lib/python3.12/site-packages/cupy/_creation/basic.py:248, in zeros(shape, dtype, order)
    229 def zeros(
    230         shape: _ShapeLike,
    231         dtype: DTypeLike = float,
    232         order: _OrderCF = 'C',
    233 ) -> NDArray[Any]:
    234     """Returns a new array of given shape and dtype, filled with zeros.
    235 
    236     Args:
   (...)
    246 
    247     """
--> 248     a = cupy.ndarray(shape, dtype, order=order)
    249     a.data.memset_async(0, a.nbytes)
    250     return a

File cupy[/_core/core.pyx:132](https://analytics-hub.fnal.gov/_core/core.pyx#line=131), in cupy._core.core.ndarray.__new__()

File cupy[/_core/core.pyx:220](https://analytics-hub.fnal.gov/_core/core.pyx#line=219), in cupy._core.core._ndarray_base._init()

File cupy[/cuda/memory.pyx:738](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=737), in cupy.cuda.memory.alloc()

File cupy[/cuda/memory.pyx:1424](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1423), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1445](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1444), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1116](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1115), in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

File cupy[/cuda/memory.pyx:1137](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1136), in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

File cupy[/cuda/memory.pyx:1382](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1381), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

File cupy[/cuda/memory.pyx:1385](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1384), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

OutOfMemoryError: Out of memory allocating 71,381,459,340,288 bytes (allocated so far: 3,718,910,464 bytes).

This error occurred while calling

    ak.sum(
        <Array [[True, False], [...], ..., [False]] type='53446198 * var * ...'>
        axis = 1
    )

ManasviGoyal · 2024-06-06T14:34:25Z

@ManasviGoyal in trying to implement query 4 from the analysis benchmarks I found some nasty memory scaling:

If you scroll to the bottom of the trace below you'll see that when attempting to execute:

This is processing ~53M rows all at once in the input file, the data fit with no problem in the GPU itself. Same for the histogram that is being filled.

For this test I have merged #3123, #3142, and this PR on top of awkward main. This PR was merged in last.

However, for the ak.sum, where the calculation fails, it is attempting to allocation 71 terabytes of ram on the device. This seems excessive and is indicative of some poor memory scaling properties in the implementation. You'll see that this fails in the ak.sum step and nowhere else.

Here's the full stack trace:

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[3], line 16
     11 MET_pt = ak.to_backend(jetmet.MET_pt, "cuda")
     12 q4_hist = hist.Hist(
     13     "Counts",
     14     hist.Bin("met", "$E_{T}^{miss}$ [GeV]", 100, 0, 200),
     15 )
---> 16 has2jets = ak.sum(Jet_pt > 40, axis=1) >= 2
     17 q4_hist.fill(met=MET_pt[has2jets])
     19 q4_hist.to_hist().plot1d(flow="none");

File [~/coffea-gpu/awkward/src/awkward/_dispatch.py:64](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_dispatch.py#line=63), in named_high_level_function.<locals>.dispatch(*args, **kwargs)
     62 # Failed to find a custom overload, so resume the original function
     63 try:
---> 64     next(gen_or_result)
     65 except StopIteration as err:
     66     return err.value

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:210](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=209), in sum(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    207 yield (array,)
    209 # Implementation
--> 210 return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)

File [~/coffea-gpu/awkward/src/awkward/operations/ak_sum.py:277](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/operations/ak_sum.py#line=276), in _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
    274     layout = ctx.unwrap(array, allow_record=False, primitive_policy="error")
    275 reducer = ak._reducers.Sum()
--> 277 out = ak._do.reduce(
    278     layout,
    279     reducer,
    280     axis=axis,
    281     mask=mask_identity,
    282     keepdims=keepdims,
    283     behavior=ctx.behavior,
    284 )
    285 return ctx.wrap(out, highlevel=highlevel, allow_other=True)

File [~/coffea-gpu/awkward/src/awkward/_do.py:333](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_do.py#line=332), in reduce(layout, reducer, axis, mask, keepdims, behavior)
    331 parents = ak.index.Index64.zeros(layout.length, layout.backend.index_nplike)
    332 shifts = None
--> 333 next = layout._reduce_next(
    334     reducer,
    335     negaxis,
    336     starts,
    337     shifts,
    338     parents,
    339     1,
    340     mask,
    341     keepdims,
    342     behavior,
    343 )
    345 return next[0]

File [~/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py:1612](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/listoffsetarray.py#line=1611), in ListOffsetArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1609 trimmed = self._content[self.offsets[0] : self.offsets[-1]]
   1610 nextstarts = self.offsets[:-1]
-> 1612 outcontent = trimmed._reduce_next(
   1613     reducer,
   1614     negaxis,
   1615     nextstarts,
   1616     shifts,
   1617     nextparents,
   1618     globalstarts_length,
   1619     mask,
   1620     keepdims,
   1621     behavior,
   1622 )
   1624 outoffsets = Index64.empty(outlength + 1, index_nplike)
   1625 assert outoffsets.nplike is index_nplike and parents.nplike is index_nplike

File [~/coffea-gpu/awkward/src/awkward/contents/numpyarray.py:1122](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/contents/numpyarray.py#line=1121), in NumpyArray._reduce_next(self, reducer, negaxis, starts, shifts, parents, outlength, mask, keepdims, behavior)
   1119 assert self.is_contiguous
   1120 assert self._data.ndim == 1
-> 1122 out = reducer.apply(self, parents, starts, shifts, outlength)
   1124 if mask:
   1125     outmask = ak.index.Index8.empty(outlength, self._backend.index_nplike)

File [~/coffea-gpu/awkward/src/awkward/_reducers.py:358](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_reducers.py#line=357), in Sum.apply(self, array, parents, starts, shifts, outlength)
    355 if result.dtype in (np.int64, np.uint64):
    356     assert parents.nplike is array.backend.index_nplike
    357     array.backend.maybe_kernel_error(
--> 358         array.backend[
    359             "awkward_reduce_sum_int64_bool_64",
    360             np.int64,
    361             array.dtype.type,
    362             parents.dtype.type,
    363         ](
    364             result,
    365             array.data,
    366             parents.data,
    367             parents.length,
    368             outlength,
    369         )
    370     )
    371 elif result.dtype in (np.int32, np.uint32):
    372     assert parents.nplike is array.backend.index_nplike

File [~/coffea-gpu/awkward/src/awkward/_kernels.py:169](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_kernels.py#line=168), in CupyKernel.__call__(self, *args)
    157 args = (
    158     *args,
    159     len(ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1]),
    160     ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][0],
    161 )
    162 ak_cuda.cuda_streamptr_to_contexts[cupy_stream_ptr][1].append(
    163     ak_cuda.Invocation(
    164         name=self.key[0],
    165         error_context=ak._errors.ErrorContext.primary(),
    166     )
    167 )
--> 169 self._impl(grid, blocks, args)

File [~/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py:4337](https://analytics-hub.fnal.gov/user/lagray/lab/tree/coffea-gpu/coffea-gpu/awkward/src/awkward/_connect/cuda/_kernel_signatures.py#line=4336), in by_signature.<locals>.f(grid, block, args)
   4335     segment = 0
   4336     grid_size = 1
-> 4337 partial = cupy.zeros(outlength * grid_size, dtype=toptr.dtype)
   4338 temp = cupy.zeros(lenparents, dtype=toptr.dtype)
   4339 cuda_kernel_templates.get_function(fetch_specialization(["awkward_reduce_sum_int64_bool_64_a", int64, bool_, parents.dtype]))((grid_size,), block, (toptr, fromptr, parents, lenparents, outlength, partial, temp, invocation_index, err_code))

File ~/.conda/envs/coffea-gpu/lib/python3.12/site-packages/cupy/_creation/basic.py:248, in zeros(shape, dtype, order)
    229 def zeros(
    230         shape: _ShapeLike,
    231         dtype: DTypeLike = float,
    232         order: _OrderCF = 'C',
    233 ) -> NDArray[Any]:
    234     """Returns a new array of given shape and dtype, filled with zeros.
    235 
    236     Args:
   (...)
    246 
    247     """
--> 248     a = cupy.ndarray(shape, dtype, order=order)
    249     a.data.memset_async(0, a.nbytes)
    250     return a

File cupy[/_core/core.pyx:132](https://analytics-hub.fnal.gov/_core/core.pyx#line=131), in cupy._core.core.ndarray.__new__()

File cupy[/_core/core.pyx:220](https://analytics-hub.fnal.gov/_core/core.pyx#line=219), in cupy._core.core._ndarray_base._init()

File cupy[/cuda/memory.pyx:738](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=737), in cupy.cuda.memory.alloc()

File cupy[/cuda/memory.pyx:1424](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1423), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1445](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1444), in cupy.cuda.memory.MemoryPool.malloc()

File cupy[/cuda/memory.pyx:1116](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1115), in cupy.cuda.memory.SingleDeviceMemoryPool.malloc()

File cupy[/cuda/memory.pyx:1137](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1136), in cupy.cuda.memory.SingleDeviceMemoryPool._malloc()

File cupy[/cuda/memory.pyx:1382](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1381), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

File cupy[/cuda/memory.pyx:1385](https://analytics-hub.fnal.gov/cuda/memory.pyx#line=1384), in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc()

OutOfMemoryError: Out of memory allocating 71,381,459,340,288 bytes (allocated so far: 3,718,910,464 bytes).

This error occurred while calling

    ak.sum(
        <Array [[True, False], [...], ..., [False]] type='53446198 * var * ...'>
        axis = 1
    )

Hi, I am still working on these kernels and need to fix a few things. I will update once I am done with this PR. The issue is most likely use to the use of partial. I plan to remove it. PR #3123 is just for experimenting in separate python scripts. That is not the implementation.

lgray · 2024-06-06T14:41:43Z

No worries - just reporting what I'm finding with things as they are. Thanks!

ManasviGoyal · 2024-06-06T15:28:49Z

No worries - just reporting what I'm finding with things as they are. Thanks!

Yes. It's very helpful since I can only test of a limited number of cases so knowing how it works for actual data helps in identifying the issues. Thanks! I will keep you updated.

lgray · 2024-06-07T16:46:18Z

The change to accumulator with atomics certainly fixed the memory issue, it's a bit slower than I expected for a sum though. ~250 MHz throughput for summing bools into int64.

As an optimization for sums on the last dimension, couldn't you write this without atomics or any race conditions by having each thread sum over the last dimension into an array of one less dimension? Or is the thread divergence too bad and atomics are still faster?

With the atomic implementation you're guaranteed to have access contention because each element is going to be hitting the same output position to make the sum. I don't have good intuition if that's going to be better or worse than thread divergence.

@jpivarski maybe?

lgray · 2024-06-07T18:38:20Z

In any case - with this latest change I've now got query 4 done on the ADL benchmarks. The rest seem to require combinations, so I'll wait for that!

ianna

I do not see any changes in this PR that would cause the test failures or causing numpy to use int32 for sum operations on boolean arrays (depending on the platform (32-bit vs 64-bit)).

We could explicitly cast the result of numpy sum to int64 to ensure the test doesn't fail:

numpy_sum = np.sum(array, axis=-1).astype(np.int64)

I'd say merge it - @jpivarski?

ianna · 2024-06-19T09:48:44Z

I'm checking it with #3158

ianna · 2024-06-19T10:26:30Z

I'm checking it with #3158
and #3159

I've opened an issue: #3160

jpivarski

This is excellent!!! As I understand it, this enables all axis=-1 reducers, with tests for crossing block boundaries. As we talked about in our meeting, it could have more tests of block boundary crossing and integration tests (converted from tests to tests-cuda, particularly test_0115_generic_reducer_operation.py).

The one failing test is unrelated, and it's failing in main also, so it isn't a blocker to merging this PR. (We should never introduce failing tests into main, but it's already there. @ianna, if you need elevated permissions to bypass the red warning about merging with a failing test, I can give you those permissions.)

What is a blocker, however, is that these need to be tested on more than one GPU. @ianna will be able to test it on Sunday, and it can be merged if it passes on her GPU. I'll be able to test it on Tuesday, and I'll just test it in main (after merging).

ianna

@ManasviGoyal - all tests pass except for the tests-cuda-kernels/test_cudaawkward_BitMaskedArray_to_ByteMaskedArray.py that relies on Numpy. I've added an import. Please, double check. Thanks!

dev/generate-tests.py

ianna

@ManasviGoyal - all tests pass on my local computer with the updated branch with the import of numpy! If it works fine on yours the PR is good to be merged. Thanks.

ManasviGoyal · 2024-06-24T10:22:20Z

@ManasviGoyal - all tests pass on my local computer with the updated branch with the import of numpy! If it works fine on yours the PR is good to be merged. Thanks.

@ianna All MacOS tests are cancelled in the CI due to which I am unable to merge.

ianna · 2024-06-24T12:27:37Z

@jpivarski - I think, we need some other macOS node:

Run Tests (macos-11, 3.8, x64, full) This is a scheduled macOS-11 brownout. The macOS-11 environment is deprecated and will be removed on June 28th, 2024. --

ManasviGoyal · 2024-06-24T15:09:50Z

This is excellent!!! As I understand it, this enables all axis=-1 reducers, with tests for crossing block boundaries. As we talked about in our meeting, it could have more tests of block boundary crossing and integration tests (converted from tests to tests-cuda, particularly test_0115_generic_reducer_operation.py).

@jpivarski #3162 adds all the axis=-1 tests in test_0115_generic_reducer_operation.py for cuda. I have also added tests to check for block boundary for array size = 3000 (primes for ak.prod) as we discussed in the last meeting. Thanks!

jpivarski

@ianna and I have both tested with GPUs and all is good, so I'll merge this now. (Before June 28, we avoid complications with MacOS > 11.)

ManasviGoyal added 3 commits May 29, 2024 11:28

feat: add tree reduction implementation of argmin and argmax

860868d

feat: add awkward_ListOffsetArray_reduce_local_outoffsets_64 kernel

3cdbd7e

test: integration tests for cuda

c1a846b

ManasviGoyal temporarily deployed to docs May 30, 2024 13:51 — with GitHub Actions Inactive

test: some more integration tests for cuda

7be3f98

ManasviGoyal temporarily deployed to docs May 30, 2024 14:33 — with GitHub Actions Inactive

ManasviGoyal added 3 commits June 6, 2024 09:24

feat: add awkward_reduce_count_64 kernel

98fb7ed

fix: indexing and indentation

0ed94ef

feat: add awkward_reduce_countnonzero kernel

02c03bc

ManasviGoyal temporarily deployed to docs June 6, 2024 08:10 — with GitHub Actions Inactive

feat: add reduce sum, min and max kernels

34fc82b

ManasviGoyal temporarily deployed to docs June 6, 2024 08:27 — with GitHub Actions Inactive

ManasviGoyal added 2 commits June 6, 2024 10:37

feat: add reduce prod and sum_int_bool

4e00f07

feat: add sum_bool and prod_bool kernels

b28a605

ManasviGoyal temporarily deployed to docs June 6, 2024 08:55 — with GitHub Actions Inactive

fix: use cpt.assert_allclose

9e7abc7

ManasviGoyal temporarily deployed to docs June 6, 2024 10:05 — with GitHub Actions Inactive

test: reducer integration tests

458165c

ManasviGoyal temporarily deployed to docs June 6, 2024 12:59 — with GitHub Actions Inactive

lgray mentioned this pull request Jun 6, 2024

chore: trying atomics and tree reduction for CUDA reducer kernels #3123

Merged

21 tasks

fix: typr conversion

c75cb79

ManasviGoyal temporarily deployed to docs June 6, 2024 14:43 — with GitHub Actions Inactive

fix: use atomic to avoid race conditions

427670c

fix: remove unnessary variable

127e035

fix: correctly interpret typetracer array for cuda backend

8e926ab

ManasviGoyal temporarily deployed to docs June 18, 2024 13:15 — with GitHub Actions Inactive

fix: tests-spec error for bool

38d314d

ManasviGoyal force-pushed the ManasviGoyal/add-reducer-kernels branch from a98503d to 38d314d Compare June 18, 2024 13:23

fix: check for the backend of head

15068b6

ManasviGoyal temporarily deployed to docs June 18, 2024 13:35 — with GitHub Actions Inactive

ManasviGoyal temporarily deployed to docs June 18, 2024 14:46 — with GitHub Actions Inactive

Merge branch 'main' into ManasviGoyal/add-reducer-kernels

d864481

ianna temporarily deployed to docs June 19, 2024 09:18 — with GitHub Actions Inactive

ianna approved these changes Jun 19, 2024

View reviewed changes

jpivarski approved these changes Jun 19, 2024

View reviewed changes

jpivarski mentioned this pull request Jun 21, 2024

fix: correct dtypes for numpy v2 #3159

Merged

Merge branch 'main' into ManasviGoyal/add-reducer-kernels

b2c0a89

ianna temporarily deployed to docs June 21, 2024 14:36 — with GitHub Actions Inactive

ianna requested changes Jun 24, 2024

View reviewed changes

dev/generate-tests.py Show resolved Hide resolved

Update dev/generate-tests.py

8921b82

ianna temporarily deployed to docs June 24, 2024 09:05 — with GitHub Actions Inactive

ianna approved these changes Jun 24, 2024

View reviewed changes

ianna temporarily deployed to docs June 24, 2024 10:27 — with GitHub Actions Inactive

ManasviGoyal mentioned this pull request Jun 24, 2024

test: reducer CUDA kernel tests #3162

Merged

Merge branch 'main' into ManasviGoyal/add-reducer-kernels

c9bff0f

jpivarski deployed to docs June 25, 2024 15:23 — with GitHub Actions View deployment

jpivarski approved these changes Jun 25, 2024

View reviewed changes

jpivarski merged commit ba4890a into main Jun 25, 2024
39 checks passed

jpivarski deleted the ManasviGoyal/add-reducer-kernels branch June 25, 2024 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add reduce kernels #3136

feat: add reduce kernels #3136

ManasviGoyal commented May 30, 2024 •

edited

Loading

lgray commented Jun 6, 2024 •

edited

Loading

ManasviGoyal commented Jun 6, 2024 •

edited

Loading

lgray commented Jun 6, 2024

ManasviGoyal commented Jun 6, 2024

lgray commented Jun 7, 2024 •

edited

Loading

lgray commented Jun 7, 2024 •

edited

Loading

ianna left a comment

ianna commented Jun 19, 2024

ianna commented Jun 19, 2024

jpivarski left a comment

ianna left a comment

ianna left a comment

ManasviGoyal commented Jun 24, 2024

ianna commented Jun 24, 2024

ManasviGoyal commented Jun 24, 2024

jpivarski left a comment

feat: add reduce kernels #3136

feat: add reduce kernels #3136

Conversation

ManasviGoyal commented May 30, 2024 • edited Loading

lgray commented Jun 6, 2024 • edited Loading

ManasviGoyal commented Jun 6, 2024 • edited Loading

lgray commented Jun 6, 2024

ManasviGoyal commented Jun 6, 2024

lgray commented Jun 7, 2024 • edited Loading

lgray commented Jun 7, 2024 • edited Loading

ianna left a comment

Choose a reason for hiding this comment

ianna commented Jun 19, 2024

ianna commented Jun 19, 2024

jpivarski left a comment

Choose a reason for hiding this comment

ianna left a comment

Choose a reason for hiding this comment

ianna left a comment

Choose a reason for hiding this comment

ManasviGoyal commented Jun 24, 2024

ianna commented Jun 24, 2024

ManasviGoyal commented Jun 24, 2024

jpivarski left a comment

Choose a reason for hiding this comment

ManasviGoyal commented May 30, 2024 •

edited

Loading

lgray commented Jun 6, 2024 •

edited

Loading

ManasviGoyal commented Jun 6, 2024 •

edited

Loading

lgray commented Jun 7, 2024 •

edited

Loading

lgray commented Jun 7, 2024 •

edited

Loading