CeedVector/Preconditioning: fix CeedInt loop vars to CeedSize #1241

jedbrown · 2023-06-23T02:50:05Z

While 32-bit is sufficient for CeedElemRestriction, a Vector is used to store matrix entries and the number of entries can overflow 32-bit even for a small number of dofs. For example, 85k Q3 fluid elements is enough to overflow.

Reported-by: Ken Jansen

jeremylt · 2023-06-23T03:08:17Z

Should we update the GPU backend impls for the vec functions too?

jedbrown · 2023-06-23T03:10:56Z

We'll need to, but it's harder to test because most devices aren't big enough to overflow.

interface/ceed-preconditioning.c

nbeams · 2023-06-23T22:31:36Z

Adding some notes about HIP/CUDA here, though we can move this discussion if we want to open a separate PR for that so we can go ahead and merge this.

I was able to reproduce the issue on MI250X, and resolve it by modifications to:

Functions in ceed-hip-ref-vector.c : were still using CeedInt for size even though the "official" type of a CeedVector size is CeedSize now
kernels in hip-ref-vector.hip.cpp (again, length/size still using CeedInt)
kernels for operator assembly and operator linear diagonal assembly (indices)

But, there are maybe some minor issues. One is the use of cuBLAS/hipBLAS for the norm routines (CeedVectorNorm_[Hip/Cuda]) -- they only take 32-bit integers for the length (and index, for I*amax). cuBLAS 12 also has 64-bit integer routines for the norms, but it seems hipBLAS doesn't yet.

The other potential issue is in simple kernel launches like the ones in hip-ref-vector.hip.cpp, e.g. CeedDeviceSetValue_Hip:

  const int bsize    = 512;
  const CeedInt vecsize  = length;
  int       gridsize = vecsize / bsize;
  
  if (bsize * gridsize < vecsize) gridsize += 1;
  hipLaunchKernelGGL(setValueK, dim3(gridsize), dim3(bsize), 0, 0, d_array, length, val);

vecsize should be promoted to CeedSize since the length of the vector is currently allowed to be bigger than what will fit in CeedInt anyway. But we are assuming that vecsize/bsize will always be small enough to fit into int (or otherwise we will be trying to have too big of a grid, in addition to the value not fitting in int). In the case of the kernels that launch with the grid size tied to the number of elements, we've felt pretty confident that no one would be trying to have that many local elements such that the calculated size of the grid would be a concern. But for these vector routines, we've already said Vectors can need to be bigger than int, but then -- how much bigger do we want to allow them to be, technically, in terms of any more sophisticated launching of these kernels?

But realistically, I think, the norm situation is the real concern.

jedbrown · 2023-06-23T22:56:11Z

Can we just check for overflow on the host and error explaining how larger sizes are unsupported in hipBLAS? I don't think it's useful to take the norm of these vectors (basically just arrays of entries with some redundancy), but we should have a useful error if someone tries.

jedbrown · 2023-06-26T17:28:24Z

@nbeams Would you like to fold your work into this PR or make it a new PR?

nbeams · 2023-06-26T19:52:46Z

Can we just check for overflow on the host and error explaining how larger sizes are unsupported in hipBLAS? I don't think it's useful to take the norm of these vectors (basically just arrays of entries with some redundancy), but we should have a useful error if someone tries.

You mean just by comparing the size to INT_MAX, or something more sophisticated?

I wasn't thrilled with the idea of a fairly "standard" vector action being unavailable for perfectly valid CeedVectors, but I guess that does seem like the fastest way to get it fixed for the fluids app, and we can improve later if we want (maybe adding our own kernels that can handle larger vectors?).

I don't really have a preference for where to put the changes. I'd like to test a few things with the assembly kernels before we officially merge, but I should be able to make it a priority tomorrow.

jedbrown · 2023-06-26T21:19:23Z

The alternative is to have a simple loop around the hipblas calls, bumping the base pointers one each iteration with a length of min(length_remaining, INT_MAX). Those will be plenty big to cover kernel launch latency. For NORM_2, we need to sum the squares from each block.

nbeams · 2023-06-27T21:09:30Z

The alternative is to have a simple loop around the hipblas calls, bumping the base pointers one each iteration with a length of min(length_remaining, INT_MAX). Those will be plenty big to cover kernel launch latency. For NORM_2, we need to sum the squares from each block.

Yeah, makes sense. For CUDA, should we also check for CUDA >= 12 and call the 64-bit integer interface if we can? Can we always assume CeedSize <=> int64_t (by which I mean, safe to cast to what CUDA uses) since we would already be getting failures at these vector sizes if CeedSize were only 32 bit for a particular system?

I did a little playing around on MI250X today. For the linear diagonal assembly kernel in the Q3 fluids example, switching from CeedInt to CeedSize does increase the total register usage slightly (as expected). In this case, it's not enough to change the occupancy of the kernel, though of course it could for other kernels.

I was thinking, for both linear diagonal assembly and operator assembly, we always assume the user has passed in a CeedVector of the correct size for holding the assembled output, right? (And we also know the size of the assembled QFunction when setting up the operator assembly.) Would we want to use that information to pick the integer type in the assembly kernels defined at the time of JIT, so we can use fewer registers when we don't need CeedSize?

E.g. if we add a new compile-time-defined variable to the kernels, like CEEDSIZE, which is set to 1 or 0, then in the kernels do

#if CEEDSIZE
typedef CeedSize IndexType;
#else 
typedef CeedInt IndexType;
#endif 

[...all integer types are IndexType in the kernel itself]

but if that seems too complicated, we can ignore for now and just take the slight hit on the assembly kernels in cases that only need ints.

jedbrown · 2023-06-27T21:17:54Z

CeedSize is ptrdiff_t, which is the standard array index type. It's 32-bit on a machine with a 32-bit address space. Not many of those are interesting to HPC, but I can't promise they don't exist.

Choosing loop variable type in JIT seems interesting if we think the performance impact is concerning. The alternative would be to have multiple kernel launches, though I'm not sure that even works for these operations.

nbeams · 2023-06-27T21:51:59Z

It's 32-bit on a machine with a 32-bit address space. Not many of those are interesting to HPC, but I can't promise they don't exist.

Right, but I meant that if CeedSize is only 32-bit, then trying to handle cases where the vector length is more than INT_MAX will be doomed from the start (from trying to create a vector with that size), no? Just for the purposes of handling things inside the norm function.

jedbrown · 2023-06-27T21:58:17Z

Yeah, but you also just can't allocate or address that much memory on a 32-bit arch.

nbeams · 2023-06-28T20:49:33Z

I added some proposed changes for HIP. Once we are happy with these, I will add similar changes for CUDA. I may have either over- or under-done things in some places with the casts to make sure the compiler would use CeedSize and get rid of integer overflow warnings.

I did some local testing on the norm with a very large vector, but I assumed we didn't want to try adding that to the unit tests.

nbeams · 2023-06-28T20:55:32Z

(Sorry, not used to the new style check yet, and didn't have the right command prior to first push)

jedbrown

This all looks reasonable to me. Just a couple issues that should be simple to resolve.

backends/hip-ref/ceed-hip-ref-vector.c

backends/hip-ref/kernels/hip-ref-vector.hip.cpp

While 32-bit is sufficient for CeedElemRestriction, a Vector is used to store matrix entries and the number of entries can overflow 32-bit even for a small number of dofs. For example, 85k Q3 fluid elements is enough to overflow. Reported-by: Ken Jansen

…e of 32 bit integers

…for length

nbeams · 2023-07-05T22:23:29Z

I got the chance to test the CUDA implementation locally today with a very large vector (> INT_MAX size), for CUDA 11.6 and 12.1. For now, I just always used the 64-bit interface if CUDA >=12, but we can change this.

I rebased locally since it says there is a conflict with the main branch -- is it okay if I force-push to this branch now?

jedbrown · 2023-07-05T22:24:37Z

Great! Go ahead and force-push, then we can merge this.

nbeams · 2023-07-05T23:39:45Z

Not sure what's going on with the build failure on Noether, but I see it on main, as well...?

sebastiangrimberg · 2023-07-06T02:15:19Z

Not sure what's going on with the build failure on Noether, but I see it on main, as well...?

I’m sorry, I think this is my fault. Looks like there was a transitive #include dependency on hipblas.h via ceed-hip-common.h in ceed-hip-ref.h which was not caught by IWYU and removed in #1244 when trying to make these patterns consistent with ceed-cuda-common/ref.h.

The solution is to add the hipblas.h include block from https://github.com/CEED/libCEED/blob/main/backends/hip/ceed-hip-common.h#L15C1-L15C1 to here: https://github.com/CEED/libCEED/blob/main/backends/hip-ref/ceed-hip-ref.h#L15.

My apologies again. I’m not sure why the PR CI didn’t catch this before merging.

nbeams · 2023-07-06T20:54:06Z

No worries, looks like we were missing the hipblas header in ceed-hip-ref.h and also the ceed-hip-common.h header (for the Ceed_Hip struct) in ceed-hip-ref-qfunction-load.cpp. (For the latter, we do something extra that cuda-ref doesn't do, which is why we need access to that header here, but the CUDA version does not.)

My local build on Noether was clean now, so hopefully CI passes this time. I think this is ready to merge, if so.

nbeams · 2023-07-06T20:55:59Z

...er, once I make the style check happy, that is. I didn't realize the order of headers mattered...

nbeams · 2023-07-06T21:02:33Z

🐇 🎩 "This time, for sure!"

jedbrown · 2023-07-07T03:28:06Z

Thanks @nbeams!

jedbrown requested a review from jrwrigh June 23, 2023 02:50

jedbrown force-pushed the jed/fix-vec-size-loop-vars branch from e911285 to c50bba7 Compare June 23, 2023 03:22

kris-rowe mentioned this pull request Jun 23, 2023

Indexing in sycl-ref vector causes integer overflow argonne-lcf/libCEED#57

Closed

jrwrigh reviewed Jun 23, 2023

View reviewed changes

interface/ceed-preconditioning.c Show resolved Hide resolved

jrwrigh approved these changes Jun 23, 2023

View reviewed changes

jedbrown requested a review from nbeams June 26, 2023 17:28

jedbrown commented Jun 29, 2023

View reviewed changes

backends/hip-ref/ceed-hip-ref-vector.c Outdated Show resolved Hide resolved

backends/hip-ref/kernels/hip-ref-vector.hip.cpp Outdated Show resolved Hide resolved

jedbrown and others added 6 commits July 5, 2023 22:18

HIP: Improve support for CeedVectors that are longer than the max siz…

9330dae

…e of 32 bit integers

simplify logic of hipblas use for norm

90709ca

Force use of CeedSize in calculating global idx

3d13c0f

CUDA: improvements for handling large CeedVectors requiring CeedSize …

f7c1b51

…for length

Ensure initialization for new L1 norm handling in GPU backends

f6f49ad

nbeams force-pushed the jed/fix-vec-size-loop-vars branch from 76cf8ac to f6f49ad Compare July 5, 2023 22:28

Add headers missed by IWYU to hip-ref

05c335c

nbeams force-pushed the jed/fix-vec-size-loop-vars branch from 1c5e820 to 05c335c Compare July 6, 2023 20:57

jedbrown merged commit b3d4ed2 into main Jul 7, 2023

jedbrown deleted the jed/fix-vec-size-loop-vars branch July 7, 2023 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CeedVector/Preconditioning: fix CeedInt loop vars to CeedSize #1241

CeedVector/Preconditioning: fix CeedInt loop vars to CeedSize #1241

jedbrown commented Jun 23, 2023

jeremylt commented Jun 23, 2023

jedbrown commented Jun 23, 2023

nbeams commented Jun 23, 2023

jedbrown commented Jun 23, 2023

jedbrown commented Jun 26, 2023

nbeams commented Jun 26, 2023

jedbrown commented Jun 26, 2023

nbeams commented Jun 27, 2023

jedbrown commented Jun 27, 2023

nbeams commented Jun 27, 2023

jedbrown commented Jun 27, 2023

nbeams commented Jun 28, 2023

nbeams commented Jun 28, 2023

jedbrown left a comment

nbeams commented Jul 5, 2023

jedbrown commented Jul 5, 2023

nbeams commented Jul 5, 2023

sebastiangrimberg commented Jul 6, 2023 •

edited

Loading

nbeams commented Jul 6, 2023

nbeams commented Jul 6, 2023

nbeams commented Jul 6, 2023

jedbrown commented Jul 7, 2023

CeedVector/Preconditioning: fix CeedInt loop vars to CeedSize #1241

CeedVector/Preconditioning: fix CeedInt loop vars to CeedSize #1241

Conversation

jedbrown commented Jun 23, 2023

jeremylt commented Jun 23, 2023

jedbrown commented Jun 23, 2023

nbeams commented Jun 23, 2023

jedbrown commented Jun 23, 2023

jedbrown commented Jun 26, 2023

nbeams commented Jun 26, 2023

jedbrown commented Jun 26, 2023

nbeams commented Jun 27, 2023

jedbrown commented Jun 27, 2023

nbeams commented Jun 27, 2023

jedbrown commented Jun 27, 2023

nbeams commented Jun 28, 2023

nbeams commented Jun 28, 2023

jedbrown left a comment

Choose a reason for hiding this comment

nbeams commented Jul 5, 2023

jedbrown commented Jul 5, 2023

nbeams commented Jul 5, 2023

sebastiangrimberg commented Jul 6, 2023 • edited Loading

nbeams commented Jul 6, 2023

nbeams commented Jul 6, 2023

nbeams commented Jul 6, 2023

jedbrown commented Jul 7, 2023

sebastiangrimberg commented Jul 6, 2023 •

edited

Loading