Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CeedVector/Preconditioning: fix CeedInt loop vars to CeedSize #1241

Merged
merged 7 commits into from
Jul 7, 2023

Conversation

jedbrown
Copy link
Member

While 32-bit is sufficient for CeedElemRestriction, a Vector is used to store matrix entries and the number of entries can overflow 32-bit even for a small number of dofs. For example, 85k Q3 fluid elements is enough to overflow.

Reported-by: Ken Jansen

@jedbrown jedbrown requested a review from jrwrigh June 23, 2023 02:50
@jeremylt
Copy link
Member

Should we update the GPU backend impls for the vec functions too?

@jedbrown
Copy link
Member Author

We'll need to, but it's harder to test because most devices aren't big enough to overflow.

@nbeams
Copy link
Contributor

nbeams commented Jun 23, 2023

Adding some notes about HIP/CUDA here, though we can move this discussion if we want to open a separate PR for that so we can go ahead and merge this.

I was able to reproduce the issue on MI250X, and resolve it by modifications to:

  • Functions in ceed-hip-ref-vector.c : were still using CeedInt for size even though the "official" type of a CeedVector size is CeedSize now
  • kernels in hip-ref-vector.hip.cpp (again, length/size still using CeedInt)
  • kernels for operator assembly and operator linear diagonal assembly (indices)

But, there are maybe some minor issues. One is the use of cuBLAS/hipBLAS for the norm routines (CeedVectorNorm_[Hip/Cuda]) -- they only take 32-bit integers for the length (and index, for I*amax). cuBLAS 12 also has 64-bit integer routines for the norms, but it seems hipBLAS doesn't yet.

The other potential issue is in simple kernel launches like the ones in hip-ref-vector.hip.cpp, e.g. CeedDeviceSetValue_Hip:

  const int bsize    = 512;
  const CeedInt vecsize  = length;
  int       gridsize = vecsize / bsize;
  
  if (bsize * gridsize < vecsize) gridsize += 1;
  hipLaunchKernelGGL(setValueK, dim3(gridsize), dim3(bsize), 0, 0, d_array, length, val);

vecsize should be promoted to CeedSize since the length of the vector is currently allowed to be bigger than what will fit in CeedInt anyway. But we are assuming that vecsize/bsize will always be small enough to fit into int (or otherwise we will be trying to have too big of a grid, in addition to the value not fitting in int). In the case of the kernels that launch with the grid size tied to the number of elements, we've felt pretty confident that no one would be trying to have that many local elements such that the calculated size of the grid would be a concern. But for these vector routines, we've already said Vectors can need to be bigger than int, but then -- how much bigger do we want to allow them to be, technically, in terms of any more sophisticated launching of these kernels?

But realistically, I think, the norm situation is the real concern.

@jedbrown
Copy link
Member Author

Can we just check for overflow on the host and error explaining how larger sizes are unsupported in hipBLAS? I don't think it's useful to take the norm of these vectors (basically just arrays of entries with some redundancy), but we should have a useful error if someone tries.

@jedbrown jedbrown requested a review from nbeams June 26, 2023 17:28
@jedbrown
Copy link
Member Author

@nbeams Would you like to fold your work into this PR or make it a new PR?

@nbeams
Copy link
Contributor

nbeams commented Jun 26, 2023

Can we just check for overflow on the host and error explaining how larger sizes are unsupported in hipBLAS? I don't think it's useful to take the norm of these vectors (basically just arrays of entries with some redundancy), but we should have a useful error if someone tries.

You mean just by comparing the size to INT_MAX, or something more sophisticated?

I wasn't thrilled with the idea of a fairly "standard" vector action being unavailable for perfectly valid CeedVectors, but I guess that does seem like the fastest way to get it fixed for the fluids app, and we can improve later if we want (maybe adding our own kernels that can handle larger vectors?).

I don't really have a preference for where to put the changes. I'd like to test a few things with the assembly kernels before we officially merge, but I should be able to make it a priority tomorrow.

@jedbrown
Copy link
Member Author

The alternative is to have a simple loop around the hipblas calls, bumping the base pointers one each iteration with a length of min(length_remaining, INT_MAX). Those will be plenty big to cover kernel launch latency. For NORM_2, we need to sum the squares from each block.

@nbeams
Copy link
Contributor

nbeams commented Jun 27, 2023

The alternative is to have a simple loop around the hipblas calls, bumping the base pointers one each iteration with a length of min(length_remaining, INT_MAX). Those will be plenty big to cover kernel launch latency. For NORM_2, we need to sum the squares from each block.

Yeah, makes sense. For CUDA, should we also check for CUDA >= 12 and call the 64-bit integer interface if we can? Can we always assume CeedSize <=> int64_t (by which I mean, safe to cast to what CUDA uses) since we would already be getting failures at these vector sizes if CeedSize were only 32 bit for a particular system?

I did a little playing around on MI250X today. For the linear diagonal assembly kernel in the Q3 fluids example, switching from CeedInt to CeedSize does increase the total register usage slightly (as expected). In this case, it's not enough to change the occupancy of the kernel, though of course it could for other kernels.

I was thinking, for both linear diagonal assembly and operator assembly, we always assume the user has passed in a CeedVector of the correct size for holding the assembled output, right? (And we also know the size of the assembled QFunction when setting up the operator assembly.) Would we want to use that information to pick the integer type in the assembly kernels defined at the time of JIT, so we can use fewer registers when we don't need CeedSize?

E.g. if we add a new compile-time-defined variable to the kernels, like CEEDSIZE, which is set to 1 or 0, then in the kernels do

#if CEEDSIZE
typedef CeedSize IndexType;
#else 
typedef CeedInt IndexType;
#endif 

[...all integer types are IndexType in the kernel itself]

but if that seems too complicated, we can ignore for now and just take the slight hit on the assembly kernels in cases that only need ints.

@jedbrown
Copy link
Member Author

CeedSize is ptrdiff_t, which is the standard array index type. It's 32-bit on a machine with a 32-bit address space. Not many of those are interesting to HPC, but I can't promise they don't exist.

Choosing loop variable type in JIT seems interesting if we think the performance impact is concerning. The alternative would be to have multiple kernel launches, though I'm not sure that even works for these operations.

@nbeams
Copy link
Contributor

nbeams commented Jun 27, 2023

It's 32-bit on a machine with a 32-bit address space. Not many of those are interesting to HPC, but I can't promise they don't exist.

Right, but I meant that if CeedSize is only 32-bit, then trying to handle cases where the vector length is more than INT_MAX will be doomed from the start (from trying to create a vector with that size), no? Just for the purposes of handling things inside the norm function.

@jedbrown
Copy link
Member Author

Yeah, but you also just can't allocate or address that much memory on a 32-bit arch.

@nbeams
Copy link
Contributor

nbeams commented Jun 28, 2023

I added some proposed changes for HIP. Once we are happy with these, I will add similar changes for CUDA. I may have either over- or under-done things in some places with the casts to make sure the compiler would use CeedSize and get rid of integer overflow warnings.

I did some local testing on the norm with a very large vector, but I assumed we didn't want to try adding that to the unit tests.

@nbeams
Copy link
Contributor

nbeams commented Jun 28, 2023

(Sorry, not used to the new style check yet, and didn't have the right command prior to first push)

Copy link
Member Author

@jedbrown jedbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks reasonable to me. Just a couple issues that should be simple to resolve.

backends/hip-ref/ceed-hip-ref-vector.c Outdated Show resolved Hide resolved
backends/hip-ref/kernels/hip-ref-vector.hip.cpp Outdated Show resolved Hide resolved
jedbrown and others added 6 commits July 5, 2023 22:18
While 32-bit is sufficient for CeedElemRestriction, a Vector is used
to store matrix entries and the number of entries can overflow 32-bit
even for a small number of dofs. For example, 85k Q3 fluid elements is
enough to overflow.

Reported-by: Ken Jansen
@nbeams
Copy link
Contributor

nbeams commented Jul 5, 2023

I got the chance to test the CUDA implementation locally today with a very large vector (> INT_MAX size), for CUDA 11.6 and 12.1. For now, I just always used the 64-bit interface if CUDA >=12, but we can change this.

I rebased locally since it says there is a conflict with the main branch -- is it okay if I force-push to this branch now?

@jedbrown
Copy link
Member Author

jedbrown commented Jul 5, 2023

Great! Go ahead and force-push, then we can merge this.

@nbeams nbeams force-pushed the jed/fix-vec-size-loop-vars branch from 76cf8ac to f6f49ad Compare July 5, 2023 22:28
@nbeams
Copy link
Contributor

nbeams commented Jul 5, 2023

Not sure what's going on with the build failure on Noether, but I see it on main, as well...?

@sebastiangrimberg
Copy link
Collaborator

sebastiangrimberg commented Jul 6, 2023

Not sure what's going on with the build failure on Noether, but I see it on main, as well...?

I’m sorry, I think this is my fault. Looks like there was a transitive #include dependency on hipblas.h via ceed-hip-common.h in ceed-hip-ref.h which was not caught by IWYU and removed in #1244 when trying to make these patterns consistent with ceed-cuda-common/ref.h.

The solution is to add the hipblas.h include block from https://github.com/CEED/libCEED/blob/main/backends/hip/ceed-hip-common.h#L15C1-L15C1 to here: https://github.com/CEED/libCEED/blob/main/backends/hip-ref/ceed-hip-ref.h#L15.

My apologies again. I’m not sure why the PR CI didn’t catch this before merging.

@nbeams
Copy link
Contributor

nbeams commented Jul 6, 2023

No worries, looks like we were missing the hipblas header in ceed-hip-ref.h and also the ceed-hip-common.h header (for the Ceed_Hip struct) in ceed-hip-ref-qfunction-load.cpp. (For the latter, we do something extra that cuda-ref doesn't do, which is why we need access to that header here, but the CUDA version does not.)

My local build on Noether was clean now, so hopefully CI passes this time. I think this is ready to merge, if so.

@nbeams
Copy link
Contributor

nbeams commented Jul 6, 2023

...er, once I make the style check happy, that is. I didn't realize the order of headers mattered...

@nbeams nbeams force-pushed the jed/fix-vec-size-loop-vars branch from 1c5e820 to 05c335c Compare July 6, 2023 20:57
@nbeams
Copy link
Contributor

nbeams commented Jul 6, 2023

🐇 🎩 "This time, for sure!"

@jedbrown jedbrown merged commit b3d4ed2 into main Jul 7, 2023
@jedbrown jedbrown deleted the jed/fix-vec-size-loop-vars branch July 7, 2023 03:27
@jedbrown
Copy link
Member Author

jedbrown commented Jul 7, 2023

Thanks @nbeams!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants