Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCCA SEGV when passing ROCm device memory #678

Closed
jedbrown opened this issue Dec 23, 2020 · 13 comments · Fixed by libocca/occa#435
Closed

OCCA SEGV when passing ROCm device memory #678

jedbrown opened this issue Dec 23, 2020 · 13 comments · Fixed by libocca/occa#435

Comments

@jedbrown
Copy link
Member

@dmed256 Perhaps you can identify what's going wrong here. I pass a device pointer here (and it works with all the other /gpu/hip/* backends). The self->modeMemory->dtype_ is NULL so we get a SEGV here. I'm using occa-1.1.0 and we're on ROCm-4.0, though this looks to all be plain C++ code.

Thread 1 "petsc-bps" received signal SIGSEGV, Segmentation fault.
0x00007fffecee20e0 in occa::dtype_t::self (this=0x0) at /projects/occa/include/occa/dtype/dtype.hpp:52
52            return ref ? *ref : *this;
(gdb) bt
#0  0x00007fffecee20e0 in occa::dtype_t::self (this=0x0) at /projects/occa/include/occa/dtype/dtype.hpp:52
#1  0x00007fffecf031aa in occa::dtype_t::bytes (this=0x0) at /projects/occa/src/dtype/dtype.cpp:91
#2  0x00007fffecedf057 in occa::memory::slice (this=0x7fffffffbc30, offset=0, count=-1) at /projects/occa/src/core/memory.cpp:385
#3  0x00007fffecee1b85 in occa::memory::as (this=0x7fffffffbc30, dtype_=...) at /projects/occa/src/core/memory.cpp:536
#4  0x00007ffff4be413a in ceed::occa::arrayToMemory<double> (array=array@entry=0x7ffeb080f000) at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.hpp:29
#5  0x00007ffff4be2e88 in ceed::occa::Vector::useArrayPointer (this=0x555555fc9ac0, mtype=<optimized out>, array=array@entry=0x7ffeb080f000)
    at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.cpp:199
#6  0x00007ffff4be356c in ceed::occa::Vector::setArray (this=0x555555fc9ac0, mtype=mtype@entry=CEED_MEM_DEVICE, cmode=cmode@entry=CEED_USE_POINTER, array=array@entry=0x7ffeb080f000)
    at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.cpp:119
#7  0x00007ffff4be35ce in ceed::occa::Vector::ceedSetArray (vec=<optimized out>, mtype=CEED_MEM_DEVICE, cmode=CEED_USE_POINTER, array=0x7ffeb080f000)
    at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.cpp:366
#8  0x00007ffff4bb23b3 in CeedVectorSetArray (vec=0x555555fee7f0, mtype=<optimized out>, cmode=cmode@entry=CEED_USE_POINTER, array=<optimized out>) at /home/jeka2967/libceed/interface/ceed-vector.c:200
#9  0x000055555555c436 in RunWithDM (ceedresource=<optimized out>, dm=0x5555559ca330, rp=0x5555558ce510) at /home/jeka2967/libceed/examples/petsc/bps.c:196
#10 Run (rp=0x5555558ce510, num_resources=1, ceedresources=ceedresources@entry=0x7fffffffc120, num_bpchoices=<optimized out>, bpchoices=bpchoices@entry=0x7fffffffc020)
    at /home/jeka2967/libceed/examples/petsc/bps.c:421
#11 0x0000555555560009 in main (argc=<optimized out>, argv=<optimized out>) at /home/jeka2967/libceed/examples/petsc/bps.c:588
(gdb) up
#1  0x00007fffecf031aa in occa::dtype_t::bytes (this=0x0) at /projects/occa/src/dtype/dtype.cpp:91
91          return self().bytes_;
(gdb)
#2  0x00007fffecedf057 in occa::memory::slice (this=0x7fffffffbc30, offset=0, count=-1) at /projects/occa/src/core/memory.cpp:385
385         const int dtypeSize = modeMemory->dtype_->bytes();
(gdb) p modeMemory->dtype_
$1 = (const occa::dtype_t *) 0x0
(gdb) up
#3  0x00007fffecee1b85 in occa::memory::as (this=0x7fffffffbc30, dtype_=...) at /projects/occa/src/core/memory.cpp:536
536         occa::memory mem = slice(0);
(gdb)
#4  0x00007ffff4be413a in ceed::occa::arrayToMemory<double> (array=array@entry=0x7ffeb080f000) at /home/jeka2967/libceed/backends/occa/ceed-occa-vector.hpp:29
29              return mem.as(::occa::dtype::get<TM>());
(gdb) l
24        namespace occa {
25          template <class TM>
26          ::occa::memory arrayToMemory(const TM *array) {
27            if (array) {
28              ::occa::memory mem((::occa::modeMemory_t*) array);
29              return mem.as(::occa::dtype::get<TM>());
30            }
31            return ::occa::null;
32          }
33

I guess I'm confused about how the cast on line 28 is supposed to work. This array is just a pointer to (floating point) data on the device so I don't follow how it can be cast to modeMemory_t* and then dereference the dtype_ field.

jedbrown added a commit that referenced this issue Dec 23, 2020
We resolve default VECCUDA or VECHIP based on /gpu/cuda or /gpu/hip. For
others, one should pass -dm_vec_type kokkos or -dm_vec_type standard.

There is no longer a "libCEED User Requested MemType" field in the
output.

Enable CI testing using PETSc built locally with HIP. This should be
converted to building a specified version once some bugs are fixed and
the branch merges.

Exclude automatic selection of VECHIP for /gpu/hip/occa due to SEGV:
#678
@v-dobrev
Copy link
Member

I think this is the same issue that we have in MFEM trying to use OCCA's HIP backend from libCEED. For example, running Example 1 with

./ex1 -pa -d ceed-hip:/gpu/hip/occa

produces a segfault with the following backtrace:

#0  0x00002aaaabc9d6ce in occa::modeMemory_t::addMemoryRef(occa::memory*) ()
   from /<deleted-path>/occa/lib/libocca.so
#1  0x00002aaaac0cb993 in ceed::occa::Vector::useArrayPointer(CeedMemType, double*) ()
   from /<deleted-path>/libceed/lib/libceed.so
#2  0x00002aaaac0cc3c8 in ceed::occa::Vector::setArray(CeedMemType, CeedCopyMode, double*) ()
   from /<deleted-path>/libceed/lib/libceed.so
#3  0x00002aaaac0cc458 in ceed::occa::Vector::ceedSetArray(CeedVector_private*, CeedMemType, CeedCopyMode, double*) ()
   from /<deleted-path>/libceed/lib/libceed.so
#4  0x00002aaaac08dd50 in CeedVectorSetArray ()
   from /<deleted-path>/libceed/lib/libceed.so
#5  0x0000000000782dec in mfem::InitCeedVector(mfem::Vector const&, CeedVector_private*&) ()
#6  0x000000000078315e in mfem::CeedPAAssemble(mfem::CeedPAOperator const&, mfem::CeedData&) ()
#7  0x00000000009df141 in mfem::CeedPADiffusionAssemble(mfem::FiniteElementSpace const&, mfem::IntegrationRule const&, mfem::CeedData&) ()
#8  0x000000000070ddb0 in mfem::PABilinearFormExtension::Assemble() ()
#9  0x000000000040c4b9 in main ()

I think we discussed this briefly with @dmed256 before the MFEM v4.2 release (that's when I first noticed this issue) but we did not pursue it further at the time. Basically, I think that ceed::occa::Vector::ceedSetArray does not work correctly when the CeedCopyMode is CEED_USE_POINTER. I think the external double* pointer needs to be wrapped as occa::memory using the backend-specific methods like occa::cuda::wrapMemory.

jedbrown added a commit that referenced this issue Dec 25, 2020
We resolve default VECCUDA or VECHIP based on /gpu/cuda or /gpu/hip. For
others, one should pass -dm_vec_type kokkos or -dm_vec_type standard.

There is no longer a "libCEED User Requested MemType" field in the
output.

Enable CI testing using PETSc built locally with HIP. This should be
converted to building a specified version once some bugs are fixed and
the branch merges.

Exclude automatic selection of VECHIP for /gpu/hip/occa due to SEGV:
#678
@dmed256
Copy link
Member

dmed256 commented Dec 29, 2020

I misunderstood what the getter and setter methods were for. I thought it was a way to get a handle to the data for the given backend (e.g. occa::modeMemory_t*) and not the underlying backend (e.g. CUDA pointer).

@jedbrown can you confirm we want to pass/retrieve C/C++/CUDA/HIP native pointers and not backend-specific pointers for these methods:

ceedSetArray
ceedGetArray

@jedbrown
Copy link
Member Author

Yup, that's what we want. The use case in current 'main' is that we're getting the array (as a host, CUDA, or ROCm device pointer) from PETSc and having the CeedVector wrap it for application of an operator. I believe the implementations in CeedVectorSetArrayDevice_{Cuda,Hip} are correct (and they work in all cases we've considered). We could consider something different, but this mode is useful and flexible.

@dmed256
Copy link
Member

dmed256 commented Dec 29, 2020

Cool, thank you for clarifying. Making it easy to wrap a native pointer has been a feature I've wanted to add so it helps everyone :) (:bird: :bird: :curling_stone:)

@jedbrown
Copy link
Member Author

Great, I hope this isn't too hard and we'll look forward to activating this usage with the OCCA backends.

@jedbrown
Copy link
Member Author

jedbrown commented Jan 2, 2021

I think this isn't fixed yet since libCEED will need an update, though it should be simple.

@jedbrown jedbrown reopened this Jan 2, 2021
@dmed256
Copy link
Member

dmed256 commented Jan 2, 2021

Oh weird, I think Github auto-closed this ticket when I merged the PR

image

@jedbrown
Copy link
Member Author

jedbrown commented Jan 2, 2021

Yeah, that's clearly what happened. I wonder if it would have done that if you didn't have write permission to this repo.

Are you able to update backends/occa/ or would you like us to give it a try?

@dmed256
Copy link
Member

dmed256 commented Jan 2, 2021

I started #688 but will need to re-tag the OCCA versions along with the PR changes. I'll tag 1.1.1 after these tests work on a fixed main commit.

@dmed256
Copy link
Member

dmed256 commented Jan 2, 2021

Actually, I'm not sure where I can set the version/commit 🤔

@jeremylt
Copy link
Member

jeremylt commented Jan 2, 2021

For testing, it's set in .gitlab-ci.yml

For documentation it should be on the README, I think

@dmed256
Copy link
Member

dmed256 commented Jan 2, 2021

Thank you!

karenlstengel pushed a commit that referenced this issue Feb 17, 2021
We resolve default VECCUDA or VECHIP based on /gpu/cuda or /gpu/hip. For
others, one should pass -dm_vec_type kokkos or -dm_vec_type standard.

There is no longer a "libCEED User Requested MemType" field in the
output.

Enable CI testing using PETSc built locally with HIP. This should be
converted to building a specified version once some bugs are fixed and
the branch merges.

Exclude automatic selection of VECHIP for /gpu/hip/occa due to SEGV:
#678
karenlstengel pushed a commit that referenced this issue Feb 24, 2021
We resolve default VECCUDA or VECHIP based on /gpu/cuda or /gpu/hip. For
others, one should pass -dm_vec_type kokkos or -dm_vec_type standard.

There is no longer a "libCEED User Requested MemType" field in the
output.

Enable CI testing using PETSc built locally with HIP. This should be
converted to building a specified version once some bugs are fixed and
the branch merges.

Exclude automatic selection of VECHIP for /gpu/hip/occa due to SEGV:
#678
@dmed256 dmed256 removed their assignment May 25, 2022
@jeremylt
Copy link
Member

I have not checked, but I imagine this is fixed now

@jeremylt jeremylt closed this as completed Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants