-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCCA SEGV when passing ROCm device memory #678
Comments
We resolve default VECCUDA or VECHIP based on /gpu/cuda or /gpu/hip. For others, one should pass -dm_vec_type kokkos or -dm_vec_type standard. There is no longer a "libCEED User Requested MemType" field in the output. Enable CI testing using PETSc built locally with HIP. This should be converted to building a specified version once some bugs are fixed and the branch merges. Exclude automatic selection of VECHIP for /gpu/hip/occa due to SEGV: #678
I think this is the same issue that we have in MFEM trying to use OCCA's HIP backend from libCEED. For example, running Example 1 with ./ex1 -pa -d ceed-hip:/gpu/hip/occa produces a segfault with the following backtrace: #0 0x00002aaaabc9d6ce in occa::modeMemory_t::addMemoryRef(occa::memory*) ()
from /<deleted-path>/occa/lib/libocca.so
#1 0x00002aaaac0cb993 in ceed::occa::Vector::useArrayPointer(CeedMemType, double*) ()
from /<deleted-path>/libceed/lib/libceed.so
#2 0x00002aaaac0cc3c8 in ceed::occa::Vector::setArray(CeedMemType, CeedCopyMode, double*) ()
from /<deleted-path>/libceed/lib/libceed.so
#3 0x00002aaaac0cc458 in ceed::occa::Vector::ceedSetArray(CeedVector_private*, CeedMemType, CeedCopyMode, double*) ()
from /<deleted-path>/libceed/lib/libceed.so
#4 0x00002aaaac08dd50 in CeedVectorSetArray ()
from /<deleted-path>/libceed/lib/libceed.so
#5 0x0000000000782dec in mfem::InitCeedVector(mfem::Vector const&, CeedVector_private*&) ()
#6 0x000000000078315e in mfem::CeedPAAssemble(mfem::CeedPAOperator const&, mfem::CeedData&) ()
#7 0x00000000009df141 in mfem::CeedPADiffusionAssemble(mfem::FiniteElementSpace const&, mfem::IntegrationRule const&, mfem::CeedData&) ()
#8 0x000000000070ddb0 in mfem::PABilinearFormExtension::Assemble() ()
#9 0x000000000040c4b9 in main () I think we discussed this briefly with @dmed256 before the MFEM v4.2 release (that's when I first noticed this issue) but we did not pursue it further at the time. Basically, I think that |
We resolve default VECCUDA or VECHIP based on /gpu/cuda or /gpu/hip. For others, one should pass -dm_vec_type kokkos or -dm_vec_type standard. There is no longer a "libCEED User Requested MemType" field in the output. Enable CI testing using PETSc built locally with HIP. This should be converted to building a specified version once some bugs are fixed and the branch merges. Exclude automatic selection of VECHIP for /gpu/hip/occa due to SEGV: #678
I misunderstood what the getter and setter methods were for. I thought it was a way to get a handle to the data for the given backend (e.g. @jedbrown can you confirm we want to pass/retrieve C/C++/CUDA/HIP native pointers and not backend-specific pointers for these methods:
|
Yup, that's what we want. The use case in current 'main' is that we're getting the array (as a host, CUDA, or ROCm device pointer) from PETSc and having the |
Cool, thank you for clarifying. Making it easy to wrap a native pointer has been a feature I've wanted to add so it helps everyone :) (:bird: :bird: :curling_stone:) |
Great, I hope this isn't too hard and we'll look forward to activating this usage with the OCCA backends. |
I think this isn't fixed yet since libCEED will need an update, though it should be simple. |
Yeah, that's clearly what happened. I wonder if it would have done that if you didn't have write permission to this repo. Are you able to update |
I started #688 but will need to re-tag the OCCA versions along with the PR changes. I'll tag |
Actually, I'm not sure where I can set the version/commit 🤔 |
For testing, it's set in For documentation it should be on the README, I think |
Thank you! |
We resolve default VECCUDA or VECHIP based on /gpu/cuda or /gpu/hip. For others, one should pass -dm_vec_type kokkos or -dm_vec_type standard. There is no longer a "libCEED User Requested MemType" field in the output. Enable CI testing using PETSc built locally with HIP. This should be converted to building a specified version once some bugs are fixed and the branch merges. Exclude automatic selection of VECHIP for /gpu/hip/occa due to SEGV: #678
We resolve default VECCUDA or VECHIP based on /gpu/cuda or /gpu/hip. For others, one should pass -dm_vec_type kokkos or -dm_vec_type standard. There is no longer a "libCEED User Requested MemType" field in the output. Enable CI testing using PETSc built locally with HIP. This should be converted to building a specified version once some bugs are fixed and the branch merges. Exclude automatic selection of VECHIP for /gpu/hip/occa due to SEGV: #678
I have not checked, but I imagine this is fixed now |
@dmed256 Perhaps you can identify what's going wrong here. I pass a device pointer here (and it works with all the other
/gpu/hip/*
backends). Theself->modeMemory->dtype_
is NULL so we get a SEGV here. I'm using occa-1.1.0 and we're on ROCm-4.0, though this looks to all be plain C++ code.I guess I'm confused about how the cast on line 28 is supposed to work. This
array
is just a pointer to (floating point) data on the device so I don't follow how it can be cast tomodeMemory_t*
and then dereference thedtype_
field.The text was updated successfully, but these errors were encountered: