Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCCA] Update array methods and remove OCCA internal methods #688

Closed
wants to merge 9 commits into from

Conversation

dmed256
Copy link
Member

@dmed256 dmed256 commented Jan 2, 2021

Description

  • Replaces the array getter/setter methods to use the occa::memory::ptr() and occa::device::wrapMemory methods
  • Removes use of OCCA internal methods

⚠️ Before merging, I need to make sure I change the following ⚠️

  • Git branch from main -> v1.1.1
  • The cache update step
    • && (cd $OCCA_VERSION && git reset --hard origin/main)

@dmed256 dmed256 force-pushed the occa/fix-array-methods branch 2 times, most recently from c995c4a to 898d813 Compare January 2, 2021 04:38
@dmed256
Copy link
Member Author

dmed256 commented Jan 2, 2021

Unfortunately I don't have a HIP machine to test these errors locally 😞
image

@jedbrown
Copy link
Member

jedbrown commented Jan 2, 2021

I'm getting clean runs from my account on that same machine. I'll investigate in the morning. I can give you access to the machine if you send me your ssh public key.

@jedbrown
Copy link
Member

jedbrown commented Jan 2, 2021

It's clean now on Noether. The occa-1.1.1 directory was missing the latest commit and the change you made to .gitlab-ci.yml won't update.

commit c596688eba56dc2c6c45aae04128be260fedc7c0 (HEAD -> main, origin/main)
Author: David Medina <[email protected]>
Date:   Fri Jan 1 23:36:02 2021 -0500

    [HIP] Update ptr ref hackery (#437)

commit d23f05d8b15c1af1c53272180facd65ace26df0c
Author: David Medina <[email protected]>
Date:   Fri Jan 1 22:42:21 2021 -0500

    [Refactor] Move modeIsEnabled back to public API (#436)

- >
cd ..
&& export OCCA_VERSION=occa-1.1.1 OCCA_OPENCL_ENABLED=0
&& { [[ -d $OCCA_VERSION ]] || { git clone --depth 1 --branch main https://github.com/libocca/occa.git $OCCA_VERSION && make -C $OCCA_VERSION -j$(nproc); }; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than hacking --branch main here, I'd rather set OCCA_VERSION to a commit hash if this comes up in the future. (I fixed the issue manually this time.) I suppose in this instance you plan to tag v1.1.1 on the current 'main'.

Copy link
Member

@jedbrown jedbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved once the OCCA_VERSION stuff in .gitlab-ci.yml is amended (and pipelines pass again).

@dmed256
Copy link
Member Author

dmed256 commented Jan 2, 2021

Oh awesome, I'll go ahead and tag 1.1.1 later this weekend and update main -> v1.1.1. Thanks for the help!

@jedbrown
Copy link
Member

jedbrown commented Jan 3, 2021

Hmm, it would appear that hipcc doesn't like what it's getting. I tried clearing the cache and rebuilding occa and libceed, but I still get failures in some tests.

not ok 3 t204-elemrestriction /gpu/hip/occa stderr
# +terminate called after throwing an instance of 'occa::exception'
# +what():
# +---[ Error ]--------------------------------------------------------------------
# +File     : /home/gitlab-runner/builds/N8GCsqus/0/libceed/occa-1.1.1/src/occa/internal/modes/hip/d
evice.cpp
# +Line     : 308
# +Function : compileKernel
# +Message  : Error compiling [applyRestriction], Command: [hipcc --genco -O3   --amdgpu-target=gfx9
06 -I/home/gitlab-runner/builds/N8GCsqus/0/libceed/occa-1.1.1/include -I/home/gitlab-runner/builds/N
8GCsqus/0/libceed/occa-1.1.1/include /home/gitlab-runner/.occa/cache/15fa00807d8b4d0f/source.cpp -o
/home/gitlab-runner/.occa/cache/15fa00807d8b4d0f/binary > /dev/null 2>&1]
# +Stack
# +15 occa::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > co
nst&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, s
td::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# +14 occa::hip::device::compileKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std:
:allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<c
har> > const&, occa::json const&, occa::io::lock_t&)
# +13 occa::hip::device::buildKernelFromProcessedSource(occa::hash_t, std::__cxx11::basic_string<cha
r, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char
_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char
>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allo
cator<char> > const&, bool, occa::lang::sourceMetadata_t&, occa::lang::sourceMetadata_t&, occa::json
 const&, occa::io::lock_t)
# +12 occa::launchedModeDevice_t::buildKernel(std::__cxx11::basic_string<char, std::char_traits<char
>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allo
cator<char> > const&, occa::hash_t, bool, occa::json const&)
# +11 occa::launchedModeDevice_t::buildKernel(std::__cxx11::basic_string<char, std::char_traits<char
>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allo
cator<char> > const&, occa::hash_t, occa::json const&)
# +10 occa::device::buildKernel(std::__cxx11::basic_string<char, std::char_traits<char>, std::alloca
tor<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >
const&, occa::json const&) const
# +9 occa::device::buildKernelFromString(std::__cxx11::basic_string<char, std::char_traits<char>, st
d::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
<char> > const&, occa::json const&) const
# +8 occa::kernelBuilder::build(occa::device, occa::hash_t const&, occa::json const&)
# +7 occa::kernelBuilder::build(occa::device, occa::json const&)
# +6 ceed::occa::ElemRestriction::apply(CeedTransposeMode, ceed::occa::Vector&, ceed::occa::Vector&)
# +5 ceed::occa::ElemRestriction::ceedApply(CeedElemRestriction_private*, CeedTransposeMode, CeedVec
tor_private*, CeedVector_private*, CeedRequest_private**)
# +4 /home/gitlab-runner/libceed/lib/libceed_test.so(CeedElemRestrictionApply+0x35)
# +3 build/t204-elemrestriction(+0x1368)
# +2 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)
# +1 build/t204-elemrestriction(+0x113a)

@dmed256
Copy link
Member Author

dmed256 commented Jan 3, 2021

Weird, I'm able to compile that file with the same command except different input/output files 🤔

davidmedina@noether:~$ cp /home/gitlab-runner/.occa/cache/15fa00807d8b4d0f/source.cpp .
davidmedina@noether:~$ hipcc --genco -O3   --amdgpu-target=gfx906 -I/home/gitlab-runner/builds/N8GCsqus/0/libceed/occa-1.1.1/include -I/home/gitlab-runner/builds/N8GCsqus/0/libceed/occa-1.1.1/include source.cpp -obinary
davidmedina@noether:~$ echo $?
0
davidmedina@noether:~$ ls
binary  occa  source.cpp

@dmed256
Copy link
Member Author

dmed256 commented Jan 3, 2021

Where did you see that error trace?

@jedbrown
Copy link
Member

jedbrown commented Jan 3, 2021

I tried make -j1 test BACKENDS=/gpu/hip/occa as gitlab-runner from ~gitlab-runner/libceed (trying to reduce variation without getting in the way of possible CI jobs).

@jedbrown
Copy link
Member

jedbrown commented Jan 5, 2021

make test on current OCCA main has some failures on Noether.

Failed examples:
  - HIP: cpp/01_add_vectors
  - HIP: cpp/02_background_device
  - HIP: cpp/03_inline_okl
  - HIP: cpp/04_reduction
  - HIP: cpp/05_building_kernels
  - HIP: cpp/06_unified_memory
  - HIP: cpp/07_dtypes
  - HIP: cpp/08_arrays
  - HIP: c/01_add_vectors
  - HIP: c/02_background_device
  - HIP: c/03_inline_okl
  - HIP: c/04_reduction

Do those pass for you somewhere? It seems like they should be fixed before trying to debug in libCEED.

@jedbrown
Copy link
Member

jedbrown commented Jan 5, 2021

The above tests did not fail with c596688eba56dc2c6c45aae04128be260fedc7c0, though some other OpenCL tests did fail. Perhaps you can bisect on Noether?

@dmed256
Copy link
Member Author

dmed256 commented Jan 6, 2021

🤦‍♂️ I think it's because OCCA can't find hipcc

[noether]> hipcc --genco -O3   --amdgpu-target=gfx906 -I/home/davidmedina/git/night/include -I/home/davidmedina/git/night/include /home/davidmedina/.occa/cache/2281ad69661fb068/source.cpp -o /home/davidmedina/.occa/cache/2281ad69661fb068/binary
-bash: hipcc: command not found

@dmed256
Copy link
Member Author

dmed256 commented Jan 6, 2021

Actually, it's still crashing for some reason 🤔

#1  0x00007ffff6476537 in __GI_abort () at abort.c:79
#2  0x00007ffff60a08df in rocr::core::Runtime::VMFaultHandler(long, void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007ffff60a29eb in rocr::core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#4  0x00007ffff6050367 in rocr::os::ThreadTrampoline(void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#5  0x00007ffff661eea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007ffff654ed8f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@dmed256
Copy link
Member Author

dmed256 commented Jan 6, 2021

Ah, found the issue. Will fix soon, although I want to wait for 1 more big PR before tagging + releasing the next version.

@jedbrown
Copy link
Member

jedbrown commented Jan 6, 2021

Per libocca/occa#449 (comment)

Also, the HIP bug is on main now

I cleaned and rebuilt with cc9bd4fb1d66d6a377f7f9683eb7b38ffd3cae00 (current 'main') and got lots of errors due to the race conditions, ran the tests again and got lots of numeric diffs. This will show you the problematic ones quickly.

make test search=t3 BACKENDS=/gpu/hip/occa

@dmed256
Copy link
Member Author

dmed256 commented Jan 7, 2021

Seems like it fails for cuda on my machine as well, I'll do some bisect and see what else went wrong 😞

This error might help:

# +File     : /home/david/git/night/src/occa/internal/modes/cuda/memory.cpp
# +Line     : 109
# +Function : copyTo
# +Message  : Memory: Copy From
# +CUDA Error [ 700 ]: CUDA_ERROR_ILLEGAL_ADDRESS

@jedbrown
Copy link
Member

@dmed256 Have you had time to get back to this? I see OCCA 1.1.1 has not been released yet. We'd like to be able to include your updates in the upcoming libCEED-0.8 release (part of CEED 4.0, which should probably include new OCCA).

@jedbrown
Copy link
Member

@dmed256 Is there any chance you can get to this, at least to the extent that known bugs generate errors and thus we can merge? Will there be a new OCCA release in this year's CEED distribution?

jedbrown added a commit to spack/spack that referenced this pull request Apr 1, 2021
Disabling OCCA because backend updates did not make this release and
there are some known bugs so most users won't have reason to use OCCA.

CEED/libCEED#688
jedbrown added a commit to spack/spack that referenced this pull request Apr 1, 2021
Disabling OCCA because backend updates did not make this release and
there are some known bugs so most users won't have reason to use OCCA.

CEED/libCEED#688
jedbrown added a commit to spack/spack that referenced this pull request Apr 1, 2021
Disabling OCCA because backend updates did not make this release and
there are some known bugs so most users won't have reason to use OCCA.

CEED/libCEED#688
tzanio pushed a commit to spack/spack that referenced this pull request Jun 3, 2021
* petsc: add hip variant

* libceed: add 0.8, disable occa by default, and let autodetect AVX

Disabling OCCA because backend updates did not make this release and
there are some known bugs so most users won't have reason to use OCCA.

CEED/libCEED#688

* WIP: ceed: 4.0 release

* MFEM package updates (#19748)

* MFEM package updates

* mfem: flake8

* [mfem] Various fixes and tweaks.

[arpack-ng] Add a patch to fix building with IBM XL Fortran.

[libceed] Fix building with IBM XL C/C++.

[pumi] Add C++11 flag for version 2.2.3.

* [mfem] Fix the shared CUDA build.

Reported by: @MPhysXDev

* [mfem] Fix a TODO item

* [mfem] Tweak the AmgX dependencies

* [suite-sparse] Fix the version of the mpfr dependency

* MFEM: add initial HIP support using the ROCmPackage.

* MFEM: add 'slepc' variant.

* MFEM: update the patch for v4.2 for SLEPc.

* mfem: apply 'mfem-4.2-slepc.patch' just to v4.2.

* ceed: apply 'spack style'

* [mfem] Add a patch for mfem v4.2 to work with petsc v3.15.0.

[laghos] Add laghos version 3.1 based on the latest commit in
         the repository; this version works with mfem v4.2.

[ceed] For ceed v4.0 use laghos v3.1.

* [libceed] Explicitly set 'CC_VENDOR=icc' when using 'intel'
          compiler.

* [mfem] Allow pumi >= 2.2.3 with mfem >= 4.2.0.

[ceed] Use pumi v2.2.5 with ceed v4.0.0.

* [ceed] Explicitly use occa v1.1.0 with ceed v4.0.0.
       Use [email protected]+rocm with [email protected]+mfem+hip.

* [ceed] Add NekRS v21 as a dependency for ceed v4.0.0.

* [ceed] Fix NekRS version: 21 --> 21.0

* [ceed] Propagate +cuda variant to petsc for ceed v4.0.

* [mfem] Propagate '+rocm' variant to some other packages.

* [ceed] Use +rocm variant of nekrs instead of +hip.

* [ceed] Do not enable magma with [email protected]+hip.

* [libceed] Fix hip build with [email protected].

* [laghos] For v3.1, use the release .tar.gz file instead of commit.

* Remove cuda & hip variants as they are inherited

* [ceed] Remove comments and FIXMEs about 'magma+hip'.

* [ceed] [libceed] Remove TODOs about occa + hip.

* libceed: use ROCmPackage and +rocm

* petsc: use ROCmPackage for HIP

* libceed, petsc: use CudaPackage

* ceed: forward cuda_arch and amdgpu_target

* [mfem] Use Spack's CudaPackage as a base class; as a result,
       'cuda_arch' values should not include the 'sm_' prefix.
       Also, propagate 'cuda_arch' and 'amdgpu_target' variants
       to enabled dependencies.

* petsc: variant is +rocm, package name is hip

Co-authored-by: Jed Brown <[email protected]>
Co-authored-by: Thilina Rathnayake <[email protected]>
RikkiButler20 pushed a commit to RikkiButler20/spack that referenced this pull request Jun 9, 2021
* petsc: add hip variant

* libceed: add 0.8, disable occa by default, and let autodetect AVX

Disabling OCCA because backend updates did not make this release and
there are some known bugs so most users won't have reason to use OCCA.

CEED/libCEED#688

* WIP: ceed: 4.0 release

* MFEM package updates (spack#19748)

* MFEM package updates

* mfem: flake8

* [mfem] Various fixes and tweaks.

[arpack-ng] Add a patch to fix building with IBM XL Fortran.

[libceed] Fix building with IBM XL C/C++.

[pumi] Add C++11 flag for version 2.2.3.

* [mfem] Fix the shared CUDA build.

Reported by: @MPhysXDev

* [mfem] Fix a TODO item

* [mfem] Tweak the AmgX dependencies

* [suite-sparse] Fix the version of the mpfr dependency

* MFEM: add initial HIP support using the ROCmPackage.

* MFEM: add 'slepc' variant.

* MFEM: update the patch for v4.2 for SLEPc.

* mfem: apply 'mfem-4.2-slepc.patch' just to v4.2.

* ceed: apply 'spack style'

* [mfem] Add a patch for mfem v4.2 to work with petsc v3.15.0.

[laghos] Add laghos version 3.1 based on the latest commit in
         the repository; this version works with mfem v4.2.

[ceed] For ceed v4.0 use laghos v3.1.

* [libceed] Explicitly set 'CC_VENDOR=icc' when using 'intel'
          compiler.

* [mfem] Allow pumi >= 2.2.3 with mfem >= 4.2.0.

[ceed] Use pumi v2.2.5 with ceed v4.0.0.

* [ceed] Explicitly use occa v1.1.0 with ceed v4.0.0.
       Use [email protected]+rocm with [email protected]+mfem+hip.

* [ceed] Add NekRS v21 as a dependency for ceed v4.0.0.

* [ceed] Fix NekRS version: 21 --> 21.0

* [ceed] Propagate +cuda variant to petsc for ceed v4.0.

* [mfem] Propagate '+rocm' variant to some other packages.

* [ceed] Use +rocm variant of nekrs instead of +hip.

* [ceed] Do not enable magma with [email protected]+hip.

* [libceed] Fix hip build with [email protected].

* [laghos] For v3.1, use the release .tar.gz file instead of commit.

* Remove cuda & hip variants as they are inherited

* [ceed] Remove comments and FIXMEs about 'magma+hip'.

* [ceed] [libceed] Remove TODOs about occa + hip.

* libceed: use ROCmPackage and +rocm

* petsc: use ROCmPackage for HIP

* libceed, petsc: use CudaPackage

* ceed: forward cuda_arch and amdgpu_target

* [mfem] Use Spack's CudaPackage as a base class; as a result,
       'cuda_arch' values should not include the 'sm_' prefix.
       Also, propagate 'cuda_arch' and 'amdgpu_target' variants
       to enabled dependencies.

* petsc: variant is +rocm, package name is hip

Co-authored-by: Jed Brown <[email protected]>
Co-authored-by: Thilina Rathnayake <[email protected]>
RikkiButler20 pushed a commit to RikkiButler20/spack that referenced this pull request Jun 11, 2021
* petsc: add hip variant

* libceed: add 0.8, disable occa by default, and let autodetect AVX

Disabling OCCA because backend updates did not make this release and
there are some known bugs so most users won't have reason to use OCCA.

CEED/libCEED#688

* WIP: ceed: 4.0 release

* MFEM package updates (spack#19748)

* MFEM package updates

* mfem: flake8

* [mfem] Various fixes and tweaks.

[arpack-ng] Add a patch to fix building with IBM XL Fortran.

[libceed] Fix building with IBM XL C/C++.

[pumi] Add C++11 flag for version 2.2.3.

* [mfem] Fix the shared CUDA build.

Reported by: @MPhysXDev

* [mfem] Fix a TODO item

* [mfem] Tweak the AmgX dependencies

* [suite-sparse] Fix the version of the mpfr dependency

* MFEM: add initial HIP support using the ROCmPackage.

* MFEM: add 'slepc' variant.

* MFEM: update the patch for v4.2 for SLEPc.

* mfem: apply 'mfem-4.2-slepc.patch' just to v4.2.

* ceed: apply 'spack style'

* [mfem] Add a patch for mfem v4.2 to work with petsc v3.15.0.

[laghos] Add laghos version 3.1 based on the latest commit in
         the repository; this version works with mfem v4.2.

[ceed] For ceed v4.0 use laghos v3.1.

* [libceed] Explicitly set 'CC_VENDOR=icc' when using 'intel'
          compiler.

* [mfem] Allow pumi >= 2.2.3 with mfem >= 4.2.0.

[ceed] Use pumi v2.2.5 with ceed v4.0.0.

* [ceed] Explicitly use occa v1.1.0 with ceed v4.0.0.
       Use [email protected]+rocm with [email protected]+mfem+hip.

* [ceed] Add NekRS v21 as a dependency for ceed v4.0.0.

* [ceed] Fix NekRS version: 21 --> 21.0

* [ceed] Propagate +cuda variant to petsc for ceed v4.0.

* [mfem] Propagate '+rocm' variant to some other packages.

* [ceed] Use +rocm variant of nekrs instead of +hip.

* [ceed] Do not enable magma with [email protected]+hip.

* [libceed] Fix hip build with [email protected].

* [laghos] For v3.1, use the release .tar.gz file instead of commit.

* Remove cuda & hip variants as they are inherited

* [ceed] Remove comments and FIXMEs about 'magma+hip'.

* [ceed] [libceed] Remove TODOs about occa + hip.

* libceed: use ROCmPackage and +rocm

* petsc: use ROCmPackage for HIP

* libceed, petsc: use CudaPackage

* ceed: forward cuda_arch and amdgpu_target

* [mfem] Use Spack's CudaPackage as a base class; as a result,
       'cuda_arch' values should not include the 'sm_' prefix.
       Also, propagate 'cuda_arch' and 'amdgpu_target' variants
       to enabled dependencies.

* petsc: variant is +rocm, package name is hip

Co-authored-by: Jed Brown <[email protected]>
Co-authored-by: Thilina Rathnayake <[email protected]>
@jeremylt
Copy link
Member

Closed as stale for 533 days

@jeremylt jeremylt closed this Jun 23, 2022
@jeremylt jeremylt deleted the occa/fix-array-methods branch April 27, 2023 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants