Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

illegal memory access error subchandra on CUDA #2818

Closed
zhichen3 opened this issue Apr 8, 2024 · 4 comments
Closed

illegal memory access error subchandra on CUDA #2818

zhichen3 opened this issue Apr 8, 2024 · 4 comments

Comments

@zhichen3
Copy link
Collaborator

zhichen3 commented Apr 8, 2024

I'm getting cuda errors on the very first step of the subchandra problem.
CUDA error 700 in file /home/zhi/github/amrex/Src/Base/AMReX_GpuDevice.cpp line 614: an illegal memory access was encountered

To reproduce, compile subchandra with
make -f GNUmakefile.nse_net USE_CUDA=TRUE USE_SIMPLIFIED_SDC=TRUE NETWORK_DIR=subch_base

With Backtrace:

1: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x8970da]
   _ZN5amrex11BLBackTrace7handlerEi
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_BLBackTrace.cpp:99:7

2: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x69f12c]
   _ZN5amrex18ParallelDescriptor5AbortEib
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_ParallelDescriptor.cpp:219:21

3: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x627918]
   _ZN5amrex10Error_hostEPKcS1_
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX.cpp:243:1

4: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x627857]
   _ZN5amrex5AbortEPKc inlined at /global/homes/z/zhichen/Github/amrex/Src/Base/AMReX.cpp:214:6 in _ZN5amrex5AbortERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX.H:156:1
_ZN5amrex5AbortERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX.cpp:214:6

5: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x6fbec8]
   _ZN5amrex3Gpu6Device17streamSynchronizeEv
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_GpuDevice.cpp:613:464

6: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x430792]
   _ZN5amrex3Gpu17streamSynchronizeEv
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_GpuDevice.H:242:1

7: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x805178]
   _ZN5amrex6MFIter8FinalizeEv
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_MFIter.cpp:240:1

8: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x8050e0]
   _ZN5amrex6MFIterD2Ev
/global/homes/z/zhichen/Github/amrex/Src/Base/AMReX_MFIter.cpp:213:1

9: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x6200d1]
   _ZN6Castro11react_stateEdd
/global/homes/z/zhichen/Github/Castro/Source/reactions/Castro_react.cpp:816:39

10: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x61ee11]
   _ZN6Castro16do_new_reactionsEdd
/global/homes/z/zhichen/Github/Castro/Source/reactions/Castro_react.cpp:74:33

11: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x5b5d41]
   _ZN6Castro22post_advance_operatorsEdd
/global/homes/z/zhichen/Github/Castro/Source/sources/Castro_sources.cpp:618:44

12: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x497d36]
   _ZN6Castro14do_advance_ctuEdd
/global/homes/z/zhichen/Github/Castro/Source/driver/Castro_advance_ctu.cpp:122:68

13: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x498b60]
   _ZN6Castro20subcycle_advance_ctuEddii
/global/homes/z/zhichen/Github/Castro/Source/driver/Castro_advance_ctu.cpp:391:60

14: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x490bb0]
   _ZN6Castro7advanceEddii
/global/homes/z/zhichen/Github/Castro/Source/driver/Castro_advance.cpp:69:53

15: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0xa15604]
   _ZN5amrex3Amr8timeStepEidiid
/global/homes/z/zhichen/Github/amrex/Src/Amr/AMReX_Amr.cpp:2022:44

16: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0xa15ddb]
   _ZN5amrex3Amr14coarseTimeStepEd
/global/homes/z/zhichen/Github/amrex/Src/Amr/AMReX_Amr.cpp:2133:26

17: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x4cd009]
   main
/global/homes/z/zhichen/Github/Castro/Source/driver/main.cpp:165:29

18: /lib64/libc.so.6(__libc_start_main+0xef) [0x7fe40ea3e24d]

19: ./Castro2d.gnu.DEBUG.MPI.CUDA.SMPLSDC.ex() [0x40d12a]
   _start
../sysdeps/x86_64/start.S:122
@zhichen3
Copy link
Collaborator Author

zhichen3 commented Apr 8, 2024

I'm getting CUDA Exception: Lane User Stack Overflow when evaluating *(d_num_failed.copyToHost()) line 814 in castro_react.cpp in cuda-gdb

@zingale
Copy link
Member

zingale commented Apr 8, 2024

when I link, I get this message:

Stack size for entry function '_ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpMinEEE4evalINS_10ReduceDataIJNS_10ValLocPairIdNS_7IntVectEEEEEEZNS4_4evalINS_8FabArrayINS_9FArrayBoxEEESA_ZN6Castro13estdt_burningEiEUliiiiE_EENSt9enable_ifIXsr5amrex10IsFabArrayIT_vEE5valueEvE4typeERKSI_RKS8_RT0_OT1_EUliiiE_EEvRKNS_3BoxERSI_RKSP_EUlvE_EEvimP11CUstream_stRKT0_EUlvE_EEvS13_' cannot be statically determined

so the compiler is telling us there is something up in that function

@yut23
Copy link
Collaborator

yut23 commented Apr 8, 2024

I'm able to reproduce this on my workstation with inputs.N14.coarse (I don't have enough memory for the others).

zingale pushed a commit to AMReX-Astro/Microphysics that referenced this issue Apr 9, 2024
@zingale
Copy link
Member

zingale commented Apr 9, 2024

fixed by eliminating recursion

@zingale zingale closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants