Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stokhos + MueLu: Compile failure with CUDA + UVM #12767

Closed
sebrowne opened this issue Feb 21, 2024 · 19 comments
Closed

Stokhos + MueLu: Compile failure with CUDA + UVM #12767

sebrowne opened this issue Feb 21, 2024 · 19 comments
Labels
pkg: MueLu pkg: Stokhos type: bug The primary issue is a bug in Trilinos code or tests

Comments

@sebrowne
Copy link
Contributor

Bug Report

@trilinos/stokhos
@trilinos/muelu
Compile-time errors in Stokhos and MueLu when building with UVM enabled by default. We want to add a PR build that exercises this configuration for Albany (@mperego ), so these errors need to be fixed first. Or we turn off the packages. But I'm guessing that's not the way we want to go.

Steps to Reproduce

  1. SHA1: develop HEAD 2024-02-21 (fe84214)
  2. Configure script:
source ../Trilinos/packages/framework/GenConfig/gen-config.sh rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_all ../Trilinos

This will expose MueLu and Stokhos build errors.

@sebrowne sebrowne added the type: bug The primary issue is a bug in Trilinos code or tests label Feb 21, 2024
@sebrowne sebrowne changed the title Stokhos + MueLue: Compile failure with CUDA + UVM Stokhos + MueLu: Compile failure with CUDA + UVM Feb 21, 2024
Copy link

Automatic mention of the @trilinos/muelu team

@cgcgcg
Copy link
Contributor

cgcgcg commented Feb 21, 2024

@sebrowne If the errors are specific to only a couple of spots, could you post them? Or is it all over the place?

@ndellingwood
Copy link
Contributor

Do MueLu and/or Stokhos expect to also have -DKokkos_ENABLE_CUDA_UVM=ON set as well?

@sebrowne
Copy link
Contributor Author

@cgcgcg
File that failed to compiler in MueLu is (well, output file): packages/muelu/src/CMakeFiles/muelu.dir/Utils/ExplicitInstantiation/ETI_MueLu_UtilitiesBase.cpp.o

@ndellingwood I stopped setting that option as per input from @csiefer2 and @mperego , but I can turn it back on if needed. My understanding is that that Kokkos option is going away?

@sebrowne
Copy link
Contributor Author

Ignore that my build dir is named 'uvm_off', I promise it was on....

/scratch/sebrown/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp(1319): error: static assertion failed with "Incompatible View copy construction"
          detected during:
            instantiation of "Kokkos::View<DataType, Properties...>::View(const Kokkos::View<RT, RP...> &, std::enable_if_t<Kokkos::Impl::ViewMapping<Kokkos::View<DataType, Properties...>::traits, Kokkos::View<RT, RP...>::traits, Kokkos::ViewTraits<DataType, Properties...>::specialize>::is_assignable_data_type, void> *) [with DataType=size_t *, Properties=<Kokkos::Device<Kokkos::CudaUVMSpace::execution_space, Kokkos::CudaUVMSpace::memory_space>, Kokkos::MemoryUnmanaged>, RT=size_t *, RP=<std::conditional_t<true, Kokkos::CudaSpace::execution_space, Kokkos::DefaultExecutionSpace>>]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_UtilitiesBase_def.hpp(205): here
            instantiation of "Teuchos::RCP<MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Vector> MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::GetMatrixDiagonal(const MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Matrix &) [with Scalar=double, LocalOrdinal=int, GlobalOrdinal=longlong, Node=Tpetra_KokkosCompat_KokkosCudaWrapperNode]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_ETI_4arg.hpp(34): here

/scratch/sebrown/Trilinos/packages/kokkos/core/src/impl/Kokkos_ViewMapping.hpp(3541): error: static assertion failed with "View assignment must have compatible spaces"
          detected during:
            instantiation of "void Kokkos::Impl::ViewMapping<DstTraits, SrcTraits, std::enable_if_t<<expression>, void>>::assign(Kokkos::Impl::ViewMapping<DstTraits, SrcTraits, std::enable_if_t<<expression>, void>>::DstType &, const Kokkos::Impl::ViewMapping<DstTraits, SrcTraits, std::enable_if_t<<expression>, void>>::SrcType &, const Kokkos::Impl::ViewMapping<DstTraits, SrcTraits, std::enable_if_t<<expression>, void>>::TrackType &) [with DstTraits=Kokkos::ViewTraits<size_t *, Kokkos::Device<Kokkos::CudaUVMSpace::execution_space, Kokkos::CudaUVMSpace::memory_space>, Kokkos::MemoryUnmanaged>, SrcTraits=Kokkos::ViewTraits<size_t *, std::conditional_t<true, Kokkos::CudaSpace::execution_space, Kokkos::DefaultExecutionSpace>>]" 
/scratch/sebrown/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp(1321): here
            instantiation of "Kokkos::View<DataType, Properties...>::View(const Kokkos::View<RT, RP...> &, std::enable_if_t<Kokkos::Impl::ViewMapping<Kokkos::View<DataType, Properties...>::traits, Kokkos::View<RT, RP...>::traits, Kokkos::ViewTraits<DataType, Properties...>::specialize>::is_assignable_data_type, void> *) [with DataType=size_t *, Properties=<Kokkos::Device<Kokkos::CudaUVMSpace::execution_space, Kokkos::CudaUVMSpace::memory_space>, Kokkos::MemoryUnmanaged>, RT=size_t *, RP=<std::conditional_t<true, Kokkos::CudaSpace::execution_space, Kokkos::DefaultExecutionSpace>>]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_UtilitiesBase_def.hpp(205): here
            instantiation of "Teuchos::RCP<MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Vector> MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::GetMatrixDiagonal(const MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Matrix &) [with Scalar=double, LocalOrdinal=int, GlobalOrdinal=longlong, Node=Tpetra_KokkosCompat_KokkosCudaWrapperNode]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_ETI_4arg.hpp(34): here

/scratch/sebrown/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp(1319): error: static assertion failed with "Incompatible View copy construction"
          detected during:
            instantiation of "Kokkos::View<DataType, Properties...>::View(const Kokkos::View<RT, RP...> &, std::enable_if_t<Kokkos::Impl::ViewMapping<Kokkos::View<DataType, Properties...>::traits, Kokkos::View<RT, RP...>::traits, Kokkos::ViewTraits<DataType, Properties...>::specialize>::is_assignable_data_type, void> *) [with DataType=const size_t *, Properties=<Kokkos::Device<Kokkos::CudaUVMSpace::execution_space, Kokkos::CudaUVMSpace::memory_space>, Kokkos::MemoryUnmanaged>, RT=size_t *, RP=<std::conditional_t<true, Kokkos::CudaSpace::execution_space, Kokkos::DefaultExecutionSpace>>]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_UtilitiesBase_def.hpp(206): here
            instantiation of "Teuchos::RCP<MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Vector> MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::GetMatrixDiagonal(const MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Matrix &) [with Scalar=double, LocalOrdinal=int, GlobalOrdinal=longlong, Node=Tpetra_KokkosCompat_KokkosCudaWrapperNode]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_ETI_4arg.hpp(34): here

/scratch/sebrown/Trilinos/packages/kokkos/core/src/impl/Kokkos_ViewMapping.hpp(3541): error: static assertion failed with "View assignment must have compatible spaces"
          detected during:
            instantiation of "void Kokkos::Impl::ViewMapping<DstTraits, SrcTraits, std::enable_if_t<<expression>, void>>::assign(Kokkos::Impl::ViewMapping<DstTraits, SrcTraits, std::enable_if_t<<expression>, void>>::DstType &, const Kokkos::Impl::ViewMapping<DstTraits, SrcTraits, std::enable_if_t<<expression>, void>>::SrcType &, const Kokkos::Impl::ViewMapping<DstTraits, SrcTraits, std::enable_if_t<<expression>, void>>::TrackType &) [with DstTraits=Kokkos::ViewTraits<const size_t *, Kokkos::Device<Kokkos::CudaUVMSpace::execution_space, Kokkos::CudaUVMSpace::memory_space>, Kokkos::MemoryUnmanaged>, SrcTraits=Kokkos::ViewTraits<size_t *, std::conditional_t<true, Kokkos::CudaSpace::execution_space, Kokkos::DefaultExecutionSpace>>]" 
/scratch/sebrown/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp(1321): here
            instantiation of "Kokkos::View<DataType, Properties...>::View(const Kokkos::View<RT, RP...> &, std::enable_if_t<Kokkos::Impl::ViewMapping<Kokkos::View<DataType, Properties...>::traits, Kokkos::View<RT, RP...>::traits, Kokkos::ViewTraits<DataType, Properties...>::specialize>::is_assignable_data_type, void> *) [with DataType=const size_t *, Properties=<Kokkos::Device<Kokkos::CudaUVMSpace::execution_space, Kokkos::CudaUVMSpace::memory_space>, Kokkos::MemoryUnmanaged>, RT=size_t *, RP=<std::conditional_t<true, Kokkos::CudaSpace::execution_space, Kokkos::DefaultExecutionSpace>>]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_UtilitiesBase_def.hpp(206): here
            instantiation of "Teuchos::RCP<MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Vector> MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::GetMatrixDiagonal(const MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Matrix &) [with Scalar=double, LocalOrdinal=int, GlobalOrdinal=longlong, Node=Tpetra_KokkosCompat_KokkosCudaWrapperNode]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_ETI_4arg.hpp(34): here

/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_UtilitiesBase_def.hpp(1015): warning: missing return statement at end of non-void function "MueLu::DetectDirichletRows_kokkos<SC,LO,GO,NO,memory_space>(const Xpetra::Matrix<SC, LO, GO, NO> &, const Teuchos::ScalarTraits<SC>::magnitudeType &, __nv_bool) [with SC=double, LO=int, GO=longlong, NO=Tpetra_KokkosCompat_KokkosCudaWrapperNode, memory_space=Kokkos::HostSpace]"
          detected during:
            instantiation of "Kokkos::View<__nv_bool *, memory_space> MueLu::DetectDirichletRows_kokkos<SC,LO,GO,NO,memory_space>(const Xpetra::Matrix<SC, LO, GO, NO> &, const Teuchos::ScalarTraits<SC>::magnitudeType &, __nv_bool) [with SC=double, LO=int, GO=longlong, NO=Tpetra_KokkosCompat_KokkosCudaWrapperNode, memory_space=Kokkos::HostSpace]" 
(1032): here
            instantiation of "Kokkos::View<__nv_bool *, Kokkos::HostSpace> MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::DetectDirichletRows_kokkos_host(const MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Matrix &, const MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Magnitude &, __nv_bool) [with Scalar=double, LocalOrdinal=int, GlobalOrdinal=longlong, Node=Tpetra_KokkosCompat_KokkosCudaWrapperNode]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_ETI_4arg.hpp(34): here

/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_UtilitiesBase_def.hpp(1015): warning: missing return statement at end of non-void function "MueLu::DetectDirichletRows_kokkos<SC,LO,GO,NO,memory_space>(const Xpetra::Matrix<SC, LO, GO, NO> &, const Teuchos::ScalarTraits<SC>::magnitudeType &, __nv_bool) [with SC=std_complex0double0, LO=int, GO=longlong, NO=Tpetra_KokkosCompat_KokkosCudaWrapperNode, memory_space=Kokkos::HostSpace]"
          detected during:
            instantiation of "Kokkos::View<__nv_bool *, memory_space> MueLu::DetectDirichletRows_kokkos<SC,LO,GO,NO,memory_space>(const Xpetra::Matrix<SC, LO, GO, NO> &, const Teuchos::ScalarTraits<SC>::magnitudeType &, __nv_bool) [with SC=std_complex0double0, LO=int, GO=longlong, NO=Tpetra_KokkosCompat_KokkosCudaWrapperNode, memory_space=Kokkos::HostSpace]" 
(1032): here
            instantiation of "Kokkos::View<__nv_bool *, Kokkos::HostSpace> MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::DetectDirichletRows_kokkos_host(const MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Matrix &, const MueLu::UtilitiesBase<Scalar, LocalOrdinal, GlobalOrdinal, Node>::Magnitude &, __nv_bool) [with Scalar=std_complex0double0, LocalOrdinal=int, GlobalOrdinal=longlong, Node=Tpetra_KokkosCompat_KokkosCudaWrapperNode]" 
/scratch/sebrown/Trilinos/packages/muelu/src/Utils/MueLu_ETI_4arg.hpp(34): here

4 errors detected in the compilation of "/scratch/sebrown/build_uvm_off/packages/muelu/src/Utils/ExplicitInstantiation/ETI_MueLu_UtilitiesBase.cpp".
ninja: build stopped: cannot make progress due to previous errors.

@ndellingwood
Copy link
Contributor

My understanding is that that Kokkos option is going away?

@sebrowne correct, the plan is for it to go away with the eventual Kokkos version 5, testing builds without Kokkos_ENABLE_CUDA_UVM set (but using the package specific options) is a good status check 👍

I think with Tpetra they had to introduce some additional code so that they could enable and control templating of the Views on SharedSpace to support the new configure option as a UVM replacement. Maybe other packages will need to do similar things, for example to get all the ETI stuff working (when it isn't automatically handled by -DKokkos_ENABLE_CUDA_UVM=ON)?

@ndellingwood
Copy link
Contributor

Adding references to the Tpetra PRs, looks like there were already updates to address MueLu, so my guess about needing something like similar SharedSpace support must be wrong:

@cgcgcg
Copy link
Contributor

cgcgcg commented Feb 21, 2024

@sebrowne Just double checking: does this Trilinos build already have the changes of PR #12738?

@mperego
Copy link
Contributor

mperego commented Feb 22, 2024

@sebrowne we could disable Stokhos as Albany does not use it. Unless @etphipp is willing to maintain it functional with UVM.

@sebrowne
Copy link
Contributor Author

@cgcgcg yes I have that merge commit in my checkout.
@mperego that's fine by me, will let you make the final decision

@cgcgcg cgcgcg mentioned this issue Feb 22, 2024
@cgcgcg
Copy link
Contributor

cgcgcg commented Feb 22, 2024

@sebrowne Ok, please pull and try again.

@etphipp
Copy link
Contributor

etphipp commented Feb 22, 2024

I was able to reproduce this and Christian's changes did seem to resolve the MueLu errors. I am working on the stokhos errors. They shouldn't be hard to fix (just need to modify how some partial specializations were done).

@csiefer2
Copy link
Member

csiefer2 commented Feb 23, 2024

@etphipp I suspect that I didn't check Stokhos when I made this stuff work for @mperego. My bad.

@sebrowne
Copy link
Contributor Author

sebrowne commented Feb 23, 2024

Link to today's results: https://trilinos-cdash.sandia.gov/build/1402053

MueLu errors fixed, 5 Stokhos errors remaining. Thank you all for looking at this!

@etphipp
Copy link
Contributor

etphipp commented Feb 24, 2024

I think I got the stokhos errors fixed with PR #12771

@etphipp
Copy link
Contributor

etphipp commented Feb 26, 2024

For PR #12771 one of the builds failed (Intel) but it isn't showing on trilinos-cdash. Is that on a different cdash?

@etphipp
Copy link
Contributor

etphipp commented Feb 27, 2024

Stokhos errors should be fixed now.

@sebrowne
Copy link
Contributor Author

Confirmed the nightly build is fixed, thank you all! I will be adding a new PR build for CUDA + UVM without tests very soon.

@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 3, 2024

Closing as fixed. If that's somehow not the case please reopen.

@cgcgcg cgcgcg closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: MueLu pkg: Stokhos type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Done
Development

No branches or pull requests

6 participants