Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

typeid / type_info equality check fails for clang/libc++ when using VSG in dynamic libraries #899

Open
martinweber opened this issue Aug 4, 2023 · 22 comments

Comments

@martinweber
Copy link

Issue found

I am posting this here to discuss the following behavior I found:

I have been running into issues when using VSG in a dynamic library, where the cast<> function on vsg::Object did return a nullptr even when the type was correct. This is caused by how clang/libc++ is generating type_info.hash_code()and the related type_info comparison operator.

For example, in a vsg::Visitor:
(Note: this also happens in another area where we are using vsg::Object::cast())

    void apply(vsg::Object& obj) override
    {
        auto* matrix_node = obj.cast<vsg::MatrixTransform>();
        // ...
    }

This always returned a nullptr even when the object was of type vsg::MatrixTransform. Using dynamic_cast<vsg::MatrixTransform*>(&obj) instead returned a pointer to the vsg::MatrixTransform.

I then logged the values for std::type_info in vsg::Inherit::is_compatible():

Subclass: N3vsg15MatrixTransformE, 4737813731 ? is_compatible: N3vsg15MatrixTransformE, 4577409791
Subclass: N3vsg9TransformE, 4737813807 ? is_compatible: N3vsg15MatrixTransformE, 4577409791
Subclass: N3vsg5GroupE, 4737813148 ? is_compatible: N3vsg15MatrixTransformE, 4577409791
Subclass: N3vsg4NodeE, 4737813255 ? is_compatible: N3vsg15MatrixTransformE, 4577409791
type_info object         : 4737813731  // at callsite of vsg::Object::cast()
type_info MatrixTransform: 4737813731 // at callsite of vsg::Object::cast()

(Note: the type after "? is_compatible:" is from the type parameter.)

  • The type_info.name() returned the same value (N3vsg15MatrixTransformE)
  • The type_info.hash_code() returned different values even though the type (type_info.name()) was identical
  • The type_info comparison operator returns false for matching types

The type_info.hash_code() is identical at the call site, but differs in the type's is_compatible() function. The call site is in a different dynamic library than VSG, which is linked as static library into a different dynamic library.

This seems to be an issue with clang/libc++ when using dynamic libraries. I have found discussions about this here and here.

The issue seems to be present when dynamic libraries are loaded using RTLD_LOCAL. Symbols tables are then local to the library and the type_info.hash_code() for the same type is different. Also, the comparison operator on std::type_info returns false in this case.

Environment

  • macOS 13.4.1 (c) (Ventura)
  • CPU: Apple M1 Pro
  • Apple clang version 14.0.3 (clang-1403.0.22.14.1)

The same code works correctly on Windows with MSVC!

Possible fixes?

  1. Using a strcmp() with type_info.name()? The type_info.name() is working correctly in this case. This is the solution pybind was going for. This requires a strcmp() which is computationally much more expensive than the current code. Especially considering, that in case of type difference, is_compatible() is called recursively for parent types.

  2. Implement a type_hash<> template similar to type_name<> found in type_name.h that will guarantee to return an identical value for identical types?

  3. something else?

Conclusion

We already have two known places where this breaks our application on macOS (and possibly Linux). For now, a dynamic_cast<> instead of using vsg::Object::cast() is a working alternative. Comparing type_name() values also would work.

I fear that this behavior of clang/libc++ will cause more issues though.

Thanks!

@timoore
Copy link
Contributor

timoore commented Aug 4, 2023

I never understood the problem with just using dynamic_cast. In vsgCs I have:

    template<typename TSubclass, typename TParent>
    vsg::ref_ptr<TSubclass> ref_ptr_cast(const vsg::ref_ptr<TParent>& p)
    {
        return vsg::ref_ptr<TSubclass>(dynamic_cast<TSubclass*>(p.get()));
    }

and use it instead of ref_ptr::cast(). At the time I didn't understand how ref_ptr::cast() was supposed to work.

@robertosfield
Copy link
Collaborator

robertosfield commented Aug 4, 2023

@martinweber That's an obscure and unwelcome finding. The Object::cast<> exist to lower the CPU overhead of casting compared to dynamic_cast<>. Perhaps compatibility issues like this is partly why dynamic_cast<> is so slow.

As a short term fix perhaps falling back to using dynamic_cast<> as the implementation on dynamic build would be workaround.

@timoore "I never understood the problem with just using dynamic_cast" the Elephant in the room any time you use dynamic_cast<> is how slow it is. When I introduced the VSG's RTTI functions I did benchmark them against dynamic_cast<> and they are 3.6 X faster.

I tweeted about it back in July 2020 when I introduced the functionality:

Took a detour from work on work on interleaved array support in the #Vulkan SceneGraph to implement a alternative to dynamic_cast<>. The new vsg::Object::cast<>()/vsg::cast<>() is 3.6x faster for dynamically casting between object/data/node types :-)

https://github.com/vsg-dev/Vulkan

Looking online perhaps the following might be another alternative:
https://kahncode.com/2019/09/24/c-tricks-fast-rtti-and-dynamic-cast/

@timoore
Copy link
Contributor

timoore commented Aug 4, 2023

@timoore "I never understood the problem with just using dynamic_cast" the Elephant in the room any time you use dynamic_cast<> is how slow it is. When I introduced the VSG's RTTI functions I did benchmark them against dynamic_cast<> and they are 3.6 X faster.

I tweeted about it back in July 2020 when I introduced the functionality:

I understand that dynamic_cast is or can be slow, but is dynamic downcasting really in the hot path of anything in the VSG?

@martinweber
Copy link
Author

@robertosfield

Looking online perhaps the following might be another alternative:
https://kahncode.com/2019/09/24/c-tricks-fast-rtti-and-dynamic-cast/

This looks very verbose. Unfortunately, it seems to suffer from the same issue of not working across module boundaries, as listed under limitations:

The static mechanic used to build the TypeID make this not safe to pass across module boundaries. This could be improved by generating a TypeID using a hash of the symbol name.

I was thinking of generating a compile time hash for the type that can be used by vsg::Inherit. I haven't explored that idea yet to see if there are limitations to such an approach.

@robertosfield
Copy link
Collaborator

I think the best way to tackle this issue is to create a test example in vsgExamples that we can use to reproduce the problem and benchmark performance on different solutions. Unfortunately it looks that I threw away the test program I original wrote when I originally worked on this RTTI functionality back in July 2020 as this would have a good starting place.

Once we can reliably reproduce the issue and benchmark performance we can iterator on different solutions.

I am rather stretched across tasks right now and can't handle right away another round of investigation, trying different solutions so help here would be appreciated.

@martinweber
Copy link
Author

I created a minimal example to reproduce the issue on my fork of vsgExamples: martinweber/vsgExamples@a143725

Just returning a vsg::MatrixTransform::create() from the dynamic library function did not show the issue. Once I used the vsg::ObjectFactoryto create a new object, the issue showed up.

This is the output I get on macOS: (I haven't tried it on Windows or Linux)

Local: type name: N3vsg15MatrixTransformE, type hash: 4296110999
Dylib: type name: N3vsg15MatrixTransformE, type hash: 4305375728
types are not compatible

I have an idea about using compile time generated hashes that I want to try. I'll keep you posted.

@robertosfield
Copy link
Collaborator

Thanks, I have pulled the example into vsgExamples as the branch:

https://github.com/vsg-dev/vsgExamples/tree/martinweber-clang-typeid-issue

I will now investigate.

@robertosfield
Copy link
Collaborator

Results so far:

VSG built gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, static build:

$ clang_typeid 
Local: type name: N3vsg15MatrixTransformE, type hash: 8883272728397651726
Dylib: type name: N3vsg15MatrixTransformE, type hash: 8883272728397651726
types are compatible

VSG built gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, dynamic library build:

 clang_typeid 
Local: type name: N3vsg15MatrixTransformE, type hash: 8883272728397651726
Dylib: type name: N3vsg15MatrixTransformE, type hash: 8883272728397651726
types are compatible

Next I'll install and switch over to the clang compilers.

@robertosfield
Copy link
Collaborator

I have installed clang-16 & clang++-16 and from the Ubuntu 22.04 repo, and set my CC and CXX in my env vars with:

export CC=/bin/clang-16
export CXX=/bin/clang++-16

But on attempting to configure cmake I get the following error "/bin/ld: cannot find -lstdc++: No such file or directory" :

 cmake .
-- The CXX compiler identification is Clang 16.0.6
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - failed
-- Check for working CXX compiler: /bin/clang++-16
-- Check for working CXX compiler: /bin/clang++-16 - broken
CMake Error at /usr/share/cmake-3.22/Modules/CMakeTestCXXCompiler.cmake:62 (message):
  The C++ compiler

    "/bin/clang++-16"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /home/robert/Dev/VulkanSceneGraph/CMakeFiles/CMakeTmp
    
    Run Build Command(s):/bin/gmake -f Makefile cmTC_b09ee/fast && /bin/gmake  -f CMakeFiles/cmTC_b09ee.dir/build.make CMakeFiles/cmTC_b09ee.dir/build
    gmake[1]: Entering directory '/home/robert/Dev/VulkanSceneGraph/CMakeFiles/CMakeTmp'
    Building CXX object CMakeFiles/cmTC_b09ee.dir/testCXXCompiler.cxx.o
    /bin/clang++-16    -MD -MT CMakeFiles/cmTC_b09ee.dir/testCXXCompiler.cxx.o -MF CMakeFiles/cmTC_b09ee.dir/testCXXCompiler.cxx.o.d -o CMakeFiles/cmTC_b09ee.dir/testCXXCompiler.cxx.o -c /home/robert/Dev/VulkanSceneGraph/CMakeFiles/CMakeTmp/testCXXCompiler.cxx
    Linking CXX executable cmTC_b09ee
    /usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_b09ee.dir/link.txt --verbose=1
    /bin/clang++-16 CMakeFiles/cmTC_b09ee.dir/testCXXCompiler.cxx.o -o cmTC_b09ee 
    /bin/ld: cannot find -lstdc++: No such file or directory
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    gmake[1]: *** [CMakeFiles/cmTC_b09ee.dir/build.make:100: cmTC_b09ee] Error 1
    gmake[1]: Leaving directory '/home/robert/Dev/VulkanSceneGraph/CMakeFiles/CMakeTmp'
    gmake: *** [Makefile:127: cmTC_b09ee/fast] Error 2
    
    

  

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:3 (project)


-- Configuring incomplete, errors occurred!
See also "/home/robert/Dev/VulkanSceneGraph/CMakeFiles/CMakeOutput.log".
See also "/home/robert/Dev/VulkanSceneGraph/CMakeFiles/CMakeError.log".

This is how I previously used clang instead of gcc and I recall it working OK, and searches online haven't given me any useful pointers yet.

@martinweber
Copy link
Author

Unfortunately I do not have a recent Linux installation ready to use. Our target is ancient CentOS 7 with gcc.

Looking at the clang meta package on 22.04LTS I would think clang-14 is the last supported version on 22.04LTS. clang-16 seems to be a 23.04 (lunar) package only, though even there the clang meta package still uses v15.

On macOS the current Apple clang version is 14.0.3 which is the one I use and which shows the issue.

@robertosfield
Copy link
Collaborator

I tried installing clang-14 but the package was broken :-|

@robertosfield
Copy link
Collaborator

clang++15 installs but I get the same /bin/ld: cannot find -lstdc++: No such file or director issue when running cmake.

@martinweber
Copy link
Author

I have implemented a first prototype that generates type hashes at compile time, using a simple FNV-1a hash. I got the hashing code from here.

The FNV-1a hash is very simple (just some XOR's with a good initial value) and therefore is fast. It is vulnerable to long sequences with zeros. That is not going to happen with type name strings.

I implemented the templates necessary to generate the type hash value at compile time.

Now I get this output (macOS clang 14.0.3):

Local: type name: N3vsg15MatrixTransformE         , type hash: (type_info) 4343477127   | (FNV-1a) 4022647357068752390
Dylib: type name: N3vsg15MatrixTransformE         , type hash: (type_info) 4352661200   | (FNV-1a) 4022647357068752390
types are compatible

I haven't really tested this, as I treated it as proof-of-concept! I am also not sure about the implementation for vsg::Dataas well as vsg::Array, vsg::Array2D, vsg::Array3D, and vsg::Value as they are implemented differently from other classes (as they are templates). So that needs more testing.

This implementation will have an impact on compile time. Runtime performance and memory requirements should not really be affected. I still need to implement a test / benchmark for that.

My branch with the implementation is here. I updated the reproduction test example that generates the output above as well.

@martinweber
Copy link
Author

clang++15 installs but I get the same /bin/ld: cannot find -lstdc++: No such file or director issue when running cmake.

I just noticed that it has stdc++ as a -l link library parameter. That is not correct. The parameter should be -std=c++17 or -stdlib=libc++ when libc++ is not the system standard.

Did you delete the build folder after building with gcc? At least CMakeCache.txt should be deleted to avoid left-over configuration from a previous pass.

@robertosfield
Copy link
Collaborator

I clobbered all my VSG projects before trying the clang build, so no CMakeCache.txt prior to running cmake. The error is for a will be a CMake generated testCXXCompiler.cxx file, and CMake is generating it's own link lines.

@martinweber
Copy link
Author

A quick follow up. I have replaced a couple of vsg::Object::cast<> calls that did not work with dynamic_cast<>. Now I get several of these runtime warnings:

dynamic_cast error 2: One or more of the following type_info's has hidden visibility or is defined in more than one translation unit. They should all have public visibility. N3vsg6ObjectE, N3vsg15MatrixTransformE, N3vsg15MatrixTransformE.

I will investigate this as well.

I have been very busy this week with other tasks but I plan to continue to work on tests and benchmarks for the compile time generated type hashes next week. I also ordered a SSD so I can install Ubuntu as well for testing.

@robertosfield
Copy link
Collaborator

Thanks for continuing with the work. While I haven't been able to keep trying to get clang installed and keep testing this is an area I'm committed to see an solution checked in.

I have other work that I have to get on with right now, but as a TODO items for the next point release for the VSG I think we need a solution to these problems, so will dive back into this topic prior to the next release.

@martinweber
Copy link
Author

Unfortunately I also was busy with other work. I now found some time to do more testing.

The issue is indeed caused by type_info being defined in more than one translation unit. The main application as well as the dynamic library are linking statically to VSG, so they both have their own definition of type_info. When running the reproduction test under a debugger after I added a dynamic_cast<> I got an error from Clang that indicated that.

[clang-typeid-issue][~/code/forks/vsgExamples/build/bin]$ lldb -o run ./clang_typeid 
(lldb) target create "./clang_typeid"
Current executable set to '/Users/martin/code/forks/vsgExamples/build/bin/clang_typeid' (arm64).
(lldb) run
Local: type name: N3vsg15MatrixTransformE, type hash: 4295078983
Dylib: type name: N3vsg15MatrixTransformE, type hash: 4304343248
types are not compatible
2023-09-07 10:32:15.367749+0200 clang_typeid[16101:132932] dynamic_cast error 2: One or more of the following type_info's has hidden visibility or is defined in more than one translation unit. They should all have public visibility. N3vsg6ObjectE, N3vsg15MatrixTransformE, N3vsg15MatrixTransformE.
Process 16101 exited with status = 0 (0x00000000)
Process 16101 launched: '/Users/martin/code/forks/vsgExamples/build/bin/clang_typeid' (arm64)
(lldb) exit

dynamic_cast<> nevertheless worked correctly, while std::type_info.hash_code() returned different hashes.

Checking with nm also showed that each module had their own type_info. The symbols have the same name:

[clang-typeid-issue][~/code/forks/vsgExamples/build/bin]$ nm clang_typeid | grep "MatrixTransform.*type_info"
000000010000d92c t __ZNK3vsg7InheritINS_9TransformENS_15MatrixTransformEE13is_compatibleERKSt9type_info
000000010000d920 t __ZNK3vsg7InheritINS_9TransformENS_15MatrixTransformEE9type_infoEv
[clang-typeid-issue][~/code/forks/vsgExamples/build/lib]$ nm libclang_typeid_dylib.dylib | grep "MatrixTransform.*type_info"
0000000000019a4c t __ZNK3vsg7InheritINS_9TransformENS_15MatrixTransformEE13is_compatibleERKSt9type_info
0000000000019a40 t __ZNK3vsg7InheritINS_9TransformENS_15MatrixTransformEE9type_infoEv
00000000000c3b80 t __ZNKSt3__110__function6__funcIZN3vsg13ObjectFactory3addINS2_15MatrixTransformEEEvvEUlvE_NS_9allocatorIS6_EEFNS2_7ref_ptrINS2_6ObjectEEEvEE6targetERKSt9type_info

So after this, I compiled VulkanSceneGraph as dynamic library and linked the executable and dynamic library against it. This solved the issue as now type_info is only available from one translation unit (libvsg.dylib).

So going forward, this would be the solution to the Clang/libc++ type_info issues. I still have the code that generates type hashes at compile time if that would be of interest. This approach continued to work with duplicated defines as the hash is not changing between compilations. It is based on the type name string.

Ideally, we could add a build option to either build VulkanSceneGraph as either static or dynamic library.

@robertosfield
Copy link
Collaborator

I'm a bit lost on what approach works for you now. Do you still need your changes to be applied for dynamic library version of the VSG to work OK on clang?

Ideally, we could add a build option to either build VulkanSceneGraph as either static or dynamic library.

The VSG builds using the standard CMake approach using BUILD_SHARED_LIBS option. We've been building static and dynamic libraries of the VSG since it's inception using this, so do you just mean a solution for Clang and dynamic libraries?

@martinweber
Copy link
Author

Using VSG as dynamic library in all modules works in the reproduction example.
Statically linking VSG to multiple modules (my own dynamic libraries) results in errors with type_info.hash_code().

So I am now looking into changing our use of VSG to a dynamic library.

The VSG builds using the standard CMake approach using BUILD_SHARED_LIBS option.

Ah, thanks.

I am pretty confident, that switching VSG to a dynamic library will solve my issues with Clang on macOS without any changes to VSG needed. That's the next step for me to verify.

@timoore
Copy link
Contributor

timoore commented Sep 7, 2023

Chiming in with what is probably obvious advice to most. You need to either

  • Build VSG as a static library and every library that links to it as static too. Use modern CMake to construct the link line in your application that links all these libraries together;
  • or build VSG as a dynamic library.

@martinweber
Copy link
Author

Yeah, in hindsight it seems obvious 😉
The issue is very specific to Clang/libc++ though. GCC and MSVC worked fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants