-
Notifications
You must be signed in to change notification settings - Fork 74.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linking to both tensorflow and protobuf causes segmentation fault during static initializers #24976
Comments
I found a temporary workaround for myself, but it should still be possible to do this from released binaries without the need to rebuild. Local opt build works from r1.12 at a6d8ffa
However I get the segfault from https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-cpu-linux-x86_64-1.12.0.tar.gz with protobuf built locally from and also from https://storage.googleapis.com/tensorflow-nightly/github/tensorflow/lib_package/libtensorflow-cpu-linux-x86_64.tar.gz # Wed Jan 16 22:33:29 PST 2019 with protobuf built locally from head (3.6.1) around the same time. |
I just hit this with the |
I also stumbled upon this problem.
The problem is easy to replicate:
Link with -ltensorflow and it works fine. Uncomment the line to include protobuf and link with both -ltensorflow and -lprotobuf and observe the segmentation fault on initialization. |
@gunan @allenlavoie can either of you comment? |
This has been over a month ago and we're still having issues with this. An update or fix would be very much appreciated! |
We are also having issues with this problem on NVidia's Xavier and would appreciate and update/fix. If there are no plans to fix the bug, we will try to build Tensorflow with the hints from matth79. |
Sounds like it must be a symbol conflict. And since it's the same library, it's not a case where we can just rename one of the symbols to avoid the conflict. The workarounds sound like (1) only load the second copy of protobuf in a .so that does not use TensorFlow, and you can use both that .so and TensorFlow's .so from your main program, (2) instead of linking normally, dlopen() TensorFlow with RTLD_DEEPBIND set so TensorFlow prefers its own symbols. I'm not sure what TensorFlow can do. Putting something in the global symbol table which conflicts with TensorFlow's protobuf usage isn't something we can easily work around. Unless someone has a suggestion? |
Hello. I get the same problem , the info like this: I am using C++ to call python's tensorflow. The protobuf library is called in our own environment, when we call python's “import tensorflow as tf” in C++ in our own environment. The above problem will occur. When the“ import tensorflow as tf “is deleted, the problem will disappear. Do you know the reason? I think that the protobuf of my environment conflicts with the protobuf of tensorflow. can you help me . thanks |
This is indeed a problem with protobuf; there's not much TF itself can do as @allenlavoie mentioned. We dealt with this by running TF operations in a separate process that talks over a UNIX socket, but @allenlavoie's solutions should work too. |
I hope the readers have learned a valuable lesson about using static initializers in this way from this thread. |
I also have this issue. Reproduced with |
While I do not want to close this issue, as @allenlavoie wrote in #24976 (comment) , I am not sure what we can do. So, unfortunately I can only offer #24976 (comment) , and we should close this as "Infeasible". |
I ran into core dump issue when call import tensorflow using C++ Python API.
Finally, I installed python protobuf that matches with Tensorflow's protobuf version, 3.7.1. It magically works. I don't know how to check the protobuf version inside tensorflow library libtensorflow_framework.so or _pywrap_tensorflow_internal.so. Since Tensorflow 1.14 requires protobuf >= 3.6.1, so I installed 3.6.1 first and then my program throws an error said However, if I install python protobuf to 3.11.3, I got segfault. So once I upgrade protobuf into 3.7.1, it works. |
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message).
@gnossen gave a great overview in grpc/grpc#24992 of the overall problem. If a python process using both protobuf _and_ another native library linking in libprotobuf frequently can cause crashes. This seems to frequently affect tensorflow as well: tensorflow/tensorflow#8394, tensorflow/tensorflow#9525 (comment) tensorflow/tensorflow#24976, tensorflow/tensorflow#35573, https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/contrib/makefile/rename_protobuf.sh, tensorflow/tensorflow#16104 Testing locally this fixes both crashes when linking in multiple versions of protobuf and fixes `DescriptorPool` clashes as well (e.g. Python and Native code import different versions of the same message). Co-authored-by: Roy Williams <[email protected]>
System information
Describe the current behavior
Aborts on SIGSEGV
Describe the expected behavior
Exits cleanly
Details
I want to create an application that calls the C API but also can parse protocol buffers on its own behalf. For that want to link dynamically to tensorflow and statically to protobuf. When I do this, it seems like protobuf may be tricking libtensorflow.so into thinking that it has run some static initializers that it in fact has not run (on the static variables needed by its own internal copy of protobuf).
The segfault is only on Linux. Linking the same way on Windows works fine.
I have varied libtensorflow and protobuf versions, and it seems to happen with all of them. It also happens whether I choose static or dynamic linking for my binary's copy of protobuf.
I also tried building my own liba.so that itself statically links protobuf and then a binary that linked dynamically to "a" and statically to protobuf. This worked, which is pointing away from this being a purely protobuf issue.
Code to reproduce the issue
Removing -lprotobuf from the above command will get rid of the segfault.
Other info / logs
Program received signal SIGSEGV, Segmentation fault.
0x00007fffed8f20b8 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std::un
ique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
(gdb) bt
#0 0x00007fffed8f20b8 in tensorflow::kernel_factory::OpKernelRegistrar::InitInternal(tensorflow::KernelDef const*, absl::string_view, std
::unique_ptr<tensorflow::kernel_factory::OpKernelFactory, std::default_deletetensorflow::kernel_factory::OpKernelFactory >) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#1 0x00007fffed88336a in tensorflow::kernel_factory::OpKernelRegistrar::OpKernelRegistrar(tensorflow::KernelDef const*, absl::string_view
, tensorflow::OpKernel* ()(tensorflow::OpKernelConstruction)) ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#2 0x00007fffed85f806 in _GLOBAL__sub_I_dataset.cc ()
from /usr/local/google/home/mattharvey/no_backup/libtensorflow/lib/libtensorflow_framework.so
#3 0x00007ffff7de88aa in call_init (l=, argc=argc@entry=1, argv=argv@entry=0x7fffffffdc68, env=env@entry=0x7fffffffdc78)
at dl-init.c:72
#4 0x00007ffff7de89bb in call_init (env=0x7fffffffdc78, argv=0x7fffffffdc68, argc=1, l=) at dl-init.c:30
#5 _dl_init (main_map=0x7ffff7ffe170, argc=1, argv=0x7fffffffdc68, env=0x7fffffffdc78) at dl-init.c:120
#6 0x00007ffff7dd9c5a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7 0x0000000000000001 in ?? ()
#8 0x00007fffffffdf2e in ?? ()
#9 0x0000000000000000 in ?? ()
0x00007fffed8f20a0 <+80>: mov 0x50(%r15),%rax
0x00007fffed8f20a4 <+84>: lea -0xa0(%rbp),%rbx
0x00007fffed8f20ab <+91>: mov %rbx,%rdi
0x00007fffed8f20ae <+94>: mov (%rax),%r8
0x00007fffed8f20b1 <+97>: mov 0x48(%r15),%rax
0x00007fffed8f20b5 <+101>: mov (%rax),%rsi
=> 0x00007fffed8f20b8 <+104>: mov -0x18(%r8),%r9
How did -0x18(%r8) get illegal?
(gdb) info register r8
r8 0x0 0
-0x18 is certainly illegal. Where did it come from? 0x50(%r15) if we trace through the above.
(gdb) info register r15
r15 0x555555768d10 93824994413840
(gdb) x/2 0x555555768d60
0x555555768d60: 0xee2c0bc0 0x00007fff
(gdb) x/2 0x00007fffee2c0bc0
0x7fffee2c0bc0 google::protobuf::internal::fixed_address_empty_string: 0x00000000 0x00000000
... the 0x0 that ended up in r8.
Zoom out to find lots of stuff uninitialized:
(gdb) x/64x 0x7fffee4ddb00
0x7fffee4ddb00 google::protobuf::_DoubleValue_default_instance_: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb10 google::protobuf::_DoubleValue_default_instance_+16: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb20 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb30 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb40 google::protobuf::internal::RepeatedPrimitiveDefaults::default_instance()::instance: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb50 <guard variable for google::protobuf::internal::RepeatedStringTypeTraits::GetDefaultRepeatedField()::instance>: 0x000000000x00000000 0x00000000 0x00000000
0x7fffee4ddb60 <guard variable for google::protobuf::internal::(anonymous namespace)::Register(google::protobuf::MessageLite const*, int, google::protobuf::internal::ExtensionInfo)::local_static_registry>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb70 <_ZStL8__ioinit>: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb80 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddb90 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu+16: 0x00000000 0x000000000x00000000 0x00000000
0x7fffee4ddba0 google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::mu+32: 0x00000000 0x000000000x00000000 0x00000000
0x7fffee4ddbb0 <guard variable for google::protobuf::internal::InitSCCImpl(google::protobuf::internal::SCCInfoBase*)::runner>: 0x000000000x00000000 0x00000000 0x00000000
0x7fffee4ddbc0 google::protobuf::internal::fixed_address_empty_string: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbd0 google::protobuf::internal::implicit_weak_message_default_instance: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbe0 google::protobuf::internal::implicit_weak_message_default_instance+16: 0x00000000 0x00000000 0x00000000 0x00000000
0x7fffee4ddbf0 google::protobuf::ShutdownProtobufLibrary()::is_shutdown: 0x00000000 0x00000000 0x00000000 0x00000000
The text was updated successfully, but these errors were encountered: