Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testSmartProjectionFactor unit test fails with march=native #590

Closed
Tracked by #824
chrisbeall opened this issue Nov 9, 2020 · 27 comments
Closed
Tracked by #824

testSmartProjectionFactor unit test fails with march=native #590

chrisbeall opened this issue Nov 9, 2020 · 27 comments
Assignees
Labels
bug Bug report

Comments

@chrisbeall
Copy link
Member

Description

testSmartProjectionFactor fails on Ubuntu 16.04, gcc 5.4.0 with default configuration. The default configuration uses march=native, which is not exercised by CI. Test passes with GTSAM_BUILD_WITH_MARCH_NATIVE=OFF.

126/243 Test #126: testSmartProjectionFactor ..............***Failed    0.02 sec
/home/cbeall/git/gtsam/gtsam/slam/tests/testSmartProjectionFactor.cpp:492: Failure: "Exception: 
Indeterminant linear system detected while working near variable
1 (Symbol: 1).

Thrown when a linear system is ill-posed.  The most common cause for this
error is having underconstrained variables.  Mathematically, the system is
underdetermined.  See the GTSAM Doxygen documentation at
http://borg.cc.gatech.edu/ on gtsam::IndeterminantLinearSystemException for
more information." 
There were 1 failures

I'll try to look into this myself later, but wanted to make a report.

Steps to reproduce

Get latest develop, configure, make check.

Expected behavior

Test passes.

Environment

Ubuntu 16.04, gcc 5.4.0

@jlblancoc
Copy link
Member

This might even be a bug in the older gcc version... I'll try to run all tests with optimize native ON on an u20.04.
Anyway, if it's a bug in either gcc or eigen, it might depend on the specific CPU, so it's hard to reproduce... Please, report your flags with: grep flags /proc/cpuinfo to complete the bug report.

PS: GTSAM_BUILD_WITH_MARCH_NATIVE might be OFF by default, it's a recurrent topic, but in no case should that flag lead to a failed test, only to problems in 3rd party user programs if built with different combinations of eigen-version and/or optimize-native.

@chrisbeall
Copy link
Member Author

grep flags /proc/cpuinfo
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

A few more data points:

  • I don't get this failure with gcc 7.5 in a Ubuntu 18.04 VM on the same machine.
  • testSmartFactorBase, testSmartProjectionFactor, testImuFactorSerialization segfault on clang-9 with march=native. Problem appears to be in serialization. Unclear whether the root cause might be related. Can you please try to reproduce with clang-9, also?

@jlblancoc
Copy link
Member

jlblancoc commented Nov 10, 2020

Some tests:

  • Native (no VM) u18.04 + gcc 7.5 + all gtsam defaults (optimize native=ON, Embedded Eigen=ON) ==> testSmartProjectionFactor PASSES.
  • Same machine + clang-10 + all gtsam defaults ==> SEGFAULT

Crashes occur at:

0x00007ffff538f1ce in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0  0x00007ffff538f1ce in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffff74deac8 in boost::serialization::typeid_system::extended_type_info_typeid_0::type_unregister() () from /usr/lib/x86_64-linux-gnu/libboost_serialization.so.1.65.1
#2  0x0000000000598982 in boost::serialization::extended_type_info_typeid<Eigen::Matrix<double, 3, 1, 0, 3, 1> >::~extended_type_info_typeid() ()
#3  0x00007ffff4d0c161 in __run_exit_handlers (status=0, listp=0x7ffff50b4718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108

Rebuilding with optimize native=OFF for clang-10 also segfault's, so the "optimize native" doesn't seem to be the problem here (?).

Rebuilding with system Eigen also leads to segfault.

This seems like either a bug of clang (or old GCC) + boost serialization, right?

We are not alone:

@varunagrawal
Copy link
Collaborator

What version of Boost are y'all using? I believe this is a boost bug from before version 1.67, because I had faced a serialization segfault issue in the recent past. The basic problem is that the older boost serialization code would do a shallow serialization, so when de-serializing any complex factor, the test code would try to access something that wasn't initialized, hence segfault.

I am using clang-3.8 and with default GTSAM settings. My boost version is 1.67 and my tests pass.

@varunagrawal
Copy link
Collaborator

Just tried with both GTSAM_BUILD_WITH_MARCH_NATIVE=OFF and ON and I get the tests to pass.

@chrisbeall please do try running the tests with Boost 1.67+.

@jlblancoc
Copy link
Member

What version of Boost are y'all using? I believe this is a boost bug from before version 1.67,

1.65.1. I'll try 1.67+. Any painless way to install it w/o building boost from sources for u18.04?

@jlblancoc
Copy link
Member

PS: If boost is confirmed to be the cause, at least a note should be added to the README.

@varunagrawal varunagrawal self-assigned this Nov 13, 2020
@varunagrawal
Copy link
Collaborator

This link has a custom PPA: https://asciinema.org/a/199344

@varunagrawal
Copy link
Collaborator

@chrisbeall @jlblancoc any updates? I apologize for the pushes, but there is some stuff I wish to investigate so fixing this niggling issues should help my cause.

@varunagrawal
Copy link
Collaborator

Gentle reminder. If this is indeed a boost version issue, I'd like to update the CMake and the docs to reflect this ASAP. 🙂

@ProfFan
Copy link
Collaborator

ProfFan commented Dec 24, 2020

Maybe we should consider (seriously) getting rid of boost, as the biggest boost dependency has been eliminated (which coincidentally is the wrap C++ application)

@johnwlambert
Copy link
Contributor

i would like to see boost gone, since it is a large system dependency i really dont think we need, or even want

@varunagrawal
Copy link
Collaborator

That's not quite correct. We use boost for the OptionalJacobian amongst other things.

Also I don't quite see @johnwlambert's point about it being a big system dependency that we don't need. You install boost once and forget about it, plus it can be used for other C++ projects and scripts, e.g. for parsing command line arguments.

@varunagrawal
Copy link
Collaborator

Anyway, whether or not boost is needed is tangential to the point of this PR.

@jlblancoc @chrisbeall can y'all let me know if upgrading boost fixed this issue for you?

@johnwlambert
Copy link
Contributor

On a cluster where one does not have sudo access, boost installation can be difficult to get right using conda -- I've seen this recently on Skynet.

@varunagrawal
Copy link
Collaborator

Well that's not a Boost problem, that's a lazy admin problem.

Moreover the fix in that case is incredibly simple: install Boost to a location you have access to and just add that location to the Cmake and ldconfig env variables, and voila!

@varunagrawal
Copy link
Collaborator

FYI that's how I manage multiple versions of boost on my system, no issues.

@ProfFan
Copy link
Collaborator

ProfFan commented Dec 24, 2020

Boost is a burden for us because boost is a source of uncertainty - I think bugs in boost reflected on GTSAM have cost me 100+ work-hours to debug in the previous two years. I would say if we are targeting Ubuntu 18.04+ (which = GCC 7.4), then we should have most of the Boost functionalities in std.

@ProfFan
Copy link
Collaborator

ProfFan commented Dec 24, 2020

Anyway I agree this is tangential to this issue so let's move the discussion elsewhere.

@jlblancoc
Copy link
Member

@ProfFan :

Boost is a burden for us because boost is a source of uncertainty - I think bugs in boost reflected on GTSAM have cost me 100+ work-hours to debug in the previous two years. I would say if we are targeting Ubuntu 18.04+ (which = GCC 7.4), then we should have most of the Boost functionalities in std.

If it's acceptable to impose the minimum standard C++17 to users, then it would be a really easy port on boost::{optional,shared_ptr} -> std::*, but the remaining part would be serialization.
It could be either left as an optional feature implemented only if boost is avalable in the system (minimum gtsam effort), or port it to another serialization library (I'm the author of one, so I would better take a step back and give no opinions ;-) ).

@varunagrawal

Gentle reminder. If this is indeed a boost version issue, I'd like to update the CMake and the docs to reflect this ASAP. slightly_smiling_face

Thanks for the PPA. I had to remove ROS to allow apt to install boost 1.67, but it's ok.

However, I couldn't get cmake to be happy with this 1.67.0 version, I'll try to continue investigating later, but just in case someone knows the solution, here's the issue:

cmake -DBoost_ADDITIONAL_VERSIONS="1.67.0"  ..

...

-- Could NOT find Boost (missing: serialization system filesystem thread program_options date_time timer chrono regex) (found suitable version "1.67.0", minimum required is "1.58")
CMake Error at cmake/HandleBoost.cmake:33 (message):
  Missing required Boost components >= v1.58, please install/upgrade Boost or
  configure your search paths.
Call Stack (most recent call first):
  CMakeLists.txt:41 (include)

@dellaert
Copy link
Member

Moving to c++17 and removing boost, if we decide to do that, is not going to happen until the next major version change (5.X.Y), so let's not discuss too much.

@jlblancoc
Copy link
Member

u18.04 (amd64) + clang-10 + ...

  • boost 1.65 -> segfault
  • boost 1.66 -> (could not find this version on the PPA)
  • boost 1.67 -> no crash.
  • boost 1.70 -> no crash.
  • boost 1.74 -> build error (sigh)

So it seems that minimum Boost version should be 1.67, at least for clang.

cc: @varunagrawal Should you perhaps update the README?

@varunagrawal
Copy link
Collaborator

@chrisbeall encountered this issue for gcc as well, so if he can confirm that upgrading Boost to 1.67+ works, we can close this.

@jlblancoc created the PR! #681

@varunagrawal
Copy link
Collaborator

Quick update: I tried this on an Ubuntu 16.04 docker image with GCC 5.4.0 and Boost 1.58 and I cannot reproduce the issue.

@chrisbeall am I missing something?

@varunagrawal varunagrawal added the bug Bug report label Mar 17, 2021
@jlblancoc jlblancoc mentioned this issue Jul 22, 2021
14 tasks
@jlblancoc
Copy link
Member

Does #681 imply that this issue could be closed now?

@varunagrawal
Copy link
Collaborator

I believe that to be the case.

@varunagrawal
Copy link
Collaborator

Closing this due to lack of movement and since we've added it to the list of known issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

6 participants