Note: this tutorial is mainly focused around ASAN but TSAN issues can be handled in similar way.
- Install all ROS 2 build dependencies by following the official instructions.
- Install and enable Colcon mixin package:
sudo apt-get install python3-colcon-mixin
colcon mixin add default https://raw.githubusercontent.com/colcon/colcon-mixin-repository/master/index.yaml
colcon mixin update default
Mixins are “configuration profiles” which can be used to compile ROS 2
with additional flags more easily. They are stored locally in
~/.colcon
. For instance, you can see the asan mixin configuration
file
here.
- Install colcon-sanitizer-reports (Recommended):
git clone https://github.com/colcon/colcon-sanitizer-reports.git
cd colcon-sanitizer-reports
sudo python3 setup.py install
The colcon-sanitizer-reports is a plugin for colcon test
that parses
sanitizer issues from stdout/stderr, deduplicates the issues, and outputs them
to a CSV. The CSV provides a much easier starting point for choosing sanitizer
issues to tackle and should contain enough information to debug the issue
completely.
- Setup
ccache
to speed-up development:
sudo apt install -y ccache
ccache -M 20G # increase cache size
# Add the following to your .bashrc or .zshrc file and restart your terminal:
export CC=/usr/lib/ccache/gcc
export CXX=/usr/lib/ccache/g++
- Create a ROS 2 workspace for ASAN related tasks (do not compile yet!)
cd ~/ros2_asan_ws
wget https://raw.githubusercontent.com/ros2/ros2/master/ros2.repos
vcs-import src < ros2.repos
Unless you are specifically testing these packages, you may ignore them to save time:
touch src/ros2/common_interfaces/actionlib_msgs/COLCON_IGNORE
touch src/ros2/common_interfaces/common_interfaces/COLCON_IGNORE
touch src/ros2/rosidl_typesupport_opensplice/opensplice_cmake_module/COLCON_IGNORE
touch src/ros2/rmw_fastrtps/rmw_fastrtps_dynamic_cpp/COLCON_IGNORE
touch src/ros2/rmw_opensplice/rmw_opensplice_cpp/COLCON_IGNORE
touch src/ros2/ros1_bridge/COLCON_IGNORE
touch src/ros2/rosidl_typesupport_opensplice/rosidl_typesupport_opensplice_c/COLCON_IGNORE
touch src/ros2/rosidl_typesupport_opensplice/rosidl_typesupport_opensplice_cpp/COLCON_IGNORE
touch src/ros2/common_interfaces/shape_msgs/COLCON_IGNORE
touch src/ros2/common_interfaces/stereo_msgs/COLCON_IGNORE
touch src/ros2/common_interfaces/trajectory_msgs/COLCON_IGNORE
- First, compile all packages:
cd ~/ros2_asan_ws # you will need to be exactly in this directory
# Those flags are similar to what the nightly job uses.
colcon build --build-base=build-asan --install-base=install-asan \
--cmake-args -DOSRF_TESTING_TOOLS_CPP_DISABLE_MEMORY_TOOLS=ON \
-DINSTALL_EXAMPLES=OFF -DSECURITY=ON --no-warn-unused-cli \
-DCMAKE_BUILD_TYPE=Debug \
--mixin asan-gcc \
--packages-up-to test_communication \
--symlink-install
# Equivalent command for tsan:
# You can either use different workspaces or the same one.
colcon build --build-base=build-tsan --install-base=install-tsan \
--cmake-args -DOSRF_TESTING_TOOLS_CPP_DISABLE_MEMORY_TOOLS=ON \
-DINSTALL_EXAMPLES=OFF -DSECURITY=ON --no-warn-unused-cli \
-DCMAKE_BUILD_TYPE=Debug \
--mixin tsan \
--packages-up-to test_communication \
--symlink-install
-DCMAKE_BUILD_TYPE=Debug
to
ensure that debugging symbols are generated. This allows ASAN/MSAN to
report files and line numbers in backtraces.
Once the compilation is finished, colcon will report that all packages have been compiled but some had “stderr output”. This is fine.
- To run the tests for ASan:
cd ~/ros2_asan_ws # you will need to be exactly in this directory
colcon test --build-base=build-asan --install-base=install-asan \
--event-handlers sanitizer_report+ --packages-up-to test_communication
- To run tests for TSan
cd ~/ros2_asan_ws # you will need to be exactly in this directory
colcon test --build-base=build-tsan --install-base=install-tsan \
--event-handlers sanitizer_report+ --packages-up-to test_communication
Omit the --event-handlers
flag if you did not install
colcon-sanitizer-reports.
Some tests may fail, this is OK. Once done, you can look at the test logs or sanitizer_report.csv. Examples from tests logs:
cd ~/ros2_asan_ws # you will need to be exactly in this directory
cd log/latest_test
# Displays three lines after the beginning of a ASAN reported issue.
grep -R '==.*==ERROR: .*Sanitizer' -A 3
[..]
--
rcpputils/stdout_stderr.log:1: ==32481==ERROR: LeakSanitizer: detected memory leaks
rcpputils/stdout_stderr.log-1:
rcpputils/stdout_stderr.log-1: Direct leak of 4 byte(s) in 1 object(s) allocated from:
rcpputils/stdout_stderr.log-1: #0 0x7f7d99dac458 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0458)
--
rcpputils/stdout.log:1: ==32481==ERROR: LeakSanitizer: detected memory leaks
rcpputils/stdout.log-1:
rcpputils/stdout.log-1: Direct leak of 4 byte(s) in 1 object(s) allocated from:
rcpputils/stdout.log-1: #0 0x7f7d99dac458 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0458)
--
rcpputils/streams.log:1: ==32481==ERROR: LeakSanitizer: detected memory leaks
rcpputils/streams.log-1:
rcpputils/streams.log-1: Direct leak of 4 byte(s) in 1 object(s) allocated from:
rcpputils/streams.log-1: #0 0x7f7d99dac458 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0458)
Note: if you re-run colcon test, latest_test symbolic link will be
updated and will point to a different category. Run ls -l log
to
identify which log directory you have been looking at (e.g.
latest_test → test_2019-04-05_18-03-24
) and make a note of the
string you are seeing (test_2019-04-05_18-03-24
) in my case. This
will ensure you don't forget which colcon test run contain the logs
you've been analyzing.
When we will have a stack-ranking for bugs this should be the source of
truth. For now, choose the package with the less amount of dependencies
(choose rcpputils
in priority, then rmw*
then rcl*
packages). You can print the packages dependency tree in your terminal
using the following command:
# We exclude ament packages as they are part of the buildchain / tooling
# and not part of ROS2.
$ colcon graph --packages-skip-regex 'ament.*' --packages-up-to rcl
fastcdr + * * * * *............
gtest_vendor + * . . ... . .. ..
osrf_pycommon + * . ... .
[... snip ...]
test_msgs +*
rcl +
You will want to choose a package after rcpputils
but as high as
possible in the pyramid.
For the sake of example, we will choose to work on rcpputils
(this
bug has been solved by this
PR). You can guess which
package has issues by looking at the grep output (if the bug appears in
rcpputils/*.log
, rcpputils
needs some investigations).
From this point, I suggest having two shells. I personally use i3+tmux and divide vertically the window in two. The “test shell” is to run the binaries, the “editor shell” one is to edit the source (I use vim).
Test shell: cd ~/ros2_asan_ws/build-asan/rcpputils
Source shell:
cd ~/ros2_asan_ws/src/ros2/rcpputils
In the test shell:
`cd ~/ros2_asan_ws/build-asan/rcpputils
`# We use CTEST_OUTPUT_ON_FAILURE to force CTest to display more information
# on failure.
CTEST_OUTPUT_ON_FAILURE=1 make test
# The previous line is equivalent to:
CTEST_OUTPUT_ON_FAILURE=1 ctest .
Running ctest
instead of make allows you to specify the exact flags
ctest
will use. For instance, ctest -R uncrustify
will only run
the tests containing the string uncrustify
(useful to fix C++ coding
style issues quickly).
After running the test suite, you can realize that test_basic
is
failing. Let's run the test manually:
`./test_basic
Running main() from ../../../install-asan/gtest_vendor/src/gtest_vendor/src/gtest_main.cc
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from test_tsa
[ RUN ] test_tsa.libcxx_types
[ OK ] test_tsa.libcxx_types (0 ms)
[ RUN ] test_tsa.capability
[ OK ] test_tsa.capability (0 ms)
[ RUN ] test_tsa.ptr_guard
[ OK ] test_tsa.ptr_guard (0 ms)
[ RUN ] test_tsa.shared_capability
[ OK ] test_tsa.shared_capability (0 ms)
[ RUN ] test_tsa.return_capability
[ OK ] test_tsa.return_capability (0 ms)
[ RUN ] test_tsa.try_acquire
[ OK ] test_tsa.try_acquire (0 ms)
[ RUN ] test_tsa.acquire_ordering
[ OK ] test_tsa.acquire_ordering (0 ms)
[----------] 7 tests from test_tsa (0 ms total)
[----------] Global test environment tear-down
[==========] 7 tests from 1 test case ran. (0 ms total)
[ PASSED ] 7 tests.
=================================================================
==30859==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 4 byte(s) in 1 object(s) allocated from:
#0 0x7fbefcd0b458 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0458)
#1 0x5620b4c650a9 in FakeGuarded::FakeGuarded() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x190a9)
#2 0x5620b4c63444 in **test_tsa_shared_capability_Test**::TestBody() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x17444)
#3 0x5620b4cdc4fd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x904fd)
#4 0x5620b4cce1e7 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x821e7)
#5 0x5620b4c79f0f in testing::Test::Run() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x2df0f)
#6 0x5620b4c7b33a in testing::TestInfo::Run() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x2f33a)
#7 0x5620b4c7bede in testing::TestCase::Run() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x2fede)
#8 0x5620b4c96fef in testing::internal::UnitTestImpl::RunAllTests() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x4afef)
#9 0x5620b4cdefb0 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x92fb0)
#10 0x5620b4cd04b0 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x844b0)
#11 0x5620b4c93d83 in testing::UnitTest::Run() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x47d83)
#12 0x5620b4c672d2 in RUN_ALL_TESTS() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x1b2d2)
#13 0x5620b4c67218 in main (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x1b218)
#14 0x7fbefc09bb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
Direct leak of 4 byte(s) in 1 object(s) allocated from:
#0 0x7fbefcd0b458 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0458)
#1 0x5620b4c650a9 in FakeGuarded::FakeGuarded() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x190a9)
#2 0x5620b4c62d4b in **test_tsa_capability_Test**::TestBody() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x16d4b)
#3 0x5620b4cdc4fd in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x904fd)
#4 0x5620b4cce1e7 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x821e7)
#5 0x5620b4c79f0f in testing::Test::Run() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x2df0f)
#6 0x5620b4c7b33a in testing::TestInfo::Run() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x2f33a)
#7 0x5620b4c7bede in testing::TestCase::Run() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x2fede)
#8 0x5620b4c96fef in testing::internal::UnitTestImpl::RunAllTests() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x4afef)
#9 0x5620b4cdefb0 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x92fb0)
#10 0x5620b4cd04b0 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x844b0)
#11 0x5620b4c93d83 in testing::UnitTest::Run() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x47d83)
#12 0x5620b4c672d2 in RUN_ALL_TESTS() (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x1b2d2)
#13 0x5620b4c67218 in main (/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcpputils/test_basic+0x1b218)
#14 0x7fbefc09bb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)
SUMMARY: AddressSanitizer: 8 byte(s) leaked in 2 allocation(s). `
The stack trace explains where the problem happens. In this case we have
two tests leaking memory. We can guess this function is actually the
test defined in TEST(test_tsa, shared_capability)
.
You can now fix the issue by modifying the code. In the test shell, you can compile the test again and run the updated binary:
`cd ~/ros2_asan_ws/build-asan/rcpputils
make && ./test_basic`
Note: if you'd rather work from the root directory of the workspace,
you could run make -C build-asan/rcpputils
.
If the test suite is emitting too many warning, use --gtest_filter=
to only run the test case you are looking to fix. This will keep the
amount of ASAN errors under control.
Iterate until the bug is fixed.
Never send a PR without running make test
locally. If the
uncrustify
test fails, it means the coding style is not followed
anymore. This can be fixed easily:
# Extract the "ament_uncrustify" command line from the test:
cd ~/ros2_asan_ws/build-asan/.../ # In the package BUILD directory
ctest . -R '^uncrustify$' -V # Run only the uncrustify test and print test output.UpdateCTestConfiguration from :/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils/CTestConfiguration.ini
# CTest output, the line you are looking for is in **BOLD**
Parse Config file:/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils/CTestConfiguration.ini
Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/usr/bin/cmake
UpdateCTestConfiguration from :/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils/CTestConfiguration.ini
Parse Config file:/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils/CTestConfiguration.ini
Test project /home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 34
Start 34: uncrustify
34: Test command: /usr/bin/python3 "-u" "/home/ANT.AMAZON.COM/tmoulard/ros2_ws/install-asan/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils/test_results/rcutils/uncrustify.xunit.xml" "--package-name" "rcutils" "--output-file" "/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils/ament_uncrustify/uncrustify.
txt" "--command" "/home/ANT.AMAZON.COM/tmoulard/ros2_ws/install-asan/ament_uncrustify/bin/ament_uncrustify" "--xunit-file" "/home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils/test_results/rcutils/uncrustify.xunit.xml"
34: Test timeout computed to be: 60
34: -- run_test.py: invoking following command in '/home/ANT.AMAZON.COM/tmoulard/ros2_ws/src/ros2/rcutils':
34: - **/home/ANT.AMAZON.COM/tmoulard/ros2_ws/install-asan/ament_uncrustify/bin/ament_uncrustify --xunit-file /home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils/test_results/rcutils/uncrustify.xunit.xml**
# [...snip... more CTest output]
# Copy the line in **BOLD** before (from YOUR term as paths will differ).
# Go to your package SOURCE directory:
cd ~/ros2_asan_ws/src/.../
# Paste the line in your term, add --reformat as an extra flag and run the command.
/home/ANT.AMAZON.COM/tmoulard/ros2_ws/install-asan/ament_uncrustify/bin/ament_uncrustify --xunit-file /home/ANT.AMAZON.COM/tmoulard/ros2_ws/build-asan/rcutils/test_results/rcutils/uncrustify.xunit.xml \
** --reformat**
# Your code is now correctly formatted!
Once the bug is fixed and the PR is ready to be opened:
git add -p
git commit -s # Always include a DCO in your commits.
git push origin master:thomas-moulard/fix-my-bug
Open a PR and you're done!
- Test log output:
ERROR: LeakSanitizer: detected memory leaks
- XML tag: detected-memory-leaks
At the end of the execution, ASAN reports as a single error the list of all memory chunks which never were freed. This happens as a final single error message because until the process exits, there is a possibility this memory can be freed.
Example:
1: ==32481==ERROR: LeakSanitizer: detected memory leaks
1:
1: Direct leak of 4 byte(s) in 1 object(s) allocated from:
1: #0 0x7f7d99dac458 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0458)
[snip]
1:
1: Direct leak of 4 byte(s) in 1 object(s) allocated from:
1: #0 0x7f7d99dac458 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0458)
[snip]
1:
1: SUMMARY: AddressSanitizer: 8 byte(s) leaked in 2 allocation(s).
- Test log output:
==12958==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60c0000006b8 at pc 0x* bp 0x* sp 0x*
- XML tag: heap-buffer-overflow
This happens when the program accesses an unallocated chunk of memory.
Example:
56: =================================================================
56: ==12958==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60c0000006b8 at pc 0x7f1d2a9d7733 bp 0x7ffc9668be70 sp 0x7ffc9668b618
56: READ of size 511 at 0x60c0000006b8 thread T0
56: #0 0x7f1d2a9d7732 (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79732)
[snip]
56:
56: 0x60c0000006b8 is located 0 bytes to the right of 120-byte region [0x60c000000640,0x60c0000006b8)
56: allocated by thread T0 here:
56: #0 0x7f1d2aa3e458 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.4+0xe0458)
[snip]
56:
56: SUMMARY: AddressSanitizer: heap-buffer-overflow (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x79732)
56: Shadow bytes around the buggy address:
56: 0x0c187fff8080: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
56: 0x0c187fff8090: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
56: 0x0c187fff80a0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
56: 0x0c187fff80b0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
56: 0x0c187fff80c0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
56: =>0x0c187fff80d0: 00 00 00 00 00 00 00**[fa]**fa fa fa fa fa fa fa fa
56: 0x0c187fff80e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
56: 0x0c187fff80f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
56: 0x0c187fff8100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
56: 0x0c187fff8110: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
56: 0x0c187fff8120: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
56: Shadow byte legend (one shadow byte represents 8 application bytes):
56: Addressable: 00
56: Partially addressable: 01 02 03 04 05 06 07
56: Heap left redzone: fa
56: Freed heap region: fd
56: Stack left redzone: f1
56: Stack mid redzone: f2
56: Stack right redzone: f3
56: Stack after return: f5
56: Stack use after scope: f8
56: Global redzone: f9
56: Global init order: f6
56: Poisoned by user: f7
56: Container overflow: fc
56: Array cookie: ac
56: Intra object redzone: bb
56: ASan internal: fe
56: Left alloca redzone: ca
56: Right alloca redzone: cb
56: ==12958==ABORTING
In this case, we can see that this is an off-by-on issue where the memory in BOLD is illegally accessed. The position of this cell (first cell after the end of the allocated memory) is leading us to think that this is an off-by-one bug.
- Test log output:
WARNING: ThreadSanitizer: data race (pid=*)
- XML tag:
data-races
Two threads writes to (or read and write) the same chunk of memory without holding a lock. The result of this operation is non-deterministic and can lead to crashes, dead-lock, etc.
Refactoring the code using Thread Safety Annotations is the easiest way forward. Those extensions will prevent bad patterns.
Example of error:
30: ==================
30: WARNING: ThreadSanitizer: data race (pid=23489)
30: Write of size 8 at 0x7ba000000060 by main thread (mutexes: write M3845, write M3798):
30: #0 close <null> (libtsan.so.0+0x2f447)
[snip]
30:
30: Previous read of size 8 at 0x7ba000000060 by thread T2:
30: #0 recvmsg <null> (libtsan.so.0+0x3b48b)
[snip]
30: Location is file descriptor 6 created by main thread at:
30: #0 socket <null> (libtsan.so.0+0x2ee83)
[snip]
30: Mutex M3845 (0x7fea26ffe800) created at:
30: #0 pthread_mutex_lock <null> (libtsan.so.0+0x3faeb)
[snip]
30: Mutex M3798 (0x7fea26ffe320) created at:
30: #0 pthread_mutex_lock <null> (libtsan.so.0+0x3faeb)
[snip]
30: Thread T2 (tid=23492, finished) created by main thread at:
30: #0 pthread_create <null> (libtsan.so.0+0x2bcfe)
[snip]
30:
30: SUMMARY: ThreadSanitizer: data race (/usr/lib/x86_64-linux-gnu/libtsan.so.0+0x2f447) in __interceptor_close
30: ==================
This error messages describes that the main thread wrote 8 bytes while T2 reads it. TSAN explains that 0x7ba000000060 is actually mapped to the file descriptor number 6 and includes where the file descriptor was open. When the data race happened, mutexes M3845 and M3798 were locked. TSAN also mentions where those mutexes have been created.
TSAN doc details what are some typical data races. This race condition is documented under the sections “Race on a file descriptor” and “Race during exit”.
- Test log output:
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=*)
- XML tag:
lock-order-inversion
Lock order inversions happen when a code path allows the same mutex to be locked twice before the original lock can be released. If the code path happens, the process will enter dead-lock.
Refactoring the code using Thread Safety Annotations is the easiest way forward. Those extensions will prevent bad patterns.
Example of error:
34: ==================
34: WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock) (pid=23572)
34: Cycle in lock order graph: M8016590817765616 (0x000000000000) => M56430286811999312 (0x000000000000) => M8016590817765616
34:
34: Mutex M56430286811999312 acquired here while holding mutex M8016590817765616 in thread T12:
34: #0 pthread_mutex_lock <null> (libtsan.so.0+0x3faeb)
[snip]
34:
34: Mutex M8016590817765616 acquired here while holding mutex M56430286811999312 in thread T12:
34: #0 pthread_mutex_lock <null> (libtsan.so.0+0x3faeb)
[snip]
34:
34: Thread T12 (tid=23609, running) created by main thread at:
34: #0 pthread_create <null> (libtsan.so.0+0x2bcfe)
[snip]
34:
34: SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) (/usr/lib/x86_64-linux-gnu/libtsan.so.0+0x3faeb) in __interceptor_pthread_mutex_lock
34: ==================
TSAN detected a code path where the mutex M8016590817765616 is locked twice without releasing the lock in between. The error message details the stack trace when each mutex is being locked. In order:
- Mutex M8016590817765616 gets locked a first time (this is OK - we don't know where this happens)
- Mutex M56430286811999312 gets locked (this is OK - the first stack trace explains where this happens)
- Mutex M8016590817765616 gets locked a second time (this is bad - the second stack trace explains where this happens)
Additionally, the error message explains that all locks happen in threat T12 (the last stack trace details which thread is T12).
- Test log output:
==5587==ERROR: AddressSanitizer: heap-use-after-free on address 0x61400000fe44 at pc 0x47b55f bp 0x7ffc36b28200 sp 0x7ffc36b281f8
- XML tag:
heap-use-after-free
An already freed memory block is accessed.
The code must be rewritten to avoid this problem. For instance, memory ownership can be improved to avoid misuse. unique_ptr should be used whenever a memory chunk has a single owner (over raw C pointers). Thread safety annotations are not useful in this case.
- Test log output:
WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=*)
- XML tag:
signal-unsafe-call-inside-signal
See man 7 signal-safety. Only some functions are allowed in signal handlers, unsafe functions must never be used.
Signal handlers should contain as little code as possible. If some work happening in the callback can be moved back into an application thread, this should help solving this problem.
Example of error:
32: ==================
32: WARNING: ThreadSanitizer: signal-unsafe call inside of a signal (pid=5630)
32: #0 malloc <null> (libtsan.so.0+0x2ae13)
32: #1 _IO_file_doallocate <null> (libc.so.6+0x7e18b)
32: #2 rcl_logging_multiple_output_handler ../../src/ros2/rcl/rcl/src/rcl/logging.c:132 (librcl.so+0x1560f)
32: #3 rcutils_log ../../src/ros2/rcutils/src/logging.c:407 (librcutils.so+0xabf0)
32: #4 rclcpp::SignalHandler::signal_handler(int, siginfo_t*, void*) ../../src/ros2/rclcpp/rclcpp/src/rclcpp/signal_handler.cpp:221 (librclcpp.so+0x157c93)
[snip]
32:
32: SUMMARY: ThreadSanitizer: signal-unsafe call inside of a signal (/usr/lib/x86_64-linux-gnu/libtsan.so.0+0x2ae13) in malloc
32: ==================
In this example, the signal handler is logging a string. This logging
code is allocating memory using malloc()
which is not an allowed
function in signal handlers.
- Test log output:
WARNING: ThreadSanitizer: signal handler spoils errno
- XML tag:
signal-handler-spoils-errno
errno
value must never be modified in signal handlers. See “unsafe
call inside signal callback”. The best solution is to avoid placing
complex code in signal callbacks. Example of bug this problem can
introduce (in pseudo-code):
void signal_handler() {
my_function_modifying_errno(); // Will fail and set errno to ABCD
if (errno == ABCD) { /* do something */ }
}
int main() {
install_signal_handler(&signal_handler);
my_other_function_modifying_errno(); // Will fail and set errno to EFGH.
// A signal is received! signal_handler() gets executed here.
if (errno == ABCD) { /* do something */ } // <- this gets executed
else if (errno == EFGH) {
/* do something else */ } // <- this should have been executed
}
As you can see, modifying errno value in signal handlers is creating
issues because after the signal finishes executing, any errno
checking code will think its checking the errno
value of the
function the thread just executed whereas it is actually checking the
errno value set by the signal handler call.unsafe-call-ins