Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Linux ThreadSanitizer job to CI #73777

Merged
merged 1 commit into from
Aug 4, 2023
Merged

Add a Linux ThreadSanitizer job to CI #73777

merged 1 commit into from
Aug 4, 2023

Conversation

myaaaaaaaaa
Copy link
Contributor

@myaaaaaaaaa myaaaaaaaaa commented Feb 22, 2023

Having ThreadSanitizer enabled in CI allows it to automatically catch many types of race conditions (specifically, data races, where a variable is accessed by two threads at the same time). Since Godot is moving to a fully threaded model, this is basically a requirement for stable behavior.

Some errors have been suppressed for now. See the diff for details.

@myaaaaaaaaa myaaaaaaaaa changed the title Enable ThreadSanitizer in CI (Linux/Editor with clang sanitizers) Add a Linux ThreadSanitizer job to CI Feb 22, 2023
@Calinou Calinou added this to the 4.x milestone Feb 22, 2023
@rcorre
Copy link
Contributor

rcorre commented Apr 30, 2023

@myaaaaaaaaa since it sounds like you've been playing with TSAN, have you been able to run it on an actual project? If I just run godot.linuxbsd.editor.x86_64.llvm.san, I get a lot of warnings but eventually the project manager opens. If I try to actually open a project (even a simple empty project), I get:

ThreadSanitizer: CHECK failed: tsan_interceptors_posix.cpp:1987 "((thr->slot)) != (0)" (0x0, 0x0) (tid=36945)
Segmentation fault (core dumped)

@myaaaaaaaaa
Copy link
Contributor Author

@myaaaaaaaaa since it sounds like you've been playing with TSAN, have you been able to run it on an actual project?

My motivation at the moment is just to improve CI's ability to test for threading bugs, to keep things at least somewhat manageable. Trying to get something like ThreadSanitizer working immediately on a codebase as large as Godot's is probably a hopeless endeavor.

If I just run godot.linuxbsd.editor.x86_64.llvm.san, I get a lot of warnings but eventually the project manager opens. If I try to actually open a project (even a simple empty project), I get:

ThreadSanitizer: CHECK failed: tsan_interceptors_posix.cpp:1987 "((thr->slot)) != (0)" (0x0, 0x0) (tid=36945)
Segmentation fault (core dumped)

This may be caused by the fact that ThreadSanitizer does not work with other sanitizers (ASan, UBSan, etc) and has to be enabled by itself.

@myaaaaaaaaa myaaaaaaaaa marked this pull request as ready for review June 22, 2023 13:56
@myaaaaaaaaa myaaaaaaaaa requested review from a team as code owners June 22, 2023 13:56
tests/test_main.cpp Outdated Show resolved Hide resolved
tests/test_main.cpp Outdated Show resolved Hide resolved
@myaaaaaaaaa myaaaaaaaaa requested a review from a team as a code owner June 26, 2023 12:04
@YuriSizov YuriSizov modified the milestones: 4.x, 4.2 Jul 25, 2023
@YuriSizov YuriSizov requested a review from akien-mga July 25, 2023 19:50
Copy link
Member

@akien-mga akien-mga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great now with the suppressions.

Just added a question about the need for debug_symbols, if it's confirmed they're needed, this can be merged.

@YuriSizov YuriSizov merged commit 16a9356 into godotengine:master Aug 4, 2023
@YuriSizov
Copy link
Contributor

Thanks!

@myaaaaaaaaa myaaaaaaaaa deleted the enable-tsan branch August 4, 2023 20:25
@YuriSizov
Copy link
Contributor

@myaaaaaaaaa We are starting to get some weird issues with this added workflow.

See this and this runs, which fail on the unit tests step without actually failing any unit tests or reporting any errors from the sanitizer (there are warnings though).

This seems like it's a false positive failure, and it's inconsistent (between these two there are several other runs which didn't fail).

@myaaaaaaaaa
Copy link
Contributor Author

ThreadSanitizer is complaining about races in NavigationServer. A quick fix (depending on how tolerable the spurious CI errors are) would be:

diff --git a/modules/navigation/nav_map.h b/modules/navigation/nav_map.h
index 0a82f595c7..2188df8981 100644
--- a/modules/navigation/nav_map.h
+++ b/modules/navigation/nav_map.h
@@ -98,15 +98,15 @@ class NavMap : public NavRid {
 
 	/// Physics delta time
 	real_t deltatime = 0.0;
 
 	/// Change the id each time the map is updated.
 	uint32_t map_update_id = 0;
 
-	bool use_threads = true;
+	bool use_threads = false;
 	bool avoidance_use_multiple_threads = true;
 	bool avoidance_use_high_priority_threads = true;
 
 	// Performance Monitor
 	int pm_region_count = 0;
 	int pm_agent_count = 0;
 	int pm_link_count = 0;

A more robust fix would be to merge the parallel_for code in #79490 and then apply this:

diff --git a/modules/navigation/nav_map.cpp b/modules/navigation/nav_map.cpp
index 1e55bd2314..592c4da1a9 100644
--- a/modules/navigation/nav_map.cpp
+++ b/modules/navigation/nav_map.cpp
@@ -1181,21 +1181,28 @@ void NavMap::step(real_t p_deltatime) {
 
 	bool use_parallel_for = use_threads && avoidance_use_multiple_threads;
 
 	for_range(0, active_2d_avoidance_agents.size(), use_parallel_for, SNAME("RVOAvoidanceAgents2D"), [&](const int i) {
 		NavAgent *agent = active_2d_avoidance_agents[i];
 		agent->get_rvo_agent_2d()->computeNeighbors(&rvo_simulation_2d);
 		agent->get_rvo_agent_2d()->computeNewVelocity(&rvo_simulation_2d);
+	});
+	for_range(0, active_2d_avoidance_agents.size(), use_parallel_for, SNAME("RVOAvoidanceAgents2D"), [&](const int i) {
+		NavAgent *agent = active_2d_avoidance_agents[i];
 		agent->get_rvo_agent_2d()->update(&rvo_simulation_2d);
 		agent->update();
 	});
+
 	for_range(0, active_3d_avoidance_agents.size(), use_parallel_for, SNAME("RVOAvoidanceAgents3D"), [&](const int i) {
 		NavAgent *agent = active_3d_avoidance_agents[i];
 		agent->get_rvo_agent_3d()->computeNeighbors(&rvo_simulation_3d);
 		agent->get_rvo_agent_3d()->computeNewVelocity(&rvo_simulation_3d);
+	});
+	for_range(0, active_3d_avoidance_agents.size(), use_parallel_for, SNAME("RVOAvoidanceAgents3D"), [&](const int i) {
+		NavAgent *agent = active_3d_avoidance_agents[i];
 		agent->get_rvo_agent_3d()->update(&rvo_simulation_3d);
 		agent->update();
 	});
 }
 
 void NavMap::dispatch_callbacks() {
 	for (NavAgent *agent : active_2d_avoidance_agents) {

Although it is concerning that the failures are random. @Scony Do you think it's possible to write NavigationServer tests that would find this issue more consistently?

@YuriSizov
Copy link
Contributor

I'm more concerned that these warnings fail the CI. If it's not an error, why does it do that?

@myaaaaaaaaa
Copy link
Contributor Author

myaaaaaaaaa commented Aug 7, 2023

From the point of view of Godot CI, they are errors. My understanding is that sanitizers typically refer to their output as warnings, and reserve the word "error" for very critical errors.

@YuriSizov
Copy link
Contributor

Seems like UB sanitizer reports everything as SUMMARY and it doesn't fail the CI. At the very least we have an inconsistency. Of course it's also important that this thread sanitizer error is only hit sometimes.

@Scony
Copy link
Contributor

Scony commented Aug 7, 2023

ThreadSanitizer is complaining about races in NavigationServer. A quick fix (depending on how tolerable the spurious CI errors are) would be:

diff --git a/modules/navigation/nav_map.h b/modules/navigation/nav_map.h
index 0a82f595c7..2188df8981 100644
--- a/modules/navigation/nav_map.h
+++ b/modules/navigation/nav_map.h
@@ -98,15 +98,15 @@ class NavMap : public NavRid {
 
 	/// Physics delta time
 	real_t deltatime = 0.0;
 
 	/// Change the id each time the map is updated.
 	uint32_t map_update_id = 0;
 
-	bool use_threads = true;
+	bool use_threads = false;
 	bool avoidance_use_multiple_threads = true;
 	bool avoidance_use_high_priority_threads = true;
 
 	// Performance Monitor
 	int pm_region_count = 0;
 	int pm_agent_count = 0;
 	int pm_link_count = 0;

A more robust fix would be to merge the parallel_for code in #79490 and then apply this:

diff --git a/modules/navigation/nav_map.cpp b/modules/navigation/nav_map.cpp
index 1e55bd2314..592c4da1a9 100644
--- a/modules/navigation/nav_map.cpp
+++ b/modules/navigation/nav_map.cpp
@@ -1181,21 +1181,28 @@ void NavMap::step(real_t p_deltatime) {
 
 	bool use_parallel_for = use_threads && avoidance_use_multiple_threads;
 
 	for_range(0, active_2d_avoidance_agents.size(), use_parallel_for, SNAME("RVOAvoidanceAgents2D"), [&](const int i) {
 		NavAgent *agent = active_2d_avoidance_agents[i];
 		agent->get_rvo_agent_2d()->computeNeighbors(&rvo_simulation_2d);
 		agent->get_rvo_agent_2d()->computeNewVelocity(&rvo_simulation_2d);
+	});
+	for_range(0, active_2d_avoidance_agents.size(), use_parallel_for, SNAME("RVOAvoidanceAgents2D"), [&](const int i) {
+		NavAgent *agent = active_2d_avoidance_agents[i];
 		agent->get_rvo_agent_2d()->update(&rvo_simulation_2d);
 		agent->update();
 	});
+
 	for_range(0, active_3d_avoidance_agents.size(), use_parallel_for, SNAME("RVOAvoidanceAgents3D"), [&](const int i) {
 		NavAgent *agent = active_3d_avoidance_agents[i];
 		agent->get_rvo_agent_3d()->computeNeighbors(&rvo_simulation_3d);
 		agent->get_rvo_agent_3d()->computeNewVelocity(&rvo_simulation_3d);
+	});
+	for_range(0, active_3d_avoidance_agents.size(), use_parallel_for, SNAME("RVOAvoidanceAgents3D"), [&](const int i) {
+		NavAgent *agent = active_3d_avoidance_agents[i];
 		agent->get_rvo_agent_3d()->update(&rvo_simulation_3d);
 		agent->update();
 	});
 }
 
 void NavMap::dispatch_callbacks() {
 	for (NavAgent *agent : active_2d_avoidance_agents) {

Although it is concerning that the failures are random. @Scony Do you think it's possible to write NavigationServer tests that would find this issue more consistently?

When threads are involved that's going to happen. I don't think we can do much from tests perspective. I'm also not sure if we should - test cases assume simple execution and they test particular things, threading is out of scope in that sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants