Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition between Qt and Ogre2 in presentation #304

Closed
5 tasks done
darksylinc opened this issue Apr 12, 2021 · 8 comments · Fixed by gazebosim/gz-sim#774
Closed
5 tasks done

Race condition between Qt and Ogre2 in presentation #304

darksylinc opened this issue Apr 12, 2021 · 8 comments · Fixed by gazebosim/gz-sim#774
Assignees
Labels
bug Something isn't working

Comments

@darksylinc
Copy link
Contributor

darksylinc commented Apr 12, 2021

ign-rendering uses the same pattern as provided textureinthread sample from Qt. (the code can be seen in repo)

What should happen

The pattern is the following (ideal, i.e. what should happen):

  1. Worker Thread (e.g. Ogre2) is producer
  2. GUI thread (Qt) is consumer
  3. There are two FBOs, A and B
1. Qt requests Ogre2 to start rendering
2. Ogre renders to FBO B, signals it's done to Qt thread
3. Qt thread atomically swaps A and B
4. Qt thread presents B
5. Qt requests Ogre2 to start rendering
6. Ogre renders to FBO A, signals it's done to Qt thread
7. Qt thread atomically swaps A and B
8. Qt thread presents A
9. Repeat step 1

This is basically a double-buffer scheme maintained by hand.

It is even described in Qt's sample (emphasis mine):
QtSample

What is actually happening

  1. Qt requests Ogre2 to start rendering
  2. Ogre renders to FBO A, signals it's done to Qt thread
  3. Qt thread presents A
  4. Repeat step 1

This is causing a race condition because A is both being rendered to while also being used internally as a texture by Qt. This can cause all sorts of weird behaviors or maybe even crashes when Qt UI is used.

Most likely how bad this bug is depends on which driver is being used (e.g. Mesa radeonsi vs Mesa Intel, vs Windows Intel, vs proprietary AMD, vs NVIDIA proprietary, vs Nouveau) and the workload balance and processing speed of the CPU (like all race conditions)

Why it went unnoticed

This bug went unnoticed because Gazebo defaults to using 8x MSAA (but due to bugs in setup, MSAA is not actually used, that's another problem) and the FBO passed to Qt is the resolved one. Due to this, the window for the race condition is extremely small (because FBO is only written to during the resolve operation, which is only a fraction of the whole render).

However once I disable MSAA the problem becomes much more apparent, notice the background flicker:

2021-04-12.12-48-45.mkv.mp4

Environment

  • OS Version: All affected
  • Source or binary build?
    Source. Probably all branches where Ogre2 is used

Is Ogre1 affected?

I don't know. I did not evaluate. The possibility is high, given that part of the buggy behavior comes from ign-gui, which is agnostic to both Ogre1 and Ogre2

Is OptiX affected?

Unknown.

Tasks

Affected files

Most of the bugs affect the files coming from:

  • Ogre2RenderTarget.cc
  • Ogre2Camera.cc (since Ogre2Camera::RenderTextureGLId redirects to Ogre2RenderTarget's GLId)
  • Scene3D.cc from ign-gui

Other

I will be the one fixing this bug. This ticket is for tracking progress

@darksylinc darksylinc added the bug Something isn't working label Apr 12, 2021
@iche033
Copy link
Contributor

iche033 commented Apr 12, 2021

thanks for uncovering this bug. Note that there is also a Scene3D.cc in ign-gazebo and that's the one currently being used when ign gazebo is run. There is a goal to merge the two Scene3Ds.

darksylinc added a commit to darksylinc/ign-gazebo that referenced this issue Apr 18, 2021
This fix depends on a fix in ign-rendering module, because it
depends on the new Camera::SwapFromThread function
Without it, compilation will fail

Affects gazebosim/gz-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>
darksylinc added a commit to darksylinc/ign-rendering that referenced this issue Apr 18, 2021
This fix also depends on on a fix in ign-gazebo so that it now can call
SwapFromThread, otherwise the race condition will still happen.

At the current moment, the implementation is very wasteful creating many
more RenderTargets than needed (because 2 workspaces are created).
This issue will be fixed in the next commit

Affects gazebosim#304

Signed-off-by: Matias N. Goldberg <[email protected]>
darksylinc added a commit to darksylinc/ign-rendering that referenced this issue Apr 18, 2021
This fix also depends on on a fix in ign-gazebo so that it now can call
SwapFromThread, otherwise the race condition will still happen.

At the current moment, the implementation is very wasteful creating many
more RenderTargets than needed (because 2 workspaces are created).
This issue will be fixed in the next commit

Affects gazebosim#304

Signed-off-by: Matias N. Goldberg <[email protected]>
@darksylinc
Copy link
Contributor Author

darksylinc commented Apr 18, 2021

Ooofff!!

This is going to be one complex merge.

My fixes are still WIP but still I submitted the PR.

I'm mostly interested in hearing feedback about PR gazebosim/gz-sim#774 because I know for certain that ign-rendering is currently breaking the ABI and I know how to fix it (I just have to move SwapFromThread to the derived classes and then have ign-gazebo dynamic cast these classes and call them directly)

However I'm more inexperienced about ign-gazebo. I checked out the install folder generated by the make folders, and I could not find Scene.hh anywhere.

This suggests the classes I touched were never exposed to the user and therefore it is safe to alter them. However I may be wrong.

I'll be waiting on feedback for this.
Cheers!

Update: It seems the tests are failing on ign-rendering PR, that was unexpected but I can see why I missed it. I made some local changes to trigger the race condition more aggressively and forgot to test the code without these changes. The module is basically crashing because we're attempting to create the same node definition twice, which should not be happening.

darksylinc added a commit to darksylinc/ign-gazebo that referenced this issue Apr 19, 2021
This fix depends on a fix in ign-rendering module, because it
depends on the new Camera::SwapFromThread function
Without it, compilation will fail

Affects gazebosim/gz-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>
@iche033
Copy link
Contributor

iche033 commented Apr 21, 2021

I know for certain that ign-rendering is currently breaking the ABI and I know how to fix it (I just have to move SwapFromThread to the derived classes and then have ign-gazebo dynamic cast these classes and call them directly)

ign-rendering's ogre2 rendering engine is currently dynamically loaded as a plugin library from ign-gazebo. If a downcast is needed in ign-gazebo, it means we'll need to explicitly link against the ogre2 target, correct? It's doable but that means we'll need to add a few cmake checks and ifdefs to make sure the user has ogre 2.x installed (it's not required since users can use ign-gazebo with just ogre 1.x). I'm leaning towards just merging this into the unstable branch (main branch) in ign-gazebo and ign-rendering for our next release Ignition F. So you can just cherry pick your current changes and apply them to the main branch. We can always backport this change later if you think it's going to help with other bugfixes you're working on.

However I'm more inexperienced about ign-gazebo. I checked out the install folder generated by the make folders, and I could not find Scene.hh anywhere.

Do you mean ign-gazebo's Scene3D.hh (as opposed to ign-rendering's Scene.hh)?

  • ign-gazebo's Scene3D.hh is not installed - you are free to make changes to this file without worrying about API/ABI compatibility
  • ign-rendering's Scene.hh is installed

@darksylinc
Copy link
Contributor Author

ign-rendering's ogre2 rendering engine is currently dynamically loaded as a plugin library from ign-gazebo. If a downcast is needed in ign-gazebo, it means we'll need to explicitly link against the ogre2 target, correct? It's doable but that means we'll need to add a few cmake checks and ifdefs to make sure the user has ogre 2.x installed (it's not required since users can use ign-gazebo with just ogre 1.x).

Mmm. You are right. It will be troublesome to backport.

To put another way: We just need a way to signal that a particular camera needs to do special work to swap its internal state.

If adding forcing the user to upgrade both ign-gazebo and ign-rendering at the same time is an option, then this can be fixed by adding a new class to ign-rendering which takes care of the downcasting. This will preserve ABI compatibility, but will cause trouble if the user upgrades ign-gazebo without upgrading ign-rendering (which could be detected at runtime because the symbol will be missing, and warn the user to upgrade ign-rendering as well).

I'm talking about something along the lines:

class CameraWorkaround
{
    virtual void SwapFromThread( Camera *camera ) = 0;
}

Is this even feasible?
It doesn't have to be a class. It could just be a C function we load dynamically at runtime. Hacky, but it preserves the ABI (as long as it's acceptable that upgrading ign-gazebo forces upgrading ign-rendering too)

I'm leaning towards just merging this into the unstable branch (main branch) in ign-gazebo and ign-rendering for our next release Ignition F. So you can just cherry pick your current changes and apply them to the main branch. We can always backport this change later if you think it's going to help with other bugfixes you're working on.

As for how affects my other bugfixes: The current code is bugged and can corrupt GL state (apparently in non-crashing ways at least with what's we've seen so far with Mesa drivers).

That means that:

  1. Sometimes things workout just fine
  2. Occassionally rendering breaks for a single frame
  3. Sometimes everything is broken

Basically whenever we encounter a bug, it's going to be the question: "is this code broken because I made a mistake, or just the race condition again?".
It's similar to "blaming the compiler" when a bug crashes. It's almost never the compiler, except this time you have proof the compiler can occasionally misbehave.

However I'm more inexperienced about ign-gazebo. I checked out the install folder generated by the make folders, and I could not find Scene.hh anywhere.

Do you mean ign-gazebo's Scene3D.hh (as opposed to ign-rendering's Scene.hh)?

  • ign-gazebo's Scene3D.hh is not installed - you are free to make changes to this file without worrying about API/ABI compatibility
  • ign-rendering's Scene.hh is installed

Yes, sorry. I meant Scene3D.hh. Great that it's not installed, as I suspected

@iche033
Copy link
Contributor

iche033 commented Apr 21, 2021

If adding forcing the user to upgrade both ign-gazebo and ign-rendering at the same time is an option, then this can be fixed by adding a new class to ign-rendering which takes care of the downcasting. This will preserve ABI compatibility, but will cause trouble if the user upgrades ign-gazebo without upgrading ign-rendering (which could be detected at runtime because the symbol will be missing, and warn the user to upgrade ign-rendering as well).

yes the example you gave is feasible. We've actually done something similar before in other libraries for critical bugfixes.

We can also require users to upgrade ign-rendering dependency version when they upgrade ign-gazebo. Currently ign-gazebo and ign-rendering are on version 5.0.0. Once the changes get in, we can bump both versions to 5.1.0 and tell ign-gazebo to look for ign-rednering 5.1.0 by add VERSION 5.1 here.

Just throwing in another idea. Is it possible to do this through ign-gazebo's Scene3D plugin by using 2 cameras that take turn to render? I imagine the difficulty here is keeping the 2 camera pose control (pan, orbit, zoom, etc) in sync.

@darksylinc
Copy link
Contributor Author

Just throwing in another idea. Is it possible to do this through ign-gazebo's Scene3D plugin by using 2 cameras that take turn to render? I imagine the difficulty here is keeping the 2 camera pose control (pan, orbit, zoom, etc) in sync.

That may not be a bad idea. I'll have to think a few details about on how feasible are other improvements I have to do (basically saving VRAM)

@darksylinc
Copy link
Contributor Author

darksylinc commented Apr 21, 2021

Hi!

I've been thinking about this and I realized I missed a simple element, which I couldn't discuss in the middle the weekend and wait for a reply, but NOW is the right time to discuss this:

Do we even need multithreading here?

I assume rendering is currently happening in a different thread because "that's what the sample we found in the Qt wiki did".

There are pros and cons of this approach:

Pros

  • If CPU bound (but there are cores available), threading increases UI responsiveness because Qt can finish and display the UI while Ogre is still rendering a single frame. So the simulation may display at e.g. 1 FPS while the UI responds immediately.
  • If GPU bound, threading won't help in UI responsiveness because both threads are still competing for access to the same GPU.
  • Having a separate GL context in a different thread helps avoiding cryptic OpenGL and/or UI errors (because we must restore GL state and context after we're done, so that Qt can continue normally)

Cons

  • Parallel rendering increases VRAM consumption, because we need a double buffer scheme to cycle RenderTargets
  • Increased code complexity
  • Increased debuggability difficulty
  • Always-responsive UI is still hard to keep despite being threaded, depending on how things are coded. For example if for every mouse click the UI thread sends a message to the Ogre thread by locking a mutex, the UI thread will become blocked; and things won't feel responsive at all.

Best of both worlds?

I recommend to keep rendering in another thread to avoid the cryptic GL errors, but serializing the operations.

That means the UI thread is blocked until Ogre is done. I've done this successfully in order to test RenderDoc debugging and it worked.
By serializing rendering:

  • Debuggability becomes easier
  • VRAM consumption doesn't double, because we no longer need a double buffer scheme
  • No changes are needed in ign-rendering
  • ABI won't break, as the changes are only needed in ign-gazebo which is not exposed to the user

It actually gets better, because newer versions at any time can add support for a double buffer scheme on-demand, where at launch-time the user can leverage whether to enable parallel rendering, sacrificing some VRAM for a potential increase in UI responsiveness, or disable parallel rendering.

I was aiming to do parallel rendering by default, but add a command line argument to request serialized rendering for RenderDoc debugging.
However it seems that we should do serialized rendering by default, defer parallel rendering for another time (it can be added any time in future versions) and move on to the next problem.

Thoughts?

@iche033
Copy link
Contributor

iche033 commented Apr 22, 2021

I'm also leaning towards doing serialized rendering by default and keeping the code less complicated (I hope). This sounds like the changes will mainly be done in Scene3D class in the way we integrate ogre3d with Qt. And yes we did parallel rendering as that was what worked at the time, so it's good to hear that you were already able to have something working already.

As for parallel rendering, I think the VRAM sacrifice is relatively small compared to some our typical use cases. But I agree we can leave this for the future if we find limitations in serialized rendering, thanks.

darksylinc added a commit to darksylinc/ign-gazebo that referenced this issue Apr 24, 2021
This avoids us to break ign-rendering ABI while also simplifying the
amount of work to be done

Serializing work is easier to maintain and debug
Only CPU-bound scenario would potentially benefit from parallel command
generation (in terms of UI responsiveness)
Parallel command generation can be added back later

Also fixed coding style

Refer to
gazebosim/gz-rendering#304 (comment)
for discussion

Affects ign-rendering#304
darksylinc added a commit to darksylinc/ign-gazebo that referenced this issue Apr 28, 2021
This fix depends on a fix in ign-rendering module, because it
depends on the new Camera::SwapFromThread function
Without it, compilation will fail

Affects gazebosim/gz-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>
darksylinc added a commit to darksylinc/ign-gazebo that referenced this issue Apr 28, 2021
This avoids us to break ign-rendering ABI while also simplifying the
amount of work to be done

Serializing work is easier to maintain and debug
Only CPU-bound scenario would potentially benefit from parallel command
generation (in terms of UI responsiveness)
Parallel command generation can be added back later

Also fixed coding style

Refer to
gazebosim/gz-rendering#304 (comment)
for discussion

Affects ign-rendering#304
darksylinc added a commit to darksylinc/ign-gazebo that referenced this issue Apr 28, 2021
This fix depends on a fix in ign-rendering module, because it
depends on the new Camera::SwapFromThread function
Without it, compilation will fail

Affects gazebosim/gz-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>
darksylinc added a commit to darksylinc/ign-gazebo that referenced this issue Apr 28, 2021
This avoids us to break ign-rendering ABI while also simplifying the
amount of work to be done

Serializing work is easier to maintain and debug
Only CPU-bound scenario would potentially benefit from parallel command
generation (in terms of UI responsiveness)
Parallel command generation can be added back later

Also fixed coding style

Refer to
gazebosim/gz-rendering#304 (comment)
for discussion

Affects ign-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>
darksylinc added a commit to darksylinc/ign-gazebo that referenced this issue May 9, 2021
This fix depends on a fix in ign-rendering module, because it
depends on the new Camera::SwapFromThread function
Without it, compilation will fail

Affects gazebosim/gz-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>
darksylinc added a commit to darksylinc/ign-gazebo that referenced this issue May 9, 2021
This avoids us to break ign-rendering ABI while also simplifying the
amount of work to be done

Serializing work is easier to maintain and debug
Only CPU-bound scenario would potentially benefit from parallel command
generation (in terms of UI responsiveness)
Parallel command generation can be added back later

Also fixed coding style

Refer to
gazebosim/gz-rendering#304 (comment)
for discussion

Affects ign-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>
iche033 added a commit to gazebosim/gz-sim that referenced this issue May 27, 2021
* Fix race condition when rendering the UI

This fix depends on a fix in ign-rendering module, because it
depends on the new Camera::SwapFromThread function
Without it, compilation will fail

Affects gazebosim/gz-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>

* Force serialization of render commands

This avoids us to break ign-rendering ABI while also simplifying the
amount of work to be done

Serializing work is easier to maintain and debug
Only CPU-bound scenario would potentially benefit from parallel command
generation (in terms of UI responsiveness)
Parallel command generation can be added back later

Also fixed coding style

Refer to
gazebosim/gz-rendering#304 (comment)
for discussion

Affects ign-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>

* Fix deadlock on initialization & shutdown

Also fixes preexisting race condition when shutting down and improper
uninitialization of the worker thread

Signed-off-by: Matias N. Goldberg <[email protected]>

* Fix deadlock when using MoveTo modifier

Fix coding style

Signed-off-by: Matias N. Goldberg <[email protected]>

* Reimplemented synchronization mechanism to fix deadlocks

It's a good thing we went for serializing rendering.

THe way Qt implements the double buffer scheme using signals & slots is
fundamentally flawed because it assumes the worker thread never needs to
synchronize (e.g. to invalidate FBOs if window resolution changes).

Trying to synchronize can easily cause deadlocks if Qt thread has
spurious updates which don't end up emitting TextureInUse, as the worker
thread is running slower than Qt thread.

A way to fix this could be to use a different synchronization mechanism
where the main thread increases a request counter and the worker thread
is constantly looping but only wakes up when that counter is > 0.

For now, this will do.

Signed-off-by: Matias N. Goldberg <[email protected]>

* Document RenderSync

Turn some pointers into references

Signed-off-by: Matias N. Goldberg <[email protected]>

* Document missing parameter

Signed-off-by: Matias N. Goldberg <[email protected]>

* Document better our use of emit TextureInUse

Signed-off-by: Matias N. Goldberg <[email protected]>

* Change Q_SLOTS to slots for consistency

I changed it because it didn't work, but it must've been old code
because it works now.

Signed-off-by: Matias N. Goldberg <[email protected]>

Co-authored-by: Ian Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants