Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition when rendering the UI #774

Merged

Conversation

darksylinc
Copy link
Contributor

@darksylinc darksylinc commented Apr 18, 2021

This fix depends on a fix in ign-rendering module, because it
depends on the new Camera::SwapFromThread function
Without it, compilation will fail

The PR for ign-rendering can be found at gazebosim/gz-rendering#307
Forget about that PR for now

🦟 Bug fix

Fixes gazebosim/gz-rendering#304

Checklist

  • Signed all commits for DCO

@github-actions github-actions bot added the 🏢 edifice Ignition Edifice label Apr 18, 2021
@darksylinc darksylinc force-pushed the matias-fix-race-condition-main branch from 2a597a4 to cdf6d32 Compare April 19, 2021 00:52
@darksylinc darksylinc force-pushed the matias-fix-race-condition-main branch 2 times, most recently from f884431 to 4ea254b Compare April 28, 2021 23:26
@darksylinc darksylinc marked this pull request as ready for review April 29, 2021 01:59
@darksylinc
Copy link
Contributor Author

darksylinc commented Apr 29, 2021

Ooofff!

This was quite the Pandora's box. Fix a bug, 3 new hidden/pre-existing race conditions appear. I hope I got them all (or at least the major ones).
I think some random crashes on shutdown are explained by a race condition I fixed.

It's ready for review. It does not depend on anything, but may require additional testing.

If everything goes fine, I think this fix can be (back-)ported to pretty much all branches!

Btw: What's the deal with ign-gui? @iche033 said the goal was to merge these two files; but I don't know when/where is ign-gui's Scene3D.cc file is used.

Considering they're almost identical, I suspect these patches have to be ported there too, but I don't know how to test it because I how to trigger ign-gui's version

@chapulina
Copy link
Contributor

What's the deal with ign-gui? @iche033 said the goal was to merge these two files

Yup, here's the issue for that: gazebosim/gz-gui#137. We're currently studying the possibility of doing it as part of #556.

I don't know when/where is ign-gui's Scene3D.cc file is used.

ign gui -c <path to ign-gui>/examples/config/scene3d.config instantiates ign-gui's Scene3D, for example. It's not a very interesting scene though.

The ign_rviz project makes a more interesting use of that Scene3D.

Copy link
Contributor

@iche033 iche033 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some minor coding style comments.

Here're a few things I tested and working:

  • adding and removing shapes and lights
  • moving models with transform tool and component inspector
  • image display
  • lidar visualization
  • collisions visualization
  • setting transparency
  • video recording
  • follow mode

I ran into a couple of situations where GUI froze

  1. Using the Move To function (right click on object and select Move To)
  2. Moving the camera when no Qt GUI elements are being updated. For example, when the simulation is paused, the Real Time Factor (RTF) % number on the bottom right of the screen is static - in this case sometimes camera movement locks up. You can also right click on this Qt widget, select close, then move the camera to reproduce the issue.
    • I have not found the cause of this freeze yet.

src/gui/plugins/scene3d/Scene3D.cc Outdated Show resolved Hide resolved
src/gui/plugins/scene3d/Scene3D.cc Outdated Show resolved Hide resolved
src/gui/plugins/scene3d/Scene3D.cc Outdated Show resolved Hide resolved
@darksylinc
Copy link
Contributor Author

darksylinc commented Apr 30, 2021

I fixed all mentioned issues except this one:

Moving the camera when no Qt GUI elements are being updated. For example, when the simulation is paused, the Real Time Factor (RTF) % number on the bottom right of the screen is static - in this case sometimes camera movement locks up

How do I repro that? Just keyboard movement? (I can't see the UI right now because it's compiling...)

Update: Compiling done. Steps to repro?

@iche033
Copy link
Contributor

iche033 commented May 3, 2021

How do I repro that? Just keyboard movement? (I can't see the UI right now because it's compiling...)

Here are the steps I did:

  1. Run ign gazebo as usual (paused)

    ign gazebo -v 4 shapes.sdf
    
  2. Press with left mouse and drag to pan the camera -> sometimes camera movement works sometimes it does not for me

  3. Press Play button and repeat camera movement -> camera movement always works for me

Here's a gif showing the issue. The first time I tried to move the camera when simulation is paused, it does not work. After hitting play, I did not have problem. I then paused the simulation again - this time the first mouse movement worked but subsequent mouse drags didn't.

mouse_qt_lock

@darksylinc
Copy link
Contributor Author

darksylinc commented May 8, 2021

What version of Ubuntu are you on?
And what's the output of qmake -query QT_VERSION and what Qt versions do you have in /opt/Qt?
uname -r? Kernel version ruled out.

I tried on these environments:

Ubuntu 18.04 LTS
qmake -query QT_VERSION Qt 5.9.5
/opt/Qt has Qt 5.12.2
5.11.0-051100-generic
AMD Ryzen 5900X

Results: Could not reproduce

Ubuntu 20.04 LTS
qmake -query QT_VERSION Qt 5.12.8
/opt/Qt not installed
Tried 5.8.0-50-generic and 5.11.0-051100-generic
Intel i7 7700

Results: Can reproduce, very easily

Update: No need to provide this info.

I nailed down to the problem that WaitForWorkerThread will make the main thread continue as normal if worker thread just called ReleaseQtThreadFromBlock but didn't call yet RequestQtThreadToBlock

This would normally be OK but due to how Qt run loop works, Qt will now think there is nothing else to update (until the worker thread sends more info, but the worker thread will get stuck in the next RequestQtThreadToBlock).
Qt gets unstuck when an external element (like highlighting a UI element or maximizing) forces a redraw

@darksylinc
Copy link
Contributor Author

darksylinc commented May 9, 2021

I'm tired.

A whole day of debugging. Re implemented the code. Deadlocks still appear again in different forms.

Impossible to repro in a standalone case. Only repros in the Intel machine (20.04), cannot repro at all in the AMD machine (18.04).

By now I'm starting to question if it's a HW or library bug.

A quick Google revealed there may be a glibc bug in condition variables. I've not confirmed this is our case and not just a bug in our code.

I'm posting the link for reference for myself for tomorrow.

@darksylinc
Copy link
Contributor Author

Ooof!!! Finally.

It was our own fault after all (i.e. not the alleged glibc bug).

The problem is that rendering changed between Qt versions and newer versions offload some work to a worker thread that was previously in the main thread. This causes potential deadlock scenarios that would not happen (or extremely rarely) in older Qt versions.

I've redone the synchronization code. It seems to work fine in both machines environments now.

It should be ready for review and hopefully merge.

This fix depends on a fix in ign-rendering module, because it
depends on the new Camera::SwapFromThread function
Without it, compilation will fail

Affects gazebosim/gz-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>
This avoids us to break ign-rendering ABI while also simplifying the
amount of work to be done

Serializing work is easier to maintain and debug
Only CPU-bound scenario would potentially benefit from parallel command
generation (in terms of UI responsiveness)
Parallel command generation can be added back later

Also fixed coding style

Refer to
gazebosim/gz-rendering#304 (comment)
for discussion

Affects ign-rendering#304

Signed-off-by: Matias N. Goldberg <[email protected]>
Also fixes preexisting race condition when shutting down and improper
uninitialization of the worker thread

Signed-off-by: Matias N. Goldberg <[email protected]>
Fix coding style

Signed-off-by: Matias N. Goldberg <[email protected]>
It's a good thing we went for serializing rendering.

THe way Qt implements the double buffer scheme using signals & slots is
fundamentally flawed because it assumes the worker thread never needs to
synchronize (e.g. to invalidate FBOs if window resolution changes).

Trying to synchronize can easily cause deadlocks if Qt thread has
spurious updates which don't end up emitting TextureInUse, as the worker
thread is running slower than Qt thread.

A way to fix this could be to use a different synchronization mechanism
where the main thread increases a request counter and the worker thread
is constantly looping but only wakes up when that counter is > 0.

For now, this will do.

Signed-off-by: Matias N. Goldberg <[email protected]>
@darksylinc darksylinc force-pushed the matias-fix-race-condition-main branch from fa34ea7 to c236828 Compare May 9, 2021 17:44
Copy link
Contributor

@chapulina chapulina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use smart pointers to pass renderSync around? Right now RenderWindow instantiates the object and passes its pointer around to other classes. TextureNode even keeps a pointer as a member variable. But the ownership of that pointer is not documented. I think it's very possible that someone comes some months from now and adds some delete renderSyncs where they shouldn't.

I think that passing a shared pointer should make it explicit that the receiving classes shouldn't assume ownership and can safely release it when no longer needed.

src/gui/plugins/scene3d/Scene3D.cc Show resolved Hide resolved
@darksylinc
Copy link
Contributor Author

Regarding smart ptrs: Yes and no.

The relationships are the following

  • RenderWindowItemPrivate owns the pointer. It's not even new'ed. It's not a pointer. i.e. it was declared as RenderSync renderSync
  • TextureNode gets a pointer because it needs access to RenderSync. AFAIK theoretically TextureNode is owned by the RenderWindowItem. This could be changed to a reference instead of pointer so that it is never deleted.
  • TextureNode passes the pointer via emit TextureInUse signal to RenderThread::Render every time. I wanted to use references but Qt slots had trouble with them.

I think it's very possible that someone comes some months from now and adds some delete renderSyncs where they shouldn't.
I think that passing a shared pointer should make it explicit that the receiving classes shouldn't assume ownership and can safely release it when no longer needed.

I have a bigger concern about this (the 'no' part of my answer), which is that the RenderThread, TextureNode and RenderWindowItem are meant to work in tandem. If the pointers are somehow allowed to be different, then there are no threading safety guarantees because they'll be using different mutexes.

Turn some pointers into references

Signed-off-by: Matias N. Goldberg <[email protected]>
@iche033
Copy link
Contributor

iche033 commented May 10, 2021

Thanks for tracking that down! The deadlock issue is fixed for me. I'm actually on Ubuntu Bionic with Qt 5.9.5 (and Nvidia driver) but bug happened quite often.

The changes look good to me, including the use of reference for RenderSync

Copy link
Contributor

@chapulina chapulina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be changed to a reference instead of pointer so that it is never deleted.

Thanks for the changes, TextureNode holding a reference should make this more future-proof 👍


Thank you for iterating, this LGTM, although I can't reproduce the original issue.

I just spotted some missing docs.

src/gui/plugins/scene3d/Scene3D.hh Show resolved Hide resolved
src/gui/plugins/scene3d/Scene3D.hh Outdated Show resolved Hide resolved
src/gui/plugins/scene3d/Scene3D.hh Show resolved Hide resolved
src/gui/plugins/scene3d/Scene3D.hh Show resolved Hide resolved
src/gui/plugins/scene3d/Scene3D.hh Show resolved Hide resolved
src/gui/plugins/scene3d/Scene3D.cc Outdated Show resolved Hide resolved
Signed-off-by: Matias N. Goldberg <[email protected]>
I changed it because it didn't work, but it must've been old code
because it works now.

Signed-off-by: Matias N. Goldberg <[email protected]>
@darksylinc
Copy link
Contributor Author

OK all outstanding issues fixed. Should be (hopefully) ready for merge.

@iche033
Copy link
Contributor

iche033 commented May 27, 2021

@osrf-jenkins run tests please

@iche033
Copy link
Contributor

iche033 commented May 27, 2021

test failures look unrelated. latest changes look good to me. Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏢 edifice Ignition Edifice
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition between Qt and Ogre2 in presentation
3 participants