Vulkan: Add explicit synchronization on frame boundaries #1290

goeiecool9999 · 2024-08-13T07:55:14Z

Context

Since #1166 cemu no longer waits for vkAcquireNextImage fences. The reasoning behind this was that one driver vendor reported that that fence could only be implemented with an operation that's equivalent to a vkDeviceWaitIdle. However this shouldn't happen and good drivers do not have this problem. The removal of the fence wait was poorly motivated and has led to a regression described in issue #1239.

The gist of what caused the regression is that, on some drivers, the Latte thread no longer has any point where it explicitly synchronizes with the GPU (barring occlusion queries and texture readbacks which not all games use). This is fine for cases where the fps-limiting factor is Cemu's built-in fps limiter. However if the limiting factor is the GPU or the display the Latte thread can outpace those, which can lead to input lag. This went (mostly) unnoticed because most of the time the GPU and the display are not limiting factors.

There's two scenario's where the CPU should wait for other things to catch up

GPU is too slow
display is slower than the fps limiter and FIFO present mode is used.

I'll talk about how this PR handles both.

Display-limited with FIFO present mode

Because cemu's FPS limiter runs at 60*1.001 it can match any display below that refresh rate exactly with the FIFO present mode. But to do so, somewhere the latte thread has to block to match the display refresh rate. So what ways of blocking does Vulkan provide?

AcquireNextImageKHR fences

One way to block is to use a fence from AcquireNextImageKHR to let the cpu wait for the moment when a swapchain image is released by the presentation engine. This is what cemu used to do. The downside of this is that vulkan doesn't specify which image will be acquired. In practice with FIFO you often acquire the least recently used image which could be many frames back. In this case if a driver has a minimum image count of n, cemu can queue up to n images before blocking. That means that there will always be as many frames of input lag as there are swapchain images. (could be off-by-one idk)

vkAcquireNextImageKHR with infinite timeout

What else? Well the vulkan spec says:

If timeout is UINT64_MAX, the timeout period is treated as infinite, and vkAcquireNextImageKHR will block until an image is acquired or an error occurs.

Hmmm. That sounds like we may not have to wait for a fence at all. We could just wait for vkAcquireNextImageKHR to return. Right?
No. In a note[1] the vulkan spec warns that metering rendering speed to presentation rate by relying on vkAcquireNextImageKHR to block should not be done.

Asynchronous presentation engines

These two statements seem contradictory. If vkAcquireNextImageKHR must block until an image is acquired, why is waiting for vkAcquireNextImageKHR discouraged?
Some clarity may be found in another note near the beginning of the WSI chapter:

The presentation engine may be synchronous or asynchronous with respect to the application and/or logical device.
Some implementations may use the device’s graphics queue or dedicated presentation hardware to perform presentation.

Speculating based on this there seem to be two different kinds of drivers. One where the results and blocking duration of vkAcquireImage depend on the state of the GPU, and another where the GPU state is entirely ignored. In the former vkAcquireNextImage would need to block because images have to be presented before they become available, limiting the thread to the refresh rate.
What would happen in the latter? Well one user found out for themselves in #1239. On NVIDIA there was simply a little more input lag. On their adreno driver input lag kept increasing gradually "seemingly without upper limits".
Since there are only finite swapchain images and a swapchain image cannot be queued twice, that must mean that an image becoming available for acquisition as in "vkAcquireNextImageKHR will block until an image is acquired" is not actually tied to vsync in any way. It just means that the driver has put the image presentation into an internal queue and allows the CPU to continue queuing more work on the same image.

So by removing the fence wait, on some systems the behaviour was identical to before, and on other systems there was big input lag. So let's just add the fence wait back, but move the wait to SwapBuffer(), which is conventionally the place to block. One driver vendor saying their implementation was slow in 2022 shouldn't stop us and they acknowledged it was really bad and have likely fixed it by now.

Present Wait

Like I said before, there's no way to limit the amount of images that you queue with the core vulkan API.
Is there any way to ensure low latency, even when there are a lot of swapchain images? (besides using different present modes)
Yes there is. VK_KHR_present_wait[2].
While Queuing, it allows you to give an image an ID, and later use vkWaitForPresentKHR to well... wait for the image to be presented.
We can use this to wait for the previous frame to be presented before queuing the current frame for presentation. That keeps the queue as shallow as it can theoretically be. Theoretically making the the input lag even less than it ever was using the double-buffered VSync option.

GPU limited

It's quite simple to prevent the CPU outpacing of the GPU. Simply keep a note which command buffer ID contains the last command for a swapchain image and wait for it in swapbuffers. If the GPU is fast enough the thread never has to wait.

Notes

[1] vkAcquireNextImageKHR note

Applications should not rely on vkAcquireNextImageKHR blocking in order to meter their rendering speed. The implementation may return from this function immediately regardless of how many presentation requests are queued, and regardless of when queued presentation requests will complete relative to the call. Instead, applications can use fence to meter their frame generation work to match the presentation rate.

[2] Present Wait section of Window System Integration chapter

Applications wanting to control the pacing of the application by monitoring when presentation processes have completed to limit the number of outstanding images queued for presentation, need to have a method of being signaled during the presentation process.
Providing a mechanism which allows applications to block, waiting for a specific step of the presentation process to complete allows them to control the amount of outstanding work (and hence the potential lag in responding to user input or changes in the rendering environment).

Fixes #1239

…o run dry.

…wait for it in swapbuffer of the next frame

…sent mode" This reverts commit 0f73502.

goeiecool9999 added 8 commits October 25, 2023 09:11

initial implementation

ecb7108

fix implementation

d05b000

Merge branch 'refs/heads/main' into present_wait

6d9f64f

fix renamed variable

f969894

Merge branch 'refs/heads/main' into present_wait

b9775b2

move WaitForPresent to the proper moment and remove debug logging

f9c03d5

change debug logging to something more user-friendly

31f1df7

simplify if statements

c641234

goeiecool9999 changed the title ~~Vulkan: use present_wait to explicitly limit CPU run-ahead for FIFO present mode~~ Vulkan: use present_wait to limit CPU run-ahead for FIFO present mode Aug 13, 2024

goeiecool9999 mentioned this pull request Aug 13, 2024

Latency regression in v2.0-77 #1239

Closed

wait _after_ queuing next event so the present queue is less likely t…

4692adc

…o run dry.

goeiecool9999 marked this pull request as draft August 13, 2024 09:32

This comment was marked as resolved.

Sign in to view

goeiecool9999 marked this pull request as ready for review August 13, 2024 10:08

This comment was marked as outdated.

Sign in to view

goeiecool9999 changed the title ~~Vulkan: use present_wait to limit CPU run-ahead for FIFO present mode~~ Vulkan: use present_wait to limit present queue for FIFO present mode Aug 13, 2024

goeiecool9999 added 3 commits August 13, 2024 15:54

name variable according to style guide

c795002

change the code to be more semantically correct

4e6589f

Increment on all cases that aren't an error.

be1dedd

goeiecool9999 force-pushed the present_wait branch from 128f664 to be1dedd Compare August 14, 2024 15:07

goeiecool9999 marked this pull request as draft August 14, 2024 15:12

goeiecool9999 added 2 commits August 14, 2024 17:29

Actually fix it

32c57f6

I've made up my mind. This is right.

cbfa722

goeiecool9999 force-pushed the present_wait branch from 3bcbb0a to cbfa722 Compare August 14, 2024 15:45

goeiecool9999 marked this pull request as ready for review August 14, 2024 16:00

goeiecool9999 marked this pull request as draft August 14, 2024 16:08

goeiecool9999 added 2 commits August 16, 2024 23:11

to prevent CPU from outpacing GPU, track last cmdbuffer of frame and …

1e56ed4

…wait for it in swapbuffer of the next frame

request different number of swapchain images depending on present mode

0f73502

goeiecool9999 force-pushed the present_wait branch from 74d0ddd to 05bc893 Compare August 16, 2024 22:46

submit first so the GPU already has work by the time the CPU resumes

e171fc8

goeiecool9999 force-pushed the present_wait branch from 05bc893 to e171fc8 Compare August 16, 2024 22:48

the return of the fence

53ea539

goeiecool9999 marked this pull request as ready for review August 17, 2024 01:06

goeiecool9999 marked this pull request as draft August 17, 2024 08:26

Revert "request different number of swapchain images depending on pre…

426844b

…sent mode" This reverts commit 0f73502.

goeiecool9999 marked this pull request as ready for review September 15, 2024 18:06

goeiecool9999 changed the title ~~Vulkan: use present_wait to limit present queue for FIFO present mode~~ Vulkan: add explicit synchronization on frame boundaries. Sep 15, 2024

goeiecool9999 changed the title ~~Vulkan: add explicit synchronization on frame boundaries.~~ Vulkan: add explicit synchronization on frame boundaries Sep 15, 2024

goeiecool9999 changed the title ~~Vulkan: add explicit synchronization on frame boundaries~~ Vulkan: Add explicit synchronization on frame boundaries Sep 15, 2024

goeiecool9999 merged commit a05bdb1 into cemu-project:main Sep 15, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: Add explicit synchronization on frame boundaries #1290

Vulkan: Add explicit synchronization on frame boundaries #1290

goeiecool9999 commented Aug 13, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as outdated.

Vulkan: Add explicit synchronization on frame boundaries #1290

Vulkan: Add explicit synchronization on frame boundaries #1290

Conversation

goeiecool9999 commented Aug 13, 2024 • edited Loading

Context

Display-limited with FIFO present mode

AcquireNextImageKHR fences

vkAcquireNextImageKHR with infinite timeout

Asynchronous presentation engines

Present Wait

GPU limited

Notes

This comment was marked as resolved.

This comment was marked as outdated.

goeiecool9999 commented Aug 13, 2024 •

edited

Loading