Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More fps limiter tweaks #1728

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

gendlin
Copy link
Contributor

@gendlin gendlin commented Jun 6, 2024

This branch modifies the frame limiter, improving latency, CPU usage, and accuracy at the cost of occasional higher frame time spikes (due to yielding the current thread). Given that someone using the in-game limiter is probably interested in the lowest possible latency, I think this may be a worthwhile compromise. In combination with #1733 the changes in this branch provide great latency results for me. And c50582c fixes the flaky, unpredictable performance that I've been seeing in general.

Screenshot 2024-06-08 070832

@gendlin gendlin marked this pull request as draft June 6, 2024 10:07
}
else
{
I_Sleep(0); // yield
Copy link
Collaborator

@rfomin rfomin Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need I_Sleep(0)? On Windows, SDL will call Sleep(0) which makes some sense, but I don't think it's portable.

Copy link
Contributor Author

@gendlin gendlin Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on Windows it reduces CPU usage for me and gives a nice latency improvement (which directly yielding the thread via e.g. SwitchToThread does not provide). Do you know if there is a posix counterpart to Sleep(0)?

At least this should be no worse than the existing implementation, which has many calls to Sleep(0) (when remaining time < 2ms and > 1ms).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a pthread_yield, but I'd rather avoid that complication. How about I_SleepUS(100) instead of I_Sleep(0) or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too coarse, in my tests I_SleepUS(100) often sleeps for 1 ms whereas I_Sleep(0) sleeps only 10-30us or so.

Copy link
Collaborator

@rfomin rfomin Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, maybe it makes sense to add the I_ThreadYield function, with POSIX it's sched_yield, on Windows it's just Sleep(0). How do you test this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I am only able to test on windows so I was hoping someone else would try linux and let me know if I broke anything. The only new parts are the lack of pure busy-wait at the end, and the change to I_SleepUS (both of which improved latency / reduced CPU usage for me, at the cost of slightly higher frametime variance).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try linux and let me know if I broke anything

I only ran a quick test with mangohud --dlsym woof but the frametimes were identical between the master branch and this PR. OpenGL render driver, 9900K, RTX3080, Arch.

src/i_video.c Outdated Show resolved Hide resolved
This option (included in /O1 and /O2 by default) leads to difficult to
diagnose performance issues on older machines, and makes the linker and
optimizer behavior very unpredictable while providing no performance
benefits in return.

This change increases binary size by 2-3% but otherwise has no negative
impact.
This is consistent with stuff like RTSS and helps the fps counter to
more often match the limiter setting at high fps.
# Disable function-level linking
_checked_add_compile_option(/Gy-)
endif()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this compiler flag robustly fixes the latency gremlins I've been seeing (see discussion in #1712)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I push a commit to your branch so that you can test the results? Namely reverting this "fix" to test the theory here.

Copy link
Contributor Author

@gendlin gendlin Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have mentioned that I did test that earlier and it does fix the problem, no #1712 required. Also fixes the joy_enable weirdness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, thanks for going through that exercise. I'll let the others comment on the actual fix you're proposing.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this compiler flag robustly fixes the latency gremlins I've been seeing (see discussion in #1712)

Is this for a specific version of MSVC?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find it that hard to believe. It drastically alters the layout of the binary, different optimizer and linker behavior, etc. -ffunction-sections is apparently the gcc equivalent

Maybe, but I don't see any difference in performance on my machines. I would rather use -O2 than touch those options.

Copy link
Collaborator

@rfomin rfomin Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this answer, I don't see why we should disable it: https://stackoverflow.com/questions/1834597/what-is-the-comdat-section-used-for

I found it in ZDoom, commit is from 2008 without much comments: ZDoom/gzdoom@fb50df2#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR17

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case you should add '-ffunction-sections -Wl,--gc-sections' to gcc/clang builds. Those flags are disabled by default. This change is just moving MSVC behavior into line with the other compilers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think MSVC/GCC developers have their own reasons for setting these defaults. I have a feeling that your latency issues are an SDL/Windows problem, maybe the Woof changes affect this for some reason, but the true problem is not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MSVC defaults are designed for large C++ projects. The COMDAT settings make more sense in that context than they do for us here.

@gendlin gendlin changed the title More fps limiter tweaks (experimental) More fps limiter tweaks Jun 8, 2024
@gendlin gendlin marked this pull request as ready for review June 8, 2024 11:30
src/i_video.c Show resolved Hide resolved
@ceski-1
Copy link
Collaborator

ceski-1 commented Jun 9, 2024

Screenshot 2024-06-08 070832

Is this from CapFrameX or something similar? I see two graphs here. The first is raw frametimes which look about the same, and the second is a quantile function of that data, showing a small improvement in the 99.9% frametimes where microstutters can be annoying (sometimes called 0.1% frametimes). But the change is so small it looks like typical run-to-run measurement error.

Would you mind comparing this PR (merge in the latest master changes) against a baseline without your recent improvements? Meaning without this PR and without the two changes here, and using the framerate limiter at some desired value. That should show a nice improvement with your system, since it will reflect the complete set of improvements so far. It would also be nice to see msvc profiling results.

@gendlin
Copy link
Contributor Author

gendlin commented Jun 10, 2024

@ceski-1 I can check, but I wouldn't expect those other two changes to impact frame times so I would probably just be reproducing the graph above (the input latency issues are unrelated to framerate).

looks like typical run-to-run measurement error

I mainly added the CapFrameX graph to show there weren't significant regressions in frame pacing with these changes - if there were no real changes in frame times I would be happy because I'm mainly hunting latency here.

Speaking of latency, with these changes playing Woof is now what I would describe as telepathic. I am pretty latency sensitive and I would say Woof (in large part thanks to raw_input 👍) achieves parity with a good hardware-accelerated Vulkan renderer on my machine, and it's significantly better than any other software renderers I've tried (although Odamex gets close).

So I'm very happy with this now and I promise not to spam any more anti-latency PRs (unless there are regressions🙂). I appreciate you guys' patience and willingness to humor me on this stuff, despite not being able to repro locally. I mainly write these PRs for my own benefit, but hopefully someone else out there with an old machine will see improvements as well.

@rfomin
Copy link
Collaborator

rfomin commented Jun 10, 2024

The only way to reduce input latency on the Woof side is to reduce frame time. And we are only talking about player view rotation, all other inputs are only registered 35 times per second, this is a Doom engine limitation that we have to keep for gameplay/demo compatibility.

But you say that your changes don't affect frame time even on your machine, so how is it possible that they affect input latency? I think the problem is on the SDL/Windows side.

@gendlin
Copy link
Contributor Author

gendlin commented Jun 10, 2024

@rfomin The input latency is about getting the latest mouse events from the Windows message pump, it has nothing to do with frame rates - I can detect it just as easily at 60 fps as I can at 500. That's why the Sleep(0) helps here, it allows other threads to run on the CPU, one of which is presumably tasked with filling in the data structures that store input events for the Woof process to consume later on.

To test this hypothesis, I tried reverting #1505. That PR increases the priority of the main Woof thread, which I would expect to further starve input processing of the cycles it needs to do its job, and that's exactly what it appears to do - reverting #1505 lowers latency even further for me, on top of the changes in this branch. I think this is one of those unintended side effects that leads people to recommend not messing with thread priority (or at least not increasing it)🙂.

@rfomin
Copy link
Collaborator

rfomin commented Jun 10, 2024

The input latency is about getting the latest mouse events from the Windows message pump, it has nothing to do with frame rates - I can detect it just as easily at 60 fps as I can at 500.

My point is that there is nothing we can do about the Windows message pump. All we can do is call SDL_PumpEvents and register inputs more often, and for that we need to reduce the frame time. I think you're trying to solve input problems in the wrong place.

I think this is one of those unintended side effects that leads people to recommend not messing with thread priority (or at least not increasing it)🙂.

I agree that we shouldn't mess with thread priority without a verifiable result. That change did nothing for me either.

@gendlin
Copy link
Contributor Author

gendlin commented Jun 10, 2024

My point is that there is nothing we can do about the Windows message pump.

We can allow it the resources to do its job by not thrashing the CPU.

All we can do is call SDL_PumpEvents and register inputs more often

No, that won't help. I think you missed my point that SDL is getting its events downstream from an asynchronous kernel process, which needs CPU resources independent of the Woof thread to fully do its job.

I think you're trying to solve input problems in the wrong place.

Do you suggest we rewrite the Windows kernel? 😋

That change did nothing for me either.

Do you have a hyperthreaded CPU? That might make you immune to this problem since multiple threads can run concurrently on the same core. You maybe could try turning this off and testing, although I think it requires messing in BIOS so you may not want to bother.

@rfomin
Copy link
Collaborator

rfomin commented Jun 10, 2024

All we can do is call SDL_PumpEvents and register inputs more often

No, that won't help. I think you missed my point that SDL is getting its events downstream from an asynchronous kernel process, which needs CPU resources independent of the Woof thread to fully do its job.

I think you're trying to solve input problems in the wrong place.

Do you suggest we rewrite the Windows kernel? 😋

My suggestion is to try rewriting the SDL. Do you really know how SDL receives events? If you improve this, it will help a lot of projects.

@gendlin
Copy link
Contributor Author

gendlin commented Jun 10, 2024

Do you really know how SDL receives events?

Not really (and I don't want to🙂) but in the debugger I don't see any SDL event processing thread, only one for timers. So I'm pretty sure SDL_PumpEvents happens synchronously in the same thread as the caller. In that case it can't be responsible for this problem - it can only process all the events it has available to it at the time of the call. Even if it did have an asynchronous thread running in the background, we would be back to the problem of priority starvation, etc.

@rfomin
Copy link
Collaborator

rfomin commented Jun 10, 2024

So I'm pretty sure SDL_PumpEvents happens synchronously in the same thread as the caller. In that case it can't be responsible for this problem - it can only process all the events it has available to it at the time of the call.

I think that is correct. So we can't do anything about it, right?

We can allow it the resources to do its job by not thrashing the CPU.

But how can changing compiler flags or NOINLINE fix it? Maybe we can improve it by changing the I_Sleep calls, but anything else shouldn't affect it.

@gendlin
Copy link
Contributor Author

gendlin commented Jun 10, 2024

So we can't do anything about it, right?

Not by modifying SDL, no. This is a kernel level thing.

But how can changing compiler flags or NOINLINE fix it?

Fewer cache misses, more opportunities for the OS to context switch into event processing and make more input data available later on for the SDL_PumpEvents call, would be my guess. I don't know the microarchitectural details but you'll have to take my word for it that it actually makes a difference, because I've seen it with my own eyes. It's very reproducible on my machine.

@fabiangreffrath
Copy link
Owner

reverting #1505 lowers latency even further for me, on top of the changes in this branch.

Does reverting this commit alone achieve anything for you? Because, honestly, I'm starting to get confused which combination of changes is best and which is even better and which is bestest, you know. 😉

@gendlin
Copy link
Contributor Author

gendlin commented Jun 11, 2024

Does reverting this commit alone achieve anything for you?

Good question🙂. I tested this and the answer is a resounding "no". It's dependent on the changes in this branch.

I'm starting to get confused which combination of changes is best and which is even better and which is bestest, you know. 😉

I think the general rule of thumb here is that more changes = better. No one improvement supersedes another, and using the whole suite of changes provides the best result. The only exception is #1712, which is no longer strictly necessary with the addition of the /Gy- flag.

@gendlin
Copy link
Contributor Author

gendlin commented Jun 11, 2024

After discussion with @rfomin clarified the issue in my mind, I'm starting to think that maybe the material difference here is not old CPU vs. new CPU, but CPUs with hyperthreading vs. those without. Has anyone else happened to test on a CPU without hyperthreading (or with it disabled in BIOS)?

@rfomin
Copy link
Collaborator

rfomin commented Jun 11, 2024

Has anyone else happened to test on a CPU without hyperthreading (or with it disabled in BIOS)?

My old 2012 laptop doesn't have hyperthreading, there is no difference.

Sorry, but I just don't believe that the /Gy- flag or "cache misses" can have any effect on the input latency in this case. Perhaps we can improve the frame limiter somehow, for example by changing the I_Sleep function or calling the limiter after input processing instead of before. I'm sure compiler flags or cache optimisation have nothing to do with it.

@ceski-1
Copy link
Collaborator

ceski-1 commented Jun 11, 2024

After discussion with @rfomin clarified the issue in my mind, I'm starting to think that maybe the material difference here is not old CPU vs. new CPU, but CPUs with hyperthreading vs. those without. Has anyone else happened to test on a CPU without hyperthreading (or with it disabled in BIOS)?

Like Fabian, I am a little bit lost at this point. I don't mind running some tests later with HT on/off, but what build(s) should be tested and what is being measured?

@gendlin
Copy link
Contributor Author

gendlin commented Jun 13, 2024

But what if a future change to the code in, say, half a year leads to just another binary layout? Wouldn't the micro-optimizations that we apply to the code for the specific layout as of today become counterproductive then?

The changes here help to prevent that scenario. Putting the frame limiter in its own (out-of-line) function makes the alignment independent of changes upstream in I_FinishUpdate, and /Gy- makes layout and optimization much more predictable going forward. Both of these fix weird little gremlins that made performance brittle during my testing (#1712, the joy_enable thing, sensitivity to unrelated I_FinishUpdate changes, etc).

@gendlin
Copy link
Contributor Author

gendlin commented Jun 13, 2024

My rationale against NOINLINE #1728 (comment)

I don't like the /Gy- flag because there is very little official documentation on it. I am against any optimisation without a measurable difference.

Right, those are the arguments from conservatism which I referred in my comment. That's fine, but in the face of such conservatism, nothing interesting can be done here.

Could you remove /Gy- and NOINLINE from this PR?

I believe I've answered that. However, you guys control the repository so I suppose I'm at your mercy there.

@rfomin
Copy link
Collaborator

rfomin commented Jun 13, 2024

Right, those are the arguments from conservatism which I referred in my comment. That's fine, but in the face of such conservatism, nothing interesting can be done here.

For any optimisation work we need to measure results. Where is the conservatism here?

Could you remove /Gy- and NOINLINE from this PR?

I believe I've answered that. However, you guys control the repository so I suppose I'm at your mercy there.

So will you do it?

@gendlin
Copy link
Contributor Author

gendlin commented Jun 13, 2024

Reached my limit (heh) on this one. Thanks everyone for the testing and review

@gendlin gendlin closed this Jun 13, 2024
@ceski-1
Copy link
Collaborator

ceski-1 commented Jun 13, 2024

This PR was closed but I still wanted data, so I ran some benchmarks with CapFrameX. Here's a summary of the results:

Desktop (i9-9900K) Laptop (i5-2520M)
CPU usage and power Lower than master Same as master
Frame pacing Worse than master Same as master
Input lag approximation Same as master Same as master
Toggling fpslimit_busywait No change No change
Reverting "raised priority" commit No change Worse frame pacing

Conclusion:

  • The results vary greatly depending on CPU architecture.
  • For the two systems tested, frame pacing is either the same or worse, and CPU usage is either the same or lower.
  • The current framerate limiter in the master branch could be improved, but the results indicate that more research is needed.
  • The "raised priority" commit is beneficial for older computers.

Relevant graphs only:

Graphs

No change in framerate for an older laptop:
01

Worse frame pacing for a desktop:
01

High CPU usage with framerate limiter from master branch:
02

Lower CPU usage with framerate limiter from this PR:
03

@gendlin
Copy link
Contributor Author

gendlin commented Jun 14, 2024

Re-opening this after cooling off a bit, and not wanting to close the door on fruitful investigation.

I ran some benchmarks with CapFrameX. Here's a summary of the results:

Thanks, this is interesting. It looks like the latency of I_SleepUS is architecture dependent, and not really more accurate than I_Sleep in some cases. It may be best to just I_Sleep(1) when remaining time > 2ms and busy-wait the rest, when not in CPU-preserving I_Sleep(0) mode.

@gendlin gendlin reopened this Jun 14, 2024
@gendlin
Copy link
Contributor Author

gendlin commented Jun 14, 2024

Could you remove /Gy- and NOINLINE from this PR?

I believe I've answered that. However, you guys control the repository so I suppose I'm at your mercy there.

So will you do it?

No, why would I? It provides real benefits to me, and no real detriments to anyone else. By the way, if you don't like premature optimization, you should like these changes because they both remove optimizations. 🙂

@rfomin
Copy link
Collaborator

rfomin commented Jun 14, 2024

By the way, if you don't like premature optimization, you should like these changes because they both remove optimizations. 🙂

These optimisations are made by compilers without our interversion and can be changed by compiler developers. If we add these changes to our code, we will increase complexity for no benefit.

It looks like the latency of I_SleepUS is architecture dependent, and not really more accurate than I_Sleep in some cases. It may be best to just I_Sleep(1) when remaining time > 2ms and busy-wait the rest, when not in CPU-preserving I_Sleep(0) mode.

Yes, I_SleepUS is not precise. CREATE_WAITABLE_TIMER_HIGH_RESOLUTION were added in Windows 10, version 1803, see implementation in SDL3: https://github.com/libsdl-org/SDL/blob/51902d4ac53ca7ef46baff38e75deec98321667d/src/timer/windows/SDL_systimer.c#L85 We have to support at least Windows 7.

@fabiangreffrath
Copy link
Owner

Does anyone know about the MSVC /OPT and /ORDER parameters that are mentioned in the /Gy switch documentation?

@gendlin
Copy link
Contributor Author

gendlin commented Jun 14, 2024

Does anyone know about the MSVC /OPT and /ORDER parameters that are mentioned in the /Gy switch documentation?

I've never seen anyone use /ORDER, and it's not enabled by default. /OPT:REF and /OPT:ICF are enabled by default, and act on COMDATs (the function-specific binary sections created mainly by /Gy, which allows linker more leeway to do what it likes with function layout). Since /Gy- reduces the number of COMDATs, it reduces the impact of those two optimizations, which remove redundant / unreferenced COMDATs. This is why /Gy- slightly increases the size of the binary (by 2% or so).

gcc doesn't do the /Gy or /OPT equivalents unless you specifically ask it to, via -ffunction-sections and -Wl,--gc-sections, respectively.

@gendlin
Copy link
Contributor Author

gendlin commented Jun 16, 2024

@ceski-1 Do you use 'prefer maximum performance' power management mode in your nvidia settings? I've noticed that Woof makes so little demand on the GPU that otherwise it can sometimes downclock itself, which can add 0.5+ms to render submit according to Special K.

I added results from my Nehalem machine below. I actually prefer playing in the second scenario (normal priority) due to the reduced latency. High priority is somewhat subjectively smoother, but not as much as I would expect given the numbers here. The exception is that with youtube playing (or even just sitting paused), normal priority has noticeably degraded performance.

fps-compare

This fixes the last little bit of latency flakiness I was seeeing.
@@ -731,10 +731,16 @@ static void I_ResetTargetRefresh(void);
#define I_CpuPause()
#endif

#ifdef _MSC_VER
#pragma optimize("s", on) // Don't unroll the wait loop
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this probably be the single most important change in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question but no, this doesn't supersede the other changes in this PR. Testing the various permutations of the three de-optimizations, I still need all three.

@ceski-1
Copy link
Collaborator

ceski-1 commented Jun 16, 2024

@ceski-1 Do you use 'prefer maximum performance' power management mode in your nvidia settings?

Yes. For testing I start with the default nvcpl settings, then I change that, Vsync Off, and Low Latency Mode On. The latter doesn't matter though since Doom is never GPU limited with this system, so the queue is skipped.

As an alternative to CapFrameX, which is slightly outdated, you should try a newer version of the underlying PresentMon tool (2.0+) directly since it has a "Click To Photon Latency" metric in both the overlay and console logging app. The name is misleading as it doesn't include physical hardware latency but supposedly it's a good measurement of total PC latency. So if there's an improvement with your changes, you'd see it in that number.

Edit: The name is really misleading because you only need to move the mouse to get measurements, clicking is useless since it's tied to Doom's ticrate for demo compatibility, just like player movement, weapon switching, etc.

@ceski-1 ceski-1 mentioned this pull request Jun 16, 2024
@ceski-1
Copy link
Collaborator

ceski-1 commented Jun 16, 2024

Here's a web page from 2002 that describes a framerate limiter implementation. Scroll to the bottom. Look familiar? It's the limiter from the master branch*, and then this PR adds in Sleep(0), making it essentially the same thing.

*Maybe also influenced by this and this.

@gendlin
Copy link
Contributor Author

gendlin commented Jun 16, 2024

@ceski-1 Good find. There is also some good exploration here. His robustSleep function is essentially what we are doing here. It may be good to allow a couple microseconds tolerance, which I can add.

SDL calls timeBeginPeriod(1) for us. You can actually set it lower via an undocumented ntdll function (to 0.5 ms), but 1 ms seems to work ok. I'm not sure why your frame times are slightly worse on this branch. Does the game play noticeably worse for you? I'd prefer to keep the current I_SleepUS limiter loop if possible, as it plays much better on my machine (versus using I_Sleep(1) and/or only waiting when time remaining > 2 ms).

@ceski-1
Copy link
Collaborator

ceski-1 commented Jun 17, 2024

There is also some good exploration here. His robustSleep function is essentially what we are doing here. It may be good to allow a couple microseconds tolerance, which I can add.

The high resolution timer solution is interesting too.

SDL calls timeBeginPeriod(1) for us. You can actually set it lower via an undocumented ntdll function (to 0.5 ms), but 1 ms seems to work ok. I'm not sure why your frame times are slightly worse on this branch.

I'm not sure either. See #1744 (comment).

plays much better on my machine

Can you check if there is a measureable latency improvement on your system using PresentMon? Overlay: woof_bench_raw.json, Copy it to %USERPROFILE%\Documents\PresentMon\Loadouts. It looks like this:

overlay

Untitled

You can also measure input-to-present latency like this: ceski-1@939f66f. Use the showfps cheat and then move the mouse. It looks like this:

latency

woof0011

@gendlin
Copy link
Contributor Author

gendlin commented Jun 18, 2024

Vsync Off

Forcing vsync off in the driver may not be totally benign - doing this increases frame times by 0.2-0.3 ms for me, strangely enough (on two different driver revisions, including the most recent). I also noticed that turning off "Enable G-SYNC, G-SYNC compatible" under "Set up G-SYNC" results in a small but noticeable latency improvement for me when using the direct3d9 backend.

The high resolution timer solution is interesting too.

This one plays worse for me, unfortunately.

Can you check if there is a measureable latency improvement on your system using PresentMon?

No differences. My click-to-photon averages around 2.6 ms with maxes in the 4-5 range, regardless of what I do. Even turning render batching on/off (which is a huge difference for me) has no apparent impact on the PresentMon numbers.

You can also measure input-to-present latency like this:

Thanks, it's cool to see these numbers in real time. I tend to hover around 1 ms at 2x render scale w/ 500 fps cap. It's obviously better to be below 1 ms for this "race" from input to present, but none of the changes I tested have a substantive impact on this number (as you would expect).

@ceski-1
Copy link
Collaborator

ceski-1 commented Jun 18, 2024

@gendlin Okay, I have one more request. Please bear with me, the baselines have all shifted and I need to see the data to form an opinion.

Test these commits:

  1. "before fps-limiter-tweaks" gendlin@e4ec2ab
  2. "after fps-limiter-tweaks" gendlin@f421192
  3. "current master" https://github.com/fabiangreffrath/woof/commits/master/

For (2), pick your preferred cpu_priority value and make sure it's noted. Then provide your subjective impressions for input latency and frame pacing for each of the three commits.

Additionally, please show me these CapFrameX results, configured as 60-second runs:

  1. Bar chart comparing the three commits (default layout with average fps, 1% fps, and 0.2% fps).
  2. Frametime graph for each commit (every "additional graphs" option checked, manually set y-axis to 0-10ms).
  3. Input lag approximation, just the numbers are fine (lower bound, expected, upper bound).

@gendlin
Copy link
Contributor Author

gendlin commented Jun 18, 2024

@ceski-1 Perceived latency, from best to worst:

  1. after tweaks (f421192) [cpu_priority=0]
  2. after tweaks [cpu_priority=1]
  3. master (80e7fa3) with pause re-added
  4. master
  5. before tweaks (e4ec2ab)

master with the "normal priority" change added to it has the same latency as it does without.

I recorded while playing through e1m1 (render batching disabled in all cases). I didn't include data for the cpu_priority=0 test because CapFrameX causes it to stutter too much. Otherwise, it feels about as smooth as the other tests, with better latency.

There are no significant differences between the CapFrameX reported latency numbers.

fps-test-compare

Individual graphs

fps-test-80e7fa3

fps-test-e4ec2ab

fps-test-f421192

@ceski-1
Copy link
Collaborator

ceski-1 commented Jun 18, 2024

For each test case, what does the sensor data (the tab just below the frametimes graph) say for "average CPU max thread load(%)"?

@gendlin
Copy link
Contributor Author

gendlin commented Jun 18, 2024

They're all roughly the same: (61, 63, 57) for (master, before, after) respectively.

@gendlin
Copy link
Contributor Author

gendlin commented Jun 18, 2024

Also, master with the pause instruction restored has a latency in between master and "after tweaks [cpu_priority=1]". I'm not sure why that was removed.

@ceski-1
Copy link
Collaborator

ceski-1 commented Jun 19, 2024

Not much insight from the results, unfortunately. There's slightly lower CPU max thread load for "after tweaks". It's small but consistent across all three systems (~5%). You also perceive input latency to be lower with "after tweaks" but all attempts to measure it have failed.

After reviewing SDL's input code, I have a theory that the changes in this PR seem to help due to SDL's poor raw input handling under certain conditions (8kHz mice with modern systems, 1kHz mice on older ones). SDL3 fixes this and the improvements are significant. Unfortunately, there are no plans to backport the fix to SDL2. So we wait or hope for a contributor to backport it.

My point is that there are SDL performance improvements on the horizon, so I'm not sure where to go with this PR. Attempt some profiling? Check the results of this change? I'm out of suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants