-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More fps limiter tweaks #1728
base: master
Are you sure you want to change the base?
More fps limiter tweaks #1728
Conversation
c63afbb
to
6edfba0
Compare
6edfba0
to
f32c8dd
Compare
} | ||
else | ||
{ | ||
I_Sleep(0); // yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need I_Sleep(0)
? On Windows, SDL will call Sleep(0)
which makes some sense, but I don't think it's portable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on Windows it reduces CPU usage for me and gives a nice latency improvement (which directly yielding the thread via e.g. SwitchToThread
does not provide). Do you know if there is a posix counterpart to Sleep(0)
?
At least this should be no worse than the existing implementation, which has many calls to Sleep(0)
(when remaining time < 2ms and > 1ms).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a pthread_yield
, but I'd rather avoid that complication. How about I_SleepUS(100)
instead of I_Sleep(0)
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too coarse, in my tests I_SleepUS(100)
often sleeps for 1 ms whereas I_Sleep(0)
sleeps only 10-30us or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, maybe it makes sense to add the I_ThreadYield
function, with POSIX it's sched_yield
, on Windows it's just Sleep(0)
. How do you test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I am only able to test on windows so I was hoping someone else would try linux and let me know if I broke anything. The only new parts are the lack of pure busy-wait at the end, and the change to I_SleepUS (both of which improved latency / reduced CPU usage for me, at the cost of slightly higher frametime variance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try linux and let me know if I broke anything
I only ran a quick test with mangohud --dlsym woof
but the frametimes were identical between the master branch and this PR. OpenGL render driver, 9900K, RTX3080, Arch.
This option (included in /O1 and /O2 by default) leads to difficult to diagnose performance issues on older machines, and makes the linker and optimizer behavior very unpredictable while providing no performance benefits in return. This change increases binary size by 2-3% but otherwise has no negative impact.
This is consistent with stuff like RTSS and helps the fps counter to more often match the limiter setting at high fps.
# Disable function-level linking | ||
_checked_add_compile_option(/Gy-) | ||
endif() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this compiler flag robustly fixes the latency gremlins I've been seeing (see discussion in #1712)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind if I push a commit to your branch so that you can test the results? Namely reverting this "fix" to test the theory here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have mentioned that I did test that earlier and it does fix the problem, no #1712 required. Also fixes the joy_enable weirdness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, thanks for going through that exercise. I'll let the others comment on the actual fix you're proposing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this compiler flag robustly fixes the latency gremlins I've been seeing (see discussion in #1712)
Is this for a specific version of MSVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find it that hard to believe. It drastically alters the layout of the binary, different optimizer and linker behavior, etc.
-ffunction-sections
is apparently the gcc equivalent
Maybe, but I don't see any difference in performance on my machines. I would rather use -O2
than touch those options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this answer, I don't see why we should disable it: https://stackoverflow.com/questions/1834597/what-is-the-comdat-section-used-for
I found it in ZDoom, commit is from 2008 without much comments: ZDoom/gzdoom@fb50df2#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case you should add '-ffunction-sections -Wl,--gc-sections' to gcc/clang builds. Those flags are disabled by default. This change is just moving MSVC behavior into line with the other compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think MSVC/GCC developers have their own reasons for setting these defaults. I have a feeling that your latency issues are an SDL/Windows problem, maybe the Woof changes affect this for some reason, but the true problem is not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MSVC defaults are designed for large C++ projects. The COMDAT settings make more sense in that context than they do for us here.
Is this from CapFrameX or something similar? I see two graphs here. The first is raw frametimes which look about the same, and the second is a quantile function of that data, showing a small improvement in the 99.9% frametimes where microstutters can be annoying (sometimes called 0.1% frametimes). But the change is so small it looks like typical run-to-run measurement error. Would you mind comparing this PR (merge in the latest master changes) against a baseline without your recent improvements? Meaning without this PR and without the two changes here, and using the framerate limiter at some desired value. That should show a nice improvement with your system, since it will reflect the complete set of improvements so far. It would also be nice to see msvc profiling results. |
@ceski-1 I can check, but I wouldn't expect those other two changes to impact frame times so I would probably just be reproducing the graph above (the input latency issues are unrelated to framerate).
I mainly added the CapFrameX graph to show there weren't significant regressions in frame pacing with these changes - if there were no real changes in frame times I would be happy because I'm mainly hunting latency here. Speaking of latency, with these changes playing Woof is now what I would describe as telepathic. I am pretty latency sensitive and I would say Woof (in large part thanks to So I'm very happy with this now and I promise not to spam any more anti-latency PRs (unless there are regressions🙂). I appreciate you guys' patience and willingness to humor me on this stuff, despite not being able to repro locally. I mainly write these PRs for my own benefit, but hopefully someone else out there with an old machine will see improvements as well. |
The only way to reduce input latency on the Woof side is to reduce frame time. And we are only talking about player view rotation, all other inputs are only registered 35 times per second, this is a Doom engine limitation that we have to keep for gameplay/demo compatibility. But you say that your changes don't affect frame time even on your machine, so how is it possible that they affect input latency? I think the problem is on the SDL/Windows side. |
@rfomin The input latency is about getting the latest mouse events from the Windows message pump, it has nothing to do with frame rates - I can detect it just as easily at 60 fps as I can at 500. That's why the To test this hypothesis, I tried reverting #1505. That PR increases the priority of the main Woof thread, which I would expect to further starve input processing of the cycles it needs to do its job, and that's exactly what it appears to do - reverting #1505 lowers latency even further for me, on top of the changes in this branch. I think this is one of those unintended side effects that leads people to recommend not messing with thread priority (or at least not increasing it)🙂. |
My point is that there is nothing we can do about the Windows message pump. All we can do is call
I agree that we shouldn't mess with thread priority without a verifiable result. That change did nothing for me either. |
We can allow it the resources to do its job by not thrashing the CPU.
No, that won't help. I think you missed my point that SDL is getting its events downstream from an asynchronous kernel process, which needs CPU resources independent of the Woof thread to fully do its job.
Do you suggest we rewrite the Windows kernel? 😋
Do you have a hyperthreaded CPU? That might make you immune to this problem since multiple threads can run concurrently on the same core. You maybe could try turning this off and testing, although I think it requires messing in BIOS so you may not want to bother. |
My suggestion is to try rewriting the SDL. Do you really know how SDL receives events? If you improve this, it will help a lot of projects. |
Not really (and I don't want to🙂) but in the debugger I don't see any SDL event processing thread, only one for timers. So I'm pretty sure SDL_PumpEvents happens synchronously in the same thread as the caller. In that case it can't be responsible for this problem - it can only process all the events it has available to it at the time of the call. Even if it did have an asynchronous thread running in the background, we would be back to the problem of priority starvation, etc. |
I think that is correct. So we can't do anything about it, right?
But how can changing compiler flags or |
Not by modifying SDL, no. This is a kernel level thing.
Fewer cache misses, more opportunities for the OS to context switch into event processing and make more input data available later on for the SDL_PumpEvents call, would be my guess. I don't know the microarchitectural details but you'll have to take my word for it that it actually makes a difference, because I've seen it with my own eyes. It's very reproducible on my machine. |
Does reverting this commit alone achieve anything for you? Because, honestly, I'm starting to get confused which combination of changes is best and which is even better and which is bestest, you know. 😉 |
Good question🙂. I tested this and the answer is a resounding "no". It's dependent on the changes in this branch.
I think the general rule of thumb here is that more changes = better. No one improvement supersedes another, and using the whole suite of changes provides the best result. The only exception is #1712, which is no longer strictly necessary with the addition of the |
After discussion with @rfomin clarified the issue in my mind, I'm starting to think that maybe the material difference here is not old CPU vs. new CPU, but CPUs with hyperthreading vs. those without. Has anyone else happened to test on a CPU without hyperthreading (or with it disabled in BIOS)? |
My old 2012 laptop doesn't have hyperthreading, there is no difference. Sorry, but I just don't believe that the |
Like Fabian, I am a little bit lost at this point. I don't mind running some tests later with HT on/off, but what build(s) should be tested and what is being measured? |
The changes here help to prevent that scenario. Putting the frame limiter in its own (out-of-line) function makes the alignment independent of changes upstream in I_FinishUpdate, and |
Right, those are the arguments from conservatism which I referred in my comment. That's fine, but in the face of such conservatism, nothing interesting can be done here.
I believe I've answered that. However, you guys control the repository so I suppose I'm at your mercy there. |
For any optimisation work we need to measure results. Where is the conservatism here?
So will you do it? |
Reached my limit (heh) on this one. Thanks everyone for the testing and review |
This PR was closed but I still wanted data, so I ran some benchmarks with CapFrameX. Here's a summary of the results:
Conclusion:
Relevant graphs only: |
Re-opening this after cooling off a bit, and not wanting to close the door on fruitful investigation.
Thanks, this is interesting. It looks like the latency of |
No, why would I? It provides real benefits to me, and no real detriments to anyone else. By the way, if you don't like premature optimization, you should like these changes because they both remove optimizations. 🙂 |
These optimisations are made by compilers without our interversion and can be changed by compiler developers. If we add these changes to our code, we will increase complexity for no benefit.
Yes, |
Does anyone know about the MSVC |
I've never seen anyone use gcc doesn't do the |
@ceski-1 Do you use 'prefer maximum performance' power management mode in your nvidia settings? I've noticed that Woof makes so little demand on the GPU that otherwise it can sometimes downclock itself, which can add 0.5+ms to render submit according to Special K. I added results from my Nehalem machine below. I actually prefer playing in the second scenario (normal priority) due to the reduced latency. High priority is somewhat subjectively smoother, but not as much as I would expect given the numbers here. The exception is that with youtube playing (or even just sitting paused), normal priority has noticeably degraded performance. |
This fixes the last little bit of latency flakiness I was seeeing.
@@ -731,10 +731,16 @@ static void I_ResetTargetRefresh(void); | |||
#define I_CpuPause() | |||
#endif | |||
|
|||
#ifdef _MSC_VER | |||
#pragma optimize("s", on) // Don't unroll the wait loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this probably be the single most important change in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same question but no, this doesn't supersede the other changes in this PR. Testing the various permutations of the three de-optimizations, I still need all three.
Yes. For testing I start with the default nvcpl settings, then I change that, Vsync Off, and Low Latency Mode On. The latter doesn't matter though since Doom is never GPU limited with this system, so the queue is skipped. As an alternative to CapFrameX, which is slightly outdated, you should try a newer version of the underlying PresentMon tool (2.0+) directly since it has a "Click To Photon Latency" metric in both the overlay and console logging app. The name is misleading as it doesn't include physical hardware latency but supposedly it's a good measurement of total PC latency. So if there's an improvement with your changes, you'd see it in that number. Edit: The name is really misleading because you only need to move the mouse to get measurements, clicking is useless since it's tied to Doom's ticrate for demo compatibility, just like player movement, weapon switching, etc. |
@ceski-1 Good find. There is also some good exploration here. His robustSleep function is essentially what we are doing here. It may be good to allow a couple microseconds tolerance, which I can add. SDL calls |
The high resolution timer solution is interesting too.
I'm not sure either. See #1744 (comment).
Can you check if there is a measureable latency improvement on your system using PresentMon? Overlay: woof_bench_raw.json, Copy it to You can also measure input-to-present latency like this: ceski-1@939f66f. Use the |
Forcing vsync off in the driver may not be totally benign - doing this increases frame times by 0.2-0.3 ms for me, strangely enough (on two different driver revisions, including the most recent). I also noticed that turning off "Enable G-SYNC, G-SYNC compatible" under "Set up G-SYNC" results in a small but noticeable latency improvement for me when using the direct3d9 backend.
This one plays worse for me, unfortunately.
No differences. My click-to-photon averages around 2.6 ms with maxes in the 4-5 range, regardless of what I do. Even turning render batching on/off (which is a huge difference for me) has no apparent impact on the PresentMon numbers.
Thanks, it's cool to see these numbers in real time. I tend to hover around 1 ms at 2x render scale w/ 500 fps cap. It's obviously better to be below 1 ms for this "race" from input to present, but none of the changes I tested have a substantive impact on this number (as you would expect). |
@gendlin Okay, I have one more request. Please bear with me, the baselines have all shifted and I need to see the data to form an opinion. Test these commits:
For (2), pick your preferred Additionally, please show me these CapFrameX results, configured as 60-second runs:
|
@ceski-1 Perceived latency, from best to worst:
master with the "normal priority" change added to it has the same latency as it does without. I recorded while playing through e1m1 (render batching disabled in all cases). I didn't include data for the There are no significant differences between the CapFrameX reported latency numbers. |
For each test case, what does the sensor data (the tab just below the frametimes graph) say for "average CPU max thread load(%)"? |
They're all roughly the same: (61, 63, 57) for (master, before, after) respectively. |
Also, master with the |
Not much insight from the results, unfortunately. There's slightly lower CPU max thread load for "after tweaks". It's small but consistent across all three systems (~5%). You also perceive input latency to be lower with "after tweaks" but all attempts to measure it have failed. After reviewing SDL's input code, I have a theory that the changes in this PR seem to help due to SDL's poor raw input handling under certain conditions (8kHz mice with modern systems, 1kHz mice on older ones). SDL3 fixes this and the improvements are significant. Unfortunately, there are no plans to backport the fix to SDL2. So we wait or hope for a contributor to backport it. My point is that there are SDL performance improvements on the horizon, so I'm not sure where to go with this PR. Attempt some profiling? Check the results of this change? I'm out of suggestions. |
This branch modifies the frame limiter, improving latency, CPU usage, and accuracy at the cost of occasional higher frame time spikes (due to yielding the current thread). Given that someone using the in-game limiter is probably interested in the lowest possible latency, I think this may be a worthwhile compromise. In combination with #1733 the changes in this branch provide great latency results for me. And c50582c fixes the flaky, unpredictable performance that I've been seeing in general.