Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<chrono>: Consider caching QueryPerformanceFrequency() #448

Closed
StephanTLavavej opened this issue Jan 24, 2020 · 9 comments · Fixed by #646
Closed

<chrono>: Consider caching QueryPerformanceFrequency() #448

StephanTLavavej opened this issue Jan 24, 2020 · 9 comments · Fixed by #646
Labels
fixed Something works now, yay! performance Must go faster

Comments

@StephanTLavavej
Copy link
Member

steady_clock::now() says:

STL/stl/inc/chrono

Lines 612 to 614 in a83d8c0

_NODISCARD static time_point now() noexcept { // get current time
const long long _Freq = _Query_perf_frequency(); // doesn't change after system boot
const long long _Ctr = _Query_perf_counter();

STL/stl/src/xtime.cpp

Lines 92 to 96 in a83d8c0

_CRTIMP2_PURE long long __cdecl _Query_perf_frequency() { // get frequency of performance counter
LARGE_INTEGER li;
QueryPerformanceFrequency(&li); // always succeeds
return li.QuadPart;
}

As the comment indicates, QueryPerformanceFrequency documentation says: "The frequency of the performance counter is fixed at system boot and is consistent across all processors. Therefore, the frequency need only be queried upon application initialization, and the result can be cached."

This code originally used magic statics, but TFS checkin 1586419 on March 16, 2016 removed that. My checkin notes claimed (emphasis added):

Remove intentional but unnecessary use of magic statics in steady_clock::now() calling _Query_perf_frequency(). Calling QPC is very fast (when I originally profiled now(), IIRC I could call it 6-7 times before the tick changed). Calling QPF will be comparably fast, so we may as well just do that instead of paying the magic statics cost. I've left the "system boot" comment, since it's helpful to know.

I had no evidence for this performance assumption and it was incorrect. In DevCom-505019, where this issue was originally reported, Damian Zwoliński noted that while QPC is indeed efficient on most platforms (aside: IIRC, on certain VMs it is expensive), QPF is not efficient.

We're avoiding magic statics for a reason (its use of Thread Local Storage is problematic for some users), but we should investigate whether it's possible to restore caching without TLS and without breaking ABI. For example, we could have a static long long initialized to 0 (no magic) and use interlocked operations to cache QPF, since 0 is never a valid value for it.

@StephanTLavavej StephanTLavavej added enhancement Something can be improved performance Must go faster labels Jan 24, 2020
@jwtowner
Copy link

jwtowner commented Jan 24, 2020

Hi! Agree, with the use of a global long long, but no need to use interlocked operations or any form of hardware memory fences, since you don't need to synchronize access to any other state in relation to the variable. All you need to do is guarantee atomicity of the load and store operations. It's alright if multiple threads end up racing independently in an attempt to initialize it.

So, it should be sufficient to use the equivalent of only the load() and store() operations on an std::atomic<long long> with std::memory_order_relaxed ordering.

Edit: On 32-bit CPU architectures, the above may end up being implemented in terms of a CAS or LL/SC loop, but on 64-bit architectures, should expect it to be optimized to plain old load/store instructions so long as the architecture guarantees atomicity of such operations.

I believe .NET Core uses this same technique deep in the internals of System.Threading.SpinWait to record a measurement of the time that the x86 PAUSE instruction takes, since this can vary depending on the architecture (for example, Skylake's PAUSE takes much longer than previous Intel microarchitectures). It doesn't use interlocked operations, atomics, or locking since I believe the C# memory model for loads/stores on non-volatile variables of types bool, char, byte, sbyte, short, ushort, uint, int and float is basically the same as std::memory_order_relaxed.

@StephanTLavavej StephanTLavavej removed the enhancement Something can be improved label Feb 6, 2020
@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Mar 27, 2020

Other thing to consider for this area:

STL/stl/inc/chrono

Lines 616 to 618 in a83d8c0

const long long _Whole = (_Ctr / _Freq) * period::den;
const long long _Part = (_Ctr % _Freq) * period::den / _Freq;
return time_point(duration(_Whole + _Part));

There's _mul128 and _div128 for x64.

By replacing highlighted code with the following:

#ifdef _M_X64
            long long _High;
            long long _Rem_unused;
            long long _Low = _mul128(period::den, _Ctr, &_High);
            long long _Res = _div128(_High, _Low, _Freq, &_Rem_unused);
            return time_point(duration(_Res));
#else
          // old code
#endif

Can have one division instead of two.

But on my system QPC takes most of the time.
I'm proposing to consider this, since there are apparently systems where QPC is faster.

It also may be not worth of trouble bringing intrinsic functions to that header.
And digging it into .cpp is API change.

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Mar 27, 2020
@cbezault cbezault linked a pull request Mar 27, 2020 that will close this issue
4 tasks
AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Mar 29, 2020
@MikeGitb
Copy link

We're avoiding magic statics for a reason (its use of Thread Local Storage is problematic for some users)

Maybe a bit off topic, but why do magic statics need thread local storage?

@BillyONeal
Copy link
Member

@MikeGitb The 'Magic Statics' algorithm uses a thread-local read to see 'did this thread already see that this value' to avoid synchronization if the current thread has already seen that the value is initialized.

@StephanTLavavej StephanTLavavej added the fixed Something works now, yay! label Apr 1, 2020
@AlexGuteniev
Copy link
Contributor

(If we started this off-topic) what are reasons to avoid TLS ? I recall there are issues with .NET, but this can be ruled out by preprocessor. There was dll in XP problem, but XP is no longer there. Some more exotic, like kernel mode drivers, executable packers / protectors, malware ? This STL is not suited for such scenarios anyway.

@BillyONeal
Copy link
Member

Adding a .TLS section to a DLL that previously didn't have one can push a program over the TLS slot limit, so it's a breaking change.

@AlexGuteniev
Copy link
Contributor

TLS slot limit? I though the algorithm that exists since Vista has no limits, it would reallocate TLS as many times as needed.

@AlexGuteniev
Copy link
Contributor

(Sure such TLS reallocation may be heap expensive for a program with many threads, and many DLLs already loaded, and may cause reaching heap limit on x86, but then any adding change can cause that)

@BillyONeal
Copy link
Member

@AlexGuteniev My understanding is the limit in Vista was changed from ~58 to ~1000ish but that there is still a limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants