<chrono>: Consider caching QueryPerformanceFrequency() #448

StephanTLavavej · 2020-01-24T01:38:41Z

steady_clock::now() says:

Lines 612 to 614 in a83d8c0

    
           _NODISCARD static time_point now() noexcept { // get current time
 
               const long long _Freq = _Query_perf_frequency(); // doesn't change after system boot
 
               const long long _Ctr  = _Query_perf_counter();

STL/stl/src/xtime.cpp

Lines 92 to 96 in a83d8c0

    
           _CRTIMP2_PURE long long __cdecl _Query_perf_frequency() { // get frequency of performance counter 
        
               LARGE_INTEGER li; 
        
               QueryPerformanceFrequency(&li); // always succeeds 
        
               return li.QuadPart; 
        
           }

As the comment indicates, QueryPerformanceFrequency documentation says: "The frequency of the performance counter is fixed at system boot and is consistent across all processors. Therefore, the frequency need only be queried upon application initialization, and the result can be cached."

This code originally used magic statics, but TFS checkin 1586419 on March 16, 2016 removed that. My checkin notes claimed (emphasis added):

Remove intentional but unnecessary use of magic statics in steady_clock::now() calling _Query_perf_frequency(). Calling QPC is very fast (when I originally profiled now(), IIRC I could call it 6-7 times before the tick changed). Calling QPF will be comparably fast, so we may as well just do that instead of paying the magic statics cost. I've left the "system boot" comment, since it's helpful to know.

I had no evidence for this performance assumption and it was incorrect. In DevCom-505019, where this issue was originally reported, Damian Zwoliński noted that while QPC is indeed efficient on most platforms (aside: IIRC, on certain VMs it is expensive), QPF is not efficient.

We're avoiding magic statics for a reason (its use of Thread Local Storage is problematic for some users), but we should investigate whether it's possible to restore caching without TLS and without breaking ABI. For example, we could have a static long long initialized to 0 (no magic) and use interlocked operations to cache QPF, since 0 is never a valid value for it.

The text was updated successfully, but these errors were encountered:

jwtowner · 2020-01-24T03:08:58Z

Hi! Agree, with the use of a global long long, but no need to use interlocked operations or any form of hardware memory fences, since you don't need to synchronize access to any other state in relation to the variable. All you need to do is guarantee atomicity of the load and store operations. It's alright if multiple threads end up racing independently in an attempt to initialize it.

So, it should be sufficient to use the equivalent of only the load() and store() operations on an std::atomic<long long> with std::memory_order_relaxed ordering.

Edit: On 32-bit CPU architectures, the above may end up being implemented in terms of a CAS or LL/SC loop, but on 64-bit architectures, should expect it to be optimized to plain old load/store instructions so long as the architecture guarantees atomicity of such operations.

I believe .NET Core uses this same technique deep in the internals of System.Threading.SpinWait to record a measurement of the time that the x86 PAUSE instruction takes, since this can vary depending on the architecture (for example, Skylake's PAUSE takes much longer than previous Intel microarchitectures). It doesn't use interlocked operations, atomics, or locking since I believe the C# memory model for loads/stores on non-volatile variables of types bool, char, byte, sbyte, short, ushort, uint, int and float is basically the same as std::memory_order_relaxed.

AlexGuteniev · 2020-03-27T08:55:55Z

Other thing to consider for this area:

STL/stl/inc/chrono

Lines 616 to 618 in a83d8c0

    
           const long long _Whole = (_Ctr / _Freq) * period::den;
 
           const long long _Part  = (_Ctr % _Freq) * period::den / _Freq;
 
           return time_point(duration(_Whole + _Part));

There's _mul128 and _div128 for x64.

By replacing highlighted code with the following:

#ifdef _M_X64
            long long _High;
            long long _Rem_unused;
            long long _Low = _mul128(period::den, _Ctr, &_High);
            long long _Res = _div128(_High, _Low, _Freq, &_Rem_unused);
            return time_point(duration(_Res));
#else
          // old code
#endif

Can have one division instead of two.

But on my system QPC takes most of the time.
I'm proposing to consider this, since there are apparently systems where QPC is faster.

It also may be not worth of trouble bringing intrinsic functions to that header.
And digging it into .cpp is API change.

MikeGitb · 2020-03-29T07:53:04Z

We're avoiding magic statics for a reason (its use of Thread Local Storage is problematic for some users)

Maybe a bit off topic, but why do magic statics need thread local storage?

BillyONeal · 2020-04-01T00:40:20Z

@MikeGitb The 'Magic Statics' algorithm uses a thread-local read to see 'did this thread already see that this value' to avoid synchronization if the current thread has already seen that the value is initialized.

AlexGuteniev · 2020-04-01T02:01:50Z

(If we started this off-topic) what are reasons to avoid TLS ? I recall there are issues with .NET, but this can be ruled out by preprocessor. There was dll in XP problem, but XP is no longer there. Some more exotic, like kernel mode drivers, executable packers / protectors, malware ? This STL is not suited for such scenarios anyway.

BillyONeal · 2020-04-01T02:09:56Z

Adding a .TLS section to a DLL that previously didn't have one can push a program over the TLS slot limit, so it's a breaking change.

AlexGuteniev · 2020-04-01T02:22:38Z

TLS slot limit? I though the algorithm that exists since Vista has no limits, it would reallocate TLS as many times as needed.

AlexGuteniev · 2020-04-01T04:43:33Z

(Sure such TLS reallocation may be heap expensive for a program with many threads, and many DLLs already loaded, and may cause reaching heap limit on x86, but then any adding change can cause that)

BillyONeal · 2020-04-02T04:14:36Z

@AlexGuteniev My understanding is the limit in Vista was changed from ~58 to ~1000ish but that there is still a limit.

StephanTLavavej added enhancement Something can be improved performance Must go faster labels Jan 24, 2020

StephanTLavavej removed the enhancement Something can be improved label Feb 6, 2020

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Mar 27, 2020

caching QueryPerformanceFrequency() microsoft#448

a456502

AlexGuteniev mentioned this issue Mar 27, 2020

<chrono>: Cache QueryPerformanceFrequency() #646

Merged

4 tasks

cbezault linked a pull request Mar 27, 2020 that will close this issue

<chrono>: Cache QueryPerformanceFrequency() #646

Merged

4 tasks

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Mar 29, 2020

More aggressive on microsoft#448

42c42bd

BillyONeal closed this as completed in #646 Apr 1, 2020

StephanTLavavej added the fixed Something works now, yay! label Apr 1, 2020

AlexGuteniev mentioned this issue Apr 2, 2020

Reconsider magic static usage, as there should be no portability issues since XP dropping #673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<chrono>: Consider caching QueryPerformanceFrequency() #448

<chrono>: Consider caching QueryPerformanceFrequency() #448

StephanTLavavej commented Jan 24, 2020

jwtowner commented Jan 24, 2020 •

edited

Loading

AlexGuteniev commented Mar 27, 2020 •

edited

Loading

MikeGitb commented Mar 29, 2020

BillyONeal commented Apr 1, 2020

AlexGuteniev commented Apr 1, 2020

BillyONeal commented Apr 1, 2020

AlexGuteniev commented Apr 1, 2020

AlexGuteniev commented Apr 1, 2020

BillyONeal commented Apr 2, 2020

<chrono>: Consider caching QueryPerformanceFrequency() #448

<chrono>: Consider caching QueryPerformanceFrequency() #448

Comments

StephanTLavavej commented Jan 24, 2020

jwtowner commented Jan 24, 2020 • edited Loading

AlexGuteniev commented Mar 27, 2020 • edited Loading

MikeGitb commented Mar 29, 2020

BillyONeal commented Apr 1, 2020

AlexGuteniev commented Apr 1, 2020

BillyONeal commented Apr 1, 2020

AlexGuteniev commented Apr 1, 2020

AlexGuteniev commented Apr 1, 2020

BillyONeal commented Apr 2, 2020

jwtowner commented Jan 24, 2020 •

edited

Loading

AlexGuteniev commented Mar 27, 2020 •

edited

Loading