Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random effect of Intel JCC Errata on micro optimizations #2405

Closed
AlexGuteniev opened this issue Dec 11, 2021 · 5 comments
Closed

Random effect of Intel JCC Errata on micro optimizations #2405

AlexGuteniev opened this issue Dec 11, 2021 · 5 comments
Labels
performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Dec 11, 2021

I encountered an issue in optimizing vector_algorithm.cpp.

Suddenly with otherwise good optimizations I run into bad branching pattern that is affected by Intel JCC erratum. The impact matters up to reversing the effect of the optimization. And just random NOPs can make the optimization great again, as the issue is triggered by a bad alignment of branching instructions.

The compiler has the flag /QIntel-jcc-erratum. If I add it to CMakeFile.txt, the optimizations start behaving predictable.

I'm worried that this flag may impact other CPUs though, specifically older AMD CPUs, or Atom CPUs.
@fsb4000 confirmed that enabling /QIntel-jcc-erratum makes terrible perf on AMD FX 8300.

I can try to make vector_algorithm.cpp into two translation units, one is compiled with /QIntel-jcc-erratum and the other without, and that each function would branch into the /QIntel-jcc-erratum translation unit in case of running on the affected CPU. This seems to be the best option, but it slightly impact binary size, and introduces a lot of complexity.

What can we do here?

See the table in #2386 for example of the results. Note that enabling /QIntel-jcc-erratum both makes results better, and avoid unpredictable variations!

@AlexGuteniev AlexGuteniev added the question Further information is requested label Dec 11, 2021
@StephanTLavavej
Copy link
Member

We talked about this at our weekly maintainer meeting - there don't seem to be any great answers here. Someone needs to investigate:

  • Exactly how to detect CPUs that would benefit from vs. would be harmed by /QIntel-jcc-erratum (do we need to add a bit to the logic in vcstartup's cpu_disp.c that currently detects AVX2/AVX512/etc.?).
  • The resulting perf impacts and how significant the code complexity would be (in principle, having two separately compiled TUs, one with the compiler option and one without, isn't too complicated, although some build system work would be needed for both CMake and MSBuild)

We realize that asking for such investigation, before deciding whether to accept such changes, isn't great. 😹 😿

If it looks like there isn't a reasonably low-complexity way to get good optimization benefits without harming certain processors, then the path forward is probably just not to attempt such optimizations (i.e. nobody will complain if we just ship the classic loop and don't try to do magical things).

@StephanTLavavej StephanTLavavej added info needed We need more info before working on this performance Must go faster and removed question Further information is requested labels Dec 15, 2021
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Dec 16, 2021

Exactly how to detect CPUs that would benefit from vs. would be harmed by /QIntel-jcc-erratum?

Intel's document has a definitive list at the end.

The detection is to detect Intel by GenuineIntel string (see the example in __cpuid docs), then, if Intel, match against the list by querying Model, Family, Stepping.

Any CPU that will not benefit, can be safely assumed as harmed, but the harm may vary from negligible to severe. (Technically, the affected CPUs are also harmed, but the harm is negligible compared to the positive effect)

This has to be done at startup, as the detection itself expensive (several calls to cpuid). Maybe can consider lazily detecting with an atomic though:

auto some_vector_algorithm(auto params)
{
   auto state = global_state.load();
   if (state == not_detected) {
        state = detect_and_save();
   }
   if (state == positive) {
        return tail_call_to_other_tu_impl(params);
   }
   ...
}

The resulting perf impacts?

Varies.

JCC Erratum Mitigation in MSVC post claims that the benefit is 3%, and the impact on unaffected CPUs is negligible

However:

  • I see 20% benefit for a particular case on the affected processor ( i7-8750H ).
  • @cbezault shows 20% impact for a particular case on 3600X
  • In some pathologic case (when not vectorizing reverse copy), @fsb4000 reported an order of magnitude impact on AMD FX 8300

The expectation (see also SO question):

  • Indeed helps the affected CPU without significant impact on them
  • No significant impact on Intel mainstream CPUs
  • Significant impact on older AMD CPUs in certain cases (depending on number of prefixes)
  • Significant impact on non-mainstream Intel (Atom)

How significant the code complexity would be?

I propose the following:

  • When building the TU with /QIntel-jcc-erratum, some macro is defined
  • The entry point of each vector algorithm is macroized: with the JCC macro some prefix is added
  • When building without the JCC macro, at the beginning vector algorithms would contain check for an affected CPU and if succeeds then tailcall of the other TU.

So the resulting code complexity:

  • Detection of affected CPUs
  • Macro magic of entry points that are selected for differentiation
  • Build system to produce two translation units with different options from some of .cpp files

When would it help?

Most of STL critical code is templated and in the header, and don't use explicit vectorization, it is hard to optimize it. We depend on user to compile it with / without /QIntel-jcc-erratum

Only compiled STL code may have benefit. I see the following affected code:

@AlexGuteniev
Copy link
Contributor Author

There's also connection between this issue and /Os flag

Observed with /Os:

  • Jump tables are not generated
  • Loops are not padded to 16 byte boundary

Both results in more likehood of badly aligned JCC.

Example (https://godbolt.org/z/Mv6P9EWc5):

void loop(int i, char a[], char b[])
{
    char* stop = a + i;
    while (a != stop){
        *b++ = *a++;
    }
}

void jump_table(int i, char a[], char b[])
{
    switch (i)
    {
                            case 7: 
            a[6] = b[6];    case 6: 
            a[5] = b[5];    case 5: 
            a[4] = b[4];    case 4: 
            a[3] = b[3];    case 3: 
            a[2] = b[2];    case 2: 
            a[1] = b[1];    case 1: 
            a[0] = b[1];    case 0:  break;
            default: __assume(false);
    }
}

@AlexGuteniev
Copy link
Contributor Author

TL;DR: I think this issue should be closed without any further actions

I experimented with this option and minmax benchmark after #4401

I added /QIntel-jcc-erratum to _implib_objects compile options.

I know I have an affected CPU, but the option still makes the benchmark show a bit worse figures.

My results
------------------------------------------------------------------------------------------------
Benchmark                              Old Time         Old  CPU     New Time          New CPU   
------------------------------------------------------------------------------------------------
bm<uint8_t, 8021, Op::Min>              358 ns          357 ns       394 ns          396 ns     
bm<uint8_t, 8021, Op::Max>              364 ns          366 ns       357 ns          352 ns     
bm<uint8_t, 8021, Op::Both>             563 ns          558 ns       557 ns          544 ns     
bm<uint8_t, 8021, Op::Min_val>          278 ns          279 ns       279 ns          276 ns     
bm<uint8_t, 8021, Op::Max_val>          267 ns          267 ns       305 ns          301 ns     
bm<uint8_t, 8021, Op::Both_val>       20525 ns        20403 ns     20501 ns        20403 ns     
bm<uint16_t, 8021, Op::Min>             783 ns          785 ns       816 ns          820 ns     
bm<uint16_t, 8021, Op::Max>             830 ns          837 ns       838 ns          837 ns     
bm<uint16_t, 8021, Op::Both>           1036 ns         1025 ns      1070 ns         1074 ns     
bm<uint16_t, 8021, Op::Min_val>         320 ns          322 ns       330 ns          330 ns     
bm<uint16_t, 8021, Op::Max_val>         319 ns          322 ns       324 ns          321 ns     
bm<uint16_t, 8021, Op::Both_val>      13400 ns        12451 ns     11935 ns        11998 ns     
bm<uint32_t, 8021, Op::Min>            1295 ns         1287 ns      1403 ns         1413 ns     
bm<uint32_t, 8021, Op::Max>            1296 ns         1311 ns      1377 ns         1395 ns     
bm<uint32_t, 8021, Op::Both>           2103 ns         2131 ns      2050 ns         2040 ns     
bm<uint32_t, 8021, Op::Min_val>        1045 ns         1025 ns      1066 ns         1074 ns     
bm<uint32_t, 8021, Op::Max_val>         614 ns          614 ns       634 ns          628 ns     
bm<uint32_t, 8021, Op::Both_val>        653 ns          642 ns       650 ns          642 ns     
bm<uint64_t, 8021, Op::Min>            4600 ns         4604 ns      4616 ns         4653 ns     
bm<uint64_t, 8021, Op::Max>            4706 ns         4649 ns      4850 ns         4918 ns     
bm<uint64_t, 8021, Op::Both>           4955 ns         5000 ns      5009 ns         5000 ns     
bm<uint64_t, 8021, Op::Min_val>        4047 ns         3990 ns      4066 ns         4081 ns     
bm<uint64_t, 8021, Op::Max_val>        4059 ns         4081 ns      4070 ns         4081 ns     
bm<uint64_t, 8021, Op::Both_val>       4163 ns         4143 ns      4080 ns         4143 ns     
bm<int8_t, 8021, Op::Min>               353 ns          353 ns       370 ns          369 ns     
bm<int8_t, 8021, Op::Max>               357 ns          353 ns       379 ns          384 ns     
bm<int8_t, 8021, Op::Both>              582 ns          594 ns       611 ns          614 ns     
bm<int8_t, 8021, Op::Min_val>           277 ns          273 ns       290 ns          289 ns     
bm<int8_t, 8021, Op::Max_val>           282 ns          285 ns       271 ns          270 ns     
bm<int8_t, 8021, Op::Both_val>        20115 ns        20089 ns     20230 ns        20403 ns     
bm<int16_t, 8021, Op::Min>              788 ns          767 ns       785 ns          795 ns     
bm<int16_t, 8021, Op::Max>              816 ns          820 ns       779 ns          785 ns     
bm<int16_t, 8021, Op::Both>            1042 ns         1046 ns      1063 ns         1074 ns     
bm<int16_t, 8021, Op::Min_val>          542 ns          531 ns       544 ns          530 ns     
bm<int16_t, 8021, Op::Max_val>          537 ns          547 ns       534 ns          530 ns     
bm<int16_t, 8021, Op::Both_val>       13160 ns        13393 ns     12963 ns        12870 ns     
bm<int32_t, 8021, Op::Min>             1289 ns         1283 ns      1422 ns         1381 ns     
bm<int32_t, 8021, Op::Max>             1289 ns         1311 ns      1421 ns         1413 ns     
bm<int32_t, 8021, Op::Both>            2107 ns         2100 ns      2138 ns         2148 ns     
bm<int32_t, 8021, Op::Min_val>          622 ns          628 ns       626 ns          628 ns     
bm<int32_t, 8021, Op::Max_val>         1070 ns         1088 ns      1049 ns         1067 ns     
bm<int32_t, 8021, Op::Both_val>        1054 ns         1050 ns      1023 ns         1025 ns     
bm<int64_t, 8021, Op::Min>             4648 ns         4604 ns      4625 ns         4653 ns     
bm<int64_t, 8021, Op::Max>             4747 ns         4708 ns      4742 ns         4813 ns     
bm<int64_t, 8021, Op::Both>            4902 ns         5000 ns      4928 ns         5000 ns     
bm<int64_t, 8021, Op::Min_val>         4032 ns         3990 ns      4039 ns         4081 ns     
bm<int64_t, 8021, Op::Max_val>         4046 ns         4081 ns      4057 ns         4081 ns     
bm<int64_t, 8021, Op::Both_val>        4162 ns         4081 ns      4062 ns         4081 ns     
bm<float, 8021, Op::Min>               2425 ns         2400 ns      2403 ns         2407 ns     
bm<float, 8021, Op::Max>               2392 ns         2407 ns      2392 ns         2400 ns     
bm<float, 8021, Op::Both>              2555 ns         2550 ns      2826 ns         2825 ns     
bm<float, 8021, Op::Min_val>           1987 ns         2009 ns      1981 ns         1967 ns     
bm<float, 8021, Op::Max_val>           1998 ns         1995 ns      1991 ns         1967 ns     
bm<float, 8021, Op::Both_val>          2028 ns         2040 ns      2020 ns         1995 ns     
bm<double, 8021, Op::Min>              4354 ns         4395 ns      4317 ns         4297 ns     
bm<double, 8021, Op::Max>              4061 ns         4049 ns      4044 ns         4081 ns     
bm<double, 8021, Op::Both>             5167 ns         5162 ns      5136 ns         5162 ns     
bm<double, 8021, Op::Min_val>          4027 ns         3990 ns      4037 ns         4098 ns     
bm<double, 8021, Op::Max_val>          4043 ns         3990 ns      4028 ns         3990 ns     
bm<double, 8021, Op::Both_val>         4038 ns         3990 ns      4058 ns         3990 ns     

Looks like the option may be tuned for the best results, via officially unsupported variations, but it looks too much time consuming for a questionable gain.

I can't explain why this option isn't helpful. From the disassembly I observe it works right. Maybe the impact is smaller than the impact of mitigation, or there was another microcode update to improve the situation.

@StephanTLavavej StephanTLavavej removed the info needed We need more info before working on this label Feb 22, 2024
@StephanTLavavej
Copy link
Member

Sounds good to me. Thanks for looking into this.

@StephanTLavavej StephanTLavavej closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

2 participants