-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lyra2 performance paradox #225
Comments
I have a theory for this apparent paradox. The Lyra2 optimization was intended to reduce memory access and targetted algos that use a Correction: the folllowing is incorrect as a blend instruction, not an insert, was used. Blend is fast. This theory has some issues. The gain with other algos was significant as well as the loss with I saw some similar behaviour when trying to reduce memory acceses in preparation for the The large differential makes it impossible to choose one over the other. A possible solution would be to support both versions and choose the appropriate one for More investigation is required first. Meanwhile I found a little more speed in Lyra2 for v3.11.3. |
Things are getting weird. I'mtrying to implement both functions where most algos can choose It doesn't work. just the presence of the new code slows x25x. If I comment it out I'm starting to suspect GCC. Changes to a function that isn't executed shouldn't affect other functions but it seems it does in this case. The only that is possible is if the simple existance This points to the GCC optimizer. |
I can clearly define the problem but have no solution. v1 is the code from v3.11.1 where x25x and x22i are faster, allium, lyra2z etc are slower. 2 interfaces are provided, x25x and x22i interface uses the v1 code and other algos use v2. When both interfaces point to v1 code x25 is faster and as expected allium is now slower. I tried changing v1 interface, changing order of function arguments, moving code around Simply put, the presence of the v2 function code and the presence of a call to it, even if not Now what? |
Major developments, but first some background. The original issue is due to data divergence when hashing Lyra2 2 way parallel. Lyra2 parallel Another twist is that one lane may overlap with the out pointer. In such cases an itermediate This creates 3 levels of performance: unified, which is identical to linear hashing and is the fastest. The problem is with the midstream update in the overlap case. V1 uses __m256i pointer aliasing to selectively refresh only the overlappping lane. And the results were paradoxical. More to come. |
GCC is starting to piss me off. It won't me implement both versions. When I saw slow results on x25x using v1 I put a printf in the v2 path to confirm I was on Now I have to figure out how to outsmart GCC so it doesn't override my explicit code. |
Putting a NOP just before the call to v1 seems to have done the trick to workaround GCC's Now I have to clean up and get both working simultaneously. |
I now have a working build where both versions are present and selected appropriately. |
Things are starting to settle down. The final implementation includes 2 copies of essentially the One version, preferred by x25x, uses 256 bit memory acceses the other 512 bit memory accesses There are still 2 remaining questions. 1 Why does it have such a contradictory effect? The size of the change is not a surprise nut
Meanwhile v3.11.3 is released with workarounds to adress both questions. |
CPU design effects are bizarre: https://youtu.be/ICKIMHCw--Y |
I wouldn't call them bizarre but I am familiar with competition between the compiler and the CPU. I've seen this before. I discovered a bug in the branch predictor of a specific CPU model thanks I can't go to the level described in the video. Even though cpuminer only supprts x86_64 there Some things like instruction reordering and data prefetching is hard to code because the Even vectorizing isn't an obvious win. If the application is I/O bound reducing the number Finally I don't think the CPU is at fault here, I blame the compiler. The CPU will reorder instructions |
Still pondering this issue. If I send a bug report to gcc I have to do a lot more work first. That won't be any time soon. I'm still monitoring the performance for any unexpected changes following some upcoming |
I do not understand the problem. All that I remember when I studied a little bit about cpus from x86 family, is that most Intel CPUs like memory to be aligned. |
Maybe the comment was for another weird issue where the compiler optimized beyond it's guaranteed data alignment. This issue is not about alignment but another optimization quirk that affects performance when accessing scattered data using large vectors. I don't think I'll be spending any more time on it |
Changes to avx512 lyra2 code in sponge-2way.c for v3.11.2 produced improvements of
between 6% for x21s and 47% for lyra2z. However, peformance dropped 9% for x22i and
5% for x25x. It's easilly reproduceable.
The text was updated successfully, but these errors were encountered: