Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lyra2 performance paradox #225

Open
JayDDee opened this issue Jan 8, 2020 · 13 comments
Open

Lyra2 performance paradox #225

JayDDee opened this issue Jan 8, 2020 · 13 comments
Labels

Comments

@JayDDee
Copy link
Owner

JayDDee commented Jan 8, 2020

Changes to avx512 lyra2 code in sponge-2way.c for v3.11.2 produced improvements of
between 6% for x21s and 47% for lyra2z. However, peformance dropped 9% for x22i and
5% for x25x. It's easilly reproduceable.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 9, 2020

I have a theory for this apparent paradox.

The Lyra2 optimization was intended to reduce memory access and targetted algos that use a
larger lyra2 matrix. x25x uses a smaller matrix and should be expected to gain less. all else
being equal.

Correction: the folllowing is incorrect as a blend instruction, not an insert, was used. Blend is fast.
But inserting data into the high half of a matrix is an expensive instruction even though it is only
one instruction. If the data is readilly avaible in the L1 cache it may be quicker to load from
memory than doing a vector insert.

This theory has some issues. The gain with other algos was significant as well as the loss with
x25x and x22i. The effective differential is larger that I would expect.

I saw some similar behaviour when trying to reduce memory acceses in preparation for the
increases associated with AVX512. Some efforts actualy reduced performance when data
was already cached.

The large differential makes it impossible to choose one over the other.

A possible solution would be to support both versions and choose the appropriate one for
the selected algo.

More investigation is required first. Meanwhile I found a little more speed in Lyra2 for v3.11.3.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 10, 2020

Things are getting weird. I'mtrying to implement both functions where most algos can choose
the code from the implementation from the current release and x25 can use the one from
the previous release.

It doesn't work. just the presence of the new code slows x25x. If I comment it out
the performance returns. If I uncomment it slows. Meanwhile x25 is running a different function
with the old code.

I'm starting to suspect GCC. Changes to a function that isn't executed shouldn't affect other functions but it seems it does in this case. The only that is possible is if the simple existance
of the source code changed the compiled binary code ibeyond the changed function.

This points to the GCC optimizer.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 10, 2020

I can clearly define the problem but have no solution.

v1 is the code from v3.11.1 where x25x and x22i are faster, allium, lyra2z etc are slower.
v2 is the code from v3.11.2 where x25x and x22i are slower, allium, lyra2z, etc are faster.

2 interfaces are provided, x25x and x22i interface uses the v1 code and other algos use v2.
The result is x25x is slower and allium is faster.

When both interfaces point to v1 code x25 is faster and as expected allium is now slower.

I tried changing v1 interface, changing order of function arguments, moving code around
without breaking functionality, but no luck.

Simply put, the presence of the v2 function code and the presence of a call to it, even if not
executed by x25x will cause x25x to hash slower. Removing either the function call or
commenting out the function body will restore the x25x hash rate.

Now what?

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 10, 2020

Major developments, but first some background.

The original issue is due to data divergence when hashing Lyra2 2 way parallel. Lyra2 parallel
AVX512 is hashed in 2 256 bit lanes. Most of the time the data is contiguous but in one phase each
lane uses data blocks from different rows. This requires gathering the two lanes into a contiguous
vector then writing the data back to their respective locations, with performance penalties.

Another twist is that one lane may overlap with the out pointer. In such cases an itermediate
refresh of the local data is required, also with a performance penalty.

This creates 3 levels of performance: unified, which is identical to linear hashing and is the fastest.
overlap witch requires initial merge and final split plus an additional update in the loop.
Normal just has the initial merge and final split.

The problem is with the midstream update in the overlap case.

V1 uses __m256i pointer aliasing to selectively refresh only the overlappping lane.
V2 uses __m512i_mask_blend on the full 512 bit vector to perform the same task.

And the results were paradoxical.

More to come.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 10, 2020

GCC is starting to piss me off. It won't me implement both versions.

When I saw slow results on x25x using v1 I put a printf in the v2 path to confirm I was on
the right path. The printf wasn't hit but hashrate rose. Once again a change un unexecuted
source code affected run time performance.

Now I have to figure out how to outsmart GCC so it doesn't override my explicit code.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 10, 2020

Putting a NOP just before the call to v1 seems to have done the trick to workaround GCC's
optimizer and allow both versions to coexist.

Now I have to clean up and get both working simultaneously.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 10, 2020

I now have a working build where both versions are present and selected appropriately.
This was a lot more difficult that it should have been.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 11, 2020

Things are starting to settle down. The final implementation includes 2 copies of essentially the
same function. They differ only in how they manage non-contiguous data.

One version, preferred by x25x, uses 256 bit memory acceses the other 512 bit memory accesses
with masking and blending. This version is faster for allium and most other lyra2 algos.

There are still 2 remaining questions.

1 Why does it have such a contradictory effect? The size of the change is not a surprise nut
the contrast is. Some consistency is expected.

  1. Why did GCC interfere with the source code? It was proven that making a change in one
    version had a run time effect on the other's performance. It is concluded GCC saw the 2
    versions as the same and optimized one out in favour of the other. This prevented supporting
    both versions simultaneously. There may be follow up.

Meanwhile v3.11.3 is released with workarounds to adress both questions.

@Vid0Vid0
Copy link

CPU design effects are bizarre: https://youtu.be/ICKIMHCw--Y

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 11, 2020

I wouldn't call them bizarre but I am familiar with competition between the compiler and the CPU.
The compiler clearly misjudged the effect of the differences in the 2 functions even though they
were functionally identical.

I've seen this before. I discovered a bug in the branch predictor of a specific CPU model thanks
to a number of crashes that resulted from it. The obvious workaround is to disable branch
prediction but management was worried about the loss in performance (customers were always
pushing things as far as possible). But surprise disabling branch prediction improved performance
by about 3%. It turns out a compiler optimization was conflicting with theCPU's branch prediction
causing more misses than hits.

I can't go to the level described in the video. Even though cpuminer only supprts x86_64 there
are still too many variations to tune the SW to the HW. I have to assume a generic machine
and focus only on those things that help generically. For example reducing the number of
instructions, reducing memory usage, reducing branches, etc.

Some things like instruction reordering and data prefetching is hard to code because the
compiler and CPU mess with both so the resulting machine code is not as intended and could
perform worse. I am still learning this the hard way.

Even vectorizing isn't an obvious win. If the application is I/O bound reducing the number
of instructions won't help but will actually hurt due to the CPU's lower clock rate for vector instructions. Some vector instructions don't scale well and actual perform worse with larger vectors
than smaller ones.

Finally I don't think the CPU is at fault here, I blame the compiler. The CPU will reorder instructions
but doesn't substitute other instructions like the compiler does.

@JayDDee
Copy link
Owner Author

JayDDee commented Jan 25, 2020

Still pondering this issue.

If I send a bug report to gcc I have to do a lot more work first. That won't be any time soon.

I'm still monitoring the performance for any unexpected changes following some upcoming
updates.

@JayDDee JayDDee added the rant label Feb 10, 2020
@sumariva
Copy link

sumariva commented Apr 4, 2024

I do not understand the problem. All that I remember when I studied a little bit about cpus from x86 family, is that most Intel CPUs like memory to be aligned.
It depends on the assembler's instruction being used. Some work faster on aligned stuff.
GCC I guess will not align memory.
But I am just a PHP programmer now, not a C developer.

@JayDDee
Copy link
Owner Author

JayDDee commented Apr 5, 2024

Maybe the comment was for another weird issue where the compiler optimized beyond it's guaranteed data alignment.
Specifically, vector optimzation replaced a loop with AVX2 code which requires 256 bit alignment but alignment was only guaranteed to 128 bits. Had the data been defined in source as __m256i it would have been properly aligned but it was an integer array optimized by the compiler to __m256i, so I blame the optimizer for not properly aligning the data to to the new type it chose to use.

This issue is not about alignment but another optimization quirk that affects performance when accessing scattered data using large vectors. I don't think I'll be spending any more time on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants