Unfinished work #266

JayDDee · 2020-05-21T04:03:51Z

This issue is opened to document architectural changes that require changes to the scanhash
function of each algo. These changes may not have been propagated to all algl algos for various
reasons.

The reason for most of the changes is to streamline the code by reducing instructions.
Sale share reduction is the goal of one change, and the generic scanhash will reduce the
work of propagating other changes

defining a series of generic scanhash functions that can be used by multiple algos to replace their
individual custom scanhash functions for specific cases:
- one way linear hashing
- N way hashing for each N (4, 8, 16) and the format of the hash to be tested: 64 bit interleaved,
  32 bit interleaved, or de-interleaved.

The remaining scanhash changes are automatically implemented for algos that can use a
generic scanhash function.

vectored byte swap and interleaving of input data, for various N ways, for 32 and 64 bit data.
byte-swap the nonce only when necessary, when a valid share is found, instead of
byte-swapping every nonce tested.
implement new hash for test including pre-test before de-interleaving N way hash.
submit shares in scanhash loop then continue hashing instead of returning to the main thread
loop to submit shares.
thread id argument added to hash call to enable restart flag checking.

There are also changes to the hash functions of each algo:

use union overlay instead of struct for the context holder for algos that use a lot of contexts,
implement midstate prehash when first function use a block size of 64 bytes or less,
use full versions of chained hash functions instead of the 3 step init, update & close,
write final hash directly to output buffer instead of using an intermediate buffer and memcpy,
implement intermediate stale work detection for low hash rate algos to reduce stale shares.
use rintrlv instead of 2 step dintrlv, intrlv when interleaved data needs to be interleaved in a
different format.
ensure hash function returns a default 1 if thread restart checking is not used.

JayDDee · 2020-05-31T17:43:47Z

The implementation of a generic scanhash is complicated with n-way parallel hashing with
chained algorithms. Each function in the chain may be interleaved 64 bit words, interleaved
32 bit words or not interleaved. he first and last functions may have different interleaving
which must be handled differently by scanhash.

This results in up to 9 different generic scanhash functions to handle each situation for each
architecture. The full requirement is 22 individual scanhash functions. Calling them generic
may seem a bit ambitious but they can be used by most of the chained algorithms and
still represents a significant reduction in code duplication.

SSE2: 4 way 32 bit words (4 cases)
AVX2: 8 way 32 bit words, 4 way 32 bit words, 4 way 64 bit words (9 cases)
AVX512: 16 way 32 bit words, 8 way 32 bit words, 8 way 64 bit words (9 cases)

Algorithms that perform a midstate prehash are not considered at this time. Support would
require a gate function for prehash as each algo has its own custom prehash..

JayDDee · 2020-06-01T02:42:30Z

x17, xevan and sonoa algorithms are currently up to date with all mods, including generic
scanhash.

JayDDee · 2022-06-29T15:01:31Z

Allium & Lyra2Z AVX512 & AVX2 are up to date with 2 stage blake256 prehash optimization using linear SIMD for the first
stage and Nway parallel for the second.
X17 AVX512 & AVX2 have blake512 second stage prehash, first stage not possible.
Generic scanhash is not used with prehashing.

JayDDee · 2022-07-12T16:26:52Z

Many chained algorithms have redundant endian byte swaps that can be eliminated. Blake is often the first hash function in a
chain and it either performs a bswap32 (blake256) or bswap64 (blake512). Prior to calling blake a bswap32 is done on the
block header.

I the case of blake256 it's fully redundant and both can be eliminated. In the case of blake512 it results in a simple swapping
of 32 bits in each 64 bit word which also results in the nonce shifting.

An "LE" version of the blake transform functions is added to implement this optimization as werll as associated changes to
scanhash.

JayDDee · 2022-08-25T21:10:04Z

The blake family of core hash fucntions can be optimized with linear vectoring (one way). Blake256 & blake2s can use SSE2 while blake512 & blake2b can use SSE2 or AVX2. For practical reasons only blake256 and blake2b have been so optimized at this time.
With the exception of midstate prehashing, only possible with small blakes, parallel N-way is usually preferable.

Edit: blake2s is included in v3.21.3

EDIT: No, blakes2s won't be included. Testing has shown a negative impact from prehashing blake2s using serial SIMD over parallel hashing. Other algos have not had this problem. blake2s was also slower with centralized prehash, serial and parallel, so that won't be impemented for blake2s either

JayDDee · 2023-03-09T15:22:52Z

Another midstate optimization.

Centralize midstate prehash by doing it in stratum thread or when a miner thread returns from getwork and sharing the result with all miner threads. Previously each miner thread would do the prehash for itself.

JayDDee · 2023-03-09T15:26:31Z

Some old algos have been found not to have proper stats reporting when using an old CPU (#392). Some will be fixed in v3.21.3 but there may be more remaining. They will be fixed as discovered if they can be tested. Testing these algos is difficult, pun intended.

YetAnotherRussian · 2023-04-08T06:59:23Z

There's a good candidate to add (pufferfish2bmb) https://github.com/De-Crypted/dcrptd-miner/tree/master/Algorithms if the're any plans on adding new algos. I see some new (not really) sha algos in the latest release.

JayDDee · 2024-05-29T17:43:46Z

The use of Nway notation in hash functions is being changed to Nx64 or Nx32 where appropriate. This notation is already used for interleave functions.
This is needed for algos that have implementations using different data size. For example Hamsi can be implemented using Nx32 or Nx64. Cubehash can be implemented as pure parallel using Nx32 or a hybrid serial-parallel using Nx128.
Nx64 requires larger vectors, and therefore higher features, than Nx32.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unfinished work #266

Unfinished work #266

JayDDee commented May 21, 2020 •

edited

Loading

JayDDee commented May 31, 2020 •

edited

Loading

JayDDee commented Jun 1, 2020

JayDDee commented Jun 29, 2022

JayDDee commented Jul 12, 2022

JayDDee commented Aug 25, 2022 •

edited

Loading

JayDDee commented Mar 9, 2023

JayDDee commented Mar 9, 2023

YetAnotherRussian commented Apr 8, 2023

JayDDee commented May 29, 2024

Unfinished work #266

Unfinished work #266

Comments

JayDDee commented May 21, 2020 • edited Loading

JayDDee commented May 31, 2020 • edited Loading

JayDDee commented Jun 1, 2020

JayDDee commented Jun 29, 2022

JayDDee commented Jul 12, 2022

JayDDee commented Aug 25, 2022 • edited Loading

JayDDee commented Mar 9, 2023

JayDDee commented Mar 9, 2023

YetAnotherRussian commented Apr 8, 2023

JayDDee commented May 29, 2024

JayDDee commented May 21, 2020 •

edited

Loading

JayDDee commented May 31, 2020 •

edited

Loading

JayDDee commented Aug 25, 2022 •

edited

Loading