Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize reverse for 8 and 16-bit trivial types #2386

Closed
wants to merge 4 commits into from

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Dec 10, 2021

Faster reverse tail.

The benchmark uses random lengths:

#include <cstdint>
#include <chrono>
#include <iostream>
#include <random>

constexpr std::size_t N = 2048;
constexpr std::size_t NS = 8192;
constexpr std::size_t R = 10'000;

alignas(64) std::uint8_t    a8[N];
alignas(64) std::uint16_t   a16[N / 2];
alignas(64) std::uint32_t   a32[N / 4];
alignas(64) std::uint64_t   a64[N / 8];

alignas(64) std::uint8_t    d8[N];
alignas(64) std::uint16_t   d16[N / 2];
alignas(64) std::uint32_t   d32[N / 4];
alignas(64) std::uint64_t   d64[N / 8];

template<typename T, std::size_t S>
void rev(bool c, T(&a)[S], T(&d)[S], const char* name) {
    std::mt19937 gen(65521);
    std::uniform_int_distribution<std::size_t> dis(0, S);
    std::size_t sizes[NS];
    for (auto& s : sizes) {
        s = dis(gen);
    }

    auto t1 = std::chrono::steady_clock::now();
    if (c) {
        for (std::size_t i = 0; i < R; i++) {
            for (std::size_t s = 0; s < NS; s++) {
                std::reverse_copy(std::begin(a), std::begin(a) + sizes[s], std::begin(d));
            }
        }
    }
    else {
        for (std::size_t i = 0; i < R; i++) {
            for (std::size_t s = 0; s < NS; s++) {
                std::reverse(std::begin(a), std::begin(a) + sizes[s]);
            }
        }
    }
    auto t2 = std::chrono::steady_clock::now();
    std::cout << name << ":\t" << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() << "s\n";
}

int main()
{
    rev(false, a8, d8, "reverse 8");
    rev(false, a16, d16, "reverse 16");
    rev(false, a32, d32, "reverse 32");
    rev(false, a64, d64, "reverse 64");
    rev(true, a8, d8, "rev. copy 8");
    rev(true, a16, d16, "rev. copy 16");
    rev(true, a32, d32, "rev. copy 32");
    rev(true, a64, d64, "rev. copy 64");
}

On my Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz

Name Before After Before JCC mtg After JCC mtg
reverse 8 3.46938s 2.45349s 2.74506s 2.41771s
reverse 16 2.91174s 2.49834s 2.5235s 2.36707s
reverse 32 2.44931s 2.46176s 2.4739s 2.50781s
reverse 64 2.18105s 2.32562s 2.21836s 2.22949s
rev. copy 8: 2.80264s 2.4998s 2.81799s 2.51851s
rev. copy 16 2.83642s 2.4392s 2.71336s 2.43827s
rev. copy 32 2.56736s 2.71499s 2.46718s 2.468s
rev. copy 64 2.21659s 2.32269s 2.21912s 2.21747s

JCC mtg = added /QIntel-jcc-erratum to root makefile

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner December 10, 2021 21:16
@AlexGuteniev AlexGuteniev marked this pull request as draft December 10, 2021 22:14
@CaseyCarter CaseyCarter added the performance Must go faster label Dec 10, 2021
@AlexGuteniev
Copy link
Contributor Author

The results are small and ambiguous, I'm afraid I cannot prove this PR is much helpful.
If anyone disagrees and has strong confidence in this direction, feel free to pick it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants