Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZSTD_copy16() uses ZSTD_memcpy() #2836

Merged
merged 1 commit into from Nov 11, 2021
Merged

ZSTD_copy16() uses ZSTD_memcpy() #2836

merged 1 commit into from Nov 11, 2021

Conversation

ghost
Copy link

@ghost ghost commented Oct 28, 2021

ZSTD_copy16() uses ZSTD_memcpy() instead of ZSTD_memmove(), this speeds up MSVC builds.

The decompression speed of msvc2019 build has been improved a lot.

 1#silesia.tar  326.4 MB/s,  695.2 MB/s  ->  332.1 MB/s,  774.1 MB/s
 2#silesia.tar  261.4 MB/s,  623.0 MB/s  ->  259.7 MB/s,  698.8 MB/s
 3#silesia.tar  185.5 MB/s,  606.4 MB/s  ->  186.7 MB/s,  681.9 MB/s
 4#silesia.tar  172.7 MB/s,  593.7 MB/s  ->  174.6 MB/s,  673.6 MB/s
 5#silesia.tar   98.3 MB/s,  607.7 MB/s  ->   98.8 MB/s,  686.9 MB/s
 6#silesia.tar   67.0 MB/s,  612.9 MB/s  ->   67.6 MB/s,  693.2 MB/s
 7#silesia.tar   58.8 MB/s,  644.1 MB/s  ->   58.9 MB/s,  726.5 MB/s
 8#silesia.tar   47.3 MB/s,  651.4 MB/s  ->   48.0 MB/s,  736.3 MB/s
 9#silesia.tar   42.4 MB/s,  659.8 MB/s  ->   42.6 MB/s,  743.5 MB/s
10#silesia.tar   37.0 MB/s,  651.2 MB/s  ->   36.8 MB/s,  730.3 MB/s
11#silesia.tar   28.3 MB/s,  658.1 MB/s  ->   28.7 MB/s,  741.5 MB/s
12#silesia.tar   19.7 MB/s,  665.2 MB/s  ->   19.5 MB/s,  749.2 MB/s

There is almost no change in gcc build, maybe gcc knows that there is no overlap in the data.

 1#silesia.tar  375.1 MB/s  1296.9 MB/s  ->  374.7 MB/s  1300.7 MB/s
 2#silesia.tar  286.8 MB/s, 1144.8 MB/s  ->  285.9 MB/s, 1143.0 MB/s
 3#silesia.tar  209.1 MB/s, 1090.9 MB/s  ->  209.6 MB/s, 1093.5 MB/s
 4#silesia.tar  194.0 MB/s, 1057.7 MB/s  ->  191.4 MB/s, 1058.4 MB/s
 5#silesia.tar  106.4 MB/s, 1077.6 MB/s  ->  106.2 MB/s, 1079.3 MB/s
 6#silesia.tar   71.4 MB/s, 1070.5 MB/s  ->   71.4 MB/s, 1075.0 MB/s
 7#silesia.tar   62.6 MB/s, 1135.0 MB/s  ->   62.4 MB/s, 1132.9 MB/s
 8#silesia.tar   50.6 MB/s, 1149.2 MB/s  ->   50.5 MB/s, 1151.7 MB/s
 9#silesia.tar   44.9 MB/s, 1159.0 MB/s  ->   44.8 MB/s, 1162.5 MB/s
10#silesia.tar   38.4 MB/s, 1130.3 MB/s  ->   38.2 MB/s, 1128.2 MB/s
11#silesia.tar   29.7 MB/s, 1142.0 MB/s  ->   29.7 MB/s, 1133.5 MB/s
12#silesia.tar   20.8 MB/s, 1149.4 MB/s  ->   20.7 MB/s, 1152.9 MB/s

@ghost
Copy link
Author

ghost commented Oct 28, 2021

This PR basically only recovers the reduced performance of 6a7ede3 (by @binhdvo ).

@Cyan4973
Copy link
Contributor

Well, there might be some good reasons to use memmove() rather than memcpy(). Aka, it's quite possible that, in some specific circumstances, dst and src overlap, making memcpy() incorrect (UB). cc @binhdvo .

@binhdvo
Copy link
Contributor

binhdvo commented Oct 28, 2021

This PR basically only recovers the reduced performance of 6a7ede3 (by @binhdvo ).

The changes in that PR require that we use memmove here since the literal buffer can now be located within the dst buffer. In circumstances where the op "catches up" to where the literal buffer is, there can be partial overlaps in this call on the final copy if the literal is being shifted by less than 16 bytes. This is surfaced in the currently failing oss-fuzz test f tor his PR.

There weren't regressions for clang or gcc, and I don't believe we are actively optimizing for msvc2019 build. It might be possible to recover this by adding some conditional definition for ZSTD_copy16() to use ZSTD_memcpy() for this specific build, and some corresponding additional logic (also conditional on building for msvc2019) within the execSequence call to account for this case, but it's unclear if the cost of that logic would outweigh the gain.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 28, 2021

Yes, it's unclear if the gain is worth the complexity.

But I think this PR points to a need to document why we use memmove() instead of memcpy() at this place in the code. Maybe some additional comments in the code would be welcome.

@terrelln
Copy link
Contributor

Another thing to try might be something like:

void ZSTD_copy16(void* dst, void const* src)
{
    uint64_t tmp[2];
    ZSTD_memcpy(tmp, src, 16);
    ZSTD_memcpy(dst, tmp, 16);
}

If that is helpful for msvc, we could use that definition under an #ifdef.

@ghost
Copy link
Author

ghost commented Oct 29, 2021

The second commit (try atomic operation) test SSE2 instruction, it passed the tests.
The third commit (test general code) tests terrelln's code, it also passed.

binhdvo, please create a PR for this problem, since I don't know how to do this:

But I think this PR points to a need to document why we use memmove() instead of memcpy() at this place in the code. Maybe some additional code comments would be welcome.

Changes:

 static void ZSTD_copy16(void* dst, const void* src) {
 #if defined(ZSTD_ARCH_ARM_NEON)
     vst1q_u8((uint8_t*)dst, vld1q_u8((const uint8_t*)src));
+#elif defined(ZSTD_ARCH_X86_SSE2)
+    _mm_storeu_si128((__m128i*)dst, _mm_loadu_si128((const __m128i*)src));
 #else
-    ZSTD_memmove(dst, src, 16);
+    {
+        U64 tmp[2];
+        ZSTD_memcpy(tmp, src, 16);
+        ZSTD_memcpy(dst, tmp, 16);
+    }
 #endif
 }
 #define COPY16(d,s) { ZSTD_copy16(d,s); d+=16; s+=16; }

This accelerates the decompression speed of MSVC build.
@ghost
Copy link
Author

ghost commented Nov 4, 2021

I simplified the change, only speedup for SSE2 platform.
Because it's mainly for MSVC, GCC can handle it properly on other platforms.
MSVC can also compile ARM code, and there is a ZSTD_ARCH_ARM_NEON branch here.

edit: I think ZSTD_copy8() doesn't need such change.
Does ZSTD_copy8() need such change? It may have the same problem on 32-bit platforms:

static void ZSTD_copy8(void* dst, const void* src) {
#if defined(ZSTD_ARCH_ARM_NEON)
    vst1_u8((uint8_t*)dst, vld1_u8((const uint8_t*)src));
#else
    ZSTD_memcpy(dst, src, 8);
#endif
}
#define COPY8(d,s) { ZSTD_copy8(d,s); d+=8; s+=8; }

Copy link
Contributor

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change looks good to me.
It likely improves performance for x86 compilers unable to produce good optimization from memmove(), while not compromising correctness even in scenarios where input is erroneous.

@ghost
Copy link
Author

ghost commented Nov 11, 2021

It restores the performance of MSVC build.

MSVC calls memmove() function, the same for 8 bytes.
I have reported this problem to MSVC.

edit: In MSVC, memcpy() can use SSE2 instruction for 16 bytes.

@Cyan4973 Cyan4973 merged commit 9ba0790 into facebook:dev Nov 11, 2021
@ghost ghost deleted the copy16 branch November 12, 2021 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants