-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZSTD_copy16() uses ZSTD_memcpy() #2836
Conversation
Well, there might be some good reasons to use |
The changes in that PR require that we use memmove here since the literal buffer can now be located within the dst buffer. In circumstances where the op "catches up" to where the literal buffer is, there can be partial overlaps in this call on the final copy if the literal is being shifted by less than 16 bytes. This is surfaced in the currently failing oss-fuzz test f tor his PR. There weren't regressions for clang or gcc, and I don't believe we are actively optimizing for msvc2019 build. It might be possible to recover this by adding some conditional definition for ZSTD_copy16() to use ZSTD_memcpy() for this specific build, and some corresponding additional logic (also conditional on building for msvc2019) within the execSequence call to account for this case, but it's unclear if the cost of that logic would outweigh the gain. |
Yes, it's unclear if the gain is worth the complexity. But I think this PR points to a need to document why we use |
Another thing to try might be something like: void ZSTD_copy16(void* dst, void const* src)
{
uint64_t tmp[2];
ZSTD_memcpy(tmp, src, 16);
ZSTD_memcpy(dst, tmp, 16);
} If that is helpful for msvc, we could use that definition under an |
The second commit (try atomic operation) test SSE2 instruction, it passed the tests. binhdvo, please create a PR for this problem, since I don't know how to do this:
Changes: static void ZSTD_copy16(void* dst, const void* src) {
#if defined(ZSTD_ARCH_ARM_NEON)
vst1q_u8((uint8_t*)dst, vld1q_u8((const uint8_t*)src));
+#elif defined(ZSTD_ARCH_X86_SSE2)
+ _mm_storeu_si128((__m128i*)dst, _mm_loadu_si128((const __m128i*)src));
#else
- ZSTD_memmove(dst, src, 16);
+ {
+ U64 tmp[2];
+ ZSTD_memcpy(tmp, src, 16);
+ ZSTD_memcpy(dst, tmp, 16);
+ }
#endif
}
#define COPY16(d,s) { ZSTD_copy16(d,s); d+=16; s+=16; } |
This accelerates the decompression speed of MSVC build.
I simplified the change, only speedup for edit: I think static void ZSTD_copy8(void* dst, const void* src) {
#if defined(ZSTD_ARCH_ARM_NEON)
vst1_u8((uint8_t*)dst, vld1_u8((const uint8_t*)src));
#else
ZSTD_memcpy(dst, src, 8);
#endif
}
#define COPY8(d,s) { ZSTD_copy8(d,s); d+=8; s+=8; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed change looks good to me.
It likely improves performance for x86 compilers unable to produce good optimization from memmove()
, while not compromising correctness even in scenarios where input is erroneous.
It restores the performance of MSVC build. MSVC calls edit: In MSVC, |
ZSTD_copy16()
usesZSTD_memcpy()
instead ofZSTD_memmove()
, this speeds up MSVC builds.The decompression speed of
msvc2019
build has been improved a lot.There is almost no change in
gcc
build, maybegcc
knows that there is no overlap in the data.