ZSTD_copy16() uses ZSTD_memcpy() #2836

ghost · 2021-10-28T12:52:57Z

ZSTD_copy16() uses ZSTD_memcpy() instead of ZSTD_memmove(), this speeds up MSVC builds.

The decompression speed of msvc2019 build has been improved a lot.

 1#silesia.tar  326.4 MB/s,  695.2 MB/s  ->  332.1 MB/s,  774.1 MB/s
 2#silesia.tar  261.4 MB/s,  623.0 MB/s  ->  259.7 MB/s,  698.8 MB/s
 3#silesia.tar  185.5 MB/s,  606.4 MB/s  ->  186.7 MB/s,  681.9 MB/s
 4#silesia.tar  172.7 MB/s,  593.7 MB/s  ->  174.6 MB/s,  673.6 MB/s
 5#silesia.tar   98.3 MB/s,  607.7 MB/s  ->   98.8 MB/s,  686.9 MB/s
 6#silesia.tar   67.0 MB/s,  612.9 MB/s  ->   67.6 MB/s,  693.2 MB/s
 7#silesia.tar   58.8 MB/s,  644.1 MB/s  ->   58.9 MB/s,  726.5 MB/s
 8#silesia.tar   47.3 MB/s,  651.4 MB/s  ->   48.0 MB/s,  736.3 MB/s
 9#silesia.tar   42.4 MB/s,  659.8 MB/s  ->   42.6 MB/s,  743.5 MB/s
10#silesia.tar   37.0 MB/s,  651.2 MB/s  ->   36.8 MB/s,  730.3 MB/s
11#silesia.tar   28.3 MB/s,  658.1 MB/s  ->   28.7 MB/s,  741.5 MB/s
12#silesia.tar   19.7 MB/s,  665.2 MB/s  ->   19.5 MB/s,  749.2 MB/s

There is almost no change in gcc build, maybe gcc knows that there is no overlap in the data.

 1#silesia.tar  375.1 MB/s  1296.9 MB/s  ->  374.7 MB/s  1300.7 MB/s
 2#silesia.tar  286.8 MB/s, 1144.8 MB/s  ->  285.9 MB/s, 1143.0 MB/s
 3#silesia.tar  209.1 MB/s, 1090.9 MB/s  ->  209.6 MB/s, 1093.5 MB/s
 4#silesia.tar  194.0 MB/s, 1057.7 MB/s  ->  191.4 MB/s, 1058.4 MB/s
 5#silesia.tar  106.4 MB/s, 1077.6 MB/s  ->  106.2 MB/s, 1079.3 MB/s
 6#silesia.tar   71.4 MB/s, 1070.5 MB/s  ->   71.4 MB/s, 1075.0 MB/s
 7#silesia.tar   62.6 MB/s, 1135.0 MB/s  ->   62.4 MB/s, 1132.9 MB/s
 8#silesia.tar   50.6 MB/s, 1149.2 MB/s  ->   50.5 MB/s, 1151.7 MB/s
 9#silesia.tar   44.9 MB/s, 1159.0 MB/s  ->   44.8 MB/s, 1162.5 MB/s
10#silesia.tar   38.4 MB/s, 1130.3 MB/s  ->   38.2 MB/s, 1128.2 MB/s
11#silesia.tar   29.7 MB/s, 1142.0 MB/s  ->   29.7 MB/s, 1133.5 MB/s
12#silesia.tar   20.8 MB/s, 1149.4 MB/s  ->   20.7 MB/s, 1152.9 MB/s

ghost · 2021-10-28T13:59:02Z

This PR basically only recovers the reduced performance of 6a7ede3 (by @binhdvo ).

Cyan4973 · 2021-10-28T16:51:55Z

Well, there might be some good reasons to use memmove() rather than memcpy(). Aka, it's quite possible that, in some specific circumstances, dst and src overlap, making memcpy() incorrect (UB). cc @binhdvo .

binhdvo · 2021-10-28T18:39:34Z

This PR basically only recovers the reduced performance of 6a7ede3 (by @binhdvo ).

The changes in that PR require that we use memmove here since the literal buffer can now be located within the dst buffer. In circumstances where the op "catches up" to where the literal buffer is, there can be partial overlaps in this call on the final copy if the literal is being shifted by less than 16 bytes. This is surfaced in the currently failing oss-fuzz test f tor his PR.

There weren't regressions for clang or gcc, and I don't believe we are actively optimizing for msvc2019 build. It might be possible to recover this by adding some conditional definition for ZSTD_copy16() to use ZSTD_memcpy() for this specific build, and some corresponding additional logic (also conditional on building for msvc2019) within the execSequence call to account for this case, but it's unclear if the cost of that logic would outweigh the gain.

Cyan4973 · 2021-10-28T19:53:11Z

Yes, it's unclear if the gain is worth the complexity.

But I think this PR points to a need to document why we use memmove() instead of memcpy() at this place in the code. Maybe some additional comments in the code would be welcome.

terrelln · 2021-10-28T20:23:34Z

Another thing to try might be something like:

void ZSTD_copy16(void* dst, void const* src)
{
    uint64_t tmp[2];
    ZSTD_memcpy(tmp, src, 16);
    ZSTD_memcpy(dst, tmp, 16);
}

If that is helpful for msvc, we could use that definition under an #ifdef.

ghost · 2021-10-29T02:58:27Z

The second commit (try atomic operation) test SSE2 instruction, it passed the tests.
The third commit (test general code) tests terrelln's code, it also passed.

binhdvo, please create a PR for this problem, since I don't know how to do this:

But I think this PR points to a need to document why we use memmove() instead of memcpy() at this place in the code. Maybe some additional code comments would be welcome.

Changes:

 static void ZSTD_copy16(void* dst, const void* src) {
 #if defined(ZSTD_ARCH_ARM_NEON)
     vst1q_u8((uint8_t*)dst, vld1q_u8((const uint8_t*)src));
+#elif defined(ZSTD_ARCH_X86_SSE2)
+    _mm_storeu_si128((__m128i*)dst, _mm_loadu_si128((const __m128i*)src));
 #else
-    ZSTD_memmove(dst, src, 16);
+    {
+        U64 tmp[2];
+        ZSTD_memcpy(tmp, src, 16);
+        ZSTD_memcpy(dst, tmp, 16);
+    }
 #endif
 }
 #define COPY16(d,s) { ZSTD_copy16(d,s); d+=16; s+=16; }

This accelerates the decompression speed of MSVC build.

ghost · 2021-11-04T03:46:21Z

I simplified the change, only speedup for SSE2 platform.
Because it's mainly for MSVC, GCC can handle it properly on other platforms.
MSVC can also compile ARM code, and there is a ZSTD_ARCH_ARM_NEON branch here.

edit: I think ZSTD_copy8() doesn't need such change.
~~Does ZSTD_copy8() need such change? It may have the same problem on 32-bit platforms:~~

static void ZSTD_copy8(void* dst, const void* src) {
#if defined(ZSTD_ARCH_ARM_NEON)
    vst1_u8((uint8_t*)dst, vld1_u8((const uint8_t*)src));
#else
    ZSTD_memcpy(dst, src, 8);
#endif
}
#define COPY8(d,s) { ZSTD_copy8(d,s); d+=8; s+=8; }

Cyan4973

The proposed change looks good to me.
It likely improves performance for x86 compilers unable to produce good optimization from memmove(), while not compromising correctness even in scenarios where input is erroneous.

ghost · 2021-11-11T03:47:52Z

It restores the performance of MSVC build.

MSVC calls memmove() function, the same for 8 bytes.
I have reported this problem to MSVC.

edit: In MSVC, memcpy() can use SSE2 instruction for 16 bytes.

facebook-github-bot added the CLA Signed label Oct 28, 2021

ZSTD_copy16() uses SSE2 instructions

b10357c

This accelerates the decompression speed of MSVC build.

Cyan4973 approved these changes Nov 10, 2021

View reviewed changes

Cyan4973 merged commit 9ba0790 into facebook:dev Nov 11, 2021

ghost deleted the copy16 branch November 12, 2021 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZSTD_copy16() uses ZSTD_memcpy() #2836

ZSTD_copy16() uses ZSTD_memcpy() #2836

ghost commented Oct 28, 2021

ghost commented Oct 28, 2021

Cyan4973 commented Oct 28, 2021

binhdvo commented Oct 28, 2021

Cyan4973 commented Oct 28, 2021 •

edited

Loading

terrelln commented Oct 28, 2021

ghost commented Oct 29, 2021 •

edited by ghost

Loading

ghost commented Nov 4, 2021 •

edited by ghost

Loading

Cyan4973 left a comment

ghost commented Nov 11, 2021 •

edited by ghost

Loading

ZSTD_copy16() uses ZSTD_memcpy() #2836

ZSTD_copy16() uses ZSTD_memcpy() #2836

Conversation

ghost commented Oct 28, 2021

ghost commented Oct 28, 2021

Cyan4973 commented Oct 28, 2021

binhdvo commented Oct 28, 2021

Cyan4973 commented Oct 28, 2021 • edited Loading

terrelln commented Oct 28, 2021

ghost commented Oct 29, 2021 • edited by ghost Loading

ghost commented Nov 4, 2021 • edited by ghost Loading

Cyan4973 left a comment

Choose a reason for hiding this comment

ghost commented Nov 11, 2021 • edited by ghost Loading

Cyan4973 commented Oct 28, 2021 •

edited

Loading

ghost commented Oct 29, 2021 •

edited by ghost

Loading

ghost commented Nov 4, 2021 •

edited by ghost

Loading

ghost commented Nov 11, 2021 •

edited by ghost

Loading