ORC-1041: Use `memcpy` during LZO decompression #958

guiyanakuang · 2021-11-03T04:17:24Z

What changes were proposed in this pull request?

This pr is aimed to fix the implementation of copy data blocks during LZO decompression.

*reinterpret_cast< int64_t*>(output) = *reinterpret_cast< int64_t*>(matchAddress); can lead to unexpected behavior, and in failed test cases it does not appear to be an atomic operation.

This pr uses memcpy instead of the above statement.

Here are the performance benchmarks, where memcpy is basically the same as reinterpret_cast + assignment. The newer compilers outperform unfolded assignment, so here is a screenshot of the results with only some of the parameters, which you can reproduce with the following test code at https://quick-bench.com/

#include <string.h>

static void use_memcpy(benchmark::State& state) {
  auto size = state.range(0);
  char buf[size];
  for (int i = 0; i < 8; ++i) {
    buf[i] = 'a';
  }
  for (auto _ : state) {
    char *output = buf + 8;
    char *matchAddress = buf;
    char *matchOutputLimit = buf + size;
    while (output < matchOutputLimit) {
      memcpy(output, matchAddress, 8);
      matchAddress += 8;
      output += 8;
    }
  }
}

static void use_expanded_assignment(benchmark::State& state) {
  auto size = state.range(0);
  char buf[size];
  for (int i = 0; i < 8; ++i) {
    buf[i] = 'a';
  }
  for (auto _ : state) {
    char *output = buf + 8;
    char *matchAddress = buf;
    char *matchOutputLimit = buf + size;
    while (output < matchOutputLimit) {
      output[0] = *matchOutputLimit;
      output[1] = *(matchOutputLimit + 1);
      output[2] = *(matchOutputLimit + 2);
      output[3] = *(matchOutputLimit + 3);
      output[4] = *(matchOutputLimit + 4);
      output[5] = *(matchOutputLimit + 5);
      output[6] = *(matchOutputLimit + 6);
      output[7] = *(matchOutputLimit + 7);
      matchAddress += 8;
      output += 8;
    }
  }
}

static void use_reinterpret_assignment(benchmark::State& state) {
  auto size = state.range(0);
  char buf[size];
  for (int i = 0; i < 8; ++i) {
    buf[i] = 'a';
  }
  for (auto _ : state) {
    char *output = buf + 8;
    char *matchAddress = buf;
    char *matchOutputLimit = buf + size;
    while (output < matchOutputLimit) {
      *reinterpret_cast<int64_t*>(output) =
                *reinterpret_cast<int64_t*>(matchAddress);
      matchAddress += 8;
      output += 8;
    }
  }
}

BENCHMARK(use_memcpy)->Arg(100000);

BENCHMARK(use_expanded_assignment)->Arg(100000);

BENCHMARK(use_reinterpret_assignment)->Arg(100000);

Why are the changes needed?

Fix the bug of LZO decompression.

How was this patch tested?

Pass the CIs.

dongjoon-hyun · 2021-11-03T23:32:49Z

Thank you for making a PR, @guiyanakuang .

dongjoon-hyun · 2021-11-03T23:33:38Z

cc @wgtmac , @stiga-huang , and @williamhyun

dongjoon-hyun · 2021-11-03T23:38:44Z

c++/src/LzoDecompressor.cc

@@ -312,13 +312,11 @@ namespace orc {
              output += SIZE_OF_INT;
              matchAddress += increment32;

-              *reinterpret_cast<int32_t*>(output) =
-                *reinterpret_cast<int32_t*>(matchAddress);
+              memcpy(output, matchAddress, SIZE_OF_INT);


The combination of reinterpret_cast + assignment looks cheaper than memcpy function invocation. I'm wondering if we need to pay some performance penalty here.

The combination of reinterpret_cast + assignment looks cheaper than memcpy function invocation. I'm wondering if we need to pay some performance penalty here.

I'll do some performance tests later, repeat_cast + assignment makes direct use of registers, memcpy is usually used for larger copies of data, I'm not sure if it's lossy yet

The compiler may optimize the memcpy call. BTW, should we wrap a bit_cast function which uses memcpy before C++20 and uses the native one if C++20 is available?

@wgtmac you are right, the compiler does optimize memcpy, the performance of both ways is similar in different compilers, in older versions of the compiler expand the assignment will be faster.
I agree to wrap a bit_cast function for binary copy between different types.

@dongjoon-hyun, So I don't think there's any performance loss here compared to the original.

#include <string.h> static void use_memcpy(benchmark::State& state) { auto size = state.range(0); char buf[size]; for (int i = 0; i < 8; ++i) { buf[i] = 'a'; } for (auto _ : state) { char *output = buf + 8; char *matchAddress = buf; char *matchOutputLimit = buf + size; while (output < matchOutputLimit) { memcpy(output, matchAddress, 8); matchAddress += 8; output += 8; } } } static void use_expanded_assignment(benchmark::State& state) { auto size = state.range(0); char buf[size]; for (int i = 0; i < 8; ++i) { buf[i] = 'a'; } for (auto _ : state) { char *output = buf + 8; char *matchAddress = buf; char *matchOutputLimit = buf + size; while (output < matchOutputLimit) { output[0] = *matchOutputLimit; output[1] = *(matchOutputLimit + 1); output[2] = *(matchOutputLimit + 2); output[3] = *(matchOutputLimit + 3); output[4] = *(matchOutputLimit + 4); output[5] = *(matchOutputLimit + 5); output[6] = *(matchOutputLimit + 6); output[7] = *(matchOutputLimit + 7); matchAddress += 8; output += 8; } } } static void use_reinterpret_assignment(benchmark::State& state) { auto size = state.range(0); char buf[size]; for (int i = 0; i < 8; ++i) { buf[i] = 'a'; } for (auto _ : state) { char *output = buf + 8; char *matchAddress = buf; char *matchOutputLimit = buf + size; while (output < matchOutputLimit) { *reinterpret_cast<int64_t*>(output) = *reinterpret_cast<int64_t*>(matchAddress); matchAddress += 8; output += 8; } } } BENCHMARK(use_memcpy)->Arg(100000); BENCHMARK(use_expanded_assignment)->Arg(100000); BENCHMARK(use_reinterpret_assignment)->Arg(100000);

Thanks. Could you put this investigation result into the PR description?

No problem, I've updated the pr description

dongjoon-hyun · 2021-11-04T05:10:00Z

Do we have one remaining comment to address?

guiyanakuang · 2021-11-04T05:18:33Z

Do we have one remaining comment to address?

@wgtmac Do you want to use the wrapped bit_cast in this pr instead of memcpy?
But in LzoDecompressor.cc, the intention of the code is simply copy, and there is no intention of cast.

wgtmac · 2021-11-04T05:56:48Z

Do we have one remaining comment to address?

@wgtmac Do you want to use the wrapped bit_cast in this pr instead of memcpy? But in LzoDecompressor.cc, the intention of the code is simply copy, and there is no intention of cast.

I am OK not to address the wrapper of bit_cast in this patch.

guiyanakuang · 2021-11-04T06:05:35Z

I am OK not to address the wrapper of bit_cast in this patch.

@wgtmac Thanks for the review and approval

dongjoon-hyun

+1, LGTM. Thank you so much, @guiyanakuang and @wgtmac .

BTW, this will be tested more in the branch and released as 1.8.0/1.7.2/1.6.13 to be safe.

Please participate on the current on-going votes without considering this.

dongjoon-hyun · 2021-11-04T15:51:10Z

Also, cc @williamhyun .

### What changes were proposed in this pull request? This pr is aimed to fix the implementation of copy data blocks during LZO decompression. `*reinterpret_cast< int64_t*>(output) = *reinterpret_cast< int64_t*>(matchAddress);` can lead to unexpected behavior, and in failed test cases it does not appear to be an atomic operation. This pr uses memcpy instead of the above statement. Here are the performance benchmarks, where `memcpy` is basically the same as `reinterpret_cast` + `assignment`. The newer compilers outperform unfolded assignment, so here is a screenshot of the results with only some of the parameters, which you can reproduce with the following test code at https://quick-bench.com/ ![WX20211104-104627](https://user-images.githubusercontent.com/4069905/140250827-6282739b-c060-43fa-b348-87ede15129fc.png) ![WX20211104-105010](https://user-images.githubusercontent.com/4069905/140250854-cf6da388-18d8-42f0-8cd6-18468633acc3.png) ![WX20211104-105348](https://user-images.githubusercontent.com/4069905/140250863-6c99cfcb-0b72-4ee0-a6b0-ac31344ac771.png) ```c++ #include <string.h> static void use_memcpy(benchmark::State& state) { auto size = state.range(0); char buf[size]; for (int i = 0; i < 8; ++i) { buf[i] = 'a'; } for (auto _ : state) { char *output = buf + 8; char *matchAddress = buf; char *matchOutputLimit = buf + size; while (output < matchOutputLimit) { memcpy(output, matchAddress, 8); matchAddress += 8; output += 8; } } } static void use_expanded_assignment(benchmark::State& state) { auto size = state.range(0); char buf[size]; for (int i = 0; i < 8; ++i) { buf[i] = 'a'; } for (auto _ : state) { char *output = buf + 8; char *matchAddress = buf; char *matchOutputLimit = buf + size; while (output < matchOutputLimit) { output[0] = *matchOutputLimit; output[1] = *(matchOutputLimit + 1); output[2] = *(matchOutputLimit + 2); output[3] = *(matchOutputLimit + 3); output[4] = *(matchOutputLimit + 4); output[5] = *(matchOutputLimit + 5); output[6] = *(matchOutputLimit + 6); output[7] = *(matchOutputLimit + 7); matchAddress += 8; output += 8; } } } static void use_reinterpret_assignment(benchmark::State& state) { auto size = state.range(0); char buf[size]; for (int i = 0; i < 8; ++i) { buf[i] = 'a'; } for (auto _ : state) { char *output = buf + 8; char *matchAddress = buf; char *matchOutputLimit = buf + size; while (output < matchOutputLimit) { *reinterpret_cast<int64_t*>(output) = *reinterpret_cast<int64_t*>(matchAddress); matchAddress += 8; output += 8; } } } BENCHMARK(use_memcpy)->Arg(100000); BENCHMARK(use_expanded_assignment)->Arg(100000); BENCHMARK(use_reinterpret_assignment)->Arg(100000); ``` ### Why are the changes needed? Fix the bug of LZO decompression. ### How was this patch tested? Pass the CIs. (cherry picked from commit 502661a) Signed-off-by: Dongjoon Hyun <[email protected]>

dongjoon-hyun · 2021-11-07T22:52:38Z

This is backported to branch-1.7 for 1.7.2.

Fix TestDecompression.testLzoLong on Debian 11

e3c7d16

github-actions bot added the CPP label Nov 3, 2021

guiyanakuang changed the title ~~Fix the implementation of copy data blocks during LZO decompression~~ ORC-1041: Fix the implementation of copy data blocks during LZO decompression Nov 3, 2021

Additional Fix

3d4cc3b

dongjoon-hyun added this to the 1.8.0 milestone Nov 3, 2021

dongjoon-hyun reviewed Nov 3, 2021

View reviewed changes

wgtmac approved these changes Nov 4, 2021

View reviewed changes

dongjoon-hyun approved these changes Nov 4, 2021

View reviewed changes

dongjoon-hyun changed the title ~~ORC-1041: Fix the implementation of copy data blocks during LZO decompression~~ ORC-1041: Use memcpy during LZO decompression Nov 4, 2021

dongjoon-hyun merged commit 502661a into apache:main Nov 4, 2021

dongjoon-hyun modified the milestones: 1.8.0, 1.7.2 Nov 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-1041: Use `memcpy` during LZO decompression #958

ORC-1041: Use `memcpy` during LZO decompression #958

guiyanakuang commented Nov 3, 2021 •

edited

Loading

dongjoon-hyun commented Nov 3, 2021

dongjoon-hyun commented Nov 3, 2021 •

edited

Loading

dongjoon-hyun Nov 3, 2021

guiyanakuang Nov 4, 2021

wgtmac Nov 4, 2021

guiyanakuang Nov 4, 2021

dongjoon-hyun Nov 4, 2021

guiyanakuang Nov 4, 2021

dongjoon-hyun commented Nov 4, 2021

guiyanakuang commented Nov 4, 2021

wgtmac commented Nov 4, 2021

guiyanakuang commented Nov 4, 2021

dongjoon-hyun left a comment

dongjoon-hyun commented Nov 4, 2021

dongjoon-hyun commented Nov 7, 2021

ORC-1041: Use memcpy during LZO decompression #958

ORC-1041: Use memcpy during LZO decompression #958

Conversation

guiyanakuang commented Nov 3, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

dongjoon-hyun commented Nov 3, 2021

dongjoon-hyun commented Nov 3, 2021 • edited Loading

dongjoon-hyun Nov 3, 2021

Choose a reason for hiding this comment

guiyanakuang Nov 4, 2021

Choose a reason for hiding this comment

wgtmac Nov 4, 2021

Choose a reason for hiding this comment

guiyanakuang Nov 4, 2021

Choose a reason for hiding this comment

dongjoon-hyun Nov 4, 2021

Choose a reason for hiding this comment

guiyanakuang Nov 4, 2021

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 4, 2021

guiyanakuang commented Nov 4, 2021

wgtmac commented Nov 4, 2021

guiyanakuang commented Nov 4, 2021

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 4, 2021

dongjoon-hyun commented Nov 7, 2021

ORC-1041: Use `memcpy` during LZO decompression #958

ORC-1041: Use `memcpy` during LZO decompression #958

guiyanakuang commented Nov 3, 2021 •

edited

Loading

dongjoon-hyun commented Nov 3, 2021 •

edited

Loading