Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1041: Use memcpy during LZO decompression #958

Merged
merged 2 commits into from
Nov 4, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 5 additions & 10 deletions c++/src/LzoDecompressor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -312,13 +312,11 @@ namespace orc {
output += SIZE_OF_INT;
matchAddress += increment32;

*reinterpret_cast<int32_t*>(output) =
*reinterpret_cast<int32_t*>(matchAddress);
memcpy(output, matchAddress, SIZE_OF_INT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The combination of reinterpret_cast + assignment looks cheaper than memcpy function invocation. I'm wondering if we need to pay some performance penalty here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The combination of reinterpret_cast + assignment looks cheaper than memcpy function invocation. I'm wondering if we need to pay some performance penalty here.

I'll do some performance tests later, repeat_cast + assignment makes direct use of registers, memcpy is usually used for larger copies of data, I'm not sure if it's lossy yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler may optimize the memcpy call. BTW, should we wrap a bit_cast function which uses memcpy before C++20 and uses the native one if C++20 is available?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac you are right, the compiler does optimize memcpy, the performance of both ways is similar in different compilers, in older versions of the compiler expand the assignment will be faster.
I agree to wrap a bit_cast function for binary copy between different types.

@dongjoon-hyun, So I don't think there's any performance loss here compared to the original.

WX20211104-104627
WX20211104-105010
WX20211104-105348

#include <string.h>

static void use_memcpy(benchmark::State& state) {
  auto size = state.range(0);
  char buf[size];
  for (int i = 0; i < 8; ++i) {
    buf[i] = 'a';
  }
  for (auto _ : state) {
    char *output = buf + 8;
    char *matchAddress = buf;
    char *matchOutputLimit = buf + size;
    while (output < matchOutputLimit) {
      memcpy(output, matchAddress, 8);
      matchAddress += 8;
      output += 8;
    }
  }
}

static void use_expanded_assignment(benchmark::State& state) {
  auto size = state.range(0);
  char buf[size];
  for (int i = 0; i < 8; ++i) {
    buf[i] = 'a';
  }
  for (auto _ : state) {
    char *output = buf + 8;
    char *matchAddress = buf;
    char *matchOutputLimit = buf + size;
    while (output < matchOutputLimit) {
      output[0] = *matchOutputLimit;
      output[1] = *(matchOutputLimit + 1);
      output[2] = *(matchOutputLimit + 2);
      output[3] = *(matchOutputLimit + 3);
      output[4] = *(matchOutputLimit + 4);
      output[5] = *(matchOutputLimit + 5);
      output[6] = *(matchOutputLimit + 6);
      output[7] = *(matchOutputLimit + 7);
      matchAddress += 8;
      output += 8;
    }
  }
}

static void use_reinterpret_assignment(benchmark::State& state) {
  auto size = state.range(0);
  char buf[size];
  for (int i = 0; i < 8; ++i) {
    buf[i] = 'a';
  }
  for (auto _ : state) {
    char *output = buf + 8;
    char *matchAddress = buf;
    char *matchOutputLimit = buf + size;
    while (output < matchOutputLimit) {
      *reinterpret_cast<int64_t*>(output) =
                *reinterpret_cast<int64_t*>(matchAddress);
      matchAddress += 8;
      output += 8;
    }
  }
}

BENCHMARK(use_memcpy)->Arg(100000);

BENCHMARK(use_expanded_assignment)->Arg(100000);

BENCHMARK(use_reinterpret_assignment)->Arg(100000);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Could you put this investigation result into the PR description?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I've updated the pr description

output += SIZE_OF_INT;
matchAddress -= decrement64;
} else {
*reinterpret_cast<int64_t*>(output) =
*reinterpret_cast<int64_t*>(matchAddress);
memcpy(output, matchAddress, SIZE_OF_LONG);
matchAddress += SIZE_OF_LONG;
output += SIZE_OF_LONG;
}
Expand All @@ -329,8 +327,7 @@ namespace orc {
}

while (output < fastOutputLimit) {
*reinterpret_cast<int64_t*>(output) =
*reinterpret_cast<int64_t*>(matchAddress);
memcpy(output, matchAddress, SIZE_OF_LONG);
matchAddress += SIZE_OF_LONG;
output += SIZE_OF_LONG;
}
Expand All @@ -340,8 +337,7 @@ namespace orc {
}
} else {
while (output < matchOutputLimit) {
*reinterpret_cast<int64_t*>(output) =
*reinterpret_cast<int64_t*>(matchAddress);
memcpy(output, matchAddress, SIZE_OF_LONG);
matchAddress += SIZE_OF_LONG;
output += SIZE_OF_LONG;
}
Expand All @@ -366,8 +362,7 @@ namespace orc {
// fast copy. We may over-copy but there's enough room in input
// and output to not overrun them
do {
*reinterpret_cast<int64_t*>(output) =
*reinterpret_cast<const int64_t*>(input);
memcpy(output, input, SIZE_OF_LONG);
input += SIZE_OF_LONG;
output += SIZE_OF_LONG;
} while (output < literalOutputLimit);
Expand Down