GH-40431: [C++] Move key_hash/key_map/light_array related files to internal for prevent using by users #40484

ZhangHuiGui · 2024-03-12T10:30:06Z

Rationale for this change

These files expose implementation details and APIs that are not meant for third-party use. This PR explicitly marks them internal, which also avoids having them installed.

Are these changes tested?

By existing builds and tests.

Are there any user-facing changes?

No, except hiding some header files that were not supposed to be included externally.

GitHub Issue: [C++] Crashed at TempStack alloc when use Hashing32::HashBatch independently #40431

ZhangHuiGui · 2024-03-12T10:36:16Z

cc @kou , it's a temporary fix. And i haven't add ut for now. PTAL?

kou · 2024-03-12T20:33:14Z

cpp/src/arrow/compute/key_hash.cc

+    const int64_t alloc_size1 =
+        2 * (alloc_entry_length * sizeof(uint32_t) + util::TempVectorStack::meta_size());


Is this for hash_temp_buf and null_hash_temp_buf?

kou · 2024-03-12T20:33:33Z

cpp/src/arrow/compute/key_hash.cc

+    const int64_t alloc_size2 =
+        alloc_entry_length * sizeof(uint16_t) + util::TempVectorStack::meta_size();


Is this for null_indices_buf?

kou · 2024-03-12T20:35:12Z

cpp/src/arrow/compute/key_hash.cc

+  const int64_t alloc_entry_length = column_arrays[0].length();
+  auto estimate_size = [&] {
+    // An estimate TempVectorStack usage size for Hashing32::HashMultiColumm.
+    const int64_t alloc_size1 =
+        2 * (alloc_entry_length * sizeof(uint32_t) + util::TempVectorStack::meta_size());
+    const int64_t alloc_size2 =
+        alloc_entry_length * sizeof(uint16_t) + util::TempVectorStack::meta_size();
+    return alloc_size1 + alloc_size2;
+  };


Why do you want to do this in HashBatch() not HashMultiColumn() (that has a real allocation logic)?

Yes, the codes here are unreasonable. It's temporary codes and refactored.

kou · 2024-03-12T20:35:38Z

cpp/src/arrow/compute/key_hash.cc

+  if (!temp_stack) {
+    util::TempVectorStack stack;
+    RETURN_NOT_OK(stack.Init(default_memory_pool(), estimate_size()));
+    ctx.stack = std::move(&stack);


Is this safe?
I think that stack is invalid outside of this block.

Oh,,, i lost my mind!

kou · 2024-03-12T20:36:15Z

cpp/src/arrow/compute/key_hash.cc

+    return alloc_size1 + alloc_size2;
+  };
+
+  if (!temp_stack) {


You want to pass nullptr for temp_stack for your use case, right?

Yes, i think if user want to use the HashBatch as an independent api and they may needn't care about the stack size.

kou · 2024-03-12T20:38:26Z

cpp/src/arrow/compute/key_hash.cc

+        << temp_stack->buffer_size() << "Bytes, expect " << estimate_alloc_size
+        << "Bytes)";
+    ctx.stack = temp_stack;
+  }


We have similar codes in Hashing64::HashBatch(). Can we avoid it?

kou · 2024-03-12T20:38:39Z

cpp/src/arrow/compute/util.h

@@ -97,6 +97,9 @@ class ARROW_EXPORT TempVectorStack {
    return Status::OK();
  }

+  const int64_t buffer_size() const { return buffer_size_; }


Suggested change

const int64_t buffer_size() const { return buffer_size_; }

int64_t buffer_size() const { return buffer_size_; }

kou · 2024-03-12T20:40:05Z

cpp/src/arrow/compute/util.h

@@ -97,6 +97,9 @@ class ARROW_EXPORT TempVectorStack {
    return Status::OK();
  }

+  const int64_t buffer_size() const { return buffer_size_; }
+  static int64_t meta_size() { return kPadding + 2 * sizeof(uint64_t); }


Can we provide RequiredSize(), EstimateSize() or something instead of providing this?

ZhangHuiGui · 2024-03-14T01:57:22Z

cpp/src/arrow/compute/util.h

@@ -89,16 +89,23 @@ class ARROW_EXPORT TempVectorStack {
  Status Init(MemoryPool* pool, int64_t size) {
    num_vectors_ = 0;
    top_ = 0;
-    buffer_size_ = PaddedAllocationSize(size) + kPadding + 2 * sizeof(uint64_t);


We have already added the kPadding in PaddedAllocationSize, it's unnecessary to add it again.

ZhangHuiGui · 2024-03-25T04:26:08Z

@kou how about this refactor?

kou · 2024-03-25T04:32:11Z

cpp/src/arrow/compute/key_hash.cc

+  const uint32_t alloc_batch_size = std::min(num_rows, max_batch_size);
+  const int64_t estimate_alloc_size = EstimateBatchStackSize<uint32_t>(alloc_batch_size);


Can we use auto here?

Suggested change

const uint32_t alloc_batch_size = std::min(num_rows, max_batch_size);

const int64_t estimate_alloc_size = EstimateBatchStackSize<uint32_t>(alloc_batch_size);

const auto alloc_batch_size = std::min(num_rows, max_batch_size);

const auto estimate_alloc_size = EstimateBatchStackSize<uint32_t>(alloc_batch_size);

kou · 2024-03-25T04:34:57Z

cpp/src/arrow/compute/key_hash.cc

+  util::TempVectorStack temp_stack;
+  if (!ctx->stack) {
+    ARROW_CHECK_OK(temp_stack.Init(default_memory_pool(), estimate_alloc_size));
+    ctx->stack = &temp_stack;


Could you set nullptr to ctx->stack before this function is exited?

kou · 2024-03-25T04:35:19Z

cpp/src/arrow/compute/key_hash.cc

@@ -472,6 +483,7 @@ Status Hashing32::HashBatch(const ExecBatch& key_batch, uint32_t* hashes,
  LightContext ctx;
  ctx.hardware_flags = hardware_flags;
  ctx.stack = temp_stack;
+


Could you revert a needless change?

kou · 2024-03-25T04:39:25Z

cpp/src/arrow/compute/util.cc

@@ -35,7 +35,7 @@ void TempVectorStack::alloc(uint32_t num_bytes, uint8_t** data, int* id) {
  int64_t new_top = top_ + PaddedAllocationSize(num_bytes) + 2 * sizeof(uint64_t);
  // Stack overflow check (see GH-39582).


Could you move this comment to CheckAllocSizeValid()?

kou · 2024-03-25T04:40:17Z

cpp/src/arrow/compute/util.cc

@@ -58,6 +58,13 @@ void TempVectorStack::release(int id, uint32_t num_bytes) {
  --num_vectors_;
 }

+void TempVectorStack::CheckAllocSizeValid(int64_t estimate_alloc_size) {


Could you return arrow::Status instead of void here?

TempVectorStack::alloc() will not be able to use it for now but Hashing32::HashBatch() can use it.

Yes, reasonable. Refactored!

kou · 2024-03-25T04:49:25Z

cpp/src/arrow/compute/util.cc

@@ -58,6 +58,13 @@ void TempVectorStack::release(int id, uint32_t num_bytes) {
  --num_vectors_;
 }

+void TempVectorStack::CheckAllocSizeValid(int64_t estimate_alloc_size) {
+  ARROW_DCHECK_LE(estimate_alloc_size, buffer_size_)


I think that we should receive additional allocation size instead of total new allocation size here:

Suggested change

ARROW_DCHECK_LE(estimate_alloc_size, buffer_size_)

ARROW_DCHECK_LE(top_ + alloc_size, buffer_size_)

kou · 2024-03-25T04:50:34Z

cpp/src/arrow/compute/util.h

@@ -89,16 +89,23 @@ class ARROW_EXPORT TempVectorStack {
  Status Init(MemoryPool* pool, int64_t size) {
    num_vectors_ = 0;
    top_ = 0;
-    buffer_size_ = PaddedAllocationSize(size) + kPadding + 2 * sizeof(uint64_t);
+    buffer_size_ = PaddedAllocationSize(size) + 2 * sizeof(uint64_t);


Can we use EstimateAllocSize() here?

kou · 2024-03-25T04:50:47Z

cpp/src/arrow/compute/util.h

+    return PaddedAllocationSize(size) + 2 * sizeof(uint64_t);
+  }
+
+  int64_t StackBufferSize() const { return buffer_size_; }


Do we need this?

kou · 2024-03-25T04:51:14Z

cpp/src/arrow/compute/key_hash_test.cc

@@ -311,5 +311,32 @@ TEST(VectorHash, FixedLengthTailByteSafety) {
  HashFixedLengthFrom(/*key_length=*/19, /*num_rows=*/64, /*start_row=*/63);
 }

+TEST(HashBatch, AllocTempStackAsNeeded) {
+  auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
+  const int32_t batch_size = static_cast<int32_t>(arr->length());


Suggested change

const int32_t batch_size = static_cast<int32_t>(arr->length());

const auto batch_size = static_cast<int32_t>(arr->length());

kou · 2024-03-25T04:56:58Z

cpp/src/arrow/compute/key_hash.h

@@ -219,5 +219,24 @@ class ARROW_EXPORT Hashing64 {
                      const uint8_t* keys, uint64_t* hashes);
 };

+template <typename T = uint32_t>
+static int64_t EstimateBatchStackSize(int32_t batch_size) {


Do we need to export this?

Actually, i want to unify the logic in Hashing32 and Hashing64. But seems unnecessary.

ZhangHuiGui · 2024-03-26T14:32:30Z

@kou Thank you for your review!
@westonpace PTAL? This is a refactoring job, in order to allow users in need to better use the HashBatch related API.

mapleFU · 2024-03-26T14:55:47Z

cc @zanmato1984 if you're interested in this

zanmato1984

The idea of generalizing the hashing APIs is nice. Some suggestions.

zanmato1984 · 2024-03-26T17:06:54Z

cpp/src/arrow/compute/util.cc

  // XXX cannot return a regular Status because most consumers do not either.
-  ARROW_CHECK_LE(new_top, buffer_size_) << "TempVectorStack::alloc overflow";
+  ARROW_DCHECK_OK(CheckAllocOverflow(estimate_size));


We probably should use ARROW_CHECK_OK here?

Yes, you're right. ARROW_DCHECK_OK seems not work in NDEBUG mode.

zanmato1984 · 2024-03-26T17:08:29Z

cpp/src/arrow/compute/util.h

 private:
-  int64_t PaddedAllocationSize(int64_t num_bytes) {
+  static int64_t PaddedAllocationSize(int64_t num_bytes) {


Maybe we can align all the function names to either XxxAllocSize or XxxAllocationSize?

zanmato1984 · 2024-03-26T17:18:01Z

cpp/src/arrow/compute/key_hash.cc

  constexpr uint32_t max_batch_size = util::MiniBatch::kMiniBatchLength;
+  const auto alloc_batch_size = std::min(num_rows, max_batch_size);


Maybe it can be more clear if we combine these two lines to const uint32_t max_batch_size = std::min(num_rows, util::MiniBatch::kMiniBatchLength);.

zanmato1984 · 2024-03-26T17:23:36Z

cpp/src/arrow/compute/key_hash.cc

+  const auto alloc_hash_temp_buf =
+      util::TempVectorStack::EstimateAllocSize(alloc_batch_size * sizeof(uint32_t));
+  const auto alloc_for_null_indices_buf =
+      util::TempVectorStack::EstimateAllocSize(alloc_batch_size * sizeof(uint16_t));
+  const auto alloc_size = alloc_hash_temp_buf * 2 + alloc_for_null_indices_buf;


The scope of these three variables doesn't have to be this function, right? We can put them into the if statement below?

Actually the alloc_size is used both in if and else statement, it's not suitable to move these three variables into if statement.

You are right, thanks.

zanmato1984 · 2024-03-26T17:27:00Z

cpp/src/arrow/compute/key_hash.cc

+      util::TempVectorStack::EstimateAllocSize(alloc_batch_size * sizeof(uint16_t));
+  const auto alloc_size = alloc_hash_temp_buf * 2 + alloc_for_null_indices_buf;
+
+  std::shared_ptr<util::TempVectorStack> temp_stack(nullptr);


Suggested change

std::shared_ptr<util::TempVectorStack> temp_stack(nullptr);

auto stack = ctx->stack;

std::unique_ptr<util::TempVectorStack> temp_stack(nullptr);

Point is you don't really have to set the temp_stack pointer into the ctx. Just a regular temp variable will do. So you don't have to clear ctx->stack at the end.

Yes, totally agree!

zanmato1984 · 2024-03-26T17:28:01Z

cpp/src/arrow/compute/key_hash.cc

+    RETURN_NOT_OK(temp_stack->Init(default_memory_pool(), alloc_size));
+    ctx->stack = temp_stack.get();
+  } else {
+    RETURN_NOT_OK(ctx->stack->CheckAllocOverflow(alloc_size));


Suggested change

RETURN_NOT_OK(ctx->stack->CheckAllocOverflow(alloc_size));

RETURN_NOT_OK(stack->CheckAllocOverflow(alloc_size));

zanmato1984 · 2024-03-26T17:28:13Z

cpp/src/arrow/compute/key_hash.cc


-  auto hash_temp_buf = util::TempVectorHolder<uint32_t>(ctx->stack, max_batch_size);
+  auto hash_temp_buf = util::TempVectorHolder<uint32_t>(ctx->stack, alloc_batch_size);


Suggested change

auto hash_temp_buf = util::TempVectorHolder<uint32_t>(ctx->stack, alloc_batch_size);

auto hash_temp_buf = util::TempVectorHolder<uint32_t>(stack, alloc_batch_size);

zanmato1984 · 2024-03-26T17:28:23Z

cpp/src/arrow/compute/key_hash.cc

  uint32_t* hash_temp = hash_temp_buf.mutable_data();

-  auto null_indices_buf = util::TempVectorHolder<uint16_t>(ctx->stack, max_batch_size);
+  auto null_indices_buf = util::TempVectorHolder<uint16_t>(ctx->stack, alloc_batch_size);


Suggested change

auto null_indices_buf = util::TempVectorHolder<uint16_t>(ctx->stack, alloc_batch_size);

auto null_indices_buf = util::TempVectorHolder<uint16_t>(stack, alloc_batch_size);

zanmato1984 · 2024-03-26T17:28:54Z

cpp/src/arrow/compute/key_hash.cc

  uint16_t* null_indices = null_indices_buf.mutable_data();
  int num_null_indices;

-  auto null_hash_temp_buf = util::TempVectorHolder<uint32_t>(ctx->stack, max_batch_size);
+  auto null_hash_temp_buf =
+      util::TempVectorHolder<uint32_t>(ctx->stack, alloc_batch_size);


Suggested change

util::TempVectorHolder<uint32_t>(ctx->stack, alloc_batch_size);

util::TempVectorHolder<uint32_t>(stack, alloc_batch_size);

zanmato1984 · 2024-03-26T17:29:10Z

cpp/src/arrow/compute/key_hash.cc

+  if (temp_stack) {
+    ctx->stack = nullptr;
+  }


Suggested change

if (temp_stack) {

ctx->stack = nullptr;

}

zanmato1984 · 2024-03-26T17:34:21Z

cpp/src/arrow/compute/key_hash.cc

+
+  std::shared_ptr<util::TempVectorStack> temp_stack(nullptr);
+  if (!ctx->stack) {
+    temp_stack = std::make_shared<util::TempVectorStack>();


Suggested change

temp_stack = std::make_shared<util::TempVectorStack>();

temp_stack = std::make_unique<util::TempVectorStack>();

ZhangHuiGui · 2024-03-27T14:05:20Z

@zanmato1984 Thank you very much for your suggestion, the code looks clearer!

zanmato1984 · 2024-03-27T15:26:24Z

cpp/src/arrow/compute/key_hash.cc

+Status Hashing32::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
+                                  LightContext* ctx, uint32_t* hashes) {
+  auto num_rows = static_cast<uint32_t>(cols[0].length());
+  const auto alloc_batch_size =


I would suggest keeping the name max_batch_size. It carries the meaning of how many rows to process in each iteration. In addition, this name is used everywhere in hash join related code so keeping it may complies with existing code base more. Last, it doesn't seem to be in the same category of the following three alloc family variables - we can think any alloc variable is solely to make sure the stack is large enough.

zanmato1984 · 2024-03-27T15:32:28Z

cpp/src/arrow/compute/key_hash_test.cc

+
+  // alloc stack overflow in HashBatch
+  ASSERT_OK(stack.Init(default_memory_pool(), batch_size));
+  ASSERT_NOT_OK(arrow::compute::Hashing32::HashBatch(


Maybe we can use ASSERT_RAISES_WITH_MESSAGE to check the detailed error message.

Yes, i've considered this. But the message has some detail numbers which related with internal alloc size. This is not convenient for future maintenance (for example, if some variables that require stack allocation are removed in HashMultiColumn, this test will need to be modified).

Yeah, you are right. Thanks.

zanmato1984

Some minor suggestions.

zanmato1984 · 2024-03-27T15:39:28Z

cpp/src/arrow/compute/key_hash_test.cc

+  auto ctx = arrow::compute::default_exec_context();
+  std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
+
+  // alloc stack by HashBatch internal


Suggested change

// alloc stack by HashBatch internal

// HashBatch using internally allocated buffer.

zanmato1984 · 2024-03-27T15:41:33Z

cpp/src/arrow/compute/key_hash_test.cc

+  util::TempVectorStack stack;
+  std::vector<uint32_t> h2(batch_size);
+
+  // alloc stack overflow in HashBatch


Suggested change

// alloc stack overflow in HashBatch

// HashBatch using pre-allocated buffer of insufficient size raises stack overflow.

zanmato1984 · 2024-03-27T15:42:41Z

cpp/src/arrow/compute/key_hash_test.cc

+      exec_batch, h2.data(), temp_column_arrays, ctx->cpu_info()->hardware_flags(),
+      &stack, 0, batch_size));
+
+  // alloc stack normally in HashBatch


Suggested change

// alloc stack normally in HashBatch

// HashBatch using big enough pre-allocated buffer.

zanmato1984

My one last suggestion :)

zanmato1984 · 2024-03-27T17:15:20Z

cpp/src/arrow/compute/util.cc

+    return Status::Invalid("TempVectorStack alloc overflow. (Actual ", buffer_size_,
+                           "Bytes, expect ", alloc_size, "Bytes)");


Suggested change

return Status::Invalid("TempVectorStack alloc overflow. (Actual ", buffer_size_,

"Bytes, expect ", alloc_size, "Bytes)");

return Status::Invalid("TempVectorStack allocation overflow: capacity ", buffer_size_, ", current size ", top, ", attempt allocating ", alloc_size);

zanmato1984

My one last suggestion :)

ZhangHuiGui · 2024-03-28T01:44:26Z

My one last suggestion :)

Thanks!

felipecrv · 2024-03-28T18:39:51Z

cpp/src/arrow/compute/key_hash.cc

+    temp_stack = std::make_unique<util::TempVectorStack>();
+    RETURN_NOT_OK(temp_stack->Init(default_memory_pool(), alloc_size));
+    stack = temp_stack.get();
+  } else {
+    RETURN_NOT_OK(stack->CheckAllocationOverflow(alloc_size));
+  }


If there is a possibility that ctx->stack is nullptr, then it's better to declare a TempVectorStack * parameter explicitly so the caller can allocate a stack with the right memory pool instead of this function internally relying on the global default_memory_pool(). Most calls would be passing ctx, ctx->stack except for the ones that for some reason don't have a stack in the context.

Thanks, you're right. HashBatch is used internally the way you said!

ZhangHuiGui · 2024-03-30T05:40:07Z

@pitrou PTAL!
The new commit include two things:

Move these codes(key_hash/key_map/light_array) to internal! Besides, seems it's unnecessary to use internal namespace for them. The purpose we want is just prevent user's calling/
Simple refactor to simplify some codes in TempVectorStack.

westonpace

I agree these files were meant to be internal. Thanks for cleaning this up :)

pitrou · 2024-04-02T12:40:10Z

@github-actions crossbow submit -g cpp

cpp/src/arrow/acero/schema_util.h

pitrou · 2024-04-02T13:08:13Z

@github-actions crossbow submit -g cpp

pitrou · 2024-04-02T14:41:39Z

@github-actions crossbow submit -g cpp

github-actions · 2024-04-02T14:44:09Z

Revision: 0722f88

Submitted crossbow builds: ursacomputing/crossbow @ actions-5141a1fc14

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

pitrou · 2024-04-02T16:01:41Z

CI failures are unrelated, I'll merge.

conbench-apache-arrow · 2024-04-03T08:47:16Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 8163d02.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 14 possible false positives for unstable benchmarks that are known to sometimes produce them.

… to internal for prevent using by users (apache#40484) ### Rationale for this change These files expose implementation details and APIs that are not meant for third-party use. This PR explicitly marks them internal, which also avoids having them installed. ### Are these changes tested? By existing builds and tests. ### Are there any user-facing changes? No, except hiding some header files that were not supposed to be included externally. * GitHub Issue: apache#40431 Lead-authored-by: ZhangHuiGui <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

github-actions bot added Component: C++ awaiting review Awaiting review labels Mar 12, 2024

kou reviewed Mar 12, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Mar 12, 2024

ZhangHuiGui force-pushed the try-fix-40431 branch from 560de84 to c5dd7be Compare March 13, 2024 09:20

ZhangHuiGui commented Mar 14, 2024

View reviewed changes

kou reviewed Mar 25, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 25, 2024

ZhangHuiGui requested a review from westonpace as a code owner March 26, 2024 14:26

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 26, 2024

ZhangHuiGui force-pushed the try-fix-40431 branch from 6903766 to 6cff6f2 Compare March 26, 2024 14:50

zanmato1984 requested changes Mar 26, 2024

View reviewed changes

zanmato1984 reviewed Mar 26, 2024

View reviewed changes

zanmato1984 reviewed Mar 27, 2024

View reviewed changes

ZhangHuiGui force-pushed the try-fix-40431 branch from b8db90c to 8f262af Compare March 28, 2024 01:43

github-actions bot added the awaiting committer review Awaiting committer review label Mar 28, 2024

felipecrv reviewed Mar 28, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 28, 2024

mark key_hash/key_map/light_array to internal and some simple refactor

2cc866f

ZhangHuiGui force-pushed the try-fix-40431 branch from b13eeb5 to 2cc866f Compare March 30, 2024 05:32

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 30, 2024

ZhangHuiGui changed the title ~~GH-40431: [C++] Try to check/alloc the TempVectorStack size as HashBatch needed~~ GH-40431: [C++] Move key_hash/key_map/light_array related files to internal for prevent using by users Mar 30, 2024

westonpace approved these changes Apr 2, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 2, 2024

pitrou requested changes Apr 2, 2024

View reviewed changes

cpp/src/arrow/acero/schema_util.h Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

ZhangHuiGui and others added 2 commits April 2, 2024 21:00

fix

e15d1e5

Avoid DCHECK in non-internal header file

5180559

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into try-fix-40431

0722f88

pitrou merged commit 8163d02 into apache:main Apr 2, 2024
34 of 35 checks passed

pitrou removed the awaiting merge Awaiting merge label Apr 2, 2024

pitrou mentioned this pull request Apr 2, 2024

[C++] Crashed at TempStack alloc when use Hashing32::HashBatch independently #40431

Closed

		const int64_t alloc_size1 =
		2 * (alloc_entry_length * sizeof(uint32_t) + util::TempVectorStack::meta_size());

		const int64_t alloc_size2 =
		alloc_entry_length * sizeof(uint16_t) + util::TempVectorStack::meta_size();

	const int64_t buffer_size() const { return buffer_size_; }
	int64_t buffer_size() const { return buffer_size_; }

		const uint32_t alloc_batch_size = std::min(num_rows, max_batch_size);
		const int64_t estimate_alloc_size = EstimateBatchStackSize<uint32_t>(alloc_batch_size);

		@@ -35,7 +35,7 @@ void TempVectorStack::alloc(uint32_t num_bytes, uint8_t** data, int* id) {
		int64_t new_top = top_ + PaddedAllocationSize(num_bytes) + 2 * sizeof(uint64_t);
		// Stack overflow check (see GH-39582).

	ARROW_DCHECK_LE(estimate_alloc_size, buffer_size_)
	ARROW_DCHECK_LE(top_ + alloc_size, buffer_size_)

	const int32_t batch_size = static_cast<int32_t>(arr->length());
	const auto batch_size = static_cast<int32_t>(arr->length());

		constexpr uint32_t max_batch_size = util::MiniBatch::kMiniBatchLength;
		const auto alloc_batch_size = std::min(num_rows, max_batch_size);

	std::shared_ptr<util::TempVectorStack> temp_stack(nullptr);
	auto stack = ctx->stack;
	std::unique_ptr<util::TempVectorStack> temp_stack(nullptr);

	RETURN_NOT_OK(ctx->stack->CheckAllocOverflow(alloc_size));
	RETURN_NOT_OK(stack->CheckAllocOverflow(alloc_size));


		auto hash_temp_buf = util::TempVectorHolder<uint32_t>(ctx->stack, max_batch_size);
		auto hash_temp_buf = util::TempVectorHolder<uint32_t>(ctx->stack, alloc_batch_size);

	auto null_indices_buf = util::TempVectorHolder<uint16_t>(ctx->stack, alloc_batch_size);
	auto null_indices_buf = util::TempVectorHolder<uint16_t>(stack, alloc_batch_size);

	util::TempVectorHolder<uint32_t>(ctx->stack, alloc_batch_size);
	util::TempVectorHolder<uint32_t>(stack, alloc_batch_size);

	temp_stack = std::make_shared<util::TempVectorStack>();
	temp_stack = std::make_unique<util::TempVectorStack>();

	// alloc stack by HashBatch internal
	// HashBatch using internally allocated buffer.

	// alloc stack overflow in HashBatch
	// HashBatch using pre-allocated buffer of insufficient size raises stack overflow.

	// alloc stack normally in HashBatch
	// HashBatch using big enough pre-allocated buffer.

		return Status::Invalid("TempVectorStack alloc overflow. (Actual ", buffer_size_,
		"Bytes, expect ", alloc_size, "Bytes)");

GH-40431: [C++] Move key_hash/key_map/light_array related files to internal for prevent using by users #40484

GH-40431: [C++] Move key_hash/key_map/light_array related files to internal for prevent using by users #40484

Conversation

ZhangHuiGui commented Mar 12, 2024 • edited by github-actions bot Loading

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

ZhangHuiGui commented Mar 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZhangHuiGui commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZhangHuiGui commented Mar 26, 2024

mapleFU commented Mar 26, 2024

zanmato1984 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZhangHuiGui commented Mar 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanmato1984 left a comment

Choose a reason for hiding this comment

ZhangHuiGui commented Mar 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZhangHuiGui commented Mar 30, 2024

westonpace left a comment

Choose a reason for hiding this comment

pitrou commented Apr 2, 2024

ZhangHuiGui commented Mar 12, 2024 •

edited by github-actions bot

Loading

zanmato1984 Mar 26, 2024 •

edited

Loading