Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-42247: [C++] Support casting to and from utf8_view/binary_view #43302

Merged
merged 26 commits into from
Sep 12, 2024

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Jul 17, 2024

Rationale for this change

We need casts between string (binary) and string-view (binary-view) types since they are semantically equivalent.

What changes are included in this PR?

  • Add is_binary_view_like() type predicate
  • Add BinaryViewTypes() list including STRING_VIEW/BINARY_VIEW
  • New cast kernels

Are these changes tested?

Yes, but test coverage might be improved.

Are there any user-facing changes?

More casts are available.

@felipecrv felipecrv changed the title GH-43010: [C++] Support casting to and from utf8_view/binary_view GH-42247: [C++] Support casting to and from utf8_view/binary_view Jul 17, 2024
@felipecrv felipecrv marked this pull request as ready for review July 21, 2024 17:48
@felipecrv felipecrv requested a review from bkietz July 21, 2024 18:47
cpp/src/arrow/compute/kernels/scalar_cast_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_cast_string.cc Outdated Show resolved Hide resolved
BinaryViewType::kPrefixSize);
// out_view.ref.buffer_index = 0;
out_view.ref.offset = static_cast<int32_t>(data_offset);
// TODO(felipecrv): validate data_offsets can't overflow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO for this PR? Otherwise, perhaps create a GH issue for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapleFU this one needs to be fixed with the same fix I added in line 477 // Check against offset overflow. I forgot that there were two places with this TODO.

cpp/src/arrow/type_traits.h Show resolved Hide resolved
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! memset to 0 really handle some tricky problem in protocol layer, thanks for your effort!

cpp/src/arrow/compute/kernels/scalar_cast_string.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jul 23, 2024
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few nits

cpp/src/arrow/compute/kernels/scalar_cast_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/type_traits.h Show resolved Hide resolved
cpp/src/arrow/visit_data_inline.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 5, 2024
@felipecrv felipecrv requested a review from pitrou August 6, 2024 00:10
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 6, 2024
InitializeUTF8();
ArraySpanVisitor<I> visitor;
Utf8Validator validator;
RETURN_NOT_OK(visitor.Visit(input, &validator));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Unrelated to this pr: What reminds me is the utf8 checking in arrow-rs, maybe we can use same algorithm? apache/arrow-rs#6009 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good [Parquet] issue to open

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I see what you mean: we could similarly assemble larger contiguous byte ranges on which we run a single Utf8 validation pass.

For the common case of views whose out-of-line data directly follows the previous out-of-line bytes, this would yield one long byte range for Utf8 validation.

Inline strings would also always be valid Utf8 since their size would consist of 3 zero bytes and one small byte plus the inline data and padding zero bytes, so we could validate on runs of inline views too.

cpp/src/arrow/compute/kernels/scalar_cast_string.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Sep 12, 2024

@github-actions crossbow submit -g cpp

Copy link

Revision: 2416d19

Submitted crossbow builds: ursacomputing/crossbow @ actions-aaf4a45dfc

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou
Copy link
Member

pitrou commented Sep 12, 2024

CI failures are unrelated, I'll merge

@pitrou pitrou merged commit 85fc3eb into apache:main Sep 12, 2024
38 of 39 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Sep 12, 2024
@felipecrv felipecrv deleted the str2str_casts branch September 12, 2024 18:24
@mapleFU
Copy link
Member

mapleFU commented Sep 13, 2024

Thanks!

Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 85fc3eb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 97 possible false positives for unstable benchmarks that are known to sometimes produce them.

khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…ew (apache#43302)

### Rationale for this change

We need casts between string (binary) and string-view (binary-view) types since they are semantically equivalent.

### What changes are included in this PR?

 - Add `is_binary_view_like()` type predicate
 - Add `BinaryViewTypes()` list including `STRING_VIEW/BINARY_VIEW`
 - New cast kernels

### Are these changes tested?

Yes, but test coverage might be improved.

### Are there any user-facing changes?

More casts are available.
* GitHub Issue: apache#42247

Lead-authored-by: Felipe Oliveira Carvalho <[email protected]>
Co-authored-by: mwish <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants