Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Supports write BinaryView/StringView to Parquet file #43244

Open
mapleFU opened this issue Jul 13, 2024 · 7 comments
Open

[C++][Parquet] Supports write BinaryView/StringView to Parquet file #43244

mapleFU opened this issue Jul 13, 2024 · 7 comments

Comments

@mapleFU
Copy link
Member

mapleFU commented Jul 13, 2024

Describe the enhancement requested

This part requires write arrow's StringView/BinaryView/LargeStringView/LargeBinaryView to Parquet file.

The Parquet library has the layers below:

  1. arrow/parquet: which accepts the input "Array" and able to write them to the underlying Parquet api
  2. parquet: the core Parquet libs

In this part, we should:

  1. Check and test View types in parquet/arrow/schema module, and add tests. Dictionary<View> might also be test
  2. Update TypedColumnWriterImpl<ByteArrayType>::WriteArrowDense and allowing it write view type
  3. Allowing DeltaLengthByteArrayEncoder<DType>::Put, DeltaByteArrayEncoder<DType>::Put, DictEncoderImpl<ByteArrayType>::Put and PlainEncoder<ByteArrayType>::Put supports view type

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Jul 13, 2024

Besides, can we separate it into two tasks:

  • Support Put for encoder
  • Support put and encoding schema in the Arrow Parquet layer

I think this making finishing the task more smothly

@IndifferentArea
Copy link
Contributor

IndifferentArea commented Jul 17, 2024

I'm currently trying on part of this feature.
I found each encoder uses an assert function AssertBaseBinary which wraps on is_base_binary_like

void AssertBaseBinary(const ::arrow::Array& values) {
if (!::arrow::is_base_binary_like(values.type_id())) {
throw ParquetException("Only BaseBinaryArray and subclasses supported");
}
}

However, this trait will return false on {String, Binary}ViewArray

/// \brief Check for a base-binary-like type
///
/// This predicate doesn't match fixed-size binary types and will otherwise
/// match all binary- and string-like types regardless of offset width.
///
/// \param[in] type_id the type-id to check
/// \return whether type-id is a base-binary-like type one
constexpr bool is_base_binary_like(Type::type type_id) {
switch (type_id) {
case Type::BINARY:
case Type::LARGE_BINARY:
case Type::STRING:
case Type::LARGE_STRING:
return true;
default:
break;
}
return false;
}

I didn't find a trait that represents is_base_binary_like + {String, Binary}View or just represents {String, Binary}View, do we need a new trait? @mapleFU @felipecrv

@mapleFU
Copy link
Member Author

mapleFU commented Jul 17, 2024

@IndifferentArea Nice catch, I think we can do some changes like:

 void AssertBaseBinary(const ::arrow::Array& values) { 
   if (!::arrow::is_base_binary_like(values.type_id())) { 
     throw ParquetException("Only BaseBinaryArray and subclasses supported"); 
   } 
 } 

void AssertViewBinary(const ::arrow::Array& values);

Then dispatch by view or binary type?

@IndifferentArea
Copy link
Contributor

IndifferentArea commented Jul 17, 2024

For now, I implement a local is_binary_view_like trait in encoding.cc.

bool is_binary_view_like(::arrow::Type::type type_id){
  return type_id == ::arrow::Type::BINARY_VIEW || type_id == ::arrow::Type::STRING_VIEW;
}

But I still think this trait may be necessary in type_traits.h. eg, in current implementation, BinaryArray's constructor is implemented as:

BinaryArray::BinaryArray(const std::shared_ptr<ArrayData>& data) {
ARROW_CHECK(is_binary_like(data->type->id()));
SetData(data);
}

However, BinaryViewArray's constructor is implemented as:
BinaryViewArray::BinaryViewArray(std::shared_ptr<ArrayData> data) {
ARROW_CHECK_EQ(data->type->id(), Type::BINARY_VIEW);
SetData(std::move(data));
}

As a result, we can't build a BinaryViewArray from StringViewArray. Comparing to BinaryArray and StringArray, I think it's not expected?

@mapleFU
Copy link
Member Author

mapleFU commented Jul 17, 2024

As a result, we can't build a BinaryViewArray from StringViewArray.

Lol, this is funny but StringViewArray just call default ctor in BinaryViewArray and construct like below:

StringViewArray::StringViewArray(std::shared_ptr<ArrayData> data) {
  ARROW_CHECK_EQ(data->type->id(), Type::STRING_VIEW);
  SetData(std::move(data));
}

@mapleFU
Copy link
Member Author

mapleFU commented Jul 17, 2024

#42247
@felipecrv Hi, just wonder during our testing, currently we don't support reading string_views from arrow, so maybe "cast" or something is required for testing. Would we have cast or there is some workaround?

@felipecrv
Copy link
Contributor

@mapleFU here is my draft PR: #43302

Blocked on dictionary casts. I will see if I can make it pass the tests and keep dictionary casts as a TODO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants