Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40695 [C++] Expand Substrait type support #40696

Merged
merged 8 commits into from
Apr 11, 2024

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Mar 20, 2024

Rationale for this change

See #40695

What changes are included in this PR?

This PR does a few things:

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like to_char(my_timestamp, "America/New York") instead of to_char(cast(my_timestamp, timestamp("nanos", "America/New York").

However, the current implementation makes it impossible to produce or consume a plan with to_char(my_timestamp, "America/New York") because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of DICTIONARY<INT32> will convert to INT32. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be.

Are these changes tested?

Yes, I added new unit tests

Are there any user-facing changes?

Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".

Copy link

⚠️ GitHub issue #40695 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

format/substrait/extension_types.yaml Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/extension_set.cc Outdated Show resolved Hide resolved
format/substrait/extension_types.yaml Outdated Show resolved Hide resolved
python/pyarrow/tests/test_substrait.py Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/type_internal.cc Outdated Show resolved Hide resolved
# of values. For example, unsigned integer types are very similar to their integer
# counterparts, but have a different range of values. These types are defined here
# as extension types.
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why large_binary should be an extension type but binary_view should only be an encoding? I think it'd provide a useful guide for future authors who need to pick where to put a type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I explain this above (I have updated the wording slightly)?

# Certain Arrow data types are, from Substrait's point of view, encodings.
# These include dictionary, the view types (e.g. binary view, list view),
# and REE.
#
# These types are not logically distinct from the type they are encoding.
# Specifically, the types meet the following criteria:
#  *  There is no value in the decoded type that cannot be represented
#     as a value in the encoded type and vice versa.
#  *  Functions have the same meaning when applied to the encoded type
#
# Note: if two types have a different range (e.g. string and large_string) then
# they do not satisfy the above criteria and are not encodings. 
#
# These types will never have a Substrait equivalent.  In the Substrait point
# of view these are execution details.

So large_string and string are different types because concat(<string-with-2B-characters>, 'x') will have a different output for string and large_string (it will output an error given string and a valid value given large_string). However, there are no possible inputs that could lead to a different function output between string and string_view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks. I had forgotten that substrait specifies that strings may not be longer than 2GB.

cpp/src/arrow/engine/substrait/expression_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/extension_set.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Mar 21, 2024
@westonpace westonpace requested a review from bkietz March 22, 2024 14:28
@westonpace westonpace force-pushed the feat/expand-substrait-type-support branch from 67839a1 to 962b2a8 Compare April 8, 2024 13:57
@raulcd raulcd modified the milestones: 16.0.0, 17.0.0 Apr 8, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 9, 2024
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nitpicks.

google::protobuf::UInt64Value value;
if (!user_defined_->value().UnpackTo(&value)) {
return Status::Invalid(
"Failed to unpack user defined integer literal to UInt64Value");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a function to create this Status taking arguments like "integer literal" and "UInt64Value" would reduce the string literal bloat in the binary because "Failed to unpack user defined " wouldn't have to be inlined so many times in the literals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding ARROW_PREDICT_FALSE to the condition would also reduce the inlining of code in the block because it becomes cold code code for the compiler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return EncodeUserDefined(*s.type, value);
}
Status Visit(const UInt32Scalar& s) {
google::protobuf::UInt64Value value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be UInt32Value? And what about the 16-bit version above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In protobuf all unsigned integers are encoded the same way and there is no uint16 or uint8 so I figured it would be simplest and most consistent to just use uint64 for everything. It's something of an arbitrary decision and, even if the receiver decodes it as a uint32, it will still work.

Comment on lines +817 to +822
auto user_defined = std::make_unique<Lit::UserDefined>();
user_defined->set_type_reference(anchor);
auto value_any = std::make_unique<google::protobuf::Any>();
value_any->PackFrom(value);
user_defined->set_allocated_value(value_any.release());
lit_->set_allocated_user_defined(user_defined.release());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use destination-passing style to avoid extra allocations with protobuf types:

    auto *user_defined = lit_->mutable_user_defined();
    user_defined->set_type_reference(anchor);
    value.PackTo(user_defined->mutable_value());

A change in style for the whole file really, so not a blocker for this PR specifically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/src/arrow/engine/substrait/type_internal.cc Outdated Show resolved Hide resolved
@westonpace westonpace force-pushed the feat/expand-substrait-type-support branch from 962b2a8 to ab8237a Compare April 11, 2024 02:06
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 11, 2024
@westonpace
Copy link
Member Author

Thanks @felipecrv

The whole-file optimizations I've added follow-up PRs for. I do plan on adding support for parameterized types (e.g. decimal256, etc.) in the next release and I'll take a stab at those when I do that. Otherwise, I believe I have addressed your comments and will merge on green.

@felipecrv
Copy link
Contributor

Thanks @felipecrv

The whole-file optimizations I've added follow-up PRs for. I do plan on adding support for parameterized types (e.g. decimal256, etc.) in the next release and I'll take a stab at those when I do that. Otherwise, I believe I have addressed your comments and will merge on green.

Sounds great!

@westonpace westonpace merged commit 7f64fff into apache:main Apr 11, 2024
39 checks passed
@westonpace westonpace removed the awaiting changes Awaiting changes label Apr 11, 2024
@jorisvandenbossche jorisvandenbossche modified the milestones: 17.0.0, 16.0.0 Apr 11, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 7f64fff.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 11 possible false positives for unstable benchmarks that are known to sometimes produce them.

raulcd pushed a commit that referenced this pull request Apr 11, 2024
### Rationale for this change

See #40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: #40695

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Apr 15, 2024
### Rationale for this change

See apache#40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: apache#40695

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
### Rationale for this change

See apache#40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: apache#40695

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
### Rationale for this change

See apache#40695 

### What changes are included in this PR?

This PR does a few things:

 * Substrait is upgraded to the latest version
 * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611).
 * Support is added for the following arrow-specific types:
   * fp16
   * date_millis
   * time_seconds
   * time_millis
   * time_nanos
   * large_string
   * large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column.  Substrait puts time zone information in the function and not the type.  In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`.

However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone.  With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation.  E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`.  At a logical expression / plan perspective these encodings are irrelevant.  If anything, they may belong in a more physical plan representation.  Should a need for them arise we can dig into it more later.  However, I believe it is better to err on the side of generating "something" rather than failing in these cases.  I don't consider this last change critical and can back it out if need be.

### Are these changes tested?

Yes, I added new unit tests

### Are there any user-facing changes?

Yes, via the Substrait conversion.  These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".
* GitHub Issue: apache#40695

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants