GH-40695 [C++] Expand Substrait type support #40696

westonpace · 2024-03-20T23:51:33Z

Rationale for this change

What changes are included in this PR?

This PR does a few things:

Substrait is upgraded to the latest version
Support is added for the parameterized timestamp type (but not literals due to If a precision timestamp literal is encountered there is no way to know the precision substrait-io/substrait#611).
Support is added for the following arrow-specific types:
- fp16
- date_millis
- time_seconds
- time_millis
- time_nanos
- large_string
- large_binary

When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like to_char(my_timestamp, "America/New York") instead of to_char(cast(my_timestamp, timestamp("nanos", "America/New York").

However, the current implementation makes it impossible to produce or consume a plan with to_char(my_timestamp, "America/New York") because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type.

In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of DICTIONARY<INT32> will convert to INT32. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be.

Are these changes tested?

Yes, I added new unit tests

Are there any user-facing changes?

Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported".

GitHub Issue: [C++] Add Substrait support for arrow-specific types (non-paramaeterized) #40695

github-actions · 2024-03-20T23:51:59Z

⚠️ GitHub issue #40695 has been automatically assigned in GitHub to PR creator.

bkietz

Thanks for working on this!

format/substrait/extension_types.yaml

cpp/src/arrow/engine/substrait/extension_set.cc

format/substrait/extension_types.yaml

cpp/src/arrow/engine/substrait/extension_set.cc

python/pyarrow/tests/test_substrait.py

cpp/src/arrow/engine/substrait/type_internal.cc

bkietz · 2024-03-21T13:02:35Z

format/substrait/extension_types.yaml

+# of values.  For example, unsigned integer types are very similar to their integer
+# counterparts, but have a different range of values.  These types are defined here
+# as extension types.
+#


Could you explain why large_binary should be an extension type but binary_view should only be an encoding? I think it'd provide a useful guide for future authors who need to pick where to put a type

I think I explain this above (I have updated the wording slightly)?

# Certain Arrow data types are, from Substrait's point of view, encodings. # These include dictionary, the view types (e.g. binary view, list view), # and REE. # # These types are not logically distinct from the type they are encoding. # Specifically, the types meet the following criteria: # * There is no value in the decoded type that cannot be represented # as a value in the encoded type and vice versa. # * Functions have the same meaning when applied to the encoded type # # Note: if two types have a different range (e.g. string and large_string) then # they do not satisfy the above criteria and are not encodings. # # These types will never have a Substrait equivalent. In the Substrait point # of view these are execution details.

So large_string and string are different types because concat(<string-with-2B-characters>, 'x') will have a different output for string and large_string (it will output an error given string and a valid value given large_string). However, there are no possible inputs that could lead to a different function output between string and string_view.

Makes sense, thanks. I had forgotten that substrait specifies that strings may not be longer than 2GB.

cpp/src/arrow/engine/substrait/extension_set.cc