Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for StringView / character functions, fix regexp_like and regexp_match to work with StringView #11753

Merged
merged 7 commits into from
Aug 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 3 additions & 6 deletions datafusion/functions/src/regex/regexplike.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,13 +75,10 @@ impl ScalarUDFImpl for RegexpLikeFunc {
use DataType::*;

Ok(match &arg_types[0] {
LargeUtf8 | Utf8 => Boolean,
Null => Null,
other => {
return plan_err!(
"The regexp_like function can only accept strings. Got {other}"
);
}
// Type coercion is done by DataFusion based on signature, so if we
// get here, the first argument is always a string
_ => Boolean,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, regexp_like fails with an error about unsupported types if passed a StringView

With this change it succeeds (though the StringView is cast to a String)

@tshauck and I are going to organize a bunch of changes to get StringView working across functions as part of #11790

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return error for non-string type?

LargeUtf8 | Utf8 | Utf8View => Boolean
Null => Null
other => not supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the type coercion already handles, in theory, ensuring the correct types are passed in, so the extra check was not necessary. I can put back the explicit check if you prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems #11618 is striking again. I think @alamb is right, but it is very counter intuitive to return a Boolean because that case shouldn't exists.

Should we use 'unreachable!' ?

Copy link
Contributor

@jayzhan211 jayzhan211 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the type coercion already handles, in theory, ensuring the correct types are passed in, so the extra check was not necessary. I can put back the explicit check if you prefer

I think you are right, we don't need another type check in return_type.

it is very counter intuitive to return a Boolean because that case shouldn't exists.

The type check should be handled in signature before return_type. We could improve docs about how the types and length check are handled in UDF/UDAF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I'll at least add some comments to make it clear why this isn't doing the typechecks here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use 'unreachable!' ?

I prefer using internal error when possible as unreachable! will panic. But now that you mention this, the function is going to panic at runtime anyways 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad I mixed up two things. In the thread about better error handling I think we agreed to return a planning error in this scenario, and some form of unreachable within the invoke function to signal that branch should never be executed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments in ead4a7c

})
}
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
Expand Down
12 changes: 2 additions & 10 deletions datafusion/functions/src/regex/regexpmatch.rs
Original file line number Diff line number Diff line change
Expand Up @@ -74,17 +74,9 @@ impl ScalarUDFImpl for RegexpMatchFunc {
}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
use DataType::*;

Ok(match &arg_types[0] {
LargeUtf8 => List(Arc::new(Field::new("item", LargeUtf8, true))),
Utf8 => List(Arc::new(Field::new("item", Utf8, true))),
Null => Null,
other => {
return plan_err!(
"The regexp_match function can only accept strings. Got {other}"
);
}
DataType::Null => DataType::Null,
other => DataType::List(Arc::new(Field::new("item", other.clone(), true))),
})
}
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
Expand Down
2 changes: 1 addition & 1 deletion datafusion/functions/src/unicode/character_length.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ impl CharacterLengthFunc {
Self {
signature: Signature::uniform(
1,
vec![Utf8, LargeUtf8],
vec![Utf8, LargeUtf8, Utf8View],
Volatility::Immutable,
),
aliases: vec![String::from("length"), String::from("char_length")],
Expand Down
Loading