-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Support Dict types in in_list
physical plans
#10031
Changes from all commits
f8e5cdf
70f5481
9c279f7
5e593bb
07a95b7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -415,6 +415,18 @@ impl PartialEq<dyn Any> for InListExpr { | |
} | ||
} | ||
|
||
/// Checks if two types are logically equal, dictionary types are compared by their value types. | ||
fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool { | ||
match (lhs, rhs) { | ||
(DataType::Dictionary(_, v1), DataType::Dictionary(_, v2)) => { | ||
v1.as_ref().eq(v2.as_ref()) | ||
} | ||
(DataType::Dictionary(_, l), _) => l.as_ref().eq(rhs), | ||
(_, DataType::Dictionary(_, r)) => lhs.eq(r.as_ref()), | ||
_ => lhs.eq(rhs), | ||
} | ||
} | ||
|
||
/// Creates a unary expression InList | ||
pub fn in_list( | ||
expr: Arc<dyn PhysicalExpr>, | ||
|
@@ -426,7 +438,7 @@ pub fn in_list( | |
let expr_data_type = expr.data_type(schema)?; | ||
for list_expr in list.iter() { | ||
let list_expr_data_type = list_expr.data_type(schema)?; | ||
if !expr_data_type.eq(&list_expr_data_type) { | ||
if !is_logically_eq(&expr_data_type, &list_expr_data_type) { | ||
return internal_err!( | ||
"The data type inlist should be same, the value type is {expr_data_type}, one of list expr type is {list_expr_data_type}" | ||
); | ||
|
@@ -499,7 +511,21 @@ mod tests { | |
macro_rules! in_list { | ||
($BATCH:expr, $LIST:expr, $NEGATED:expr, $EXPECTED:expr, $COL:expr, $SCHEMA:expr) => {{ | ||
let (cast_expr, cast_list_exprs) = in_list_cast($COL, $LIST, $SCHEMA)?; | ||
let expr = in_list(cast_expr, cast_list_exprs, $NEGATED, $SCHEMA).unwrap(); | ||
in_list_raw!( | ||
$BATCH, | ||
cast_list_exprs, | ||
$NEGATED, | ||
$EXPECTED, | ||
cast_expr, | ||
$SCHEMA | ||
); | ||
}}; | ||
} | ||
|
||
// applies the in_list expr to an input batch and list without cast | ||
macro_rules! in_list_raw { | ||
($BATCH:expr, $LIST:expr, $NEGATED:expr, $EXPECTED:expr, $COL:expr, $SCHEMA:expr) => {{ | ||
let expr = in_list($COL, $LIST, $NEGATED, $SCHEMA).unwrap(); | ||
let result = expr | ||
.evaluate(&$BATCH)? | ||
.into_array($BATCH.num_rows()) | ||
|
@@ -540,7 +566,7 @@ mod tests { | |
&schema | ||
); | ||
|
||
// expression: "a not in ("a", "b")" | ||
// expression: "a in ("a", "b", null)" | ||
let list = vec![lit("a"), lit("b"), lit(ScalarValue::Utf8(None))]; | ||
in_list!( | ||
batch, | ||
|
@@ -551,7 +577,7 @@ mod tests { | |
&schema | ||
); | ||
|
||
// expression: "a not in ("a", "b")" | ||
// expression: "a not in ("a", "b", null)" | ||
let list = vec![lit("a"), lit("b"), lit(ScalarValue::Utf8(None))]; | ||
in_list!( | ||
batch, | ||
|
@@ -1314,4 +1340,96 @@ mod tests { | |
|
||
Ok(()) | ||
} | ||
|
||
#[test] | ||
fn in_list_utf8_with_dict_types() -> Result<()> { | ||
fn dict_lit(key_type: DataType, value: &str) -> Arc<dyn PhysicalExpr> { | ||
lit(ScalarValue::Dictionary( | ||
Box::new(key_type), | ||
Box::new(ScalarValue::new_utf8(value.to_string())), | ||
)) | ||
} | ||
|
||
fn null_dict_lit(key_type: DataType) -> Arc<dyn PhysicalExpr> { | ||
lit(ScalarValue::Dictionary( | ||
Box::new(key_type), | ||
Box::new(ScalarValue::Utf8(None)), | ||
)) | ||
} | ||
|
||
let schema = Schema::new(vec![Field::new( | ||
"a", | ||
DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)), | ||
true, | ||
)]); | ||
let a: UInt16DictionaryArray = | ||
vec![Some("a"), Some("d"), None].into_iter().collect(); | ||
let col_a = col("a", &schema)?; | ||
let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?; | ||
|
||
// expression: "a in ("a", "b")" | ||
let lists = [ | ||
vec![lit("a"), lit("b")], | ||
vec![ | ||
dict_lit(DataType::Int8, "a"), | ||
dict_lit(DataType::UInt16, "b"), | ||
], | ||
]; | ||
for list in lists.iter() { | ||
in_list_raw!( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand this change -- the tests now call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The function For the sqllogictest part, jayzhan11 has already pointed out in #9530 (comment), the type coercion is happened in the optimization phase.
I think that's why the E2E tests already pass without this PR's fix. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it -- what I don't understand is how these validate the Comet use case. I expect them to call What I was expected to see was a test that mirrors what comet does: call Given this case current errors, we have no test coverage, even if the Sorry to be so pedantic about this, but I think it is somewhat subtle so making sure we get it right (and don't accidentally break it in the future) I think is important There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There might be some misunderstanding here.
No worries. I think it's exactly the spirit we need towards a better software engineering practice. And totally agree that it's important to make sure we get it right and won't break it in the future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ahh! Yes I was missing exactly this point. Sorry about that. Makes total sense. |
||
batch, | ||
list.clone(), | ||
&false, | ||
vec![Some(true), Some(false), None], | ||
col_a.clone(), | ||
&schema | ||
); | ||
} | ||
|
||
// expression: "a not in ("a", "b")" | ||
for list in lists.iter() { | ||
in_list_raw!( | ||
batch, | ||
list.clone(), | ||
&true, | ||
vec![Some(false), Some(true), None], | ||
col_a.clone(), | ||
&schema | ||
); | ||
} | ||
|
||
// expression: "a in ("a", "b", null)" | ||
let lists = [ | ||
vec![lit("a"), lit("b"), lit(ScalarValue::Utf8(None))], | ||
vec![ | ||
dict_lit(DataType::Int8, "a"), | ||
dict_lit(DataType::UInt16, "b"), | ||
null_dict_lit(DataType::UInt16), | ||
], | ||
]; | ||
for list in lists.iter() { | ||
in_list_raw!( | ||
batch, | ||
list.clone(), | ||
&false, | ||
vec![Some(true), None, None], | ||
col_a.clone(), | ||
&schema | ||
); | ||
} | ||
|
||
// expression: "a not in ("a", "b", null)" | ||
for list in lists.iter() { | ||
in_list_raw!( | ||
batch, | ||
list.clone(), | ||
&true, | ||
vec![Some(false), None, None], | ||
col_a.clone(), | ||
&schema | ||
); | ||
} | ||
|
||
Ok(()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I think we have a similar method in https://github.com/apache/arrow-datafusion/blob/f55c1d90215614ce531a4103c7dbebf318de1cfd/datafusion/common/src/dfschema.rs#L643 basically the schema struct has a lot of equal checks between datatypes. may want to move them into separate utils struct as they mostly working with datatypes rather than schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we could use
datatype_is_logically_equal
instead here. It also seems like a good idea to make that function more discoverable -- I also think it would be fine to do as a follow on PRThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me promote and use
datatype_is_logically_equal
in a follow up PR.I noticed that method earlier, but thought it might be a bit overkill for cases here in the
in_list
.