Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Support Dict types in in_list physical plans #10031

Merged
merged 5 commits into from
Apr 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 122 additions & 4 deletions datafusion/physical-expr/src/expressions/in_list.rs
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,18 @@ impl PartialEq<dyn Any> for InListExpr {
}
}

/// Checks if two types are logically equal, dictionary types are compared by their value types.
fn is_logically_eq(lhs: &DataType, rhs: &DataType) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I think we have a similar method in https://github.com/apache/arrow-datafusion/blob/f55c1d90215614ce531a4103c7dbebf318de1cfd/datafusion/common/src/dfschema.rs#L643 basically the schema struct has a lot of equal checks between datatypes. may want to move them into separate utils struct as they mostly working with datatypes rather than schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we could use datatype_is_logically_equal instead here. It also seems like a good idea to make that function more discoverable -- I also think it would be fine to do as a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me promote and use datatype_is_logically_equal in a follow up PR.

I noticed that method earlier, but thought it might be a bit overkill for cases here in the in_list.

match (lhs, rhs) {
(DataType::Dictionary(_, v1), DataType::Dictionary(_, v2)) => {
v1.as_ref().eq(v2.as_ref())
}
(DataType::Dictionary(_, l), _) => l.as_ref().eq(rhs),
(_, DataType::Dictionary(_, r)) => lhs.eq(r.as_ref()),
_ => lhs.eq(rhs),
}
}

/// Creates a unary expression InList
pub fn in_list(
expr: Arc<dyn PhysicalExpr>,
Expand All @@ -426,7 +438,7 @@ pub fn in_list(
let expr_data_type = expr.data_type(schema)?;
for list_expr in list.iter() {
let list_expr_data_type = list_expr.data_type(schema)?;
if !expr_data_type.eq(&list_expr_data_type) {
if !is_logically_eq(&expr_data_type, &list_expr_data_type) {
return internal_err!(
"The data type inlist should be same, the value type is {expr_data_type}, one of list expr type is {list_expr_data_type}"
);
Expand Down Expand Up @@ -499,7 +511,21 @@ mod tests {
macro_rules! in_list {
($BATCH:expr, $LIST:expr, $NEGATED:expr, $EXPECTED:expr, $COL:expr, $SCHEMA:expr) => {{
let (cast_expr, cast_list_exprs) = in_list_cast($COL, $LIST, $SCHEMA)?;
let expr = in_list(cast_expr, cast_list_exprs, $NEGATED, $SCHEMA).unwrap();
in_list_raw!(
$BATCH,
cast_list_exprs,
$NEGATED,
$EXPECTED,
cast_expr,
$SCHEMA
);
}};
}

// applies the in_list expr to an input batch and list without cast
macro_rules! in_list_raw {
($BATCH:expr, $LIST:expr, $NEGATED:expr, $EXPECTED:expr, $COL:expr, $SCHEMA:expr) => {{
let expr = in_list($COL, $LIST, $NEGATED, $SCHEMA).unwrap();
let result = expr
.evaluate(&$BATCH)?
.into_array($BATCH.num_rows())
Expand Down Expand Up @@ -540,7 +566,7 @@ mod tests {
&schema
);

// expression: "a not in ("a", "b")"
// expression: "a in ("a", "b", null)"
let list = vec![lit("a"), lit("b"), lit(ScalarValue::Utf8(None))];
in_list!(
batch,
Expand All @@ -551,7 +577,7 @@ mod tests {
&schema
);

// expression: "a not in ("a", "b")"
// expression: "a not in ("a", "b", null)"
let list = vec![lit("a"), lit("b"), lit(ScalarValue::Utf8(None))];
in_list!(
batch,
Expand Down Expand Up @@ -1314,4 +1340,96 @@ mod tests {

Ok(())
}

#[test]
fn in_list_utf8_with_dict_types() -> Result<()> {
fn dict_lit(key_type: DataType, value: &str) -> Arc<dyn PhysicalExpr> {
lit(ScalarValue::Dictionary(
Box::new(key_type),
Box::new(ScalarValue::new_utf8(value.to_string())),
))
}

fn null_dict_lit(key_type: DataType) -> Arc<dyn PhysicalExpr> {
lit(ScalarValue::Dictionary(
Box::new(key_type),
Box::new(ScalarValue::Utf8(None)),
))
}

let schema = Schema::new(vec![Field::new(
"a",
DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
true,
)]);
let a: UInt16DictionaryArray =
vec![Some("a"), Some("d"), None].into_iter().collect();
let col_a = col("a", &schema)?;
let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?;

// expression: "a in ("a", "b")"
let lists = [
vec![lit("a"), lit("b")],
vec![
dict_lit(DataType::Int8, "a"),
dict_lit(DataType::UInt16, "b"),
],
];
for list in lists.iter() {
in_list_raw!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change -- the tests now call in_list_raw directly. But there is no way that in_list_raw will be called outside of this module 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there is no way that in_list_raw will be called outside of this module

The function in_list is public, so it's possible that in_list is called directly without any type coercion, which is exactly what Comet tries to do in the first place to leverage the static filter optimization. Since there's no type coercion, it's possible the value type and the list type are mixed with dictionary types. I think this test simulates exactly the issue I encountered in the Comet's case.

For the sqllogictest part, jayzhan11 has already pointed out in #9530 (comment), the type coercion is happened in the optimization phase.

In datafusion, we have done it in the optimization step, when we reach in_list here, we can ensure the types are consistent, so we just go ahead without type checking.

I think that's why the E2E tests already pass without this PR's fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- what I don't understand is how these validate the Comet use case. I expect them to call in_list (instead they calling the in_list_raw! macro)

What I was expected to see was a test that mirrors what comet does: call in_list with a Dictionary column but string literals (that haven't ben type cerced).

Given this case current errors, we have no test coverage, even if the in_list implementation does actually support it.

Sorry to be so pedantic about this, but I think it is somewhat subtle so making sure we get it right (and don't accidentally break it in the future) I think is important

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- what I don't understand is how these validate the Comet use case. I expect them to call in_list (instead they calling the in_list_raw! macro)

There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not. These two macros both call the in_list method though. The comet case calls the in_list method(not the test macro) directly without type coercion, which is exactly the in_list_utf8_with_dict_types test trying to simulate by calling the in_list_raw test macro.

Sorry to be so pedantic about this, but I think it is somewhat subtle so making sure we get it right (and don't accidentally break it in the future) I think is important

No worries. I think it's exactly the spirit we need towards a better software engineering practice. And totally agree that it's important to make sure we get it right and won't break it in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some misunderstanding here. in_list and in_list_raw are both test macros, the main difference is that the former performs type coercion first and the latter does not.

Ahh! Yes I was missing exactly this point. Sorry about that. Makes total sense.

batch,
list.clone(),
&false,
vec![Some(true), Some(false), None],
col_a.clone(),
&schema
);
}

// expression: "a not in ("a", "b")"
for list in lists.iter() {
in_list_raw!(
batch,
list.clone(),
&true,
vec![Some(false), Some(true), None],
col_a.clone(),
&schema
);
}

// expression: "a in ("a", "b", null)"
let lists = [
vec![lit("a"), lit("b"), lit(ScalarValue::Utf8(None))],
vec![
dict_lit(DataType::Int8, "a"),
dict_lit(DataType::UInt16, "b"),
null_dict_lit(DataType::UInt16),
],
];
for list in lists.iter() {
in_list_raw!(
batch,
list.clone(),
&false,
vec![Some(true), None, None],
col_a.clone(),
&schema
);
}

// expression: "a not in ("a", "b", null)"
for list in lists.iter() {
in_list_raw!(
batch,
list.clone(),
&true,
vec![Some(false), None, None],
col_a.clone(),
&schema
);
}

Ok(())
}
}
39 changes: 39 additions & 0 deletions datafusion/sqllogictest/test_files/dictionary.slt
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,22 @@ f3 Utf8 YES
f4 Float64 YES
time Timestamp(Nanosecond, None) YES

# in list with dictionary input
query BBB
SELECT
tag_id in ('1000'), '1000' in (tag_id, null), arrow_cast('999','Dictionary(Int32, Utf8)') in (tag_id, null)
FROM m1
----
true true NULL
true true NULL
true true NULL
true true NULL
true true NULL
true true NULL
true true NULL
true true NULL
true true NULL
true true NULL

# Table m2 with a tag columns `tag_id` and `type`, a field column `f5`, and `time`
statement ok
Expand Down Expand Up @@ -165,6 +181,29 @@ order by date_bin('30 minutes', time) DESC
3 400 600 500 2023-12-04T00:30:00
3 100 300 200 2023-12-04T00:00:00

# query with in list
query BBBBBBBB
SELECT
type in ('active', 'passive')
, 'active' in (type)
, 'active' in (type, null)
, arrow_cast('passive','Dictionary(Int8, Utf8)') in (type, null)
, tag_id in ('1000', '2000')
, tag_id in ('999')
, '1000' in (tag_id, null)
, arrow_cast('999','Dictionary(Int16, Utf8)') in (tag_id, null)
FROM m2
----
true true true NULL true false true NULL
true true true NULL true false true NULL
true true true NULL true false true NULL
true true true NULL true false true NULL
true true true NULL true false true NULL
true true true NULL true false true NULL
true false NULL true true false true NULL
true false NULL true true false true NULL
true false NULL true true false true NULL
true false NULL true true false true NULL


# Reproducer for https://github.com/apache/arrow-datafusion/issues/8738
Expand Down
Loading