Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avro_to_arrow: Handle avro nested nullable struct (union) #7663

Merged
merged 1 commit into from
Oct 4, 2023

Conversation

Samrose-Ahmed
Copy link
Contributor

Corrects handling of a nullable struct union.

Which issue does this PR close?

Closes #7662.

Rationale for this change

Bug fix

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Sep 27, 2023
@alamb
Copy link
Contributor

alamb commented Sep 28, 2023

@Samrose-Ahmed -- is it possible to add a test to ensure this functionality is not broken in the future?

@alamb
Copy link
Contributor

alamb commented Sep 28, 2023

(Thank you for the contribution, BTW)

@Samrose-Ahmed
Copy link
Contributor Author

Of course I can add a test.

@Samrose-Ahmed
Copy link
Contributor Author

I have added a test to verify this behavior.

@alamb
Copy link
Contributor

alamb commented Sep 29, 2023

I don't think the CI failures are related to your work (they are fixed in #7701)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @Samrose-Ahmed

@alamb
Copy link
Contributor

alamb commented Sep 29, 2023

I took the liberty of merging up from main to get the CI to pass cleanly.

@sarutak
Copy link
Member

sarutak commented Sep 29, 2023

@Samrose-Ahmed Thank you for catching this issue! I should have fixed this in #7525 .
@alamb Do we need end-to-end test for this change?

r.put("col1", AvroValue::Union(0, Box::new(AvroValue::Null)));

let mut w = apache_avro::Writer::new(&schema, vec![]);
w.append(r).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to add one more record which contains non-null col1 to ensure it works even in this case?

@alamb
Copy link
Contributor

alamb commented Sep 29, 2023

@alamb Do we need end-to-end test for this change?

Hi @sarutak -- end to end test coverage would definitely be better ❤️ -- I am not familiar enough with the avro spec to know how hard/easy that would be to create. Is it something you can help with?

@sarutak
Copy link
Member

sarutak commented Sep 29, 2023

Is it something you can help with?

@alamb
Of course! I'll update nested_record.avro in arow-testing to add nullable record schema.

@Samrose-Ahmed
After I have finished the change, could you add end-to-end test to avro.slt?

@Samrose-Ahmed
Copy link
Contributor Author

Sure I can do that

@sarutak
Copy link
Member

sarutak commented Sep 29, 2023

@Samrose-Ahmed Or, do you want to change the test data too?
If you want to do it, this PR might help you.

@Samrose-Ahmed
Copy link
Contributor Author

I would appreciate if you can do it.

@sarutak
Copy link
Member

sarutak commented Sep 29, 2023

@Samrose-Ahmed
OK, let me do it.

@sarutak
Copy link
Member

sarutak commented Sep 29, 2023

I've opened a PR to update the test data.
apache/arrow-testing#94

alamb added a commit to apache/arrow-testing that referenced this pull request Oct 1, 2023
This PR proposes to update `nested_recods.avro` to support nullable
records.
This change is necessary for [this
PR](apache/datafusion#7663).

This change appends new fields `f3` and `f4` to the existing schema (the
last two fields).
`f3` is nullable record, and `f4` is array of nullable record.
```
{
    "name": "record1",
    "namespace": "ns1",
    "type": "record",
    "fields": [
        {
            "name": "f1",
            "type": {
                "name": "record2",
                "namespace": "ns2",
                "type": "record",
                "fields": [
                    {
                        "name": "f1_1",
                        "type": "string"
                    },  {
                        "name": "f1_2",
                        "type": "int"
                    },  {
                        "name": "f1_3",
                        "type": {
                            "name": "record3",
                            "namespace": "ns3",
                            "type": "record",
                            "fields": [
                                {
                                    "name": "f1_3_1",
                                    "type": "double"
                                }
                            ]
                        }
                    }
                ]
            }
        },  {
            "name": "f2",
            "type": "array",
            "items": {
                "name": "record4",
                "namespace": "ns4",
                "type": "record",
                "fields": [
                    {
                        "name": "f2_1",
                        "type": "boolean"
                    },  {
                        "name": "f2_2",
                        "type": "float"
                    }
                ]
            }
        },  {
            "name": "f3",
            "type": [
                "null",
                {
                    "name": "record5",
                    "namespace": "ns5",
                    "type": "record",
                    "fields": [
                        {
                            "name": "f3_1",
                            "type": "string"
                        }
                    ]
                }
            ],
            "default": null
        },  {
            "name": "f4",
            "type": "array",
            "items": [
                "null",
                {
                    "name": "record6",
                    "namespace": "ns6",
                    "type": "record",
                    "fields": [
                        {
                            "name": "f4_1",
                            "type": "long"
                        }
                    ]
                }
            ]
        }
    ]
}
```

And the data represented in JSON is as follows.
```
{"f1":{"f1_1":"aaa","f1_2":10,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":true,"f2_2":1.2000000476837158},{"f2_1":true,"f2_2":2.200000047683716}],"f3":{"f3_1":"xyz"},"f4":[{"f4_1":200},null]}
{"f1":{"f1_1":"bbb","f1_2":20,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":false,"f2_2":10.199999809265137}],"f3":null,"f4":[null,{"f4_1":300}]}
```
@alamb
Copy link
Contributor

alamb commented Oct 1, 2023

apache/arrow-testing#94 is merged

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 2, 2023
@Samrose-Ahmed
Copy link
Contributor Author

I've updated the PR.

I've fixed a correctness issue needed for this PR with mapping the Avro and Arrow schema as well, the Avro Record name should not be used in Arrow (the code that was doing &format!("{}.{}", name.fullname(None), field.name), only the field names should be used, the record name is an Avro concept used for schema merging/references. I've added a test for that as well.

@Samrose-Ahmed
Copy link
Contributor Author

Actually this code is a bit weird I found more issues with nested types... will have to fix.

@alamb
Copy link
Contributor

alamb commented Oct 2, 2023

Actually this code is a bit weird I found more issues with nested types... will have to fix.

It might make sense to get this PR in as it improves things, even if it doesn't fix everything completely and then work on follow ups

Also, note that @tustvold I think is planning to work on adding upstream support in apache/arrow-rs#4886 arrow-rs (which might automatically handle nested types), though that won't land in datafusion for a week or two

@Samrose-Ahmed
Copy link
Contributor Author

Oh that's great, let me make another revision and see if theirs something improving we can get in.

@Samrose-Ahmed
Copy link
Contributor Author

I have made another revision, I had to edit some of the schema lookup logic to handle nested fields, didn't want to change the code too much.

@Samrose-Ahmed Samrose-Ahmed force-pushed the avro-struct-union branch 2 times, most recently from ce88ea4 to cf34a25 Compare October 2, 2023 23:14
Copy link
Member

@sarutak sarutak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me except one minor comment!

datafusion/core/Cargo.toml Outdated Show resolved Hide resolved
@sarutak
Copy link
Member

sarutak commented Oct 3, 2023

Oh, wait. Clippy raises some warnings.

error: this expression creates a reference which is immediately dereferenced by the compiler
   --> datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs:115:29
    |
115 | ...                   &parent_field_name,
    |                       ^^^^^^^^^^^^^^^^^^ help: change this to: `parent_field_name`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow
    = note: `-D clippy::needless-borrow` implied by `-D warnings`

error: redundant closure
   --> datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs:595:38
    |
595 | ...                   .map(|v| maybe_resolve_union(v))
    |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^ help: replace the closure with the function itself: `maybe_resolve_union`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#redundant_closure
    = note: `-D clippy::redundant-closure` implied by `-D warnings`

error: could not compile `datafusion` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...
error: the borrowed expression implements the required traits
    --> datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs:1189:40
     |
1189 |           let r1 = apache_avro::to_value(&serde_json::json!({
     |  ________________________________________^
1190 | |             "headers": [
1191 | |                 {
1192 | |                     "name": "a",
...    |
1195 | |             ]
1196 | |         }))
     | |__________^
     |
     = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow
help: change this to
     |
1189 ~         let r1 = apache_avro::to_value(serde_json::json!({
1190 +             "headers": [
1191 +                 {
1192 +                     "name": "a",
1193 +                     "value": "b"
1194 +                 }
1195 +             ]
1196 ~         }))
     |

error: the borrowed expression implements the required traits
    --> datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs:1407:40
     |
1407 |         let r1 = apache_avro::to_value(&jv1)
     |                                        ^^^^ help: change this to: `jv1`
     |
     = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow

error: could not compile `datafusion` (lib test) due to 4 previous errors

@Samrose-Ahmed Could you fix them? Also you can check the style locally by running ci/scripts/rust_clippy.sh and ci/scripts/rust_fmt.sh.

@sarutak
Copy link
Member

sarutak commented Oct 3, 2023

@alamb Could you trigger GA workflows?

Corrects handling of a nullable struct union.

Signed-off-by: 🐼 Samrose Ahmed 🐼 <[email protected]>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Samrose-Ahmed and @sarutak

BTW @tustvold is working on porting / implementing this upstream: apache/arrow-rs#4888

@sarutak
Copy link
Member

sarutak commented Oct 4, 2023

@alamb Thank you for letting me know. I'll check it out.

@alamb
Copy link
Contributor

alamb commented Oct 4, 2023

In the interim, I think this PR makes DataFusion demonstrably better, and increases test coverage so I think we should merge it in. Thanks again everyone who was involved

@alamb alamb merged commit 5361d2e into apache:main Oct 4, 2023
22 checks passed
@sarutak
Copy link
Member

sarutak commented Oct 4, 2023

@alamb
BTW, will this project still open for changes related to Avro?
Actually, I'm working on a type conversion feature currently marked as todo!().

@alamb
Copy link
Contributor

alamb commented Oct 4, 2023

BTW, will this project still open for changes related to Avro?
Actually, I'm working on a type conversion feature currently marked as todo!().

Absolutely -- I view what @tustvold is doing as complementary (even if somewhat duplicative). I would expect that DataFusion will continue to support avro for reading / processing, and that all end to end tests (e.g. slt tests) would pass without modification, even if the internals of how avro reading / processing move upstream

Does that make sense @sarutak ?

@sarutak
Copy link
Member

sarutak commented Oct 4, 2023

@alamb
I understand. Thank you!

Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
…e#7663)

Corrects handling of a nullable struct union.

Signed-off-by: 🐼 Samrose Ahmed 🐼 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

avro_to_arrow: Nested nullable struct (union)
3 participants