Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AvroArrowArrayReader possible to scan Avro backed table which contains nested records #7525

Merged
merged 2 commits into from
Sep 14, 2023

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Sep 11, 2023

Which issue does this PR close?

Closes #7524

Rationale for this change

This PR fixes an issue that I explained #7524.

What changes are included in this PR?

The causes are:

  1. schema_lookup considers the lookup table only for root record. Child records have their own lookup table so they should be considered too.
  2. The logic for reading arrays of records are wrong.

So, this change includes fixes for them.

Are these changes tested?

I prepared this Avro format file for test.
The schema of this file is as follows.

{
    "name": "record1",
    "namespace": "ns1",
    "type": "record",
    "fields": [
        {
            "name": "f1",
            "type": {
                "name": "record2",
                "namespace": "ns2",
                "type": "record",
                "fields": [
                    {
                        "name": "f1_1",
                        "type": "string"
                    },  {
                        "name": "f1_2",
                        "type": "int"
                    },  {
                        "name": "f1_3",
                        "type": {
                            "name": "record3",
                            "namespace": "ns3",
                            "type": "record",
                            "fields": [
                                {
                                    "name": "f1_3_1",
                                    "type": "double"
                                }
                            ]
                        }
                    }
                ]
            }
        },  {
            "name": "f2",
            "type": "array",
            "items": {
                "name": "record4",
                "namespace": "ns4",
                "type": "record",
                "fields": [
                    {
                        "name": "f2_1",
                        "type": "boolean"
                    },  {
                        "name": "f2_2",
                        "type": "float"
                    }
                ]
            }
        }
    ]
}

And the JSON representation of the Avro format file is as follows.

{"f1":{"f1_1":"aaa","f1_2":10,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":true,"f2_2":1.2},{"f2_1":true,"f2_2":2.2}]}
{"f1":{"f1_1":"bbb","f1_2":20,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":false,"f2_2":10.2}]}

Using this data, I create a table and scan it.

CREATE EXTERNAL TABLE mytbl STORED AS AVRO LOCATION '/path/to/nested_records.avro';
SELECT * FROM mytbl;

+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
| f1                                                                                          | f2                                                                                                 |
+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
| {ns2.record2.f1_1: aaa, ns2.record2.f1_2: 10, ns2.record2.f1_3: {ns3.record3.f1_3_1: 3.14}} | [{ns4.record4.f2_1: true, ns4.record4.f2_2: 1.2}, {ns4.record4.f2_1: true, ns4.record4.f2_2: 2.2}] |
| {ns2.record2.f1_1: bbb, ns2.record2.f1_2: 20, ns2.record2.f1_3: {ns3.record3.f1_3_1: 3.14}} | [{ns4.record4.f2_1: false, ns4.record4.f2_2: 10.2}]                                                |
+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.006 seconds.

The result seems as expected.

After this change is merged, I'll open a PR to add the the test data to arrow-testing. Then, I'll open a followup PR to add tests to avro.slt

Are there any user-facing changes?

No.

@github-actions github-actions bot added the core Core DataFusion crate label Sep 11, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me, though I am not an expert. I think this PR needs a test so that we don't accidentally break the feature in the future during a refactor

@alamb
Copy link
Contributor

alamb commented Sep 11, 2023

Thank you for the contribution @sarutak

@sarutak
Copy link
Member Author

sarutak commented Sep 11, 2023

@alamb
Test data suitable for this change is not present in testing. So I'm planning to add a test data to arrow-testing if this change seems good, and merged. And then, will open a followup PR to add test to avro.slt.

Or, is it better to add the test data to arrow-testing first?

@alamb
Copy link
Contributor

alamb commented Sep 12, 2023

Thanks @sarutak -- that makes sense.

Or, is it better to add the test data to arrow-testing first?

I suggest we add the data to arrow-testing first. I would feel much more comfortable merging code into datafusion that is tested, not only to prevent regressions, but also as part of reviewing unfamiliar code, having at test demonstrating it working is a major part of evaluating its suitability.

@sarutak
Copy link
Member Author

sarutak commented Sep 13, 2023

@alamb All right. I've opend a PR in arrow-testing.
apache/arrow-testing#91

After the test data is added, I'll modify this PR to add tests.

alamb added a commit to apache/arrow-testing that referenced this pull request Sep 13, 2023
This PR proposes to add an Avro format test data which contains nested
records.
This data is necessary for testing the change proposed in [this
PR](apache/datafusion#7525).

The schema of this test data is as follows.
```
{
    "name": "record1",
    "namespace": "ns1",
    "type": "record",
    "fields": [
        {
            "name": "f1",
            "type": {
                "name": "record2",
                "namespace": "ns2",
                "type": "record",
                "fields": [
                    {
                        "name": "f1_1",
                        "type": "string"
                    },  {
                        "name": "f1_2",
                        "type": "int"
                    },  {
                        "name": "f1_3",
                        "type": {
                            "name": "record3",
                            "namespace": "ns3",
                            "type": "record",
                            "fields": [
                                {
                                    "name": "f1_3_1",
                                    "type": "double"
                                }
                            ]
                        }
                    }
                ]
            }
        },  {
            "name": "f2",
            "type": "array",
            "items": {
                "name": "record4",
                "namespace": "ns4",
                "type": "record",
                "fields": [
                    {
                        "name": "f2_1",
                        "type": "boolean"
                    },  {
                        "name": "f2_2",
                        "type": "float"
                    }
                ]
            }
        }
    ]
}
```

And the JSON representation of the Avro format file is as follows.
```
{"f1":{"f1_1":"aaa","f1_2":10,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":true,"f2_2":1.2},{"f2_1":true,"f2_2":2.2}]}
{"f1":{"f1_1":"bbb","f1_2":20,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":false,"f2_2":10.2}]}
```
@alamb
Copy link
Contributor

alamb commented Sep 13, 2023

apache/arrow-testing#91

Has been merged

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 13, 2023
@sarutak
Copy link
Member Author

sarutak commented Sep 13, 2023

@alamb Thank you!
I've added test for this change.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @sarutak

@alamb alamb merged commit 58ddcee into apache:main Sep 14, 2023
21 checks passed
@andygrove andygrove added the enhancement New feature or request label Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate enhancement New feature or request sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't scan Avro format tables which contain nested records
3 participants