Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support complex types in Arrow/Parquet/ORC #24341

Merged
merged 21 commits into from
Jun 21, 2021
Merged

Conversation

Avogar
Copy link
Member

@Avogar Avogar commented May 20, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Support structs and maps in Arrow/Parquet/ORC and dictionaries in Arrow input/output formats. Present new setting output_format_arrow_low_cardinality_as_dictionary.

Detailed description / Documentation draft:
Support Struct and Map types in input/output column-oriented formats Arrow/Parquet/ORC. Now you can input/output ClickHouse Tuples and Maps (experimental type) in these formats. Nested complex types are also supported. If setting output_format_arrow_low_cardinality_as_dictionary is true, LowCardinality columns will be converted into dictionary Arrow column, if false, LowCardinality columns will be converted to full column before output. By default this setting is false.
Closes #17240 and #21866

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels May 20, 2021
@Avogar
Copy link
Member Author

Avogar commented May 21, 2021

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented May 21, 2021

Command update: success

Branch has been successfully updated

@Avogar Avogar changed the title Support structs in Arrow/Parquet/ORC Support structs in Arrow/Parquet/ORC and dictionaries for Arrow May 25, 2021
@Avogar Avogar changed the title Support structs in Arrow/Parquet/ORC and dictionaries for Arrow Support structs in Arrow/Parquet/ORC and dictionaries in Arrow May 25, 2021
@nikitamikhaylov nikitamikhaylov self-assigned this May 26, 2021
@Avogar
Copy link
Member Author

Avogar commented May 27, 2021

I am also going to add Map type support in this PR.

@Avogar Avogar changed the title Support structs in Arrow/Parquet/ORC and dictionaries in Arrow Support complex types in Arrow/Parquet/ORC May 27, 2021
@robot-ch-test-poll2 robot-ch-test-poll2 added the submodule changed At least one submodule changed in this PR. label May 28, 2021
@Avogar Avogar force-pushed the arrow branch 3 times, most recently from 8455492 to 32715b7 Compare June 1, 2021 08:43
@Avogar
Copy link
Member Author

Avogar commented Jun 1, 2021

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Jun 1, 2021

Command update: success

Branch has been successfully updated

@buyology
Copy link

buyology commented Jun 8, 2021

Thanks so much for working on this.

I tried this PR as I thought it could help us to load some of our Parquet files into ClickHouse.

The table

:) CREATE TABLE t ( v Nested(a String, b String) ) ENGINE=Memory()

An array of structs

» parquet-tools schema f1
message schema {
  required group v (LIST) {
    repeated group list {
      required group v {
        optional binary a (STRING);
        optional binary b (STRING);
      }
    }
  }
}
» ch --query="INSERT INTO t FORMAT Parquet" < f1
Code: 8. DB::Exception: Column "v.a" is not presented in input data.: data for INSERT was parsed from stdin

A struct of arrays

» parquet-tools schema f2
message schema {
  optional group v {
    required group a (LIST) {
      repeated group list {
        required binary a (STRING);
      }
    }
    required group b (LIST) {
      repeated group list {
        required binary b (STRING);
      }
    }
  }
}
» ch --query="INSERT INTO t FORMAT Parquet" < f2
Code: 8. DB::Exception: Column "v.a" is not presented in input data.: data for INSERT was parsed from stdin

This could be worked around by transforming the data using input-function (going through Tuples), but it would be amazing to just be able to load both of these files directly into the destination table.

@Avogar
Copy link
Member Author

Avogar commented Jun 15, 2021

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Jun 15, 2021

Command update: success

Branch has been successfully updated

@Avogar
Copy link
Member Author

Avogar commented Jun 15, 2021

@buyology thank you for you comment, I will try to support inserting into Nested by struct of arrays or arrays of struct in the next PR.

@Avogar
Copy link
Member Author

Avogar commented Jun 15, 2021

AST fuzzer failure: #25293

@Avogar
Copy link
Member Author

Avogar commented Jun 18, 2021

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Jun 18, 2021

Command update: success

Branch has been successfully updated

@Avogar Avogar merged commit a54cbef into ClickHouse:master Jun 21, 2021
@sevirov
Copy link
Contributor

sevirov commented Jun 24, 2021

Internal documentation ticket: DOCSUP-10557.

@alexey-milovidov
Copy link
Member

Continued here: #36832

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The type "dictionary" ... is not supported for conversion from a Arrow data format
7 participants