Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/parquet: support tuples #103589

Merged
merged 1 commit into from
May 22, 2023
Merged

Commits on May 22, 2023

  1. util/parquet: support tuples

    This change adds support for writing tuples. Implementation details below.
    
    The standard way to write a tuple in parquet is to use a group:
    ```
    message schema {                 -- toplevel schema
       optional group a (LIST) {
           optional T1 element;       -- physical column for the first field
           ...
           optional Tn element;       -- physical column for the nth field
       }
    }
    ```
    
    Because parquet has a very strict format, it does not write such groups
    as one column with all the fields adjacent to each other. Instead, it
    writes each field in the tuple as its own column. This 1:N mapping
    from CRDB datum to physical column in parquet violates the assumption
    used in this library that the mapping is 1:1.
    
    This change aims to update the library to break that assumption. Firstly,
    there is now a clear distiction between a "datum column" and a "physical
    column". Also, the `Writer` is updated to be able to write to multiple
    physical columns for a given datum, and the reader is updated
    to "squash" physical columns into single tuple datums if needed. Finally,
    randomized testing and benchmarking is extended to cover tuples.
    
    Informs: cockroachdb#99028
    Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
    Release note: None
    jayshrivastava committed May 22, 2023
    Configuration menu
    Copy the full SHA
    198a5ad View commit details
    Browse the repository at this point in the history