Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include union type as dictionaries #44

Closed
Tracked by #61
chmp opened this issue Apr 5, 2023 · 7 comments
Closed
Tracked by #61

Include union type as dictionaries #44

chmp opened this issue Apr 5, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@chmp
Copy link
Owner

chmp commented Apr 5, 2023

Update: serialize unions as is, but include an additional {name}_type field that encodes the union type in human readable form.

At the moment field-less enums are serialized as a union of null fields. This is probably not expected (or desired). Probably it would be better to serialize it as a string-dictionary.

test_round_trip!(
    test_name = enums_union,
    tracing_options = TracingOptions::default().allow_null_fields(true),
    field = Field::new(
        "value",
        DataType::Union(
            vec![
                Field::new("A", DataType::Null, true),
                Field::new("B", DataType::Null, true),
            ],
            None,
            UnionMode::Dense
        ),
        false,
    ),
    ty = Item,
    values = [Item::A, Item::B,],
    define = {
        #[derive(Debug, PartialEq, Deserialize, Serialize)]
        enum Item {
            A,
            B,
        }
    },
);

Current idea:

  • add fields with strategy "enum_type" that only serialize the type of the enum
  • Q: allow both data and type to be stored in different fields? How to deal with inconsistencies?
    • on serialization no inconsistency can happen
    • on deserialization ignore the type if both are present?
@chmp chmp added the enhancement New feature or request label Apr 5, 2023
@chmp chmp mentioned this issue Apr 6, 2023
14 tasks
@droher
Copy link

droher commented May 26, 2023

Not sure whether this should be its own issue or under this one, but I am currently unable to serialize field-less enums - they all panic with a backtrace like this:

thread '<unnamed>' panicked at 'index out of bounds: the len is 2 but the index is 3', /Users/davidroher/.cargo/registry/src/github.com-1ecc6299db9ec823/serde_arrow-0.7.0/src/internal/generic_sinks/union.rs:85:45
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::panicking::panic_bounds_check
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:159:5
   3: <serde_arrow::internal::generic_sinks::union::UnionArrayBuilder<B> as serde_arrow::internal::sink::EventSink>::accept_variant
   4: <serde_arrow::internal::generic_sinks::struct::StructArrayBuilder<B> as serde_arrow::internal::sink::EventSink>::accept_variant
   5: <serde_arrow::internal::generic_sinks::struct::StructArrayBuilder<B> as serde_arrow::internal::sink::EventSink>::accept_variant
   6: <serde_arrow::internal::sink::EventSerializer<S> as serde::ser::Serializer>::serialize_unit_variant
   7: boxball_rs::event_file::game_state::_::<impl serde::ser::Serialize for boxball_rs::event_file::game_state::GameMetadata>::serialize

As soon as it gets to any struct which has a fieldless enum as one of its fields (when I remove all enum fields from the struct, i no longer get the error).

@chmp
Copy link
Owner Author

chmp commented May 27, 2023

@droher Thanks for posting. A separate issue would be better (as I have something different in mind for this topic :D). I created one here.

@chmp chmp changed the title Allow to store field-less enums as dictionaries? Include union type as dictionaries May 29, 2023
@droher
Copy link

droher commented Jun 2, 2023

I'm not sure if you would want to include this as an intermediate option, but my current workaround for this is to just serialize them as strings (with a serialize_with override) and not worry about the dictionary part. This is especially relevant for the Parquet write use-case, because the parquet crate A) isn't able to handle null union types and B) automatically writes string fields as dictionaries regardless of their arrow representation.

@chmp
Copy link
Owner Author

chmp commented Jun 3, 2023

The dictionary part is actually not the tricky part - as it is already implemented :). My issue is rather how to do all of this in a generic way. My feeling is the type as a string could be useful in addition to the data quite some often. So my current thinking is to allow maximum flexibility:

  • data only: the current approach
  • type only: store variant name as a dictionary encoded string
  • data + type: store the enum as separate fields (e.g., {field_name}_type and {field_name}_data)

What is keeping me from implementing it, is mostly, that I'm not sure whether I like this approach :) (plus time)

@chmp chmp mentioned this issue Jun 18, 2023
15 tasks
@chmp chmp mentioned this issue Jun 3, 2024
@chmp
Copy link
Owner Author

chmp commented Jun 3, 2024

Started work in #185

@chmp
Copy link
Owner Author

chmp commented Jun 3, 2024

See here for the current design: #183 (comment)

@chmp
Copy link
Owner Author

chmp commented Jun 4, 2024

Merged #185

@chmp chmp closed this as completed Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants