Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for decoding StringArray in LargeUtf8 schema #143

Closed
wants to merge 2 commits into from

Conversation

progval
Copy link
Contributor

@progval progval commented Mar 6, 2024

Vec::<_>::from_type::<Record>(TracingOptions::default())? is very
convenient, but it always returns a LargeUtf8 type for String fields
in Record, which prevents using it before decoding from a
RecordBatch which may contain a StringArray (instead of the expected
LargeStringArray).

This change adds a fallback on error when decoding what we expect to be
a LargeUtf8, to try decoding it as a Utf8 as a last resort.

(this PR contains the commit from #142 to avoid a merge conflict)

This can help debug incorrect data types.
`Vec::<_>::from_type::<Record>(TracingOptions::default())?` is very
convenient, but it always returns a `LargeUtf8` type for String fields
in `Record`, which prevents using it before decoding from a
`RecordBatch` which may contain a `StringArray` (instead of the expected
`LargeStringArray`).

This change adds a fallback on error when decoding what we expect to be
a `LargeUtf8`, to try decoding it as a `Utf8` as a last resort.
@chmp
Copy link
Owner

chmp commented Mar 6, 2024

Hey again :).

TBH. I have to think about this change from a design perspective. My original idea was that the supplied fields reflect the types of the supplied arrays one-to-one. The idea of a fallback would weaken this one to one correspondence.

As part of #138, I added the option to directly get the fields from the RecordBatch schema. That would simplify the common case to:

let fields = Vec::<Field>::from_value(&batch.schema())?;
let items: Vec<Record> = serde_arrow::from_arrow(&fields, batch.columns())?;

I also plan to add a from_record_batch that would direclty handle the common scenario and correspond to the code above.
I further plan to add to rewrite from_arrow to accept a [impl AsRef<Field>], which would support using from_arrow(bacth.schema(), batch.columns())?. However, this option requires some more (breaking) changes to offer a coherent API and is a bit further down the line.

If it helps you, I could release the changes of #138 already now as a minor release.

@progval
Copy link
Contributor Author

progval commented Mar 6, 2024

Thanks, I'll give it a try. No need for a release, I don't mind pulling directly from git

@chmp
Copy link
Owner

chmp commented Mar 6, 2024

You should be able include the changes as serde_arrow = { version = "0.10.1-rc.1", ... }. See also here.

@progval
Copy link
Contributor Author

progval commented Mar 6, 2024

I'm a little confused by this. Your example here and the documentation show Vec::<Field>::from_value(&batch.schema())?; but from the code and the executable examples, it looks like from_value expects a value rather than a schema. And that value needs to implement Serialize, which means it isn't suitable for types that implement only Deserialize

@chmp
Copy link
Owner

chmp commented Mar 6, 2024

Def. to include more docs :). The short version:

  • arrow::datatypes::Schema supports Serialize, when you include arrow with the serde flag.
    • E.g., the schema of a batch with fields first: Int32, second: Struct { a: Int32 } will serialize to the equivalent of [{"name": "first", "data_type": {Int32: null}}, {"name": "second", "data_type": { "Struct": [ { "name": "a":, "data_type": {"Int32": null}}]}]
    • To see this, you could include a println!("{}", serde_json::to_string(&batch.schema())?);
  • from_value interprets the corresponding serialization events to built the fields for this schema, e.g., [Field::new("first", DataType::Int32, false), Field::new("second", DataType::Struct(vec![Fiedl::new("a", DataType::Int32, false)]), false)].

from_value and its requirement of Serialize are only related to how to obtain the field information. I.e., it replaces the from_type call in your code. That the resulting fields are used in deserialization is unrelated to how the fields are built initially. In my personal usage, I use from_type to reconstruct the schema from a serde-arrow specific JSON format (the arrow serde representation of a schema is a bit more verbose). For the example:

let fields = Vec::<Field>:.from_type(&json!([ 
  {"name": "first", "data_type": "I32"},
  {"name": "second", "data_type": "Struct", "children": [ {"name": "a", "data_type": "I32} ]},
]);

@chmp
Copy link
Owner

chmp commented Mar 6, 2024

Oh. And regarding the name: from_value is probably misnamed (I am open to suggestions :)). I chose the name, as I am typically using it with serde_json::Value(s) and it is conceptually similar to serde_json::from_value. Except, SchemaLike::from_value accepts any Rust type that can be interpreted by serde, not only serde_json::Value(s). You could think of it implicitly passing the argument throught serde_json::to_value first, then interpreting the resulting JSON object.

@progval
Copy link
Contributor Author

progval commented Mar 6, 2024

Got it, thanks.

For what it's worth, I see a small (5 to 10%) slowdown of the deserialization code of my application when using this vs my PR. It is probably because I use a rather small batch size (256 or 512 records).

  • arrow::datatypes::Schema supports Serialize, when you include arrow with the serde flag.

It's arrow_schema, not arrow btw.

@progval progval closed this Mar 6, 2024
@chmp
Copy link
Owner

chmp commented Mar 7, 2024

Re. Slowdown. Yes, this way of getting the schema is totally, not optimized. If speed is imporant, you can have a look at the second code example in this issue. This way of getting the fields should be much faster.

The arrow crate exports various types, for example arrow:datatypes::Schema. I am only using the indvidiual crates to keep the dependency tree small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants