Add support for decoding StringArray in LargeUtf8 schema #143

progval · 2024-03-06T15:25:35Z

Vec::<_>::from_type::<Record>(TracingOptions::default())? is very
convenient, but it always returns a LargeUtf8 type for String fields
in Record, which prevents using it before decoding from a
RecordBatch which may contain a StringArray (instead of the expected
LargeStringArray).

This change adds a fallback on error when decoding what we expect to be
a LargeUtf8, to try decoding it as a Utf8 as a last resort.

(this PR contains the commit from #142 to avoid a merge conflict)

This can help debug incorrect data types.

`Vec::<_>::from_type::<Record>(TracingOptions::default())?` is very convenient, but it always returns a `LargeUtf8` type for String fields in `Record`, which prevents using it before decoding from a `RecordBatch` which may contain a `StringArray` (instead of the expected `LargeStringArray`). This change adds a fallback on error when decoding what we expect to be a `LargeUtf8`, to try decoding it as a `Utf8` as a last resort.

chmp · 2024-03-06T21:38:16Z

Hey again :).

TBH. I have to think about this change from a design perspective. My original idea was that the supplied fields reflect the types of the supplied arrays one-to-one. The idea of a fallback would weaken this one to one correspondence.

As part of #138, I added the option to directly get the fields from the RecordBatch schema. That would simplify the common case to:

let fields = Vec::<Field>::from_value(&batch.schema())?;
let items: Vec<Record> = serde_arrow::from_arrow(&fields, batch.columns())?;

I also plan to add a from_record_batch that would direclty handle the common scenario and correspond to the code above.
I further plan to add to rewrite from_arrow to accept a [impl AsRef<Field>], which would support using from_arrow(bacth.schema(), batch.columns())?. However, this option requires some more (breaking) changes to offer a coherent API and is a bit further down the line.

If it helps you, I could release the changes of #138 already now as a minor release.

progval · 2024-03-06T21:43:47Z

Thanks, I'll give it a try. No need for a release, I don't mind pulling directly from git

chmp · 2024-03-06T22:04:22Z

You should be able include the changes as serde_arrow = { version = "0.10.1-rc.1", ... }. See also here.

progval · 2024-03-06T22:04:43Z

I'm a little confused by this. Your example here and the documentation show Vec::<Field>::from_value(&batch.schema())?; but from the code and the executable examples, it looks like from_value expects a value rather than a schema. And that value needs to implement Serialize, which means it isn't suitable for types that implement only Deserialize

chmp · 2024-03-06T22:20:29Z

Def. to include more docs :). The short version:

arrow::datatypes::Schema supports Serialize, when you include arrow with the serde flag.
- E.g., the schema of a batch with fields first: Int32, second: Struct { a: Int32 } will serialize to the equivalent of [{"name": "first", "data_type": {Int32: null}}, {"name": "second", "data_type": { "Struct": [ { "name": "a":, "data_type": {"Int32": null}}]}]
- To see this, you could include a println!("{}", serde_json::to_string(&batch.schema())?);
from_value interprets the corresponding serialization events to built the fields for this schema, e.g., [Field::new("first", DataType::Int32, false), Field::new("second", DataType::Struct(vec![Fiedl::new("a", DataType::Int32, false)]), false)].

from_value and its requirement of Serialize are only related to how to obtain the field information. I.e., it replaces the from_type call in your code. That the resulting fields are used in deserialization is unrelated to how the fields are built initially. In my personal usage, I use from_type to reconstruct the schema from a serde-arrow specific JSON format (the arrow serde representation of a schema is a bit more verbose). For the example:

let fields = Vec::<Field>:.from_type(&json!([ 
  {"name": "first", "data_type": "I32"},
  {"name": "second", "data_type": "Struct", "children": [ {"name": "a", "data_type": "I32} ]},
]);

chmp · 2024-03-06T22:23:31Z

Oh. And regarding the name: from_value is probably misnamed (I am open to suggestions :)). I chose the name, as I am typically using it with serde_json::Value(s) and it is conceptually similar to serde_json::from_value. Except, SchemaLike::from_value accepts any Rust type that can be interpreted by serde, not only serde_json::Value(s). You could think of it implicitly passing the argument throught serde_json::to_value first, then interpreting the resulting JSON object.

progval · 2024-03-06T22:45:14Z

Got it, thanks.

For what it's worth, I see a small (5 to 10%) slowdown of the deserialization code of my application when using this vs my PR. It is probably because I use a rather small batch size (256 or 512 records).

arrow::datatypes::Schema supports Serialize, when you include arrow with the serde flag.

It's arrow_schema, not arrow btw.

chmp · 2024-03-07T16:12:35Z

Re. Slowdown. Yes, this way of getting the schema is totally, not optimized. If speed is imporant, you can have a look at the second code example in this issue. This way of getting the fields should be much faster.

The arrow crate exports various types, for example arrow:datatypes::Schema. I am only using the indvidiual crates to keep the dependency tree small.

progval added 2 commits March 6, 2024 14:15

Add data type to error message when failing to convert an array

756cc40

This can help debug incorrect data types.

progval closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for decoding StringArray in LargeUtf8 schema #143

Add support for decoding StringArray in LargeUtf8 schema #143

progval commented Mar 6, 2024

chmp commented Mar 6, 2024 •

edited

Loading

progval commented Mar 6, 2024 •

edited

Loading

chmp commented Mar 6, 2024

progval commented Mar 6, 2024

chmp commented Mar 6, 2024

chmp commented Mar 6, 2024 •

edited

Loading

progval commented Mar 6, 2024 •

edited

Loading

chmp commented Mar 7, 2024

Add support for decoding StringArray in LargeUtf8 schema #143

Add support for decoding StringArray in LargeUtf8 schema #143

Conversation

progval commented Mar 6, 2024

chmp commented Mar 6, 2024 • edited Loading

progval commented Mar 6, 2024 • edited Loading

chmp commented Mar 6, 2024

progval commented Mar 6, 2024

chmp commented Mar 6, 2024

chmp commented Mar 6, 2024 • edited Loading

progval commented Mar 6, 2024 • edited Loading

chmp commented Mar 7, 2024

chmp commented Mar 6, 2024 •

edited

Loading

progval commented Mar 6, 2024 •

edited

Loading

chmp commented Mar 6, 2024 •

edited

Loading

progval commented Mar 6, 2024 •

edited

Loading