-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for decoding StringArray in LargeUtf8 schema #143
Conversation
This can help debug incorrect data types.
`Vec::<_>::from_type::<Record>(TracingOptions::default())?` is very convenient, but it always returns a `LargeUtf8` type for String fields in `Record`, which prevents using it before decoding from a `RecordBatch` which may contain a `StringArray` (instead of the expected `LargeStringArray`). This change adds a fallback on error when decoding what we expect to be a `LargeUtf8`, to try decoding it as a `Utf8` as a last resort.
Hey again :). TBH. I have to think about this change from a design perspective. My original idea was that the supplied fields reflect the types of the supplied arrays one-to-one. The idea of a fallback would weaken this one to one correspondence. As part of #138, I added the option to directly get the fields from the RecordBatch schema. That would simplify the common case to: let fields = Vec::<Field>::from_value(&batch.schema())?;
let items: Vec<Record> = serde_arrow::from_arrow(&fields, batch.columns())?; I also plan to add a If it helps you, I could release the changes of #138 already now as a minor release. |
Thanks, I'll give it a try. No need for a release, I don't mind pulling directly from git |
You should be able include the changes as |
I'm a little confused by this. Your example here and the documentation show |
Def. to include more docs :). The short version:
let fields = Vec::<Field>:.from_type(&json!([
{"name": "first", "data_type": "I32"},
{"name": "second", "data_type": "Struct", "children": [ {"name": "a", "data_type": "I32} ]},
]); |
Oh. And regarding the name: |
Got it, thanks. For what it's worth, I see a small (5 to 10%) slowdown of the deserialization code of my application when using this vs my PR. It is probably because I use a rather small batch size (256 or 512 records).
It's |
Re. Slowdown. Yes, this way of getting the schema is totally, not optimized. If speed is imporant, you can have a look at the second code example in this issue. This way of getting the fields should be much faster. The arrow crate exports various types, for example |
Vec::<_>::from_type::<Record>(TracingOptions::default())?
is veryconvenient, but it always returns a
LargeUtf8
type for String fieldsin
Record
, which prevents using it before decoding from aRecordBatch
which may contain aStringArray
(instead of the expectedLargeStringArray
).This change adds a fallback on error when decoding what we expect to be
a
LargeUtf8
, to try decoding it as aUtf8
as a last resort.(this PR contains the commit from #142 to avoid a merge conflict)