Add Avro Support #4886

tustvold · 2023-10-02T12:27:10Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Avro is a widely used binary, row-oriented data encoding. It is very similar to protobuf, and has seen very wide adoption in the data ecosystem, especially for streaming workloads.

Describe the solution you'd like

A new arrow_avro crate will provide vectorised support for reading and writing avro data. The APIs should be designed in such a way as to work for the various different container formats for avro encoded data, including single object encoding, object container files and message even if first-class support is not provided for all these framing mechanisms.

Describe alternatives you've considered

Additional context

DataFusion has some avro support, however, it is based on the row-based apache_avro crate and is therefore likely extremely sub-optimal.

FYI @Samrose-Ahmed @sarutak @devinjdangelo I intend to work on this, but any help with reviews / testing would be most welcome

alamb · 2023-10-02T16:24:31Z

I think having a native Avro --> Arrow code will be good as well as continue to encourage additional use of Arrow in the ecosystem.

bjchambers · 2023-10-02T17:37:28Z

This would be awesome!

Samrose-Ahmed · 2023-10-02T18:02:18Z

Awesome happy to review any pr and help with tests.

alamb · 2023-10-30T09:54:33Z

I think this may have gotten bumped by other priorities and I think @tustvold plans to wrap up whatever his current state is while he works on other things

tustvold · 2023-10-30T20:50:06Z

I intend to keep working away at this in the background, but any help on reviews would be most appreciated

Jefffrey · 2023-11-13T09:01:46Z

FYI a duplicate issue also exists: #727

alamb · 2023-11-18T10:48:14Z

Today I learned that there is a version of the avro reader/writer in arrow2:

https://github.com/jorgecarleitao/arrow2/tree/3ddc6a10c6fbc2d0f85a9f66eeb46112abd07029/src/io/avro

* Map AvroSchema to Arrow (#4886) * Docs * Clippy * Add unrecognized logicalType to metadata * More docs * Split out AvroDataType

Ten0 · 2024-01-15T10:16:30Z

Hello,
I have recently rewritten a Rust avro crate from ~scratch due to performance and ergonomics issues with the apache-avro crate. (It achieves x10 perf compared to the apache-avro crate).
This crate is soon hitting 1.0.
I suspect part of that work could be reused to solve this issue, and I could probably help with this.
Would you mind explaining in a bit more detail what you mean by "vectorized support for reading and writing avro data", and point me to where that would plug in the code?
Thanks 🙂

alamb · 2024-01-16T14:02:30Z

Would you mind explaining in a bit more detail what you mean by "vectorized support for reading and writing avro data", and point me to where that would plug in the code?

I think this means "reading avro records into Array directly

Here is the way it is implemented in datafusion:
https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/avro_to_arrow/reader.rs

There are more sophisticated ways of implementing this feature, for example the tape based methods of the JSON and CSV readers in this crate

What I would personally recommend doing is:

Make a PR with a relatively basic (can be missing features) Avro Reader / writer and use that to work out the interface that is desired (@tustvold may already have this sitting on a branch somewhere)
Implement a basic reader/writer, perhaps using apache-avro or perhaps another implementation, including broad test coverage
Work on optimizing the implementation (using the existing coverage)

tustvold · 2024-01-16T14:06:17Z

(@tustvold may already have this sitting on a branch somewhere)

I have most of the bits and pieces and hope to push something up in the coming week or two. The major missing thing at the moment is tests.

Would you mind explaining in a bit more detail what you mean by "vectorized support for reading and writing avro data", and point me to where that would plug in the code

I had a brief look at https://github.com/Ten0/serde_avro_fast and I am not sure it necessarily would be a good fit for arrow-rs as it appears to rely on knowing the data schema at compile time, but I could be completely off base here

Ten0 · 2024-01-16T21:17:24Z

Heyy,

Thanks for the quick answers.

it appears to rely on knowing the data schema at compile time

It doesn't 😊
It goes through serde, but that also works with unknown structure: for example you can convert any avro with dynamic schema to json (and vice versa) just by plugging them via serde_transcode.
It seems that the main idea would be to have a custom implementation of DeserializeSeed for arrow records 🤔
(Most likely that would also enable plugging any serde format pretty easily afterwards as well since you could likely plug any deserializer (except maybe you'd want to read the schema in advance, but maybe it's not even necessary if you determine the structure as you receive the first records).)

tustvold · 2024-01-16T21:27:54Z

That's an interesting idea, typically the way to achieve performant decode is to decode the values for a column at a time, as this allows amortizing per-row overheads, reducing branch misses, etc... This would obviously not be possible with the serde model which is inherently value-oriented, but it is also possible that the nature of the avro encoding, which relies extensively on varint encoding, reduces the benefits of such a columnar approach.

I'll try to get what I have polished up over the next few days, and we can compare benchmarks.

Ten0 · 2024-01-16T23:30:05Z

typically the way to achieve performant decode is to decode the values for a column at a time, as this allows amortizing per-row overheads, reducing branch misses, etc... This would obviously not be possible with the serde model which is inherently value-oriented, but it is also possible that the nature of the avro encoding, which relies extensively on varint encoding, reduces the benefits of such a columnar approach.

Oh that's interesting! I would have imagined that we would have prepared all the vectors and be pushing to each of them as we read each field. What are you referring to with regards to per-row overheads? (I'd like to read documentation on this topic, I'm familiar with branch prediction but not this.)

That being said indeed with avro's encoding where you have to precisely deserialize each field of an object before you know where the next object starts, plus with the block encoding with the compression, it's very hard for me to imagine that reading several times to extract a single field each time would be the most performant approach. (But even if that was the case, that would look very close to driving the deserializer multiple times, just ignoring all the fields but one each time.)

I'll try to get what I have polished up over the next few days, and we can compare benchmarks.

Wonderful! 😊

Here is the way it is implemented in datafusion

https://github.com/apache/arrow-datafusion/blob/3f219bc929cfd418b0e3d3501f8eba1d5a2c87ae/datafusion/core/src/datasource/avro_to_arrow/reader.rs#L160C1-L168C2

So IIUC the interface we'd want is basically something that enables to convert from an arbitrary (Buf)Reads to something that yields RecordBatches? If somebody confirms that I'm not looking at the wrong track here I may give a go at implementing something for this based on Serde on which we could then plug notably avro support via serde_avro_fast (and if that doesn't work with just serde maybe get a pluggable part for arrow schema specification that could be populated by reading the avro schema). 🙂
Side note: how many RecordBatches? Why not just one? How does one choose this? Is this because datafusion wants to be able to process very large files by stream-processing the batches?
Side note 2: I will notably have a look at serde_arrow as well for that purpose - I'm not sure to what extent that implementation is optimal for this purpose currently but it seems to be under active development and it looks like fundamentally serde_avro_fast->serde_transcode->serde_arrow is precisely what I'd be looking for. If that is the case it looks like the implementation would be pretty simple 😄 (My first glance has me wonder why the implementation is so complex but then I don't know too much about constructing arrow values)

tustvold · 2024-01-16T23:37:26Z

Is this because datafusion wants to be able to process very large files by stream-processing the batches

Yes, whilst this is more important for file formats like parquet that achieve much higher compression ratios than avro, having streaming iterators is pretty standard practice.

I will notably have a look at serde_arrow as well for that purpose - I'm not sure to what extent that implementation is optimal for this purpose currently

You might also be interested in https://docs.rs/arrow-json/50.0.0/arrow_json/reader/struct.Decoder.html#method.serialize

My first glance has me wonder why the implementation is so complex but then I don't know too much about constructing arrow values

Converting between row-oriented and columnar formats is very fiddly, especially where they encode nullability differently 😅

What are you referring to with regards to per-row overheads

The major downside of row-oriented approaches is that unless you know the types at compile time, or have a JIT, you are paying the overhead of type-dispatch for every field. The whole premise of vectorised execution is that by instead operating on columns, you can amortise this cost across thousands of values within a given column, as well as make it easier for the compiler to optimise the code.

Ten0 · 2024-01-17T09:35:31Z

Thanks!

My first glance has me wonder why the implementation is so complex

So I've asked and it turns out that it might indeed be sub-optimal 🫤
So that will probably have to be implemented before benchmarks with that pipeline will be relevant. The good news is that the author appears to be very reactive and open to this 😊

You might also be interested in https://docs.rs/arrow-json/50.0.0/arrow_json/reader/struct.Decoder.html#method.serialize

The benchmarks here seem to suggest that this may be a less efficient approach than the one I drafted at chmp/serde_arrow#118 (comment) but I will also have a look, maybe the performance loss comes from elsewhere 🤔

Ten0 · 2024-02-01T18:57:30Z

I'll try to get what I have polished up over the next few days, and we can compare benchmarks.

Here's a quick POC for for full-featured Avro to Arrow using serde_avro_fast, serde_arrow and serde_transcode:
https://github.com/Ten0/arrow_serde_avro/blob/0ea1292064f877b210211c09d001e7b7db02fbdf/tests/simple.rs#L60-L61
https://github.com/Ten0/arrow_serde_avro/blob/0ea1292064f877b210211c09d001e7b7db02fbdf/src/lib.rs#L8

It holds in <150 lines total ATM and successfully loads avro object container files to arrow RecordBatch. (Schema conversion is pretty basic ATM but straightforward to add more)

Performance of serde_arrow should be very close to zero-cost abstraction since chmp/serde_arrow#120.
There's just chmp/serde_arrow#120 (comment), chmp/serde_arrow#120 (comment) and chmp/serde_arrow#92 (comment) that are clear areas of potential performance improvements for this particular integration, but that's reasonably quick to implement.

I'll probably PR that before benchmarks (if @chmp hasn't done it before 🚀)

You might also be interested in arrow_json::Decoder::serialize

So I've checked and it adds significant intermediate representations in the "tape" thing. It seems pretty clear that is indeed why it's so far behind in the benchmarks.

tustvold · 2024-02-01T20:12:28Z

Sounds promising, I'm afraid I'm not likely to have time to take a look for the next week or so, but I will do so when able.

I'm curious if you've tried comparing the performance of arrow-json vs serde_json + serde_transcode + serde_arrow. The reason I ask is the motivation for the tape is to amortise dispatch overheads (among other things) so I am curious if that is being outweighed by other factors

chmp · 2024-02-01T22:45:49Z

Just a quick comment: the performance of arrow_json improved drastically compared to arrow2_convert in my latest benchmarks. The previous benchmarks run with arrow_json=46, the most recent ones with arrow_json=50. Also the benchmarks are somewhat sensitive to the structure of the records (e.g., nested structs vs. flat structs). So the exact numbers may change.

tustvold · 2024-02-01T23:04:25Z

That's likely the result of #4861 which avoided needing to serialize numerics to strings. TBC the tape largely exists to serve the needs of the raw arrow JSON decoder. It was only later hooked up into serde because we could do so with little additional effort, only very basic effort has been made to optimise it for this usecase.

I am, however, very interested if a combination of serde_json and serde_arrow is competitive with the raw JSON implementation, as that would open up some very interesting possibilities.

chmp · 2024-02-01T23:27:21Z

There are some limited tests for serde_json with serde_arrow and my own use case is generating arrow arrays from JSON (to then query them with datafusion). So at least the basics should work. I will add writing some benchmarks for this use case to my todo list.

Ten0 · 2024-02-17T19:15:44Z

First draft of the benchmark seems to show that the tape achieves ~ the same performance as serde_arrow when plugged to serde_json, but that wouldn't be true as soon as the BTreeMap lookups are removed according to the perf benchmarks, so it seems serde_arrow is faster than the tape as soon as serde is involved.

But more importantly, the specialized implementation of json to arrow that's done in arrow_json performs much better than going through serde, and according to perf that seems to be notably because it spends quite a lot of time validating that &strs are valid UTF-8 as serde imposes that this is done early, whereas the tailored json-to-arrow implementation builds the arrow array without doing any particular validation. That would also be true for a tailored arrow/avro implementation.

ATM it's using random inputs though, which basically contain only escaped strings, so that might be damaging the benchmark's relevance.

Updates the Implementation Status docs page to reflect that the Go implementation can read Avro files. For the Rust implementation, I inferred from [this PR](apache/arrow-rs#4886) and [this comment](apache/arrow-rs#5562 (comment)) that we should hold off on indicating that the Rust implementation can read Avro files. * GitHub Issue: #41386

) Updates the Implementation Status docs page to reflect that the Go implementation can read Avro files. For the Rust implementation, I inferred from [this PR](apache/arrow-rs#4886) and [this comment](apache/arrow-rs#5562 (comment)) that we should hold off on indicating that the Rust implementation can read Avro files. * GitHub Issue: apache#41386

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Oct 2, 2023

tustvold self-assigned this Oct 2, 2023

alamb mentioned this issue Oct 2, 2023

fix: avro_to_arrow: Handle avro nested nullable struct (union) apache/datafusion#7663

Merged

tustvold mentioned this issue Oct 3, 2023

Support Parsing Avro File Headers #4888

Merged

alamb mentioned this issue Oct 18, 2023

Reading Avro files supports other types apache/datafusion#7828

Open

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 30, 2023

Map AvroSchema to Arrow (apache#4886)

81d4f82

This was referenced Oct 30, 2023

Map AvroSchema to Arrow (#4886) #5009

Merged

Simplify error using thiserror crate apache/datafusion#7974

Closed

alamb mentioned this issue Nov 13, 2023

Support for an avro table reader #727

Closed

alamb mentioned this issue Dec 4, 2023

Support AVRO Format for Write Queries apache/datafusion#7679

Open

tustvold added a commit that referenced this issue Jan 11, 2024

Map AvroSchema to Arrow (#4886) (#5009)

4a6ae68

* Map AvroSchema to Arrow (#4886) * Docs * Clippy * Add unrecognized logicalType to metadata * More docs * Split out AvroDataType

tustvold mentioned this issue Jan 16, 2024

[Proposal] Combination of arrow-rs and arrow2, deprecation of arrow2 repository jorgecarleitao/arrow2#1429

Open

tustvold mentioned this issue Jan 16, 2024

Avro block decompression #5306

Merged

chmp mentioned this issue Feb 1, 2024

Benchmark serde_json + serde_arrow vs arrow_json chmp/serde_arrow#129

Closed

tustvold mentioned this issue Mar 28, 2024

Support for AVRO format (trying to use arrow_avro) #5562

Closed

This was referenced Apr 25, 2024

[Docs] Update status of Avro support on Implementation Status page apache/arrow#41386

Closed

GH-41386: [Docs] Document Arrow Go Avro read support apache/arrow#41393

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Avro Support #4886

Add Avro Support #4886

tustvold commented Oct 2, 2023

alamb commented Oct 2, 2023

bjchambers commented Oct 2, 2023

Samrose-Ahmed commented Oct 2, 2023 •

edited

Loading

alamb commented Oct 30, 2023

tustvold commented Oct 30, 2023

Jefffrey commented Nov 13, 2023 •

edited

Loading

alamb commented Nov 18, 2023

Ten0 commented Jan 15, 2024 •

edited

Loading

alamb commented Jan 16, 2024

tustvold commented Jan 16, 2024

Ten0 commented Jan 16, 2024

tustvold commented Jan 16, 2024

Ten0 commented Jan 16, 2024 •

edited

Loading

tustvold commented Jan 16, 2024 •

edited

Loading

Ten0 commented Jan 17, 2024 •

edited

Loading

Ten0 commented Feb 1, 2024 •

edited

Loading

tustvold commented Feb 1, 2024

chmp commented Feb 1, 2024

tustvold commented Feb 1, 2024 •

edited

Loading

chmp commented Feb 1, 2024 •

edited

Loading

Ten0 commented Feb 17, 2024 •

edited

Loading

Add Avro Support #4886

Add Avro Support #4886

Comments

tustvold commented Oct 2, 2023

alamb commented Oct 2, 2023

bjchambers commented Oct 2, 2023

Samrose-Ahmed commented Oct 2, 2023 • edited Loading

alamb commented Oct 30, 2023

tustvold commented Oct 30, 2023

Jefffrey commented Nov 13, 2023 • edited Loading

alamb commented Nov 18, 2023

Ten0 commented Jan 15, 2024 • edited Loading

alamb commented Jan 16, 2024

tustvold commented Jan 16, 2024

Ten0 commented Jan 16, 2024

tustvold commented Jan 16, 2024

Ten0 commented Jan 16, 2024 • edited Loading

tustvold commented Jan 16, 2024 • edited Loading

Ten0 commented Jan 17, 2024 • edited Loading

Ten0 commented Feb 1, 2024 • edited Loading

tustvold commented Feb 1, 2024

chmp commented Feb 1, 2024

tustvold commented Feb 1, 2024 • edited Loading

chmp commented Feb 1, 2024 • edited Loading

Ten0 commented Feb 17, 2024 • edited Loading

Samrose-Ahmed commented Oct 2, 2023 •

edited

Loading

Jefffrey commented Nov 13, 2023 •

edited

Loading

Ten0 commented Jan 15, 2024 •

edited

Loading

Ten0 commented Jan 16, 2024 •

edited

Loading

tustvold commented Jan 16, 2024 •

edited

Loading

Ten0 commented Jan 17, 2024 •

edited

Loading

Ten0 commented Feb 1, 2024 •

edited

Loading

tustvold commented Feb 1, 2024 •

edited

Loading

chmp commented Feb 1, 2024 •

edited

Loading

Ten0 commented Feb 17, 2024 •

edited

Loading