Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Avro Support #4886

Open
tustvold opened this issue Oct 2, 2023 · 21 comments
Open

Add Avro Support #4886

tustvold opened this issue Oct 2, 2023 · 21 comments
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

tustvold commented Oct 2, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Avro is a widely used binary, row-oriented data encoding. It is very similar to protobuf, and has seen very wide adoption in the data ecosystem, especially for streaming workloads.

Describe the solution you'd like

A new arrow_avro crate will provide vectorised support for reading and writing avro data. The APIs should be designed in such a way as to work for the various different container formats for avro encoded data, including single object encoding, object container files and message even if first-class support is not provided for all these framing mechanisms.

Describe alternatives you've considered

Additional context

DataFusion has some avro support, however, it is based on the row-based apache_avro crate and is therefore likely extremely sub-optimal.

FYI @Samrose-Ahmed @sarutak @devinjdangelo I intend to work on this, but any help with reviews / testing would be most welcome

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Oct 2, 2023
@tustvold tustvold self-assigned this Oct 2, 2023
@alamb
Copy link
Contributor

alamb commented Oct 2, 2023

I think having a native Avro --> Arrow code will be good as well as continue to encourage additional use of Arrow in the ecosystem.

@bjchambers
Copy link
Contributor

This would be awesome!

@Samrose-Ahmed
Copy link
Contributor

Samrose-Ahmed commented Oct 2, 2023

Awesome happy to review any pr and help with tests.

@alamb
Copy link
Contributor

alamb commented Oct 30, 2023

I think this may have gotten bumped by other priorities and I think @tustvold plans to wrap up whatever his current state is while he works on other things

@tustvold
Copy link
Contributor Author

I intend to keep working away at this in the background, but any help on reviews would be most appreciated

@Jefffrey
Copy link
Contributor

Jefffrey commented Nov 13, 2023

FYI a duplicate issue also exists: #727

@alamb
Copy link
Contributor

alamb commented Nov 18, 2023

Today I learned that there is a version of the avro reader/writer in arrow2:

https://github.com/jorgecarleitao/arrow2/tree/3ddc6a10c6fbc2d0f85a9f66eeb46112abd07029/src/io/avro

tustvold added a commit that referenced this issue Jan 11, 2024
* Map AvroSchema to Arrow (#4886)

* Docs

* Clippy

* Add unrecognized logicalType to metadata

* More docs

* Split out AvroDataType
@Ten0
Copy link

Ten0 commented Jan 15, 2024

Hello,
I have recently rewritten a Rust avro crate from ~scratch due to performance and ergonomics issues with the apache-avro crate. (It achieves x10 perf compared to the apache-avro crate).
This crate is soon hitting 1.0.
I suspect part of that work could be reused to solve this issue, and I could probably help with this.
Would you mind explaining in a bit more detail what you mean by "vectorized support for reading and writing avro data", and point me to where that would plug in the code?
Thanks 🙂

@alamb
Copy link
Contributor

alamb commented Jan 16, 2024

Would you mind explaining in a bit more detail what you mean by "vectorized support for reading and writing avro data", and point me to where that would plug in the code?

I think this means "reading avro records into Array directly

Here is the way it is implemented in datafusion:
https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/avro_to_arrow/reader.rs

There are more sophisticated ways of implementing this feature, for example the tape based methods of the JSON and CSV readers in this crate

What I would personally recommend doing is:

  1. Make a PR with a relatively basic (can be missing features) Avro Reader / writer and use that to work out the interface that is desired (@tustvold may already have this sitting on a branch somewhere)
  2. Implement a basic reader/writer, perhaps using apache-avro or perhaps another implementation, including broad test coverage
  3. Work on optimizing the implementation (using the existing coverage)

@tustvold
Copy link
Contributor Author

(@tustvold may already have this sitting on a branch somewhere)

I have most of the bits and pieces and hope to push something up in the coming week or two. The major missing thing at the moment is tests.

Would you mind explaining in a bit more detail what you mean by "vectorized support for reading and writing avro data", and point me to where that would plug in the code

I had a brief look at https://github.com/Ten0/serde_avro_fast and I am not sure it necessarily would be a good fit for arrow-rs as it appears to rely on knowing the data schema at compile time, but I could be completely off base here

@Ten0
Copy link

Ten0 commented Jan 16, 2024

Heyy,

Thanks for the quick answers.

it appears to rely on knowing the data schema at compile time

It doesn't 😊
It goes through serde, but that also works with unknown structure: for example you can convert any avro with dynamic schema to json (and vice versa) just by plugging them via serde_transcode.
It seems that the main idea would be to have a custom implementation of DeserializeSeed for arrow records 🤔
(Most likely that would also enable plugging any serde format pretty easily afterwards as well since you could likely plug any deserializer (except maybe you'd want to read the schema in advance, but maybe it's not even necessary if you determine the structure as you receive the first records).)

@tustvold
Copy link
Contributor Author

That's an interesting idea, typically the way to achieve performant decode is to decode the values for a column at a time, as this allows amortizing per-row overheads, reducing branch misses, etc... This would obviously not be possible with the serde model which is inherently value-oriented, but it is also possible that the nature of the avro encoding, which relies extensively on varint encoding, reduces the benefits of such a columnar approach.

I'll try to get what I have polished up over the next few days, and we can compare benchmarks.

@Ten0
Copy link

Ten0 commented Jan 16, 2024

typically the way to achieve performant decode is to decode the values for a column at a time, as this allows amortizing per-row overheads, reducing branch misses, etc... This would obviously not be possible with the serde model which is inherently value-oriented, but it is also possible that the nature of the avro encoding, which relies extensively on varint encoding, reduces the benefits of such a columnar approach.

Oh that's interesting! I would have imagined that we would have prepared all the vectors and be pushing to each of them as we read each field. What are you referring to with regards to per-row overheads? (I'd like to read documentation on this topic, I'm familiar with branch prediction but not this.)

That being said indeed with avro's encoding where you have to precisely deserialize each field of an object before you know where the next object starts, plus with the block encoding with the compression, it's very hard for me to imagine that reading several times to extract a single field each time would be the most performant approach. (But even if that was the case, that would look very close to driving the deserializer multiple times, just ignoring all the fields but one each time.)

I'll try to get what I have polished up over the next few days, and we can compare benchmarks.

Wonderful! 😊

Here is the way it is implemented in datafusion

https://github.com/apache/arrow-datafusion/blob/3f219bc929cfd418b0e3d3501f8eba1d5a2c87ae/datafusion/core/src/datasource/avro_to_arrow/reader.rs#L160C1-L168C2

So IIUC the interface we'd want is basically something that enables to convert from an arbitrary (Buf)Reads to something that yields RecordBatches? If somebody confirms that I'm not looking at the wrong track here I may give a go at implementing something for this based on Serde on which we could then plug notably avro support via serde_avro_fast (and if that doesn't work with just serde maybe get a pluggable part for arrow schema specification that could be populated by reading the avro schema). 🙂
Side note: how many RecordBatches? Why not just one? How does one choose this? Is this because datafusion wants to be able to process very large files by stream-processing the batches?
Side note 2: I will notably have a look at serde_arrow as well for that purpose - I'm not sure to what extent that implementation is optimal for this purpose currently but it seems to be under active development and it looks like fundamentally serde_avro_fast->serde_transcode->serde_arrow is precisely what I'd be looking for. If that is the case it looks like the implementation would be pretty simple 😄 (My first glance has me wonder why the implementation is so complex but then I don't know too much about constructing arrow values)

@tustvold
Copy link
Contributor Author

tustvold commented Jan 16, 2024

Is this because datafusion wants to be able to process very large files by stream-processing the batches

Yes, whilst this is more important for file formats like parquet that achieve much higher compression ratios than avro, having streaming iterators is pretty standard practice.

I will notably have a look at serde_arrow as well for that purpose - I'm not sure to what extent that implementation is optimal for this purpose currently

You might also be interested in https://docs.rs/arrow-json/50.0.0/arrow_json/reader/struct.Decoder.html#method.serialize

My first glance has me wonder why the implementation is so complex but then I don't know too much about constructing arrow values

Converting between row-oriented and columnar formats is very fiddly, especially where they encode nullability differently 😅

What are you referring to with regards to per-row overheads

The major downside of row-oriented approaches is that unless you know the types at compile time, or have a JIT, you are paying the overhead of type-dispatch for every field. The whole premise of vectorised execution is that by instead operating on columns, you can amortise this cost across thousands of values within a given column, as well as make it easier for the compiler to optimise the code.

@Ten0
Copy link

Ten0 commented Jan 17, 2024

Thanks!

My first glance has me wonder why the implementation is so complex

So I've asked and it turns out that it might indeed be sub-optimal 🫤
So that will probably have to be implemented before benchmarks with that pipeline will be relevant. The good news is that the author appears to be very reactive and open to this 😊

You might also be interested in https://docs.rs/arrow-json/50.0.0/arrow_json/reader/struct.Decoder.html#method.serialize

The benchmarks here seem to suggest that this may be a less efficient approach than the one I drafted at chmp/serde_arrow#118 (comment) but I will also have a look, maybe the performance loss comes from elsewhere 🤔

@Ten0
Copy link

Ten0 commented Feb 1, 2024

I'll try to get what I have polished up over the next few days, and we can compare benchmarks.

Here's a quick POC for for full-featured Avro to Arrow using serde_avro_fast, serde_arrow and serde_transcode:
https://github.com/Ten0/arrow_serde_avro/blob/0ea1292064f877b210211c09d001e7b7db02fbdf/tests/simple.rs#L60-L61
https://github.com/Ten0/arrow_serde_avro/blob/0ea1292064f877b210211c09d001e7b7db02fbdf/src/lib.rs#L8

It holds in <150 lines total ATM and successfully loads avro object container files to arrow RecordBatch. (Schema conversion is pretty basic ATM but straightforward to add more)

Performance of serde_arrow should be very close to zero-cost abstraction since chmp/serde_arrow#120.
There's just chmp/serde_arrow#120 (comment), chmp/serde_arrow#120 (comment) and chmp/serde_arrow#92 (comment) that are clear areas of potential performance improvements for this particular integration, but that's reasonably quick to implement.

I'll probably PR that before benchmarks (if @chmp hasn't done it before 🚀)

You might also be interested in arrow_json::Decoder::serialize

So I've checked and it adds significant intermediate representations in the "tape" thing. It seems pretty clear that is indeed why it's so far behind in the benchmarks.

@tustvold
Copy link
Contributor Author

tustvold commented Feb 1, 2024

Sounds promising, I'm afraid I'm not likely to have time to take a look for the next week or so, but I will do so when able.

I'm curious if you've tried comparing the performance of arrow-json vs serde_json + serde_transcode + serde_arrow. The reason I ask is the motivation for the tape is to amortise dispatch overheads (among other things) so I am curious if that is being outweighed by other factors

@chmp
Copy link

chmp commented Feb 1, 2024

Just a quick comment: the performance of arrow_json improved drastically compared to arrow2_convert in my latest benchmarks. The previous benchmarks run with arrow_json=46, the most recent ones with arrow_json=50. Also the benchmarks are somewhat sensitive to the structure of the records (e.g., nested structs vs. flat structs). So the exact numbers may change.

@tustvold
Copy link
Contributor Author

tustvold commented Feb 1, 2024

That's likely the result of #4861 which avoided needing to serialize numerics to strings. TBC the tape largely exists to serve the needs of the raw arrow JSON decoder. It was only later hooked up into serde because we could do so with little additional effort, only very basic effort has been made to optimise it for this usecase.

I am, however, very interested if a combination of serde_json and serde_arrow is competitive with the raw JSON implementation, as that would open up some very interesting possibilities.

@chmp
Copy link

chmp commented Feb 1, 2024

There are some limited tests for serde_json with serde_arrow and my own use case is generating arrow arrays from JSON (to then query them with datafusion). So at least the basics should work. I will add writing some benchmarks for this use case to my todo list.

@Ten0
Copy link

Ten0 commented Feb 17, 2024

First draft of the benchmark seems to show that the tape achieves ~ the same performance as serde_arrow when plugged to serde_json, but that wouldn't be true as soon as the BTreeMap lookups are removed according to the perf benchmarks, so it seems serde_arrow is faster than the tape as soon as serde is involved.

But more importantly, the specialized implementation of json to arrow that's done in arrow_json performs much better than going through serde, and according to perf that seems to be notably because it spends quite a lot of time validating that &strs are valid UTF-8 as serde imposes that this is done early, whereas the tailored json-to-arrow implementation builds the arrow array without doing any particular validation. That would also be true for a tailored arrow/avro implementation.

ATM it's using random inputs though, which basically contain only escaped strings, so that might be damaging the benchmark's relevance.

ianmcook added a commit to apache/arrow that referenced this issue Apr 26, 2024
Updates the Implementation Status docs page to reflect that the Go
implementation can read Avro files.

For the Rust implementation, I inferred from [this
PR](apache/arrow-rs#4886) and [this
comment](apache/arrow-rs#5562 (comment))
that we should hold off on indicating that the Rust implementation can
read Avro files.

* GitHub Issue: #41386
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
)

Updates the Implementation Status docs page to reflect that the Go
implementation can read Avro files.

For the Rust implementation, I inferred from [this
PR](apache/arrow-rs#4886) and [this
comment](apache/arrow-rs#5562 (comment))
that we should hold off on indicating that the Rust implementation can
read Avro files.

* GitHub Issue: apache#41386
tolleybot pushed a commit to tmct/arrow that referenced this issue May 4, 2024
)

Updates the Implementation Status docs page to reflect that the Go
implementation can read Avro files.

For the Rust implementation, I inferred from [this
PR](apache/arrow-rs#4886) and [this
comment](apache/arrow-rs#5562 (comment))
that we should hold off on indicating that the Rust implementation can
read Avro files.

* GitHub Issue: apache#41386
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
)

Updates the Implementation Status docs page to reflect that the Go
implementation can read Avro files.

For the Rust implementation, I inferred from [this
PR](apache/arrow-rs#4886) and [this
comment](apache/arrow-rs#5562 (comment))
that we should hold off on indicating that the Rust implementation can
read Avro files.

* GitHub Issue: apache#41386
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
)

Updates the Implementation Status docs page to reflect that the Go
implementation can read Avro files.

For the Rust implementation, I inferred from [this
PR](apache/arrow-rs#4886) and [this
comment](apache/arrow-rs#5562 (comment))
that we should hold off on indicating that the Rust implementation can
read Avro files.

* GitHub Issue: apache#41386
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
)

Updates the Implementation Status docs page to reflect that the Go
implementation can read Avro files.

For the Rust implementation, I inferred from [this
PR](apache/arrow-rs#4886) and [this
comment](apache/arrow-rs#5562 (comment))
that we should hold off on indicating that the Rust implementation can
read Avro files.

* GitHub Issue: apache#41386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

7 participants