Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Website]: DataFusion 26-34 blog post #457

Merged
merged 19 commits into from
Jan 19, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
345 changes: 345 additions & 0 deletions _posts/2024-01-25-datafusion-34.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
---
layout: post
title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024"
date: "2024-01-01 00:00:00"
author: pmc
categories: [release]
---

<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

## Introduction

[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that
uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to
create new, fast data centric systems such as databases, dataframe libraries,
machine learning and streaming applications. While [DataFusion’s primary design
goal] is to accelerate creating other data centric systems, it has a
reasonably good experience out of the box as a [dataframe library] and
[command line SQL tool].

[DataFusion’s primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
[dataframe library]: https://arrow.apache.org/datafusion-python/
[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html


[apache arrow datafusion]: https://arrow.apache.org/datafusion/
[apache arrow]: https://arrow.apache.org
[rust]: https://www.rust-lang.org/

We recently [released DataFusion 34.0.0]. This blog highlights some of the major
improvements since we [released DataFusion 26.0.0] (spoiler alert it is a lot)
and a preview of where the community will likely spend time in the next 6 months.

[released DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0

This may also be our last update blog post on the Apache Arrow Site. Future
updates will likely be on the DataFusion website as we are working to [graduate
to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which
will help focus governance and project growth. Also exciting, our [first
DataFusion in person meetup] is planned for March 2024.

[graduate to a top level project]: https://github.com/apache/arrow-datafusion/discussions/6475
[first DataFusion in person meetup]: https://github.com/apache/arrow-datafusion/discussions/8522

DataFusion is very much a community endeavor. The core thesis is that as a
community we can build much better and advanced technology than any of us a
individuals or companies could alone. In the last 6 months between `26.0.0` and
`34.0.0`, community growth has been strong. We accepted and reviewed over a
thousand PRs from 124 different committers, created over 650 and closed 517
issues.

<!--
$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
1009

$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
124

https://crates.io/crates/datafusion/26.0.0
DataFusion 26 released June 7, 2023

https://crates.io/crates/datafusion/34.0.0
DataFusion 34 released Dec 17, 2023

Issues created in this time: 214 open, 437 closed
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17

Issues closes: 517
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+

PRs merged in this time 908
https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
-->

The rest of this post highlights a small portion of how we have improved
DataFusion over the last 6 months previews where we are heading. You can
see a list of all changes in the detailed [CHANGELOG].

[CHANGELOG]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md

# Improved Performance 🚀

Performance is a key feature of DataFusion. We have made major improvements
since `25.0.0`, resulting in a 2x overall runtime improvement in the
[ClickBench] queries.

<!--
Source: https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
-->

[ClickBench]: https://benchmark.clickhouse.com/

<figure style="text-align: center;">
<img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview.">
<figcaption>
Fig 1: Performance improvements between DataFusion 25.0.0 and DataFusion 34.0.0.
Note that several queries don't run on <code>25.0.0</code>, for various reasons such as requiring too much memory (Q33)
or unsupported SQL features.
</figcaption>
</figure>

Here are some of the specific enhancements we made:
* [2-3x Better aggregation performance with many distinct groups]
* Partially ordered grouping / streaming grouping
* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`]
* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]
alamb marked this conversation as resolved.
Show resolved Hide resolved
* [Improved Join Performance]
* Eliminate redundant sorting with sort order aware optimizers
alamb marked this conversation as resolved.
Show resolved Hide resolved

[2-3x Better aggregation performance with many distinct groups]: https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/
[Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7192
[Specialized operator for "TopK" `ORDER BY LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7721
[Improved Join Performance]: https://github.com/apache/arrow-datafusion/pull/8126

alamb marked this conversation as resolved.
Show resolved Hide resolved
# New Features ✨

## DML / Insert / Creating Files
alamb marked this conversation as resolved.
Show resolved Hide resolved

DataFusion now supports writing data in parallel, to individual or multiple
files, using `Parquet`, `CSV`, `JSON`, `ARROW` and user defined formats.

You can do this using [`CREATE EXTERNAL TABLE` statement] for example:

```sql
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW does any one have an example of writing to remote object storage (e.g. s3) handy that they could share so I can include it here?

@devinjdangelo do you have this setup ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COPY table to 's3://my_bucket/my_prefix' should work in datafusion-cli so long as the credentials are set up. I'll verify real quick...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok confirmed with caveat. Using COPY directly to object store from datafusion cli does not work, but insert to external table does. We probably need to add special logic to datafusion-cli to make copy to object store to work directly. That would be a neat feature to add.

For now this works:

export AWS_SECRET_ACCESS_KEY=...
export AWS_ACCESS_KEY_ID=...
export AWS_DEFAULT_REGION=...
datafusion-cli
❯ create external table remote_table2(a int, b int) stored as parquet location 's3://dfcli-test-bucket2/';
0 rows in set. Query took 0.001 seconds.

❯ insert into remote_table2 values (1,2);
+-------+
| count |
+-------+
| 1     |
+-------+
1 row in set. Query took 0.272 seconds.
❯ select * from remote_table2;
+---+---+
| a | b |
+---+---+
| 1 | 2 |
+---+---+
1 row in set. Query took 0.131 seconds.

I see the parquet file in my s3 bucket as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devinjdangelo -- I'll file a ticket about this in DataFusion later today

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8907 filed to make the COPY example work...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok confirmed with caveat. Using COPY directly to object store from datafusion cli does not work, but insert to external table does. We probably need to add special logic to datafusion-cli to make copy to object store to work directly. That would be a neat feature to add.

Thank you 🙏

❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION '/tmp/my_awesome_table';
0 rows in set. Query took 0.003 seconds.

❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table;
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.024 seconds.
```

[`CREATE EXTERNAL TABLE` statement]: https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#create-external-table

You can also create files using the [`COPY` command], similarly to [DuckDB’s `COPY`] command:

[`COPY` command]: https://arrow.apache.org/datafusion/user-guide/sql/dml.html#copy
[DuckDB’s `COPY`]: https://duckdb.org/docs/sql/statements/copy.html

```sql
❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json';
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.014 seconds.
```

```shell
$ cat /tmp/output.json
{"x":1}
{"x":2}
{"x":3}

$ python3
Python 3.11.7 (main, Dec 4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.feather as ft
>>> table = ft.read_table("/tmp/output.arrow")
>>> print(table)
pyarrow.Table
x: int32
----
x: [[1,2,3]]
```

## Improved `STRUCT` and `ARRAY` support
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 is there any other improvements you think we should call out about struct/array support over the last 6 months?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. I think all of them are new features compare to 6 months before.


DataFusion `34.0.0` has much improved `STRUCT` and `ARRAY`
support, including a full range of [struct functions] and [array functions].

[struct functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#struct-functions
[array functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#array-functions

<!--
❯ create table my_table as values ([1,2,3]), ([2]), ([4,5]);
-->

For example, you can now use `[]` syntax and `array_length`:
```sql
❯ SELECT column1,
column1[1] AS first_element,
array_length(column1) AS len
FROM my_table;
+-----------+---------------+-----+
| column1 | first_element | len |
+-----------+---------------+-----+
| [1, 2, 3] | 1 | 3 |
| [2] | 2 | 1 |
| [4, 5] | 4 | 2 |
+-----------+---------------+-----+
```

```sql
❯ SELECT column1, column1['c0'] FROM my_table;
+------------------+----------------------+
| column1 | my_table.column1[c0] |
+------------------+----------------------+
| {c0: foo, c1: 1} | foo |
| {c0: bar, c1: 2} | bar |
+------------------+----------------------+
2 rows in set. Query took 0.002 seconds.
```

## Other Features
* Support grouping on datasets that exceed memory size, with [Group by spill to disk]
* All operators now track and limit their memory consumption, including Joins

[Group by spill to disk]: https://github.com/apache/arrow-datafusion/pull/7400

# Easier to Build Systems with DataFusion 🛠️

## Documentation
It is easier than ever to get started using DataFusion with
new [Library Users Guide] as well as significantly improved the [API documentation].

[Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html
[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html

## User Defined Window and Table Functions
Also, in addition to DataFusion's [User Defined Scalar Functions], and [User Defined Aggregate Functions], DataFusion now supports [User Defined Window Functions]
and [User Defined Table Functions].

For example, the `datafusion-cli` implements the `parquet_metadata` function (TODO get a link to doc) command as a
user defined table function:

```sql
❯ SELECT
path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM
parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 |
| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 |
| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.
```


[User Defined Scalar Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-scalar-udf
[User Defined Aggregate Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-an-aggregate-udf
[User Defined Window Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-window-udf
[User Defined Table Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function


### Growth of DataFusion 📈
DataFusion has begun appearing more in the wild, such as:
* New projects built on DataFusion such as [lancedb], [GlareDB], and [Arroyo].
* Public talks such as [Apache Arrow Datafusion: Vectorized
Execution Framework For Maximum Performance] in [CommunityOverCode Asia 2023]
* Blogs posts such as [Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0], [Apache Arrow, Arrow/DataFusion, AI-native Data Infra] and [A Guide to User-Defined Functions in Apache Arrow DataFusion]

[glaredb]: https://glaredb.com/
[lancedb]: https://lancedb.com/
[arroyo]: https://www.arroyo.dev/

[Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance]: https://www.youtube.com/watch?v=AJU9rdRNk9I
[CommunityOverCode Asia 2023]: https://www.bagevent.com/event/8432178
[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0]: https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
[Apache Arrow, Arrow/DataFusion, AI-native Data Infra]: https://www.synnada.ai/blog/apache-arrow-arrow-datafusion-ai-native-data-infra-an-interview-with-our-ceo-ozan
[A Guide to User-Defined Functions in Apache Arrow DataFusion]: https://www.linkedin.com/pulse/guide-user-defined-functions-apache-arrow-datafusion-dade-aderemi/

We also [submitted a paper] to [SIGMOD 2024], one of the
premiere database conferences, describing DataFusion in a technically formal
style and making the case that it is possible to create a modular and extensive query engine
without sacrificing performance. We hope this paper will help people who may be
considering DataFusion to decide if it is a good fit for their needs.

[submitted a paper]: https://github.com/apache/arrow-datafusion/issues/6782
[SIGMOD 2024]: https://2024.sigmod.org/

# DataFusion in 2024 🥳

This year some major initiatives contributors plan to focus on are:

1. *Modularity*: Make Datafusion even more modular, such as [unifying how
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ozankabak are there any plans you and your team may have that you want to share publically?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have anything to share just yet on features we will contribute in 2024 (but there will be many!). We will probably have something to publish in a month or two.

functions are defined], making it easier to customize DataFusion's behavior.

2. *Community Growth*: Graduate to our own top level Apache project, and
subsequently add more committers and PMC members to help the project grow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan to write show-and-tell blog posts and videos that explain how one can use Datafusion in real-world use cases. We will try to partner with members of the community to create toy examples relating to their use cases and try to come up with demo scripts that offer guidance to others on how they can use Datafusion in similar contexts.

Maybe it could be a good idea to mention these upcoming show-and-tells as a near-future community growth effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is great. I will include it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 95158bb


3. *Testing*: Improve CI infrastructure and test coverage, including fuzz
testing, and better functional and performance regression testing.

3. *Planning performance*: Reduce the time taken to plan queries, both [wide
tables of 1000s of columns], and in [general].

4. *Aggregate Performance*: Improve speed [aggregating "high cardinality"] data
when there are many (e.g. millions) of distinct groups

5. *Statistics*: [Better statistics handling] with an eye towards more
sophisticated analysis and cost models.

[aggregating "high cardinality"]: https://github.com/apache/arrow-datafusion/issues/7000
[wide tables of 1000s of columns]: https://github.com/apache/arrow-datafusion/issues/7698
[general]: https://github.com/apache/arrow-datafusion/issues/5637
[unifying how functions are defined]: https://github.com/apache/arrow-datafusion/issues/8045
Better statistics handling]: https://github.com/apache/arrow-datafusion/issues/8227

# How to Get Involved

If you are interested in contributing to DataFusion we would love to have you
join us. You can try out DataFusion on some of your own data and projects and
let us know how it goes, contribute suggestions, documentation, bug reports, or
a PR with documentation, tests or code. A list of open issues
suitable for beginners is [here].

As the community grows, we are also looking to restart biweekly calls /
meetings. Timezones are always a challenge for such meetings, but we hope to
have two calls that can work for most attendees. Let us know if you are interested
in helping by dropping us a note via one of the methods listed in our [Communication Doc].

[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html
Binary file added img/datafusion-34.0.0/compare.png
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this plot is hard to read, mostly due to the wildly different numbers and because "execution time" often is an aggregate. For these kinds of plots, I see two possible ways to improve them:

  • log axis: use a log axis for time, because improvements are often not linear deltas but factors and a log-space would account for that nicely. That would also make the wildly different numbers easier to read. Drawback: people don't read log space very well.
  • relative factor: Only draw bars for DF v34 as a factor relative to v25 (which would be <1.0x in most cases) and on top of the bar (or at the base) print the seconds it took for v25. This tells the story of the change but also gives readers a baseline.

Copy link
Contributor Author

@alamb alamb Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this new chart. Here is what the page looks like rendered now:

Screenshot 2024-01-16 at 4 18 09 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better. I think that the execution time (blue) shouldn't be a line plot because that type implies a connection between the neighboring points (like a time series where subsequent entries are indeed related). The linear interpolation between the measurements makes this even more misleading / "weird".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to make Google sheets do what I wanted, so I eventually just made two charts so that one shows the overall magnitude and one shoes the relative improvement

Screenshot 2024-01-19 at 7 06 09 AM

I am sure we can do better if we spent more time on this, but I think it is good enough for now

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.