Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Website]: DataFusion 26-34 blog post #457

Merged
merged 19 commits into from
Jan 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
369 changes: 369 additions & 0 deletions _posts/2024-01-19-datafusion-34.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,369 @@
---
layout: post
title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024"
date: "2024-01-19 00:00:00"
author: pmc
categories: [release]
---

<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

## Introduction

We recently [released DataFusion 34.0.0]. This blog highlights some of the major
improvements since we [released DataFusion 26.0.0] (spoiler alert there are many)
and a preview of where the community plans to focus in the next 6 months.

[released DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0

[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that
uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to
create new, fast data centric systems such as databases, dataframe libraries,
machine learning and streaming applications. While [DataFusion’s primary design
goal] is to accelerate creating other data centric systems, it has a
reasonable experience directly out of the box as a [dataframe library] and
[command line SQL tool].

[DataFusion’s primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
[dataframe library]: https://arrow.apache.org/datafusion-python/
[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html


[apache arrow datafusion]: https://arrow.apache.org/datafusion/
[apache arrow]: https://arrow.apache.org
[rust]: https://www.rust-lang.org/


This may also be our last update on the Apache Arrow Site. Future
updates will likely be on the DataFusion website as we are working to [graduate
to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which
will help focus governance and project growth. Also exciting, our [first
DataFusion in person meetup] is planned for March 2024.

[graduate to a top level project]: https://github.com/apache/arrow-datafusion/discussions/6475
[first DataFusion in person meetup]: https://github.com/apache/arrow-datafusion/discussions/8522

DataFusion is very much a community endeavor. Our core thesis is that as a
community we can build much more advanced technology than any of us as
individuals or companies could alone. In the last 6 months between `26.0.0` and
`34.0.0`, community growth has been strong. We accepted and reviewed over a
thousand PRs from 124 different committers, created over 650 issues and closed 517
of them.
You can find a list of all changes in the detailed [CHANGELOG].

<!--
$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
1009

$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
124

https://crates.io/crates/datafusion/26.0.0
DataFusion 26 released June 7, 2023

https://crates.io/crates/datafusion/34.0.0
DataFusion 34 released Dec 17, 2023

Issues created in this time: 214 open, 437 closed
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17

Issues closes: 517
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+

PRs merged in this time 908
https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
-->



[CHANGELOG]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md

# Improved Performance 🚀

Performance is a key feature of DataFusion, DataFusion is
more than 2x faster on [ClickBench] compared to version `25.0.0`, as shown below:

<!--
Scripts: https://github.com/alamb/datafusion-duckdb-benchmark/tree/datafusion-25-34
Spreadsheet: https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
-->

[ClickBench]: https://benchmark.clickhouse.com/

<figure style="text-align: center;">
<img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare-new.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview.">
<figcaption>
<b>Figure 1</b>: Performance improvement between <code>25.0.0</code> and <code>34.0.0</code> on ClickBench.
Note that DataFusion <code>25.0.0</code>, could not run several queries due to
unsupported SQL (Q9, Q11, Q12, Q14) or memory requirements (Q33).
</figcaption>
</figure>

<figure style="text-align: center;">
<img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview.">
<figcaption>
<b>Figure 2</b>: Total query runtime for DataFusion <code>34.0.0</code> and DataFusion <code>25.0.0</code>.
</figcaption>
</figure>


Here are some specific enhancements we have made to improve performance:
* [2-3x better aggregation performance with many distinct groups]
* Partially ordered grouping / streaming grouping
* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`]
* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]
* [Improved join performance]
* Eliminate redundant sorting with sort order aware optimizers

[2-3x better aggregation performance with many distinct groups]: https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/
[Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7192
[Specialized operator for "TopK" `ORDER BY LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7721
[Improved join performance]: https://github.com/apache/arrow-datafusion/pull/8126
[Pushdown Filter Condition(s) into Cross join]: https://github.com/apache/arrow-datafusion/pull/8626
# New Features ✨

## DML / Insert / Creating Files

DataFusion now supports writing data in parallel, to individual or multiple
files, using `Parquet`, `CSV`, `JSON`, `ARROW` and user defined formats.
[Benchmark results] show improvements up to 5x in some cases.

[Benchmark results]: https://github.com/apache/arrow-datafusion/pull/7655

Similarly to reading, data can now be written to any [`ObjectStore`]
implementation, including AWS S3, Azure Blob Storage, GCP Cloud Storage, local
files, and user defined implementations. While reading from [hive style
partitioned tables] has long been supported, it is now possible to write to such
tables as well.

[hive style partitioned tables]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html#features

[`ObjectStore`]: https://docs.rs/object_store/0.9.0/object_store/index.html

For example, to write to a local file:

```sql
❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION '/tmp/my_awesome_table';
0 rows in set. Query took 0.003 seconds.

❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table;
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.024 seconds.
```

[`CREATE EXTERNAL TABLE` statement]: https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#create-external-table

You can also write to files with the [`COPY`], similarly to [DuckDB’s `COPY`]:

[`COPY`]: https://arrow.apache.org/datafusion/user-guide/sql/dml.html#copy
[DuckDB’s `COPY`]: https://duckdb.org/docs/sql/statements/copy.html

```sql
❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json';
+-------+
| count |
+-------+
| 3 |
+-------+
1 row in set. Query took 0.014 seconds.
```

```shell
$ cat /tmp/output.json
{"x":1}
{"x":2}
{"x":3}
```

## Improved `STRUCT` and `ARRAY` support

DataFusion `34.0.0` has much improved `STRUCT` and `ARRAY`
support, including a full range of [struct functions] and [array functions].

[struct functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#struct-functions
[array functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#array-functions

<!--
❯ create table my_table as values ([1,2,3]), ([2]), ([4,5]);
-->

For example, you can now use `[]` syntax and `array_length` to access and inspect arrays:
```sql
❯ SELECT column1,
column1[1] AS first_element,
array_length(column1) AS len
FROM my_table;
+-----------+---------------+-----+
| column1 | first_element | len |
+-----------+---------------+-----+
| [1, 2, 3] | 1 | 3 |
| [2] | 2 | 1 |
| [4, 5] | 4 | 2 |
+-----------+---------------+-----+
```

```sql
❯ SELECT column1, column1['c0'] FROM my_table;
+------------------+----------------------+
| column1 | my_table.column1[c0] |
+------------------+----------------------+
| {c0: foo, c1: 1} | foo |
| {c0: bar, c1: 2} | bar |
+------------------+----------------------+
2 rows in set. Query took 0.002 seconds.
```

## Other Features
Other notable features include:
* Support aggregating datasets that exceed memory size, with [group by spill to disk]
* All operators now track and limit their memory consumption, including Joins

[group by spill to disk]: https://github.com/apache/arrow-datafusion/pull/7400

# Building Systems is Easier with DataFusion 🛠️

## Documentation
It is easier than ever to get started using DataFusion with the
new [Library Users Guide] as well as significantly improved the [API documentation].

[Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html
[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html

## User Defined Window and Table Functions
In addition to DataFusion's [User Defined Scalar Functions], and [User Defined Aggregate Functions], DataFusion now supports [User Defined Window Functions]
and [User Defined Table Functions].

For example, [the `datafusion-cli`] implements a DuckDB style [`parquet_metadata`]
function as a user defined table function ([source code here]):

[the `datafusion-cli`]: https://arrow.apache.org/datafusion/user-guide/cli.html
[`parquet_metadata`]: https://arrow.apache.org/datafusion/user-guide/cli.html#supported-sql
[source code here]: https://github.com/apache/arrow-datafusion/blob/3f219bc929cfd418b0e3d3501f8eba1d5a2c87ae/datafusion-cli/src/functions.rs#L222-L248

```sql
❯ SELECT
path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM
parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 |
| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 |
| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.
```


[User Defined Scalar Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-scalar-udf
[User Defined Aggregate Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-an-aggregate-udf
[User Defined Window Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-window-udf
[User Defined Table Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function


### Growth of DataFusion 📈
DataFusion has been appearing more publically in the wild. For example
* New projects built using DataFusion such as [lancedb], [GlareDB], [Arroyo], and [optd].
* Public talks such as [Apache Arrow Datafusion: Vectorized
Execution Framework For Maximum Performance] in [CommunityOverCode Asia 2023]
* Blogs posts such as [Apache Arrow, Arrow/DataFusion, AI-native Data Infra],
[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0], and
[A Guide to User-Defined Functions in Apache Arrow DataFusion]

[glaredb]: https://glaredb.com/
[lancedb]: https://lancedb.com/
[arroyo]: https://www.arroyo.dev/
[optd]: https://github.com/cmu-db/optd

[Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance]: https://www.youtube.com/watch?v=AJU9rdRNk9I
[CommunityOverCode Asia 2023]: https://www.bagevent.com/event/8432178
[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0]: https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
[Apache Arrow, Arrow/DataFusion, AI-native Data Infra]: https://www.synnada.ai/blog/apache-arrow-arrow-datafusion-ai-native-data-infra-an-interview-with-our-ceo-ozan
[A Guide to User-Defined Functions in Apache Arrow DataFusion]: https://www.linkedin.com/pulse/guide-user-defined-functions-apache-arrow-datafusion-dade-aderemi/

We have also [submitted a paper] to [SIGMOD 2024], one of the
premiere database conferences, describing DataFusion in a technically formal
style and making the case that it is possible to create a modular and extensive query engine
without sacrificing performance. We hope this paper helps people
evaluating DataFusion for their needs understand it better.

[submitted a paper]: https://github.com/apache/arrow-datafusion/issues/6782
[SIGMOD 2024]: https://2024.sigmod.org/

# DataFusion in 2024 🥳

Some major initiatives from contributors we know of this year are:

1. *Modularity*: Make DataFusion even more modular, such as [unifying
built in and user functions], making it easier to customize
DataFusion's behavior.

2. *Community Growth*: Graduate to our own top level Apache project, and
subsequently add more committers and PMC members to keep pace with project
growth.

5. *Use case white papers*: Write blog posts and videos explaining
how to use DataFusion for real-world use cases.

3. *Testing*: Improve CI infrastructure and test coverage, more fuzz
testing, and better functional and performance regression testing.

3. *Planning Time*: Reduce the time taken to plan queries, both [wide
tables of 1000s of columns], and in [general].

4. *Aggregate Performance*: Improve the speed of [aggregating "high cardinality"] data
when there are many (e.g. millions) of distinct groups.

5. *Statistics*: [Improved statistics handling] with an eye towards more
sophisticated expression analysis and cost models.

[aggregating "high cardinality"]: https://github.com/apache/arrow-datafusion/issues/7000
[wide tables of 1000s of columns]: https://github.com/apache/arrow-datafusion/issues/7698
[general]: https://github.com/apache/arrow-datafusion/issues/5637
[unifying built in and user functions]: https://github.com/apache/arrow-datafusion/issues/8045
[Improved statistics handling]: https://github.com/apache/arrow-datafusion/issues/8227

# How to Get Involved

If you are interested in contributing to DataFusion we would love to have you
join us. You can try out DataFusion on some of your own data and projects and
let us know how it goes, contribute suggestions, documentation, bug reports, or
a PR with documentation, tests or code. A list of open issues
suitable for beginners is [here].

As the community grows, we are also looking to restart biweekly calls /
meetings. Timezones are always a challenge for such meetings, but we hope to
have two calls that can work for most attendees. If you are interested
in helping, or just want to say hi, please drop us a note via one of
the methods listed in our [Communication Doc].

[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html
Binary file added img/datafusion-34.0.0/compare-new.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/datafusion-34.0.0/compare.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.