apache · alamb · Jan 19, 2024 · Jan 7, 2024 · Jan 7, 2024 · Jan 8, 2024
diff --git a/_posts/2024-01-25-datafusion-34.0.0.md b/_posts/2024-01-25-datafusion-34.0.0.md
@@ -0,0 +1,345 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024"
+date: "2024-01-01 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+## Introduction
+
+[Apache Arrow DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate creating other data centric systems, it has a
+reasonably good experience out of the box as a [dataframe library] and
+[command line SQL tool].
+
+[DataFusion’s primary design goal]: https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals
+[dataframe library]: https://arrow.apache.org/datafusion-python/
+[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html
+
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+We recently [released DataFusion 34.0.0]. This blog highlights some of the major
+improvements since we [released DataFusion 26.0.0] (spoiler alert it is a lot)
+and a preview of where the community will likely spend time in the next 6 months.
+
+[released DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
+[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0
+
+This may also be our last update blog post on the Apache Arrow Site. Future
+updates will likely be on the DataFusion website as we are working to [graduate
+to a top level project] (Apache Arrow DataFusion → Apache DataFusion!) which
+will help focus governance and project growth. Also exciting, our [first
+DataFusion in person meetup] is planned for March 2024.
+
+[graduate to a top level project]: https://github.com/apache/arrow-datafusion/discussions/6475
+[first DataFusion in person meetup]: https://github.com/apache/arrow-datafusion/discussions/8522
+
+DataFusion is very much a community endeavor. The core thesis is that as a
+community we can build much better and advanced technology than any of us a
+individuals or companies could alone. In the last 6 months between `26.0.0` and
+`34.0.0`, community growth has been strong. We accepted and reviewed over a
+thousand PRs from 124 different committers, created over 650 and closed 517
+issues.
+
+<!--
+$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
+     1009
+
+$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
+      124
+
+https://crates.io/crates/datafusion/26.0.0
+DataFusion 26 released June 7, 2023
+
+https://crates.io/crates/datafusion/34.0.0
+DataFusion 34 released Dec 17, 2023
+
+Issues created in this time: 214 open, 437 closed
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17
+
+Issues closes: 517
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+
+
+PRs merged in this time 908
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
+-->
+
+The rest of this post highlights a small portion of how we have improved
+DataFusion over the last 6 months previews where we are heading. You can
+see a list of all changes in the detailed [CHANGELOG].
+
+[CHANGELOG]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+
+# Improved Performance 🚀 
+
+Performance is a key feature of DataFusion. We have made major improvements
+since `25.0.0`, resulting in a 2x overall runtime improvement in the
+[ClickBench] queries.
+
+<!--
+  Source: https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
+  Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
+  Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
+-->
+
+[ClickBench]: https://benchmark.clickhouse.com/
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/datafusion-34.0.0/compare.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview.">
+  <figcaption>
+    Fig 1: Performance improvements between DataFusion 25.0.0 and DataFusion 34.0.0. 
+    Note that several queries don't run on <code>25.0.0</code>, for various reasons such as requiring too much memory (Q33) 
+    or unsupported SQL features.
+  </figcaption>
+</figure>
+
+Here are some of the specific enhancements we made:
+* [2-3x Better aggregation performance with many distinct groups]
+* Partially ordered grouping / streaming grouping
+* [Specialized operator for "TopK" `ORDER BY LIMIT XXX`] 
+* [Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]
+* [Improved Join Performance]
+* Eliminate redundant sorting with sort order aware optimizers
+
+[2-3x Better aggregation performance with many distinct groups]: https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/
+[Specialized operator for `min(col) GROUP BY .. ORDER by min(col) LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7192
+[Specialized operator for "TopK" `ORDER BY LIMIT XXX`]: https://github.com/apache/arrow-datafusion/pull/7721
+[Improved Join Performance]: https://github.com/apache/arrow-datafusion/pull/8126
+
+# New Features ✨
+
+## DML / Insert / Creating Files
+
+DataFusion now supports writing data in parallel, to individual or multiple
+files, using `Parquet`, `CSV`, `JSON`, `ARROW` and user defined formats. 
+
+You can do this using [`CREATE EXTERNAL TABLE` statement] for example:
+
+```sql
+❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION '/tmp/my_awesome_table';
+0 rows in set. Query took 0.003 seconds.
+
+❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table;
++-------+
+| count |
++-------+
+| 3     |
++-------+
+1 row in set. Query took 0.024 seconds.
+```
+
+[`CREATE EXTERNAL TABLE` statement]: https://arrow.apache.org/datafusion/user-guide/sql/ddl.html#create-external-table
+
+You can also create files using the [`COPY` command], similarly to [DuckDB’s `COPY`] command:
+
+[`COPY` command]: https://arrow.apache.org/datafusion/user-guide/sql/dml.html#copy
+[DuckDB’s `COPY`]: https://duckdb.org/docs/sql/statements/copy.html
+
+```sql
+❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json';
++-------+
+| count |
++-------+
+| 3     |
++-------+
+1 row in set. Query took 0.014 seconds.
+```
+
+```shell
+$ cat /tmp/output.json
+{"x":1}
+{"x":2}
+{"x":3}
+
+$ python3
+Python 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import pyarrow.feather as ft
+>>> table = ft.read_table("/tmp/output.arrow")
+>>> print(table)
+pyarrow.Table
+x: int32
+----
+x: [[1,2,3]]
+```
+
+## Improved `STRUCT` and `ARRAY` support
+
+DataFusion `34.0.0` has much improved `STRUCT` and `ARRAY`
+support, including a full range of [struct functions] and [array functions].
+
+[struct functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#struct-functions
+[array functions]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#array-functions
+
+<!--
+❯ create table my_table as values ([1,2,3]), ([2]), ([4,5]);
+--> 
+
+For example, you can now use `[]` syntax and `array_length`:
+```sql
+❯ SELECT column1, 
+         column1[1] AS first_element, 
+         array_length(column1) AS len 
+  FROM my_table;
++-----------+---------------+-----+
+| column1   | first_element | len |
++-----------+---------------+-----+
+| [1, 2, 3] | 1             | 3   |
+| [2]       | 2             | 1   |
+| [4, 5]    | 4             | 2   |
++-----------+---------------+-----+
+```
+
+```sql
+❯ SELECT column1, column1['c0'] FROM  my_table;
++------------------+----------------------+
+| column1          | my_table.column1[c0] |
++------------------+----------------------+
+| {c0: foo, c1: 1} | foo                  |
+| {c0: bar, c1: 2} | bar                  |
++------------------+----------------------+
+2 rows in set. Query took 0.002 seconds.
+```
+
+## Other Features
+* Support grouping on datasets that exceed memory size, with [Group by spill to disk]
+* All operators now track and limit their memory consumption, including Joins
+
+[Group by spill to disk]: https://github.com/apache/arrow-datafusion/pull/7400
+
+# Easier to Build Systems with DataFusion 🛠️
+
+## Documentation
+It is easier than ever to get started using DataFusion with 
+new [Library Users Guide] as well as significantly improved the [API documentation]. 
+
+[Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html
+[API documentation]: https://docs.rs/datafusion/latest/datafusion/index.html
+
+## User Defined Window and Table Functions
+Also, in addition to DataFusion's [User Defined Scalar Functions], and [User Defined Aggregate Functions], DataFusion now supports [User Defined Window Functions] 
+ and [User Defined Table Functions].
+
+For example, the `datafusion-cli` implements the `parquet_metadata` function (TODO get a link to doc) command as a
+user defined table function:
+
+```sql
+❯ SELECT 
+      path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size 
+FROM 
+      parquet_metadata('hits.parquet')
+WHERE path_in_schema = '"WatchID"' 
+LIMIT 3;
+
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
+| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
+| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
++----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
+3 rows in set. Query took 0.053 seconds.
+```
+
+
+[User Defined Scalar Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-scalar-udf
+[User Defined Aggregate Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-an-aggregate-udf
+[User Defined Window Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-window-udf
+[User Defined Table Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function
+
+
+### Growth of DataFusion 📈
+DataFusion has begun appearing more in the wild, such as:
+* New projects built on DataFusion such as [lancedb], [GlareDB], and [Arroyo].
+* Public talks such as [Apache Arrow Datafusion: Vectorized
+  Execution Framework For Maximum Performance] in [CommunityOverCode Asia 2023] 
+* Blogs posts such as [Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0], [Apache Arrow, Arrow/DataFusion, AI-native Data Infra] and [A Guide to User-Defined Functions in Apache Arrow DataFusion]
+
+[glaredb]: https://glaredb.com/
+[lancedb]: https://lancedb.com/
+[arroyo]: https://www.arroyo.dev/
+
+[Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance]: https://www.youtube.com/watch?v=AJU9rdRNk9I
+[CommunityOverCode Asia 2023]: https://www.bagevent.com/event/8432178
+[Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0]: https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
+[Apache Arrow, Arrow/DataFusion, AI-native Data Infra]: https://www.synnada.ai/blog/apache-arrow-arrow-datafusion-ai-native-data-infra-an-interview-with-our-ceo-ozan
+[A Guide to User-Defined Functions in Apache Arrow DataFusion]: https://www.linkedin.com/pulse/guide-user-defined-functions-apache-arrow-datafusion-dade-aderemi/
+
+We also [submitted a paper] to [SIGMOD 2024], one of the
+premiere database conferences, describing DataFusion in a technically formal
+style and making the case that it is possible to create a modular and extensive query engine 
+without sacrificing performance. We hope this paper will help people who may be
+considering DataFusion to decide if it is a good fit for their needs.
+
+[submitted a paper]: https://github.com/apache/arrow-datafusion/issues/6782
+[SIGMOD 2024]: https://2024.sigmod.org/
+
+# DataFusion in 2024 🥳
+
+This year some major initiatives contributors plan to focus on are:
+
+1. *Modularity*: Make Datafusion even more modular, such as [unifying how
+   functions are defined], making it easier to customize DataFusion's behavior.
+
+2. *Community Growth*: Graduate to our own top level Apache project, and
+   subsequently add more committers and PMC members to help the project grow.
+
+3. *Testing*: Improve CI infrastructure and test coverage, including fuzz
+   testing, and better functional and performance regression testing.
+
+3. *Planning performance*: Reduce the time taken to plan queries, both [wide
+   tables of 1000s of columns], and in [general].
+
+4. *Aggregate Performance*: Improve speed [aggregating "high cardinality"] data
+   when there are many (e.g. millions) of distinct groups
+
+5. *Statistics*: [Better statistics handling] with an eye towards more
+   sophisticated analysis and cost models.
+
+[aggregating "high cardinality"]: https://github.com/apache/arrow-datafusion/issues/7000
+[wide tables of 1000s of columns]: https://github.com/apache/arrow-datafusion/issues/7698
+[general]: https://github.com/apache/arrow-datafusion/issues/5637
+[unifying how functions are defined]: https://github.com/apache/arrow-datafusion/issues/8045
+Better statistics handling]: https://github.com/apache/arrow-datafusion/issues/8227
+
+# How to Get Involved
+
+If you are interested in contributing to DataFusion we would love to have you
+join us. You can try out DataFusion on some of your own data and projects and
+let us know how it goes, contribute suggestions, documentation, bug reports, or
+a PR with documentation, tests or code. A list of open issues
+suitable for beginners is [here].
+
+As the community grows, we are also looking to restart biweekly calls /
+meetings. Timezones are always a challenge for such meetings, but we hope to
+have two calls that can work for most attendees. Let us know if you are interested
+in helping by dropping us a note via one of the methods listed in our [Communication Doc].
+
+[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
+[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html
diff --git a/img/datafusion-34.0.0/compare.png b/img/datafusion-34.0.0/compare.png