Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blog post improvements #3

Merged
merged 1 commit into from
Jan 8, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 27 additions & 18 deletions _posts/2024-01-25-datafusion-34.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,10 @@ intensive analytics.
[rust]: https://www.rust-lang.org/

We recently [released DataFusion 34.0.0]. This blog highlights some of the major
improvements since we released [DataFusion 26.0.0] -– spoiler alert it is a lot
improvements since we [released DataFusion 26.0.0] -– spoiler alert it is a lot
– and a preview of where the community plans to head in the next 6 months.

[Apache Arrow DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
[released DataFusion 26.0.0]: https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.
[released DataFusion 34.0.0]: https://crates.io/crates/datafusion/34.0.0

This is also likely to be our last update blog post on the apache arrow site –
Expand Down Expand Up @@ -96,8 +96,8 @@ Some key improvements we made:

## DML / Insert / Creating Files

DataFusion now supports writing, in parallel to Parquet, CSV, JSON (ARROW?) and
soon AVRO. This includes writing in parallel to individual and multiple files
DataFusion now supports writing in parallel to Parquet, CSV, JSON (ARROW?) and
soon AVRO. This includes writing in parallel to individual and multiple files.

You can do this via `CREATE EXTERNAL TABLE`, for example:

Expand All @@ -118,8 +118,10 @@ As well as the COPY command (modeled after DuckDB’s copy (TODO duckdb copy blo
## Improved Struct/array support

26.0.0 had basic support for structs and arrays, but 34.0.0 has much improved
support, and a full range of functions (TODO link to docs) for working with
structs and arrays.
support and a full range of functions for working with [structs] and [arrays].

[structs]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#struct-functions
[arrays]: https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#array-functions

(@izveigor ❤️ and jayzhan, and wejun) –

Expand All @@ -128,24 +130,30 @@ structs and arrays.
```

# Other New Features:
* Group by spill to disk (TODO link)
* Group by spill to disk
* Join memory limiting (TODO find when this wa done)

[Group by spill to disk]: https://github.com/apache/arrow-datafusion/pull/7400

# Easier to use DataFusion to build systems

DataFusion’s primary design goal is as the basis used to create other systems
DataFusion’s primary design goal is to be used as the basis to create other systems
(TODO link to doc explaining that), though it is a reasonably good experience
out of the box as a dataframe library (TODO link to datafusion-python) and
command line SQL tool (todo link to datafusion-cli docs)
out of the box as a [dataframe library] and as a [command line SQL tool].

[dataframe library]: https://arrow.apache.org/datafusion-python/
[command line SQL tool]: https://arrow.apache.org/datafusion/user-guide/cli.html

Part of making it easier to use DataFusion is improving the documentation. To
that end we have created a new [Library Users Guide] to help people build the
applications on top of DataFusion.

[Library Users Guide]:https://arrow.apache.org/datafusion/library-user-guide/index.html

We have also added User Defined Window Functions (TODO link to documentation) , User Defined Table Functions (TODO link to doc)
We have also added [User Defined Window Functions] and [User Defined Table Functions].

[User Defined Window Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-window-udf
[User Defined Table Functions]: https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function


### Growth of DataFusion in the Wild
Expand All @@ -171,22 +179,23 @@ considering DataFusion to decide if it is a good fit for their needs.

# DataFusion in 2024

This year, some major initiatives we aim to undertake are:
This year some major initiatives we aim to undertake are:

1. Split apart the system so it is even more modular, including unifying how functions are defined to make it easier to mix and match. Improved APIs (trait based) APIs for defining extensions for Scalar, Aggregate, Window functions
1. Split apart the system so it is even more modular, including unifying how functions are defined to make it easier to mix and match. Improved APIs (trait based) APIs for defining extensions for Scalar, Aggregate and Window functions.

2. Better testing and CI infrastructure, including fuzz testing, and better functional and performance regression testing.
2. Performance: Improved planning performance (both for wide tables of 1000s of columns, as well as in general improving efficiency of the planning process) TODO find ticket.
3. Performance of performance on high cardinality grouping Also, our aggregation is still about 2x slower than state of the art (e.g. DuckDB) for “very high cardinality” aggregates, and we hope to improve there as well
2. Performance: Improved planning performance (both for [wide tables of 1000s of columns], as well as in [general improving efficiency of the planning process]).
3. Performance of performance on high cardinality grouping Also, our aggregation is still about 2x slower than state of the art (e.g. DuckDB) for “very high cardinality” aggregates, and we hope to improve there as well.
4. Better statistics handling such as XXX, YYY with an eye towards more sophisticated analysis and cost models.
5. Further invest in advanced optimization techniques such as additional predicate pruning, bloom filtering, etc (TODO link). This largely takes the form of logic rules that can take arbitrary predicates (EXMAPLE) and prove they can not be true given information at hand, either statistics from parquet files, bloom filter,s or other sources). This is a key building block for both fast performance within datafusion as well as other systems that use datafusion expressions

5. Further invest in advanced optimization techniques such as additional predicate pruning, bloom filtering, etc (TODO link). This largely takes the form of logic rules that can take arbitrary predicates (EXMAPLE) and prove they can not be true given information at hand, either statistics from parquet files, bloom filters or other sources. This is a key building block for both fast performance within datafusion as well as other systems that use datafusion expressions.

[wide tables of 1000s of columns]: https://github.com/apache/arrow-datafusion/issues/7698
[general improving efficiency of the planning process]: https://github.com/apache/arrow-datafusion/issues/5637


# How to Get Involved

If you are interested in contributing to DataFusion, we would love to have you
If you are interested in contributing to DataFusion we would love to have you
join us. You can try out DataFusion on some of your own data and projects and
let us know how it goes, contribute suggestions, documentation, bug reports, or
contribute a PR with documentation, tests or code. A list of open issues
Expand Down