Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive CTEs: Stage 3 - add execution support #6

Draft
wants to merge 69 commits into
base: matt/feat/recursive-ctes/parser
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
9b78efa
Add serde support for Arrow FileTypeWriterOptions (#8850)
tushushu Jan 18, 2024
3a9e23d
Improve datafusion-cli print format tests (#8896)
alamb Jan 19, 2024
a786921
Recursive CTEs: Stage 2 - add support for sql -> logical plan generat…
matthewgapp Jan 19, 2024
72b81f1
remove null in array-append adn array-prepend (#8901)
Weijun-H Jan 19, 2024
ae0f401
Add support for FixedSizeList type in `arrow_cast`, hashing (#8344)
Weijun-H Jan 19, 2024
d0c84cc
aggregate_statistics should only optimize MIN/MAX when relation is no…
viirya Jan 20, 2024
e7c0482
support to_timestamp with optional chrono formats (#8886)
Omega359 Jan 20, 2024
a4a9429
Minor: Document third argument of `date_bin` as optional and default …
alamb Jan 20, 2024
95e739c
Minor: distinguish parquet row group pruning test type (#8921)
Ted-Jiang Jan 20, 2024
f5a97d5
Add hash_join_single_partition_threshold_rows config (#8720)
maruschin Jan 20, 2024
b7e13a0
Prepare 35.0.0-rc1 (#8924)
andygrove Jan 20, 2024
0116e2a
feat: support `stride` in `array_slice`, change indexes to be`1` base…
Weijun-H Jan 21, 2024
97441cc
fix: recursive initialize method (#8937)
waynexia Jan 22, 2024
c0a69a7
Fix expr partial ord test (#8908)
mustafasrepo Jan 22, 2024
2b218be
Simplify windows builtin functions return type (#8920)
comphead Jan 22, 2024
38d5f75
Fix handling of nested leaf columns in parallel parquet writer (#8923)
devinjdangelo Jan 22, 2024
f2e6701
feat: emitting partial join results in `HashJoinStream` (#8020)
korowa Jan 22, 2024
c9935ae
fix: common_subexpr_eliminate rule should not apply to short-circuit …
haohuaijin Jan 22, 2024
edec418
Support GroupsAccumulator accumulator for udaf (#8892)
guojidan Jan 22, 2024
e986e15
Refactor partitioned_csv tests (#8919)
simicd Jan 22, 2024
6816d3f
[CI] Fix RUSTFLAGS (#8929)
Jefffrey Jan 22, 2024
180cbfb
Minor: Update datafusion-cli README to explain why it is not in the w…
alamb Jan 22, 2024
f39841b
Add syntax highlight to datafusion-cli (#8918)
trungda Jan 22, 2024
795c71f
Update substrait requirement from 0.22.1 to 0.23.0 (#8943)
dependabot[bot] Jan 22, 2024
903ef94
Deprecate make_scalar_function (#8878)
viirya Jan 22, 2024
827668a
Update project links (#8954)
comphead Jan 22, 2024
31b9b48
fix issue #8922 make row group test more readable (#8941)
Lordworms Jan 23, 2024
084fdfb
feat:implement sql style 'ends_with' and 'instr' string function (#8862)
zy-kkk Jan 23, 2024
04e147b
[MINOR]: Extract aggregate topk function to `aggregate_topk.slt` (#8948)
mustafasrepo Jan 23, 2024
558b3d6
Combine multiple `IN` lists in `ExprSimplifier` (#8949)
jayzhan211 Jan 23, 2024
ee7ab0b
Fix clippy failures (#8972)
alamb Jan 23, 2024
19ca7d2
feat: Support parquet bloom filter pruning for decimal128 (#8930)
Ted-Jiang Jan 24, 2024
b5db718
[MINOR]: Update create_window_expr to refer only input schema (#8945)
mustafasrepo Jan 24, 2024
78ca43c
Don't error in simplify_expressions rule (#8957)
haohuaijin Jan 24, 2024
2b84877
avoid unwrap (#8956)
Luv-Ray Jan 24, 2024
5d70c32
Change `Accumulator::evaluate` and `Accumulator::state` to take `&mut…
alamb Jan 24, 2024
bc0ba6a
Enhance simplifier by adding Canonicalize (#8780)
yyy1000 Jan 24, 2024
7ad929a
Find the correct fields when using page filter on `struct` fields in …
manoj-inukolunu Jan 24, 2024
94a6192
fix: allow placeholders to be substited when coercible (#8977)
erratic-pattern Jan 24, 2024
90e61c7
Minor: improve CatalogProvider documentation with rationale (#8968)
alamb Jan 24, 2024
d81c82d
Improve to_timestamp docs (#8981)
Omega359 Jan 24, 2024
d6ab343
Add helper function for processing scalar function input (#8962)
viirya Jan 24, 2024
4ac7de1
Fix optimize projections bug (#8960)
mustafasrepo Jan 25, 2024
4a3986a
NOT operator not return internal error when args are not boolean valu…
guojidan Jan 25, 2024
928162f
Minor: Add new Extended ClickBench benchmark queries (#8950)
alamb Jan 25, 2024
80a42bf
Minor: Add comments to MSRV CI check to help if it fails (#8995)
alamb Jan 25, 2024
7a0af5b
Minor: Document memory management design on MemoryPool (#8966)
alamb Jan 25, 2024
5e9c9a1
Fix LEAD/LAG window functions when default value null (#8989)
comphead Jan 25, 2024
eb6d63f
Optimize MIN/MAX when relation is empty (#8940)
viirya Jan 25, 2024
b97daf7
[task #8203] Port tests in joins.rs to sqllogictest (#8996)
Tangruilin Jan 25, 2024
fa65c68
[task #8213]Port tests in select.rs to sqllogictest (#8967)
Tangruilin Jan 25, 2024
6e4abf5
test: Port (last) `repartition.rs` query to sqllogictest (#8936)
simicd Jan 25, 2024
4d02cc0
Update to sqlparser `0.42.0` (#9000)
alamb Jan 25, 2024
8a4bad4
Add new test (#8992)
mustafasrepo Jan 26, 2024
bee7136
Make Topk aggregate tests deterministic (#8998)
mustafasrepo Jan 26, 2024
bd38142
Add support for Postgres LIKE operators (#8894)
gruuya Jan 26, 2024
35c7b2c
bug: Datafusion doesn't respect case sensitive table references (#8964)
xhwhis Jan 26, 2024
7005e2e
Document parallelism and thread scheduling in the architecture guide …
alamb Jan 26, 2024
ec6abec
Fix None Projections in Projection Pushdown (#9005)
berkaysynnada Jan 26, 2024
b3fe6aa
Lead and Lag window functions should support default value with data …
viirya Jan 26, 2024
c42bf48
chore: fix license badge in README (#9008)
suyanhanx Jan 26, 2024
d9cae58
rebase all execution and preceding recursive cte work
matthewgapp Jan 11, 2024
efed900
error if recursive ctes are nested
matthewgapp Jan 25, 2024
38f847d
error if recursive cte is referenced multiple times within the recurs…
matthewgapp Jan 25, 2024
6121248
wip
matthewgapp Jan 26, 2024
80069f7
fix rebase
matthewgapp Jan 26, 2024
812d64f
move testing files into main repo
matthewgapp Jan 26, 2024
39ab371
update testing pin to main pin
matthewgapp Jan 26, 2024
d100913
tweaks
matthewgapp Jan 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/actions/setup-rust-runtime/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,5 @@ runs:
echo "SCCACHE_GHA_ENABLED=true" >> $GITHUB_ENV
echo "RUST_BACKTRACE=1" >> $GITHUB_ENV
echo "RUST_MIN_STACK=3000000" >> $GITHUB_ENV
echo "RUST_FLAGS=-C debuginfo=line-tables-only -C incremental=false" >> $GITHUB_ENV
echo "RUSTFLAGS=-C debuginfo=line-tables-only -C incremental=false" >> $GITHUB_ENV

10 changes: 8 additions & 2 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -488,7 +488,7 @@ jobs:

# Verify MSRV for the crates which are directly used by other projects.
msrv:
name: Verify MSRV
name: Verify MSRV (Min Supported Rust Version)
runs-on: ubuntu-latest
container:
image: amd64/rust
Expand All @@ -500,7 +500,13 @@ jobs:
run: cargo install cargo-msrv
- name: Check datafusion
working-directory: datafusion/core
run: cargo msrv verify
run: |
# If you encounter an error with any of the commands below
# it means some crate in your dependency tree has a higher
# MSRV (Min Supported Rust Version) than the one specified
# in the `rust-version` key of `Cargo.toml`. Check your
# dependencies or update the version in `Cargo.toml`
cargo msrv verify
- name: Check datafusion-substrait
working-directory: datafusion/substrait
run: cargo msrv verify
Expand Down
26 changes: 13 additions & 13 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ license = "Apache-2.0"
readme = "README.md"
repository = "https://github.com/apache/arrow-datafusion"
rust-version = "1.70"
version = "34.0.0"
version = "35.0.0"

[workspace.dependencies]
arrow = { version = "50.0.0", features = ["prettyprint"] }
Expand All @@ -45,17 +45,17 @@ bytes = "1.4"
chrono = { version = "0.4.31", default-features = false }
ctor = "0.2.0"
dashmap = "5.4.0"
datafusion = { path = "datafusion/core", version = "34.0.0" }
datafusion-common = { path = "datafusion/common", version = "34.0.0" }
datafusion-execution = { path = "datafusion/execution", version = "34.0.0" }
datafusion-expr = { path = "datafusion/expr", version = "34.0.0" }
datafusion-optimizer = { path = "datafusion/optimizer", version = "34.0.0" }
datafusion-physical-expr = { path = "datafusion/physical-expr", version = "34.0.0" }
datafusion-physical-plan = { path = "datafusion/physical-plan", version = "34.0.0" }
datafusion-proto = { path = "datafusion/proto", version = "34.0.0" }
datafusion-sql = { path = "datafusion/sql", version = "34.0.0" }
datafusion-sqllogictest = { path = "datafusion/sqllogictest", version = "34.0.0" }
datafusion-substrait = { path = "datafusion/substrait", version = "34.0.0" }
datafusion = { path = "datafusion/core", version = "35.0.0" }
datafusion-common = { path = "datafusion/common", version = "35.0.0" }
datafusion-execution = { path = "datafusion/execution", version = "35.0.0" }
datafusion-expr = { path = "datafusion/expr", version = "35.0.0" }
datafusion-optimizer = { path = "datafusion/optimizer", version = "35.0.0" }
datafusion-physical-expr = { path = "datafusion/physical-expr", version = "35.0.0" }
datafusion-physical-plan = { path = "datafusion/physical-plan", version = "35.0.0" }
datafusion-proto = { path = "datafusion/proto", version = "35.0.0" }
datafusion-sql = { path = "datafusion/sql", version = "35.0.0" }
datafusion-sqllogictest = { path = "datafusion/sqllogictest", version = "35.0.0" }
datafusion-substrait = { path = "datafusion/substrait", version = "35.0.0" }
doc-comment = "0.3"
env_logger = "0.10"
futures = "0.3"
Expand All @@ -70,7 +70,7 @@ parquet = { version = "50.0.0", default-features = false, features = ["arrow", "
rand = "0.8"
rstest = "0.18.0"
serde_json = "1"
sqlparser = { version = "0.41.0", features = ["visitor"] }
sqlparser = { version = "0.43.0", features = ["visitor"] }
tempfile = "3"
thiserror = "1.0.44"
url = "2.2"
Expand Down
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,25 @@

# DataFusion

[![Crates.io][crates-badge]][crates-url]
[![Apache licensed][license-badge]][license-url]
[![Build Status][actions-badge]][actions-url]
[![Discord chat][discord-badge]][discord-url]

[crates-badge]: https://img.shields.io/crates/v/datafusion.svg
[crates-url]: https://crates.io/crates/datafusion
[license-badge]: https://img.shields.io/badge/license-Apache%20v2-blue.svg
[license-url]: https://github.com/apache/arrow-datafusion/blob/main/LICENSE.txt
[actions-badge]: https://github.com/apache/arrow-datafusion/actions/workflows/rust.yml/badge.svg
[actions-url]: https://github.com/apache/arrow-datafusion/actions?query=branch%3Amain
[discord-badge]: https://img.shields.io/discord/885562378132000778.svg?logo=discord&style=flat-square
[discord-url]: https://discord.com/invite/Qw5gKqHxUM

[Website](https://github.com/apache/arrow-datafusion) |
[Guides](https://github.com/apache/arrow-datafusion/tree/main/docs) |
[API Docs](https://docs.rs/datafusion/latest/datafusion/) |
[Chat](https://discord.com/channels/885562378132000778/885562378132000781)

<img src="https://arrow.apache.org/datafusion/_images/DataFusion-Logo-Background-White.png" width="256" alt="logo"/>

DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in
Expand Down
8 changes: 4 additions & 4 deletions benchmarks/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
[package]
name = "datafusion-benchmarks"
description = "DataFusion Benchmarks"
version = "34.0.0"
version = "35.0.0"
edition = { workspace = true }
authors = ["Apache Arrow <[email protected]>"]
homepage = "https://github.com/apache/arrow-datafusion"
Expand All @@ -33,8 +33,8 @@ snmalloc = ["snmalloc-rs"]

[dependencies]
arrow = { workspace = true }
datafusion = { path = "../datafusion/core", version = "34.0.0" }
datafusion-common = { path = "../datafusion/common", version = "34.0.0" }
datafusion = { path = "../datafusion/core", version = "35.0.0" }
datafusion-common = { path = "../datafusion/common", version = "35.0.0" }
env_logger = { workspace = true }
futures = { workspace = true }
log = { workspace = true }
Expand All @@ -49,4 +49,4 @@ test-utils = { path = "../test-utils/", version = "0.1.0" }
tokio = { version = "^1.0", features = ["macros", "rt", "rt-multi-thread", "parking_lot"] }

[dev-dependencies]
datafusion-proto = { path = "../datafusion/proto", version = "34.0.0" }
datafusion-proto = { path = "../datafusion/proto", version = "35.0.0" }
177 changes: 167 additions & 10 deletions benchmarks/queries/clickbench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,180 @@ ClickBench is focused on aggregation and filtering performance (though it has no
[ClickBench repository]: https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql

## "Extended" Queries
The "extended" queries are not part of the official ClickBench benchmark.
Instead they are used to test other DataFusion features that are not
covered by the standard benchmark

Each description below is for the corresponding line in `extended.sql` (line 1
is `Q0`, line 2 is `Q1`, etc.)
The "extended" queries are not part of the official ClickBench benchmark.
Instead they are used to test other DataFusion features that are not covered by
the standard benchmark Each description below is for the corresponding line in
`extended.sql` (line 1 is `Q0`, line 2 is `Q1`, etc.)

### Q0: Data Exploration

**Question**: "How many distinct searches, mobile phones, and mobile phone models are there in the dataset?"

**Important Query Properties**: multiple `COUNT DISTINCT`s, with low and high cardinality
distinct string columns.

```sql
SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel")
FROM hits;
```


### Q1: Data Exploration

**Question**: "How many distinct "hit color", "browser country" and "language" are there in the dataset?"

**Important Query Properties**: multiple `COUNT DISTINCT`s. All three are small strings (length either 1 or 2).

### Q0
Models initial Data exploration, to understand some statistics of data.
Import Query Properties: multiple `COUNT DISTINCT` on strings

```sql
SELECT
COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel")
SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage")
FROM hits;
```

### Q2: Top 10 anaylsis

**Question**: "Find the top 10 "browser country" by number of distinct "social network"s,
including the distinct counts of "hit color", "browser language",
and "social action"."

**Important Query Properties**: GROUP BY short, string, multiple `COUNT DISTINCT`s. There are several small strings (length either 1 or 2).

```sql
SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction")
FROM hits
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
```


## Data Notes

Here are some interesting statistics about the data used in the queries
Max length of `"SearchPhrase"` is 1113 characters
```sql
❯ select min(length("SearchPhrase")) as "SearchPhrase_len_min", max(length("SearchPhrase")) "SearchPhrase_len_max" from 'hits.parquet' limit 10;
+----------------------+----------------------+
| SearchPhrase_len_min | SearchPhrase_len_max |
+----------------------+----------------------+
| 0 | 1113 |
+----------------------+----------------------+
```


Here is the schema of the data
```sql
❯ describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID | Int64 | NO |
| JavaEnable | Int16 | NO |
| Title | Utf8 | NO |
| GoodEvent | Int16 | NO |
| EventTime | Int64 | NO |
| EventDate | UInt16 | NO |
| CounterID | Int32 | NO |
| ClientIP | Int32 | NO |
| RegionID | Int32 | NO |
| UserID | Int64 | NO |
| CounterClass | Int16 | NO |
| OS | Int16 | NO |
| UserAgent | Int16 | NO |
| URL | Utf8 | NO |
| Referer | Utf8 | NO |
| IsRefresh | Int16 | NO |
| RefererCategoryID | Int16 | NO |
| RefererRegionID | Int32 | NO |
| URLCategoryID | Int16 | NO |
| URLRegionID | Int32 | NO |
| ResolutionWidth | Int16 | NO |
| ResolutionHeight | Int16 | NO |
| ResolutionDepth | Int16 | NO |
| FlashMajor | Int16 | NO |
| FlashMinor | Int16 | NO |
| FlashMinor2 | Utf8 | NO |
| NetMajor | Int16 | NO |
| NetMinor | Int16 | NO |
| UserAgentMajor | Int16 | NO |
| UserAgentMinor | Utf8 | NO |
| CookieEnable | Int16 | NO |
| JavascriptEnable | Int16 | NO |
| IsMobile | Int16 | NO |
| MobilePhone | Int16 | NO |
| MobilePhoneModel | Utf8 | NO |
| Params | Utf8 | NO |
| IPNetworkID | Int32 | NO |
| TraficSourceID | Int16 | NO |
| SearchEngineID | Int16 | NO |
| SearchPhrase | Utf8 | NO |
| AdvEngineID | Int16 | NO |
| IsArtifical | Int16 | NO |
| WindowClientWidth | Int16 | NO |
| WindowClientHeight | Int16 | NO |
| ClientTimeZone | Int16 | NO |
| ClientEventTime | Int64 | NO |
| SilverlightVersion1 | Int16 | NO |
| SilverlightVersion2 | Int16 | NO |
| SilverlightVersion3 | Int32 | NO |
| SilverlightVersion4 | Int16 | NO |
| PageCharset | Utf8 | NO |
| CodeVersion | Int32 | NO |
| IsLink | Int16 | NO |
| IsDownload | Int16 | NO |
| IsNotBounce | Int16 | NO |
| FUniqID | Int64 | NO |
| OriginalURL | Utf8 | NO |
| HID | Int32 | NO |
| IsOldCounter | Int16 | NO |
| IsEvent | Int16 | NO |
| IsParameter | Int16 | NO |
| DontCountHits | Int16 | NO |
| WithHash | Int16 | NO |
| HitColor | Utf8 | NO |
| LocalEventTime | Int64 | NO |
| Age | Int16 | NO |
| Sex | Int16 | NO |
| Income | Int16 | NO |
| Interests | Int16 | NO |
| Robotness | Int16 | NO |
| RemoteIP | Int32 | NO |
| WindowName | Int32 | NO |
| OpenerName | Int32 | NO |
| HistoryLength | Int16 | NO |
| BrowserLanguage | Utf8 | NO |
| BrowserCountry | Utf8 | NO |
| SocialNetwork | Utf8 | NO |
| SocialAction | Utf8 | NO |
| HTTPError | Int16 | NO |
| SendTiming | Int32 | NO |
| DNSTiming | Int32 | NO |
| ConnectTiming | Int32 | NO |
| ResponseStartTiming | Int32 | NO |
| ResponseEndTiming | Int32 | NO |
| FetchTiming | Int32 | NO |
| SocialSourceNetworkID | Int16 | NO |
| SocialSourcePage | Utf8 | NO |
| ParamPrice | Int64 | NO |
| ParamOrderID | Utf8 | NO |
| ParamCurrency | Utf8 | NO |
| ParamCurrencyID | Int16 | NO |
| OpenstatServiceName | Utf8 | NO |
| OpenstatCampaignID | Utf8 | NO |
| OpenstatAdID | Utf8 | NO |
| OpenstatSourceID | Utf8 | NO |
| UTMSource | Utf8 | NO |
| UTMMedium | Utf8 | NO |
| UTMCampaign | Utf8 | NO |
| UTMContent | Utf8 | NO |
| UTMTerm | Utf8 | NO |
| FromTag | Utf8 | NO |
| HasGCLID | Int16 | NO |
| RefererHash | Int64 | NO |
| URLHash | Int64 | NO |
| CLID | Int32 | NO |
+-----------------------+-----------+-------------+
105 rows in set. Query took 0.034 seconds.

```
4 changes: 3 additions & 1 deletion benchmarks/queries/clickbench/extended.sql
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM hits;
SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM hits;
SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage") FROM hits;
SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction") FROM hits GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
Loading