Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive CTEs: Stage 3 - add execution support #6

Draft
wants to merge 69 commits into
base: matt/feat/recursive-ctes/parser
Choose a base branch
from

Conversation

matthewgapp
Copy link
Owner

@matthewgapp matthewgapp commented Jan 12, 2024

WIP. Adds execution support to recursive CTEs

Todos

  • Fill out docs and cleanup. Add more tests.

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@matthewgapp matthewgapp changed the title add config flag for recursive ctes Recursive CTEs: Stage 3 - add execution support Jan 12, 2024
time,
1 as "val"
FROM
(SELECT DISTINCT "time" FROM "beg_account_balance")
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails. But the following changes cause the test to pass

  • removing time or 1 as "val" from the sub_cte projection
  • removing the reference to beg_account_balance in the FROM clause (effectively removing the sub-query from the sub_cte)
  • removing the FULL JOIN "sub_cte" on 1=1

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fails with index out of bounds: the len is 0 but the index is 0 but other incarnations of the error after playing around with the repro were

Internal error: PhysicalExpr Column references column 'time' at index 0 (zero-based) but input schema only has 0 columns: [].
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
[SQL] WITH RECURSIVE "recursive_cte" AS 
External error: query failed: DataFusion error: ProjectionPushdown
caused by
Internal error: PhysicalExpr Column references column 'time' at index 0 (zero-based) but input schema only has 0 columns: [].
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
[SQL] WITH RECURSIVE "recursive_cte" AS (
External error: query failed: DataFusion error: ProjectionPushdown
caused by
Internal error: PhysicalExpr Column references column 'LEAD(beg_account_balance.time,Int64(1)) ORDER BY [beg_account_balance.time ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW' at index 1 (zero-based) but input schema only has 1 columns: ["LEAD(beg_account_balance.time,Int64(1)) ORDER BY [beg_account_balance.time ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW"].
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
[SQL] WITH RECURSIVE "recursive_cte" AS (

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the columns referenced in the error messages are different from the repro because the repro at the time that I collected those errors was different.

@@ -282,7 +282,7 @@ config_namespace! {
/// Should DataFusion support recursive CTEs
/// Defaults to false since this feature is a work in progress and may not
/// behave as expected
pub enable_recursive_ctes: bool, default = false
pub enable_recursive_ctes: bool, default = true
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to set back to false. Need to provide a method to set this at the root config level

@matthewgapp matthewgapp force-pushed the matt/feat/recursive-ctes/parser branch 2 times, most recently from 98f77ff to 531d541 Compare January 12, 2024 06:08
@matthewgapp matthewgapp force-pushed the matt/feat/recursive-ctes/execution-support branch from e0ea336 to 5257fe3 Compare January 12, 2024 06:10
@matthewgapp matthewgapp force-pushed the matt/feat/recursive-ctes/parser branch 2 times, most recently from 9615dde to 460561c Compare January 14, 2024 00:40
@matthewgapp matthewgapp force-pushed the matt/feat/recursive-ctes/execution-support branch 2 times, most recently from a993efb to f67763a Compare January 15, 2024 22:04
@matthewgapp matthewgapp force-pushed the matt/feat/recursive-ctes/parser branch from 22661af to 8d38c92 Compare January 16, 2024 17:56
@matthewgapp matthewgapp force-pushed the matt/feat/recursive-ctes/execution-support branch 2 times, most recently from 06fcbea to 68c9fe7 Compare January 18, 2024 19:56
tushushu and others added 11 commits January 18, 2024 15:24
* refactor

* generated files

* feat

* feat

* feat

* feat

* tests

* clippy

---------

Co-authored-by: Andrew Lamb <[email protected]>
…ion (apache#8839)

* add config flag for recursive ctes

update docs from script

update slt test for doc change

* restore testing pin

* add sql -> logical plan support

* impl cte as work table

* move SharedState to continuance

* impl WorkTableState

wip: readying pr to implement only logical plan

fix sql integration test

wip: add sql test for logical plan

wip: format test assertion

* wip: remove uncessary with qualifier method

some docs

more docs

* Add comments to `RecursiveQuery`

* Update datfusion-cli Cargo.lock

* Fix clippy

* better errors and comments

* add doc comment with rationale for create_cte_worktable method

* wip: tweak

---------

Co-authored-by: Andrew Lamb <[email protected]>
)

* Add support for parsing FixedSizeList type

* fix fmt

* support cast fixedsizelist from list

* clean comment

* support cast between NULL and FixedSizedLisr

* add test for FixedSizeList hash

* add test for cast fixedsizelist
…t empty (apache#8914)

* Fix aggregate_statistics

* Add more test
* Support to_timestamp with chrono formatting apache#5398

* Updated user guide's to_timestamp to include chrono formatting information apache#5398

* Minor comment update.

* Small documentation updates for to_timestamp functions.

* Cargo fmt and clippy improvements.

* Switched to assert and unwrap_err based on feedback

* Fixed assert, code compiles and runs as expected now.

* Fix fmt (again).

* Add additional to_timestamp tests covering usage with tables with and without valid formats.

* to_timestamp documentation fixes.

* - Changed internal_err! -> exec_err! for unsupported data type errors.
- Extracted out to_timestamp_impl method to reduce code duplication as per PR feedback.
- Extracted out validate_to_timestamp_data_types to reduce code duplication as per PR feedback.
- Added additional tests for argument validation and invalid arguments.
- Removed unnecessary shim function 'string_to_timestamp_nanos_with_format_shim'

* Resolved merge conflict, updated toStringXXX methods to reflect upstream change

* prettier

* Fix clippy

---------

Co-authored-by: Andrew Lamb <[email protected]>
…value (apache#8912)

* Minor: Document third argument of date_bin as optional

# Rationale
@mhilton  noticed this as part of an internal discussion where we were confused about the output values of `date_bin` for timestamps with timezones (as the default argument is for the unix EPOCH in UTC)

# Changes:
Document the third parameter to `date_bin` better

* prettier
@matthewgapp matthewgapp force-pushed the matt/feat/recursive-ctes/execution-support branch from 68c9fe7 to b11ca8b Compare January 21, 2024 08:49
Weijun-H and others added 4 commits January 21, 2024 07:30
apache#8829)

* support array slice

* fix argument

* fix typo

* support from and to is negative

* fix conflict

* modify user doc

* refactor code

* fix clippy

* add 1-index test

---------

Co-authored-by: Andrew Lamb <[email protected]>
* Fix expr partial ord test

* Add comment

* Resolve linter error
* Simplify windows builtin functions

* add field comments
Omega359 and others added 26 commits January 24, 2024 15:13
* Updated documentation concerning to_timestamp scalar function including adding missing link in the datafusion-examples/README file. Resolves apache#8980

* Fix example url.

* Applied prettier.
* Add helper function for scalar function

* Update datafusion/physical-expr/src/functions.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Fix

* Fix

---------

Co-authored-by: Andrew Lamb <[email protected]>
* Fix optimize projections bug

* Add new dataframe test
apache#8982)

* optimize NOT Expr logic

* fix Null type

* fix test case

* fmt
* Optimize MIN/MAX when relation is empty

* Fix clippy
* [task apache#8213]Part of  Port tests in select.rs to sqllogictest

Signed-off-by: tangruilin <[email protected]>

* test large strings as well

* Restore parameter tests

* make test deterministic

---------

Signed-off-by: tangruilin <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
* Migrate last repartition query

* Add reference to issue

Co-authored-by: Andrew Lamb <[email protected]>

* Fix missing statement

---------

Co-authored-by: Andrew Lamb <[email protected]>
* Update to sqlparser `0.42.0`

* Update datafusion Cargo.lock
* Make tests deterministic

* Add duplicate timestamps
* Add support for PG LIKE operators

* Bump sqlparser dep from branch to merge commit
…he#8964)

* fix: table alias will process uppercase alias into lowercase

* test: uppercase table alias

* test: add table alias sqllogictest

* test: move tests to sqllogictests

* test: update cte.slt

* chore: add test descriptions
…pache#8986)

* Document parallelism and thread scheduling in the architecture guid

* Apply suggestions from code review

Co-authored-by: Raphael Taylor-Davies <[email protected]>

---------

Co-authored-by: Raphael Taylor-Davies <[email protected]>
* Fix none projections

* Update select.slt
add config flag for recursive ctes

update docs from script

update slt test for doc change

restore testing pin

add sql -> logical plan support

* impl cte as work table

* move SharedState to continuance

* impl WorkTableState

wip: readying pr to implement only logical plan

fix sql integration test

wip: add sql test for logical plan

wip: format test assertion

wip: remove uncessary with qualifier method

some docs

more docs

Add comments to `RecursiveQuery`

Update datfusion-cli Cargo.lock

Fix clippy

better errors and comments

add sql -> logical plan support

* impl cte as work table

* move SharedState to continuance

* impl WorkTableState

wip: readying pr to implement only logical plan

fix sql integration test

wip: add sql test for logical plan

wip: format test assertion

wip: remove uncessary with qualifier method

some docs

more docs

impl execution support

add sql -> logical plan support

* impl cte as work table

* move SharedState to continuance

* impl WorkTableState

wip: readying pr to implement only logical plan

partway through porting over isidentical's work

Continuing implementation with fixes and improvements

Lint fixes

ensure that repartitions are not added immediately after RecursiveExec
in the physical-plan

add trivial sqllogictest

more recursive tests

remove test that asserts recursive cte should fail

additional cte test

wip: remove tokio from physical plan dev deps

format cargo tomls

fix issue where CTE could not be referenced more than 1 time

wip: fixes after rebase but tpcds_physical_q54 keeps overflowing its stack

Impl NamedRelation as CteWorkTable

* impl cte as work table

* move SharedState to continuance

* impl WorkTableState

* upd

* assign work table state

* upd

* upd

fix min repro but still broken on larger test case

set config in sql logic tests

clean up cte slt tests

fixes

fix option

add group by test case and more test case files

wip

add window function recursive cte example

simplify stream impl for recrusive query stream

add explain to trivial test case

move setting of recursive ctes to slt file and add test to ensure multiple record batches are produced each iteration

remove tokio dep and remove mut

lint, comments and remove tokio stream

update submodule pin to feat branch that contains csvs

update submodule pin to feat branch that contains csvs
@matthewgapp matthewgapp force-pushed the matt/feat/recursive-ctes/execution-support branch from e65caee to 80069f7 Compare January 26, 2024 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.