Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNNEST should not be allowed within CASE statements #13466

Open
2 tasks done
ThomasRolfsnes-EP opened this issue Aug 19, 2024 · 11 comments
Open
2 tasks done

UNNEST should not be allowed within CASE statements #13466

ThomasRolfsnes-EP opened this issue Aug 19, 2024 · 11 comments

Comments

@ThomasRolfsnes-EP
Copy link

ThomasRolfsnes-EP commented Aug 19, 2024

What happens?

When building a case statement it seems the THEN part is allways evaluated. This became evident when I had unnest(generate_series(..)) in one of my case-stements, resulting in unexpected output.

To Reproduce

select
case
 when True then 1
 when False then unnest(generate_series(0,2))
end as x;

Results in:
CleanShot 2024-08-19 at 10 54 43

While I expected:
CleanShot 2024-08-19 at 10 55 06

OS:

aarch64

DuckDB Version:

1.0.0

DuckDB Client:

Python

Full Name:

Thomas Rolfsnes

Affiliation:

Story House Egmont

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@pkoppstein
Copy link

pkoppstein commented Aug 19, 2024

  1. I've reproduced the problem using v1.0.1-dev4226
  2. I believe the bug is closely associated with the use of unnest here. For example, there's no problem with the similar query:
select case
   when True then [1]
   when False then generate_series(0,2)
end as x;

@szarnyasg szarnyasg changed the title THEN part of CASE-statement allways beeing evaluated THEN part of CASE-statement always beeing evaluated Aug 20, 2024
@ThomasRolfsnes-EP
Copy link
Author

@pkoppstein yes indeed, unnest() surfaces the issue here, since it actually modifies the number of rows in the query it becomes evident that the THEN part was evaluated even though WHEN was false

Another way to prove that this is how duckdb currently works is to just throw in costly calculatations that should not be evaluated, like so:

# without generate_series
%timeit -n 10000 (duckdb.query("select case when True then [1] end as x"))
# 44.8 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

false_stmt = ["when False then generate_series(0,100000)"]
# 1 generate_series
%timeit -n 10000 (duckdb.query(f"select case when True then [1] {' '.join(false_stmt*1)} end as x"))
# 69.4 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# 10 generate_series
%timeit -n 10000 (duckdb.query(f"select case when True then [1] {' '.join(false_stmt*10)} end as x"))
# 369 µs ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@ThomasRolfsnes-EP
Copy link
Author

FYI if() behaves exactly the same:

duckdb.query("""
select
  if(True,1,unnest(generate_series(0,2))) as x
""")

┌───────┐
│   x   │
│ int64 │
├───────┤
│     1 │
│     1 │
│     1 │
└───────┘

@Mytherin
Copy link
Collaborator

Thanks for the report!

UNNEST is executed completely separately from the regular SELECT, similar to an aggregate, so I would say this is more or less expected. Postgres throws an exception here:

postgres=# select
case
 when True then 1
 when False then unnest(array[0,1,2])
end as x;
ERROR:  set-returning functions are not allowed in CASE
LINE 4:  when False then unnest(array[0,1,2])
                         ^
-- HINT:  You might be able to move the set-returning function into a LATERAL FROM item.

I think that is also sane and we might want to switch to throwing an exception.

Another way to prove that this is how duckdb currently works is to just throw in costly calculatations that should not be evaluated, like so:

DuckDB short-circuits case statements. In your example this most likely does not happen because the individual case statements are constants (i.e. generate_series(0, 100000) is a constant) - and the constant folding optimizer is executed before the case statement short-circuiting. Here's a showcase of the case statement short-circuiting:

D create table ranges as select 999999999999999 range_bound;
D select case when true then [] else generate_series(0, range_bound) end as r from ranges;
┌─────────┐
│    r    │
│ int64[] │
├─────────┤
│ []      │
└─────────┘

-- or with multiple values:
D create or replace table ranges as select 0 as range_bound union all select 999999999999999 range_bound;
D select case when range_bound > 10 then [] else generate_series(0, range_bound) end as r from ranges;
┌─────────┐
│    r    │
│ int64[] │
├─────────┤
│ [0]     │
│ []      │
└─────────┘

@Mytherin Mytherin changed the title THEN part of CASE-statement always beeing evaluated CASE-statement does not short-circuit UNNEST evaluation Aug 21, 2024
@ThomasRolfsnes-EP
Copy link
Author

@Mytherin

I see, beeing able to use unnest() as part of the select statement is nice (bigquery also does not allow this).
So optimally case/if would also short-circuit when unnest is used, but if not viable to implement, an exception should absolutely be raised. E.g., this query should either return 1, or raise an exception:

select
  if(True,1,unnest([1,2,3]))

From a users perspective I don't think its intuitive or expected at all that the False part here has any impact on the output.

Thanks!

@Mytherin
Copy link
Collaborator

I agree that an exception is better, but the current behavior is consistent with the way aggregates work in SQL, which is the other way of changing result cardinality from within the SELECT, e.g.:

postgres=# select case when true then 1 else count(*) end from generate_series(0,1000);
 count 
-------
     1
(1 row)

Now you can argue that is also confusing and I would agree :)

@ThomasRolfsnes-EP
Copy link
Author

wow, ok I see, that one is also definitely weird, puts a dent in SQL beeing declarative for me :)

I'm fine with raising an exception in the case of unnest. But I'll leave the use-case I had here in case it might influence discussions down the line.

For certain values of a given column, I needed to duplicate those rows with an incremental counter in a new column, e.g:

with input as (
  select 1 as x
  union all
  select 5 as x
)
select
  x,
  case
    when x = 1 then x
    when x > 1 then unnest(generate_series(0,x))
  end as y
from input 

CleanShot 2024-08-23 at 10 11 12

@Mytherin
Copy link
Collaborator

Perhaps that can be solved by moving the case statement into the unnest, e.g.:

with input as (
  select 1 as x
  union all
  select 5 as x
)
select
  x,
  unnest(case
    when x = 1 then [x]
    when x > 1 then generate_series(0,x)
  end) as y
from input ;

@ThomasRolfsnes-EP
Copy link
Author

It does indeed! Nice one, thanks :)

@Mytherin Mytherin changed the title CASE-statement does not short-circuit UNNEST evaluation UNNEST should not be allowed within CASE statements Aug 23, 2024
@pkoppstein
Copy link

@Mytherin - You changed the title from

CASE-statement does not short-circuit UNNEST evaluation

to

UNNEST should not be allowed within CASE statements

But is that the appropriate conclusion? Consider:

select case when True then 1 else (select * from generate_series(3, 6)) end ;

├───────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                         1 │
└───────────────────────────────────────────────────────────────────────────────────────────┘

D select case when 1=2 then 1 else (select * from generate_series(3, 6)) end ;

├──────────────────────────────────────────────────────────────────────────────┤
│                                                                            3 │
└──────────────────────────────────────────────────────────────────────────────┘
D 

This seems quite correct:

(1) the logic of CASE governs the results;
(2) if one of the branches of the CASE statement yields a sequence of values, select the first.

@soerenwolfers
Copy link

Related: #14012 (which should be titled something like "Errors during constant folding should be caught and only be raised where required" )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants