Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting prefer_column_name_to_alias=1 makes column names refer to the wrong sub-query column #47552

Closed
eblio opened this issue Mar 14, 2023 · 4 comments · Fixed by #62457
Closed

Comments

@eblio
Copy link

eblio commented Mar 14, 2023

What's wrong

Using the setting prefer_column_name_to_alias=1.

With multiple SELECT sub-queries in which columns are aliased to the same name, using those alias in the top query seems to refer columns of the bottom query.

Does it reproduce on recent release?

I first detected the bug on release v22.7.3.5. I was able to reproduce it on ClickHouse Fiddle using the latest release v23.2.3.17-stable.

How to reproduce

Step by step reproduction : https://fiddle.clickhouse.com/6ae1c08a-ac7c-4b64-8843-cfc9a824c151

  • ClickHouse server version to use : v22.7.3.5 or v23.2.3.17-stable (may be any) ;
  • Interface to use : doesn't seem to matter ;
  • Using setting prefer_column_name_to_alias=1 ;
  • Preparation statements :
CREATE TABLE clickhouse_alias_issue_1 (
    id bigint,
    column_1 Nullable(Float32)
) Engine=Memory;

CREATE TABLE clickhouse_alias_issue_2 (
    id bigint,
    column_2 Nullable(Float32)
) Engine=Memory;

INSERT INTO `clickhouse_alias_issue_1`
VALUES (1, 100), (2, 200), (3, 300);

INSERT INTO `clickhouse_alias_issue_2`
VALUES (1, 10), (2, 20), (3, 30);
  • Query to run that lead to the unexpected result :
SELECT `column_1` / `column_2`, `id`
FROM (
    SELECT
        max(`column_1`) AS `column_1`,
        max(`column_2`) AS `column_2`,
        `id`
    FROM (
        SELECT
          max(`column_1`) AS `column_1`,
          NULL AS `column_2`,
          `id`
        FROM `clickhouse_alias_issue_1`
        GROUP BY
          `id`
        UNION ALL
        SELECT
          NULL AS `column_1`,
          max(`column_2`) AS `column_2`,
          `id`
        FROM `clickhouse_alias_issue_2`
        GROUP BY
          `id`
        SETTINGS prefer_column_name_to_alias=1
        ) as T1
    GROUP BY `id`
    SETTINGS prefer_column_name_to_alias=1
) as T2
WHERE `column_1` IS NOT NULL AND `column_2` IS NOT NULL
SETTINGS prefer_column_name_to_alias=1;

Expected behavior

The expected result is :

column_1 / column_2 id
10 3
10 2
10 1

While the actual result is empty :

column_1 / column_2 id

It seems like the columns used in the WHERE condition are referencing T1 columns instead of T2.

For your convenience, here are the returned values of T1 and T2 :

T1

column_1 column_2 id
\N 30 3
\N 20 2
\N 10 1
300 \N 3
200 \N 2
100 \N 1

T2

column_1 column_2 id
300 30 3
200 20 2
100 10 1
@eblio eblio added the potential bug To be reviewed by developers and confirmed/rejected. label Mar 14, 2023
@eteresh
Copy link

eteresh commented Jun 16, 2023

I experience the same issue with enabled prefer_column_name_to_alias. This simple query return 2 rows (as expected), when prefer_column_name_to_alias set to zero. When prefer_column_name_to_alias set to 1 query return only single row.
This query

WITH first_table AS (
    SELECT '2023-06-02' AS day
), second_table AS (
    SELECT '2023-06-03' AS day
)
SELECT day, day >= '2023-06-01' AS condition_from_where_filter_is_always_true
FROM (
    SELECT
        CASE
            WHEN first_table.day != '' THEN first_table.day
            ELSE second_table.day
        END AS day
    FROM first_table
    FULL OUTER JOIN second_table
    ON (first_table.day = second_table.day)
) AS virtual_table
WHERE day >= '2023-06-01'
SETTINGS prefer_column_name_to_alias = 1

returns 1 row:

┌─day────────┬─condition_from_where_filter_is_always_true─┐
│ 2023-06-02 │                                          1 │
└────────────┴────────────────────────────────────────────┘

after changing settings to SETTINGS prefer_column_name_to_alias = 0 the same query returns 2 rows:

┌─day────────┬─condition_from_where_filter_is_always_true─┐
│ 2023-06-02 │                                          1 │
│ 2023-06-03 │                                          1 │
└────────────┴────────────────────────────────────────────┘

@HSEhomework
Copy link

It is expected that prefer_column_name_to_alias has to pick the first alias after group by statements, otherwise it works incorrect.
Any workaround for this please?

@eblio
Copy link
Author

eblio commented Sep 25, 2023

@HSEhomework The workaround we found is to insert the sub-query content into a Memory() table, and only then performing the actual computation on that temporary table.

We are looking to ditch that setting as it's behavior seems unreliable.

@nikitamikhaylov nikitamikhaylov added st-fixed unexpected behaviour and removed potential bug To be reviewed by developers and confirmed/rejected. labels Apr 9, 2024
@nikitamikhaylov
Copy link
Member

Fixed by Analyzer: https://fiddle.clickhouse.com/dba9ab3d-4a6b-4034-91be-ea108a4c1fa1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants