[SPARK-47322][PYTHON][CONNECT] Make `withColumnsRenamed` column names duplication handling consistent with `withColumnRenamed` #45431

zhengruifeng · 2024-03-08T07:16:25Z

What changes were proposed in this pull request?

Make withColumnsRenamed duplicated column name handling consistent with withColumnRenamed

Why are the changes needed?

withColumnsRenamed checks the column names duplication of output dataframe, this is not consistent with withColumnRenamed:
1, withColumnRenamed doesn't do this check, and support output a dataframe with duplicated column names;
2, when the input dataframe has duplicated column names, withColumnsRenamed always fail, even if the columns with the same name are not touched at all:

In [8]: df1 = spark.createDataFrame([(1, "id2"),], ["id", "value"])
   ...: df2 = spark.createDataFrame([(1, 'x', 'id1'), ], ["id", 'a', "value"])
   ...: join = df2.join(df1, on=['id'], how='left')
   ...: join
Out[8]: DataFrame[id: bigint, a: string, value: string, value: string]

In [9]: join.withColumnRenamed('id', 'value')
Out[9]: DataFrame[value: bigint, a: string, value: string, value: string]

In [10]: join.withColumnsRenamed({'id' : 'value'})
...
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

In [11]: join.withColumnRenamed('a', 'b')
Out[11]: DataFrame[id: bigint, b: string, value: string, value: string]

In [12]: join.withColumnsRenamed({'a' : 'b'})
...
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

In [13]: join.withColumnRenamed('x', 'y')
Out[13]: DataFrame[id: bigint, a: string, value: string, value: string]

In [14]: join.withColumnsRenamed({'x' : 'y'})
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

In [15]: join.withColumnRenamed('value', 'new_value')
Out[15]: DataFrame[id: bigint, a: string, new_value: string, new_value: string]

In [16]: join.withColumnsRenamed({'value' : 'new_value'})
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `new_value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

Does this PR introduce any user-facing change?

yes

How was this patch tested?

updated tests

Was this patch authored or co-authored using generative AI tooling?

no

HyukjinKwon · 2024-03-08T07:17:55Z

cc @cloud-fan

HyukjinKwon · 2024-03-08T11:52:11Z

Merged to master.

… duplication handling consistent with `withColumnRenamed` ### What changes were proposed in this pull request? Make `withColumnsRenamed` duplicated column name handling consistent with `withColumnRenamed` ### Why are the changes needed? `withColumnsRenamed` checks the column names duplication of output dataframe, this is not consistent with `withColumnRenamed`: 1, `withColumnRenamed` doesn't do this check, and support output a dataframe with duplicated column names; 2, when the input dataframe has duplicated column names, `withColumnsRenamed` always fail, even if the columns with the same name are not touched at all: ``` In [8]: df1 = spark.createDataFrame([(1, "id2"),], ["id", "value"]) ...: df2 = spark.createDataFrame([(1, 'x', 'id1'), ], ["id", 'a', "value"]) ...: join = df2.join(df1, on=['id'], how='left') ...: join Out[8]: DataFrame[id: bigint, a: string, value: string, value: string] In [9]: join.withColumnRenamed('id', 'value') Out[9]: DataFrame[value: bigint, a: string, value: string, value: string] In [10]: join.withColumnsRenamed({'id' : 'value'}) ... AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711 In [11]: join.withColumnRenamed('a', 'b') Out[11]: DataFrame[id: bigint, b: string, value: string, value: string] In [12]: join.withColumnsRenamed({'a' : 'b'}) ... AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711 In [13]: join.withColumnRenamed('x', 'y') Out[13]: DataFrame[id: bigint, a: string, value: string, value: string] In [14]: join.withColumnsRenamed({'x' : 'y'}) AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711 In [15]: join.withColumnRenamed('value', 'new_value') Out[15]: DataFrame[id: bigint, a: string, new_value: string, new_value: string] In [16]: join.withColumnsRenamed({'value' : 'new_value'}) AnalysisException: [COLUMN_ALREADY_EXISTS] The column `new_value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711 ``` ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#45431 from zhengruifeng/connect_renames. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

zhengruifeng added 2 commits March 8, 2024 14:16

init

0e4f66a

fix

2eaabd8

github-actions bot added SQL PYTHON CONNECT labels Mar 8, 2024

HyukjinKwon approved these changes Mar 8, 2024

View reviewed changes

zhengruifeng changed the title ~~[SPARK-47322][PYTHON][CONNECT] Make withColumnsRenamed duplicated column name handling consisten with withColumnRenamed~~ [SPARK-47322][PYTHON][CONNECT] Make withColumnsRenamed column names duplication handling consistent with withColumnRenamed Mar 8, 2024

cloud-fan approved these changes Mar 8, 2024

View reviewed changes

dongjoon-hyun approved these changes Mar 8, 2024

View reviewed changes

HyukjinKwon closed this in 640ed4f Mar 8, 2024

zhengruifeng deleted the connect_renames branch March 8, 2024 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47322][PYTHON][CONNECT] Make `withColumnsRenamed` column names duplication handling consistent with `withColumnRenamed` #45431

[SPARK-47322][PYTHON][CONNECT] Make `withColumnsRenamed` column names duplication handling consistent with `withColumnRenamed` #45431

zhengruifeng commented Mar 8, 2024

HyukjinKwon commented Mar 8, 2024

HyukjinKwon commented Mar 8, 2024

[SPARK-47322][PYTHON][CONNECT] Make withColumnsRenamed column names duplication handling consistent with withColumnRenamed #45431

[SPARK-47322][PYTHON][CONNECT] Make withColumnsRenamed column names duplication handling consistent with withColumnRenamed #45431

Conversation

zhengruifeng commented Mar 8, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Mar 8, 2024

HyukjinKwon commented Mar 8, 2024

[SPARK-47322][PYTHON][CONNECT] Make `withColumnsRenamed` column names duplication handling consistent with `withColumnRenamed` #45431

[SPARK-47322][PYTHON][CONNECT] Make `withColumnsRenamed` column names duplication handling consistent with `withColumnRenamed` #45431