Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47322][PYTHON][CONNECT] Make withColumnsRenamed column names duplication handling consistent with withColumnRenamed #45431

Closed
wants to merge 2 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Make withColumnsRenamed duplicated column name handling consistent with withColumnRenamed

Why are the changes needed?

withColumnsRenamed checks the column names duplication of output dataframe, this is not consistent with withColumnRenamed:
1, withColumnRenamed doesn't do this check, and support output a dataframe with duplicated column names;
2, when the input dataframe has duplicated column names, withColumnsRenamed always fail, even if the columns with the same name are not touched at all:

In [8]: df1 = spark.createDataFrame([(1, "id2"),], ["id", "value"])
   ...: df2 = spark.createDataFrame([(1, 'x', 'id1'), ], ["id", 'a', "value"])
   ...: join = df2.join(df1, on=['id'], how='left')
   ...: join
Out[8]: DataFrame[id: bigint, a: string, value: string, value: string]

In [9]: join.withColumnRenamed('id', 'value')
Out[9]: DataFrame[value: bigint, a: string, value: string, value: string]

In [10]: join.withColumnsRenamed({'id' : 'value'})
...
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

In [11]: join.withColumnRenamed('a', 'b')
Out[11]: DataFrame[id: bigint, b: string, value: string, value: string]

In [12]: join.withColumnsRenamed({'a' : 'b'})
...
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

In [13]: join.withColumnRenamed('x', 'y')
Out[13]: DataFrame[id: bigint, a: string, value: string, value: string]

In [14]: join.withColumnsRenamed({'x' : 'y'})
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

In [15]: join.withColumnRenamed('value', 'new_value')
Out[15]: DataFrame[id: bigint, a: string, new_value: string, new_value: string]

In [16]: join.withColumnsRenamed({'value' : 'new_value'})
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `new_value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

Does this PR introduce any user-facing change?

yes

How was this patch tested?

updated tests

Was this patch authored or co-authored using generative AI tooling?

no

@HyukjinKwon
Copy link
Member

cc @cloud-fan

@zhengruifeng zhengruifeng changed the title [SPARK-47322][PYTHON][CONNECT] Make withColumnsRenamed duplicated column name handling consisten with withColumnRenamed [SPARK-47322][PYTHON][CONNECT] Make withColumnsRenamed column names duplication handling consistent with withColumnRenamed Mar 8, 2024
@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the connect_renames branch March 8, 2024 12:49
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
… duplication handling consistent with `withColumnRenamed`

### What changes were proposed in this pull request?
Make `withColumnsRenamed` duplicated column name handling consistent with `withColumnRenamed`

### Why are the changes needed?
`withColumnsRenamed` checks the column names duplication of output dataframe, this is not consistent with `withColumnRenamed`:
1, `withColumnRenamed` doesn't do this check, and support output a dataframe with duplicated column names;
2, when the input dataframe has duplicated column names, `withColumnsRenamed` always fail, even if the columns with the same name are not touched at all:

```
In [8]: df1 = spark.createDataFrame([(1, "id2"),], ["id", "value"])
   ...: df2 = spark.createDataFrame([(1, 'x', 'id1'), ], ["id", 'a', "value"])
   ...: join = df2.join(df1, on=['id'], how='left')
   ...: join
Out[8]: DataFrame[id: bigint, a: string, value: string, value: string]

In [9]: join.withColumnRenamed('id', 'value')
Out[9]: DataFrame[value: bigint, a: string, value: string, value: string]

In [10]: join.withColumnsRenamed({'id' : 'value'})
...
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

In [11]: join.withColumnRenamed('a', 'b')
Out[11]: DataFrame[id: bigint, b: string, value: string, value: string]

In [12]: join.withColumnsRenamed({'a' : 'b'})
...
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

In [13]: join.withColumnRenamed('x', 'y')
Out[13]: DataFrame[id: bigint, a: string, value: string, value: string]

In [14]: join.withColumnsRenamed({'x' : 'y'})
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711

In [15]: join.withColumnRenamed('value', 'new_value')
Out[15]: DataFrame[id: bigint, a: string, new_value: string, new_value: string]

In [16]: join.withColumnsRenamed({'value' : 'new_value'})
AnalysisException: [COLUMN_ALREADY_EXISTS] The column `new_value` already exists. Choose another name or rename the existing column. SQLSTATE: 42711
```

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
updated tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#45431 from zhengruifeng/connect_renames.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants