Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index.to_frame() seems not working properly. #1647

Closed
itholic opened this issue Jul 13, 2020 · 9 comments
Closed

Index.to_frame() seems not working properly. #1647

itholic opened this issue Jul 13, 2020 · 9 comments
Labels
bug Something isn't working

Comments

@itholic
Copy link
Contributor

itholic commented Jul 13, 2020

>>> kdf = ks.DataFrame({"Koalas": [1, 2, 3]}, index=pd.Index([1, 2, 3]))
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
   0    NEW
1  1  200.0
3  3    NaN
2  2  300.0

The above code is working well.

But the same shape of DataFrame which is made from Index.to_frame() seems not work properly.

>>> kdf = ks.Index([1, 2, 3]).to_frame(name="Koalas")
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: Reference 'Koalas' is ambiguous, could be: Koalas, Koalas.;
@itholic itholic changed the title MultiIndex.to_frame() are not working properly MultiIndex.to_frame() are not working properly internally Jul 13, 2020
@itholic
Copy link
Contributor Author

itholic commented Jul 13, 2020

Let me fix this for completing #1630

@ueshin
Copy link
Collaborator

ueshin commented Jul 13, 2020

What's the problem?

@itholic
Copy link
Contributor Author

itholic commented Jul 13, 2020

I thought we should always manage the data_spark_columns and index_spark_columns separately because the internal Spark columns should be changed when the external Koalas columns are changed.

For example,

>>> kdf = ks.DataFrame({"Koalas": [1, 2, 3]}, index=pd.Index([1, 2, 3]))
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
   0    NEW
1  1  200.0
3  3    NaN
2  2  300.0

The above code is working well.

But the same shape of DataFrame which is made from Index.to_frame() seems not work properly.

>>> kdf = ks.Index([1, 2, 3]).to_frame(name="Koalas")
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: Reference 'Koalas' is ambiguous, could be: Koalas, Koalas.;

Hmm.. but probably It seems not a problem related with internal Spark frame... let me investigate this more carefully

@itholic itholic changed the title MultiIndex.to_frame() are not working properly internally Index.to_frame() seems not working properly. Jul 13, 2020
@itholic
Copy link
Contributor Author

itholic commented Jul 13, 2020

I just changed the title and description of this issue.

@itholic
Copy link
Contributor Author

itholic commented Jul 13, 2020

Anyway, could I have some additional question about InternalFrame that I often feel confusing?

Don't we need to address the values of internal Spark columns after modifying the external values ?

For example, let's say we have a Series like the below.

>>> kser
0    1
1    2
2    3
Name: 0, dtype: int64

>>> kser._internal.spark_frame.show()
+-----------------+---+-----------------+
|__index_level_0__|  0|__natural_order__|
+-----------------+---+-----------------+
|                0|  1|      25769803776|
|                1|  2|      60129542144|
|                2|  3|      94489280512|
+-----------------+---+-----------------+

And let's modify this by filtering with where, and make a new Series named new_kser.

>>> new_kser = kser.where(kser < 2)
>>> new_kser
0    1.0
1    NaN
2    NaN
Name: 0, dtype: float64

>>> new_kser._internal.spark_frame.show()
+-----------------+---+-----------------+
|__index_level_0__|  0|__natural_order__|
+-----------------+---+-----------------+
|                0|  1|      25769803776|
|                1|  2|      60129542144|
|                2|  3|      94489280512|
+-----------------+---+-----------------+

Even though the new_kser has the values of [1.0, NaN, Nan], the internal Spark frame has the values of [1, 2, 3] which one from original kser.

Isn't there any problem with such behaviour because the InternalFrame is only used as an internal purpose ??

@ueshin
Copy link
Collaborator

ueshin commented Jul 13, 2020

InternalFrame doesn't work only with spark_frame but with all the metadata.
The data_spark_columns contains all the changes since the last time spark_frame is created.

Even if spark_frame shows the values of [1, 2, 3], data_spark_columns has the operation of kser.where(kser < 2).

>>> new_kser._internal.spark_frame.show()
+-----------------+---+-----------------+
|__index_level_0__|  0|__natural_order__|
+-----------------+---+-----------------+
|                0|  1|      25769803776|
|                1|  2|      60129542144|
|                2|  3|      94489280512|
+-----------------+---+-----------------+

>>> new_kser._internal.data_spark_columns
[Column<b'CASE WHEN CASE WHEN ((0 < 2) IS NULL) THEN false ELSE (0 < 2) END AS `0` THEN 0 ELSE NaN END AS `0` AS `0`'>]

We always use both to show the actual values.

>>> new_kser._internal.spark_frame.select(new_kser._internal.data_spark_columns).show()
+---+
|  0|
+---+
|1.0|
|NaN|
|NaN|
+---+

@ueshin
Copy link
Collaborator

ueshin commented Jul 13, 2020

The case of Index.to_frame() you mentioned above is maybe an issue of the function or util.align_diff_frames.
I'll take a look.

@itholic
Copy link
Contributor Author

itholic commented Jul 14, 2020

@ueshin Oh, now It's clear for me. Thanks !! :D

HyukjinKwon pushed a commit that referenced this issue Jul 15, 2020
)

Use `SPARK_INDEX_NAME_FORMAT` in `utils.combine_frames` to avoid ambiguity.

```py
>>> ks.options.compute.ops_on_diff_frames = True
>>> kdf = ks.DataFrame({"a": [1, 2, 3], "Koalas": [0, 1, 2]}).set_index("Koalas", drop=False)
>>> kdf.index.name = None
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: Reference 'Koalas' is ambiguous, could be: Koalas, Koalas.;
```

Related to #1647 as well.
ueshin added a commit that referenced this issue Jul 15, 2020
Consolidates the logic for `Index.to_frame()` and `MultiIndex.to_frame()` and rename Spark columns only when `index=False`.
Related to #1647 (comment), but not fully fixes it.
@itholic itholic added the bug Something isn't working label Aug 31, 2020
@itholic
Copy link
Contributor Author

itholic commented Aug 9, 2021

Close since this is resolved.

@itholic itholic closed this as completed Aug 9, 2021
rising-star92 added a commit to rising-star92/databricks-koalas that referenced this issue Jan 27, 2023
Consolidates the logic for `Index.to_frame()` and `MultiIndex.to_frame()` and rename Spark columns only when `index=False`.
Related to databricks/koalas#1647 (comment), but not fully fixes it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants