Index.to_frame() seems not working properly. #1647

itholic · 2020-07-13T04:34:59Z

>>> kdf = ks.DataFrame({"Koalas": [1, 2, 3]}, index=pd.Index([1, 2, 3]))
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
   0    NEW
1  1  200.0
3  3    NaN
2  2  300.0

The above code is working well.

But the same shape of DataFrame which is made from Index.to_frame() seems not work properly.

>>> kdf = ks.Index([1, 2, 3]).to_frame(name="Koalas")
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: Reference 'Koalas' is ambiguous, could be: Koalas, Koalas.;

itholic · 2020-07-13T04:36:26Z

Let me fix this for completing #1630

ueshin · 2020-07-13T06:49:20Z

What's the problem?

itholic · 2020-07-13T08:28:19Z

I thought we should always manage the data_spark_columns and index_spark_columns separately because the internal Spark columns should be changed when the external Koalas columns are changed.

For example,

>>> kdf = ks.DataFrame({"Koalas": [1, 2, 3]}, index=pd.Index([1, 2, 3]))
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
   0    NEW
1  1  200.0
3  3    NaN
2  2  300.0

The above code is working well.

But the same shape of DataFrame which is made from Index.to_frame() seems not work properly.

>>> kdf = ks.Index([1, 2, 3]).to_frame(name="Koalas")
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: Reference 'Koalas' is ambiguous, could be: Koalas, Koalas.;

Hmm.. but probably It seems not a problem related with internal Spark frame... let me investigate this more carefully

itholic · 2020-07-13T08:30:44Z

I just changed the title and description of this issue.

itholic · 2020-07-13T09:19:07Z

Anyway, could I have some additional question about InternalFrame that I often feel confusing?

Don't we need to address the values of internal Spark columns after modifying the external values ?

For example, let's say we have a Series like the below.

>>> kser
0    1
1    2
2    3
Name: 0, dtype: int64

>>> kser._internal.spark_frame.show()
+-----------------+---+-----------------+
|__index_level_0__|  0|__natural_order__|
+-----------------+---+-----------------+
|                0|  1|      25769803776|
|                1|  2|      60129542144|
|                2|  3|      94489280512|
+-----------------+---+-----------------+

And let's modify this by filtering with where, and make a new Series named new_kser.

>>> new_kser = kser.where(kser < 2)
>>> new_kser
0    1.0
1    NaN
2    NaN
Name: 0, dtype: float64

>>> new_kser._internal.spark_frame.show()
+-----------------+---+-----------------+
|__index_level_0__|  0|__natural_order__|
+-----------------+---+-----------------+
|                0|  1|      25769803776|
|                1|  2|      60129542144|
|                2|  3|      94489280512|
+-----------------+---+-----------------+

Even though the new_kser has the values of [1.0, NaN, Nan], the internal Spark frame has the values of [1, 2, 3] which one from original kser.

Isn't there any problem with such behaviour because the InternalFrame is only used as an internal purpose ??

ueshin · 2020-07-13T23:27:06Z

InternalFrame doesn't work only with spark_frame but with all the metadata.
The data_spark_columns contains all the changes since the last time spark_frame is created.

Even if spark_frame shows the values of [1, 2, 3], data_spark_columns has the operation of kser.where(kser < 2).

>>> new_kser._internal.spark_frame.show()
+-----------------+---+-----------------+
|__index_level_0__|  0|__natural_order__|
+-----------------+---+-----------------+
|                0|  1|      25769803776|
|                1|  2|      60129542144|
|                2|  3|      94489280512|
+-----------------+---+-----------------+

>>> new_kser._internal.data_spark_columns
[Column<b'CASE WHEN CASE WHEN ((0 < 2) IS NULL) THEN false ELSE (0 < 2) END AS `0` THEN 0 ELSE NaN END AS `0` AS `0`'>]

We always use both to show the actual values.

>>> new_kser._internal.spark_frame.select(new_kser._internal.data_spark_columns).show()
+---+
|  0|
+---+
|1.0|
|NaN|
|NaN|
+---+

ueshin · 2020-07-13T23:29:08Z

The case of Index.to_frame() you mentioned above is maybe an issue of the function or util.align_diff_frames.
I'll take a look.

itholic · 2020-07-14T09:28:15Z

@ueshin Oh, now It's clear for me. Thanks !! :D

) Use `SPARK_INDEX_NAME_FORMAT` in `utils.combine_frames` to avoid ambiguity. ```py >>> ks.options.compute.ops_on_diff_frames = True >>> kdf = ks.DataFrame({"a": [1, 2, 3], "Koalas": [0, 1, 2]}).set_index("Koalas", drop=False) >>> kdf.index.name = None >>> kdf["NEW"] = ks.Series([100, 200, 300]) >>> kdf Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'Koalas' is ambiguous, could be: Koalas, Koalas.; ``` Related to #1647 as well.

Consolidates the logic for `Index.to_frame()` and `MultiIndex.to_frame()` and rename Spark columns only when `index=False`. Related to #1647 (comment), but not fully fixes it.

itholic · 2021-08-09T04:38:50Z

Close since this is resolved.

Consolidates the logic for `Index.to_frame()` and `MultiIndex.to_frame()` and rename Spark columns only when `index=False`. Related to databricks/koalas#1647 (comment), but not fully fixes it.

itholic changed the title ~~MultiIndex.to_frame() are not working properly~~ MultiIndex.to_frame() are not working properly internally Jul 13, 2020

itholic mentioned this issue Jul 13, 2020

Fix MultiIndex.to_frame() work properly internally #1648

Closed

itholic changed the title ~~MultiIndex.to_frame() are not working properly internally~~ Index.to_frame() seems not working properly. Jul 13, 2020

This was referenced Jul 14, 2020

Rename spark columns only when index=False. #1649

Merged

Use SPARK_INDEX_NAME_FORMAT in combine_frames to avoid ambiguity. #1650

Merged

itholic added the bug Something isn't working label Aug 31, 2020

itholic closed this as completed Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index.to_frame() seems not working properly. #1647

Index.to_frame() seems not working properly. #1647

itholic commented Jul 13, 2020 •

edited

Loading

itholic commented Jul 13, 2020

ueshin commented Jul 13, 2020

itholic commented Jul 13, 2020 •

edited

Loading

itholic commented Jul 13, 2020

itholic commented Jul 13, 2020

ueshin commented Jul 13, 2020

ueshin commented Jul 13, 2020

itholic commented Jul 14, 2020

itholic commented Aug 9, 2021

Index.to_frame() seems not working properly. #1647

Index.to_frame() seems not working properly. #1647

Comments

itholic commented Jul 13, 2020 • edited Loading

itholic commented Jul 13, 2020

ueshin commented Jul 13, 2020

itholic commented Jul 13, 2020 • edited Loading

itholic commented Jul 13, 2020

itholic commented Jul 13, 2020

ueshin commented Jul 13, 2020

ueshin commented Jul 13, 2020

itholic commented Jul 14, 2020

itholic commented Aug 9, 2021

itholic commented Jul 13, 2020 •

edited

Loading

itholic commented Jul 13, 2020 •

edited

Loading