-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index.to_frame() seems not working properly. #1647
Comments
Let me fix this for completing #1630 |
What's the problem? |
I thought we should always manage the For example, >>> kdf = ks.DataFrame({"Koalas": [1, 2, 3]}, index=pd.Index([1, 2, 3]))
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
0 NEW
1 1 200.0
3 3 NaN
2 2 300.0 The above code is working well. But the same shape of >>> kdf = ks.Index([1, 2, 3]).to_frame(name="Koalas")
>>> kdf["NEW"] = ks.Series([100, 200, 300])
>>> kdf
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: Reference 'Koalas' is ambiguous, could be: Koalas, Koalas.; Hmm.. but probably It seems not a problem related with internal Spark frame... let me investigate this more carefully |
I just changed the title and description of this issue. |
Anyway, could I have some additional question about Don't we need to address the values of internal Spark columns after modifying the external values ? For example, let's say we have a >>> kser
0 1
1 2
2 3
Name: 0, dtype: int64
>>> kser._internal.spark_frame.show()
+-----------------+---+-----------------+
|__index_level_0__| 0|__natural_order__|
+-----------------+---+-----------------+
| 0| 1| 25769803776|
| 1| 2| 60129542144|
| 2| 3| 94489280512|
+-----------------+---+-----------------+ And let's modify this by filtering with >>> new_kser = kser.where(kser < 2)
>>> new_kser
0 1.0
1 NaN
2 NaN
Name: 0, dtype: float64
>>> new_kser._internal.spark_frame.show()
+-----------------+---+-----------------+
|__index_level_0__| 0|__natural_order__|
+-----------------+---+-----------------+
| 0| 1| 25769803776|
| 1| 2| 60129542144|
| 2| 3| 94489280512|
+-----------------+---+-----------------+ Even though the Isn't there any problem with such behaviour because the |
Even if >>> new_kser._internal.spark_frame.show()
+-----------------+---+-----------------+
|__index_level_0__| 0|__natural_order__|
+-----------------+---+-----------------+
| 0| 1| 25769803776|
| 1| 2| 60129542144|
| 2| 3| 94489280512|
+-----------------+---+-----------------+
>>> new_kser._internal.data_spark_columns
[Column<b'CASE WHEN CASE WHEN ((0 < 2) IS NULL) THEN false ELSE (0 < 2) END AS `0` THEN 0 ELSE NaN END AS `0` AS `0`'>] We always use both to show the actual values. >>> new_kser._internal.spark_frame.select(new_kser._internal.data_spark_columns).show()
+---+
| 0|
+---+
|1.0|
|NaN|
|NaN|
+---+ |
The case of |
@ueshin Oh, now It's clear for me. Thanks !! :D |
) Use `SPARK_INDEX_NAME_FORMAT` in `utils.combine_frames` to avoid ambiguity. ```py >>> ks.options.compute.ops_on_diff_frames = True >>> kdf = ks.DataFrame({"a": [1, 2, 3], "Koalas": [0, 1, 2]}).set_index("Koalas", drop=False) >>> kdf.index.name = None >>> kdf["NEW"] = ks.Series([100, 200, 300]) >>> kdf Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'Koalas' is ambiguous, could be: Koalas, Koalas.; ``` Related to #1647 as well.
Consolidates the logic for `Index.to_frame()` and `MultiIndex.to_frame()` and rename Spark columns only when `index=False`. Related to #1647 (comment), but not fully fixes it.
Close since this is resolved. |
Consolidates the logic for `Index.to_frame()` and `MultiIndex.to_frame()` and rename Spark columns only when `index=False`. Related to databricks/koalas#1647 (comment), but not fully fixes it.
The above code is working well.
But the same shape of DataFrame which is made from Index.to_frame() seems not work properly.
The text was updated successfully, but these errors were encountered: