-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: modin can't share categories between multiple frames when pandas can #5722
Comments
pandas-dev/pandas#51362 might be related to this. |
Apparently casting to categories using full-axis function fixes everything: diff --git a/modin/core/dataframe/pandas/dataframe/dataframe.py b/modin/core/dataframe/pandas/dataframe/dataframe.py
index 30925e40..1e38bde6 100644
--- a/modin/core/dataframe/pandas/dataframe/dataframe.py
+++ b/modin/core/dataframe/pandas/dataframe/dataframe.py
@@ -1201,6 +1201,7 @@ class PandasDataframe(ClassLogger):
columns = col_dtypes.keys()
# Create Series for the updated dtypes
new_dtypes = self.dtypes.copy()
+ full_axis_cast = False
for i, column in enumerate(columns):
dtype = col_dtypes[column]
if (
@@ -1220,6 +1221,7 @@ class PandasDataframe(ClassLogger):
# We cannot infer without computing the dtype if
elif isinstance(new_dtype, str) and new_dtype == "category":
new_dtypes = None
+ full_axis_cast = True
break
else:
new_dtypes[column] = new_dtype
@@ -1228,9 +1230,15 @@ class PandasDataframe(ClassLogger):
"""Compute new partition frame with dtypes updated."""
return df.astype({k: v for k, v in col_dtypes.items() if k in df})
- new_frame = self._partition_mgr_cls.map_partitions(
- self._partitions, astype_builder
- )
+ if full_axis_cast:
+ new_frame = self._partition_mgr_cls.map_axis_partitions(
+ 0, self._partitions, astype_builder, keep_partitioning=True
+ )
+ else:
+ new_frame = self._partition_mgr_cls.map_partitions(
+ self._partitions, astype_builder
+ )
+
return self.__constructor__(
new_frame,
self._index_cache, This happens because now each partition will have the whole categorical values, meaning that |
But the approach above have significant downsides:
|
@dchigarev Is it possible that modin will give incorrect results, while working with category data due to this problem (during join or groupby)? For instance, when I perform groupby on category columns |
not sure about join and groupby (maybe there are cases where it could be broken, however, simple scenarios should work fine, most of the time it just discards categorical dtype) but we already had a case with the exampleimport pandas
from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme
def run_scenario(df):
print(f"\n== running scenario with {type(df)=} ==")
cat_df = df.astype({"a": "category"})
# in pandas 'fi' and 'se' would still share the same categorical values,
# when with modin they will now completely independent
fi = cat_df.loc[df.b == 0]
se = cat_df.loc[df.b == 1]
fi["a"] = fi["a"].cat.codes
se["a"] = se["a"].cat.codes
print(fi["a"]) # pandas: [0, 0, 1] | modin: [0, 0, 1]
print(se["a"]) # pandas: [2, 2, 3, 1] | modin: [1, 1, 2, 0]
se_groupby = se.groupby("a").sum().squeeze(axis=1)
print(fi["a"].map(se_groupby)) # pandas: [nan, nan, 1.0] | modin: [1, 1, 2]
pandas_df = pandas.DataFrame(
{
"a": ["a", "a", "b", "c", "c", "d", "b"],
"b": [0, 0, 0, 1, 1, 1, 1]
}
)
modin_df = construct_modin_df_by_scheme(pandas_df, {"row_lengths": [2, 2, 2, 1], "column_widths": [2]})
run_scenario(pandas_df)
run_scenario(modin_df) |
…ory" Signed-off-by: Igoshev, Iaroslav <[email protected]>
…ory" Signed-off-by: Igoshev, Iaroslav <[email protected]>
Signed-off-by: Igoshev, Iaroslav <[email protected]>
Modin version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)
Reproducible Example
Issue Description
Pandas do share categorical values between offspring frames thus making it possible to treat their codes as comparable to each other.
In modin, a categorical frame often doesn't even have defined categorical values at all as the
.dtypes
are often unknown so each partition has its own categorical values. One of the consequences of this is that when building a full-axis column partition we lose categorical types at the result because there's just no full-axis table for categories and was found more beneficial to discard categories at all rather than building categories for the whole column (#2513).Although we still can ask for a full-axis function to re-build categories on demand (however it still doesn't allows us to share them between multiple frames):
modin/modin/core/storage_formats/pandas/query_compiler.py
Lines 3451 to 3453 in bb0950d
Expected Behavior
We apparently would like to have a sharable full-axis categorical object between all partitions and an ability to share it with other frames.
Error Logs
Installed Versions
Replace this line with the output of pd.show_versions()
The text was updated successfully, but these errors were encountered: