Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: modin can't share categories between multiple frames when pandas can #5722

Closed
3 tasks done
dchigarev opened this issue Mar 1, 2023 · 5 comments · Fixed by #6222
Closed
3 tasks done

BUG: modin can't share categories between multiple frames when pandas can #5722

dchigarev opened this issue Mar 1, 2023 · 5 comments · Fixed by #6222
Labels
bug 🦗 Something isn't working Internals Internal modin functionality P1 Important tasks that we should complete soon pandas concordance 🐼 Functionality that does not match pandas

Comments

@dchigarev
Copy link
Collaborator

dchigarev commented Mar 1, 2023

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import pandas

df = pandas.DataFrame(
    {
        "a": ["a", "a", "b", "c", "c", "d", "b"],
        "b": [0, 0, 0, 1, 1, 1, 1]
    }
)

cat_df = df.astype({"a": "category"})

fi = cat_df.loc[df.b == 0]
se = cat_df.loc[df.b == 1]

print(fi.dtypes["a"].categories.equals(se.dtypes["a"].categories)) # True

print(fi["a"].cat.codes) # [0, 0, 1]    
print(se["a"].cat.codes) # [2, 2, 3, 1] <-- shares categories with 'fi', thus codes start from 1

from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme

df = construct_modin_df_by_scheme(df, {"row_lengths": [2, 2, 2, 1], "column_widths": [2]})
cat_df = df.astype({"a": "category"})

fi = cat_df.loc[df.b == 0]
se = cat_df.loc[df.b == 1]
print(fi.dtypes["a"].categories.equals(se.dtypes["a"].categories)) # False

print(fi["a"].cat.codes) # [0, 0, 1]
print(se["a"].cat.codes) # [1, 1, 2, 0] <-- have independent categories, thus codes start from 0

Issue Description

Pandas do share categorical values between offspring frames thus making it possible to treat their codes as comparable to each other.

In modin, a categorical frame often doesn't even have defined categorical values at all as the .dtypes are often unknown so each partition has its own categorical values. One of the consequences of this is that when building a full-axis column partition we lose categorical types at the result because there's just no full-axis table for categories and was found more beneficial to discard categories at all rather than building categories for the whole column (#2513).

Although we still can ask for a full-axis function to re-build categories on demand (however it still doesn't allows us to share them between multiple frames):

if ser.dtype != "category":
ser = ser.astype("category", copy=False)
return ser.cat.codes.to_frame(name=MODIN_UNNAMED_SERIES_LABEL)

Expected Behavior

We apparently would like to have a sharable full-axis categorical object between all partitions and an ability to share it with other frames.

Error Logs

There is no error logs, so putting here a plain output of the reproducer
True
0    0
1    0
2    1
dtype: int8
3    2
4    2
5    3
6    1
dtype: int8
False
0    0
1    0
2    1
dtype: int8
3    1
4    1
5    2
6    0
dtype: int8

Installed Versions

Replace this line with the output of pd.show_versions()

@dchigarev dchigarev added bug 🦗 Something isn't working pandas concordance 🐼 Functionality that does not match pandas P1 Important tasks that we should complete soon Internals Internal modin functionality labels Mar 1, 2023
@YarShev
Copy link
Collaborator

YarShev commented Mar 1, 2023

pandas-dev/pandas#51362 might be related to this.

@dchigarev
Copy link
Collaborator Author

Apparently casting to categories using full-axis function fixes everything:

diff --git a/modin/core/dataframe/pandas/dataframe/dataframe.py b/modin/core/dataframe/pandas/dataframe/dataframe.py
index 30925e40..1e38bde6 100644
--- a/modin/core/dataframe/pandas/dataframe/dataframe.py
+++ b/modin/core/dataframe/pandas/dataframe/dataframe.py
@@ -1201,6 +1201,7 @@ class PandasDataframe(ClassLogger):
         columns = col_dtypes.keys()
         # Create Series for the updated dtypes
         new_dtypes = self.dtypes.copy()
+        full_axis_cast = False
         for i, column in enumerate(columns):
             dtype = col_dtypes[column]
             if (
@@ -1220,6 +1221,7 @@ class PandasDataframe(ClassLogger):
                 # We cannot infer without computing the dtype if
                 elif isinstance(new_dtype, str) and new_dtype == "category":
                     new_dtypes = None
+                    full_axis_cast = True
                     break
                 else:
                     new_dtypes[column] = new_dtype
@@ -1228,9 +1230,15 @@ class PandasDataframe(ClassLogger):
             """Compute new partition frame with dtypes updated."""
             return df.astype({k: v for k, v in col_dtypes.items() if k in df})

-        new_frame = self._partition_mgr_cls.map_partitions(
-            self._partitions, astype_builder
-        )
+        if full_axis_cast:
+            new_frame = self._partition_mgr_cls.map_axis_partitions(
+                0, self._partitions, astype_builder, keep_partitioning=True
+            )
+        else:
+            new_frame = self._partition_mgr_cls.map_partitions(
+                self._partitions, astype_builder
+            )
+
         return self.__constructor__(
             new_frame,
             self._index_cache,

This happens because now each partition will have the whole categorical values, meaning that pd.concat will not discard categorical dtypes when building a full-axis column partition.

@dchigarev
Copy link
Collaborator Author

dchigarev commented Mar 1, 2023

But the approach above have significant downsides:

  1. The cast itself becomes more expensive as it's now performed in a full-axis function
  2. Each partition is now obliged to store the whole encoding table, which can cause severe memory issues (the same as we're fixing with multiindex right now PERF-#5247: Decrease memory consumption for MultiIndex #5632)

@Egor-Krivov
Copy link
Contributor

Egor-Krivov commented Mar 28, 2023

@dchigarev Is it possible that modin will give incorrect results, while working with category data due to this problem (during join or groupby)?

For instance, when I perform groupby on category columns

@dchigarev
Copy link
Collaborator Author

dchigarev commented Mar 28, 2023

not sure about join and groupby (maybe there are cases where it could be broken, however, simple scenarios should work fine, most of the time it just discards categorical dtype) but we already had a case with the .map() function that was affected by this malfunction:

example
import pandas
from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme

def run_scenario(df):
    print(f"\n== running scenario with {type(df)=} ==")

    cat_df = df.astype({"a": "category"})

    # in pandas 'fi' and 'se' would still share the same categorical values,
    # when with modin they will now completely independent
    fi = cat_df.loc[df.b == 0]
    se = cat_df.loc[df.b == 1]

    fi["a"] = fi["a"].cat.codes
    se["a"] = se["a"].cat.codes

    print(fi["a"]) # pandas: [0, 0, 1] | modin: [0, 0, 1]
    print(se["a"]) # pandas: [2, 2, 3, 1] | modin: [1, 1, 2, 0]

    se_groupby = se.groupby("a").sum().squeeze(axis=1)

    print(fi["a"].map(se_groupby)) # pandas: [nan, nan, 1.0] | modin: [1, 1, 2]

pandas_df = pandas.DataFrame(
    {
        "a": ["a", "a", "b", "c", "c", "d", "b"],
        "b": [0, 0, 0, 1, 1, 1, 1]
    }
)
modin_df = construct_modin_df_by_scheme(pandas_df, {"row_lengths": [2, 2, 2, 1], "column_widths": [2]})

run_scenario(pandas_df)
run_scenario(modin_df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working Internals Internal modin functionality P1 Important tasks that we should complete soon pandas concordance 🐼 Functionality that does not match pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants