BUG: modin can't share categories between multiple frames when pandas can #5722

dchigarev · 2023-03-01T14:57:28Z

Modin version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import pandas

df = pandas.DataFrame(
    {
        "a": ["a", "a", "b", "c", "c", "d", "b"],
        "b": [0, 0, 0, 1, 1, 1, 1]
    }
)

cat_df = df.astype({"a": "category"})

fi = cat_df.loc[df.b == 0]
se = cat_df.loc[df.b == 1]

print(fi.dtypes["a"].categories.equals(se.dtypes["a"].categories)) # True

print(fi["a"].cat.codes) # [0, 0, 1]    
print(se["a"].cat.codes) # [2, 2, 3, 1] <-- shares categories with 'fi', thus codes start from 1

from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme

df = construct_modin_df_by_scheme(df, {"row_lengths": [2, 2, 2, 1], "column_widths": [2]})
cat_df = df.astype({"a": "category"})

fi = cat_df.loc[df.b == 0]
se = cat_df.loc[df.b == 1]
print(fi.dtypes["a"].categories.equals(se.dtypes["a"].categories)) # False

print(fi["a"].cat.codes) # [0, 0, 1]
print(se["a"].cat.codes) # [1, 1, 2, 0] <-- have independent categories, thus codes start from 0

Issue Description

Pandas do share categorical values between offspring frames thus making it possible to treat their codes as comparable to each other.

In modin, a categorical frame often doesn't even have defined categorical values at all as the .dtypes are often unknown so each partition has its own categorical values. One of the consequences of this is that when building a full-axis column partition we lose categorical types at the result because there's just no full-axis table for categories and was found more beneficial to discard categories at all rather than building categories for the whole column (#2513).

Although we still can ask for a full-axis function to re-build categories on demand (however it still doesn't allows us to share them between multiple frames):

modin/modin/core/storage_formats/pandas/query_compiler.py

Lines 3451 to 3453 in bb0950d

    
           if ser.dtype != "category": 
        
               ser = ser.astype("category", copy=False) 
        
           return ser.cat.codes.to_frame(name=MODIN_UNNAMED_SERIES_LABEL)

Expected Behavior

We apparently would like to have a sharable full-axis categorical object between all partitions and an ability to share it with other frames.

Error Logs

There is no error logs, so putting here a plain output of the reproducer

True
0    0
1    0
2    1
dtype: int8
3    2
4    2
5    3
6    1
dtype: int8
False
0    0
1    0
2    1
dtype: int8
3    1
4    1
5    2
6    0
dtype: int8

Installed Versions

Replace this line with the output of pd.show_versions()

The text was updated successfully, but these errors were encountered:

YarShev · 2023-03-01T15:23:36Z

pandas-dev/pandas#51362 might be related to this.

dchigarev · 2023-03-01T15:33:27Z

Apparently casting to categories using full-axis function fixes everything:

diff --git a/modin/core/dataframe/pandas/dataframe/dataframe.py b/modin/core/dataframe/pandas/dataframe/dataframe.py
index 30925e40..1e38bde6 100644
--- a/modin/core/dataframe/pandas/dataframe/dataframe.py
+++ b/modin/core/dataframe/pandas/dataframe/dataframe.py
@@ -1201,6 +1201,7 @@ class PandasDataframe(ClassLogger):
         columns = col_dtypes.keys()
         # Create Series for the updated dtypes
         new_dtypes = self.dtypes.copy()
+        full_axis_cast = False
         for i, column in enumerate(columns):
             dtype = col_dtypes[column]
             if (
@@ -1220,6 +1221,7 @@ class PandasDataframe(ClassLogger):
                 # We cannot infer without computing the dtype if
                 elif isinstance(new_dtype, str) and new_dtype == "category":
                     new_dtypes = None
+                    full_axis_cast = True
                     break
                 else:
                     new_dtypes[column] = new_dtype
@@ -1228,9 +1230,15 @@ class PandasDataframe(ClassLogger):
             """Compute new partition frame with dtypes updated."""
             return df.astype({k: v for k, v in col_dtypes.items() if k in df})

-        new_frame = self._partition_mgr_cls.map_partitions(
-            self._partitions, astype_builder
-        )
+        if full_axis_cast:
+            new_frame = self._partition_mgr_cls.map_axis_partitions(
+                0, self._partitions, astype_builder, keep_partitioning=True
+            )
+        else:
+            new_frame = self._partition_mgr_cls.map_partitions(
+                self._partitions, astype_builder
+            )
+
         return self.__constructor__(
             new_frame,
             self._index_cache,

This happens because now each partition will have the whole categorical values, meaning that pd.concat will not discard categorical dtypes when building a full-axis column partition.

dchigarev · 2023-03-01T15:36:59Z

But the approach above have significant downsides:

The cast itself becomes more expensive as it's now performed in a full-axis function
Each partition is now obliged to store the whole encoding table, which can cause severe memory issues (the same as we're fixing with multiindex right now PERF-#5247: Decrease memory consumption for MultiIndex #5632)

Egor-Krivov · 2023-03-28T13:45:05Z

@dchigarev Is it possible that modin will give incorrect results, while working with category data due to this problem (during join or groupby)?

For instance, when I perform groupby on category columns

dchigarev · 2023-03-28T13:51:22Z

not sure about join and groupby (maybe there are cases where it could be broken, however, simple scenarios should work fine, most of the time it just discards categorical dtype) but we already had a case with the .map() function that was affected by this malfunction:

example

import pandas
from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme

def run_scenario(df):
    print(f"\n== running scenario with {type(df)=} ==")

    cat_df = df.astype({"a": "category"})

    # in pandas 'fi' and 'se' would still share the same categorical values,
    # when with modin they will now completely independent
    fi = cat_df.loc[df.b == 0]
    se = cat_df.loc[df.b == 1]

    fi["a"] = fi["a"].cat.codes
    se["a"] = se["a"].cat.codes

    print(fi["a"]) # pandas: [0, 0, 1] | modin: [0, 0, 1]
    print(se["a"]) # pandas: [2, 2, 3, 1] | modin: [1, 1, 2, 0]

    se_groupby = se.groupby("a").sum().squeeze(axis=1)

    print(fi["a"].map(se_groupby)) # pandas: [nan, nan, 1.0] | modin: [1, 1, 2]

pandas_df = pandas.DataFrame(
    {
        "a": ["a", "a", "b", "c", "c", "d", "b"],
        "b": [0, 0, 0, 1, 1, 1, 1]
    }
)
modin_df = construct_modin_df_by_scheme(pandas_df, {"row_lengths": [2, 2, 2, 1], "column_widths": [2]})

run_scenario(pandas_df)
run_scenario(modin_df)

…ory" Signed-off-by: Igoshev, Iaroslav <[email protected]>

Signed-off-by: Igoshev, Iaroslav <[email protected]>

dchigarev added bug 🦗 Something isn't working pandas concordance 🐼 Functionality that does not match pandas P1 Important tasks that we should complete soon Internals Internal modin functionality labels Mar 1, 2023

YarShev added a commit to YarShev/modin that referenced this issue May 31, 2023

FIX-modin-project#5722: Use full axis function when casting to "categ…

cb40fc5

…ory" Signed-off-by: Igoshev, Iaroslav <[email protected]>

YarShev added a commit to YarShev/modin that referenced this issue May 31, 2023

FIX-modin-project#5722: Use full axis function when casting to "categ…

c6c9015

…ory" Signed-off-by: Igoshev, Iaroslav <[email protected]>

YarShev mentioned this issue May 31, 2023

FIX-#5722: Use full axis function when casting to "category" #6222

Merged

7 tasks

dchigarev closed this as completed in #6222 Jun 1, 2023

dchigarev pushed a commit that referenced this issue Jun 1, 2023

FIX-#5722: Use full axis function when casting to "category" (#6222)

930b8fb

Signed-off-by: Igoshev, Iaroslav <[email protected]>

dchigarev mentioned this issue Jan 19, 2024

FEAT-#5925: Enable grouping on categoricals with range-partitioning impl #6862

Merged

7 tasks

dchigarev mentioned this issue Feb 27, 2024

FEAT-#6965: Implement '.merge()' using range-partitioning implementation #6966

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: modin can't share categories between multiple frames when pandas can #5722

BUG: modin can't share categories between multiple frames when pandas can #5722

dchigarev commented Mar 1, 2023 •

edited

Loading

YarShev commented Mar 1, 2023

dchigarev commented Mar 1, 2023

dchigarev commented Mar 1, 2023 •

edited

Loading

Egor-Krivov commented Mar 28, 2023 •

edited

Loading

dchigarev commented Mar 28, 2023 •

edited

Loading

BUG: modin can't share categories between multiple frames when pandas can #5722

BUG: modin can't share categories between multiple frames when pandas can #5722

Comments

dchigarev commented Mar 1, 2023 • edited Loading

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

YarShev commented Mar 1, 2023

dchigarev commented Mar 1, 2023

dchigarev commented Mar 1, 2023 • edited Loading

Egor-Krivov commented Mar 28, 2023 • edited Loading

dchigarev commented Mar 28, 2023 • edited Loading

dchigarev commented Mar 1, 2023 •

edited

Loading

dchigarev commented Mar 1, 2023 •

edited

Loading

Egor-Krivov commented Mar 28, 2023 •

edited

Loading

dchigarev commented Mar 28, 2023 •

edited

Loading