PERF-#5247: Decrease memory consumption for MultiIndex #5632

dchigarev · 2023-02-07T13:16:24Z

Signed-off-by: Dmitry Chigarev [email protected]

What do these changes do?

This change should decrease memory consumption for dataframes that use MultiIndex (see this comment).

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves MultiIndex takes up a huge amount of storage space #5247
tests are passing
module layout described at docs/development/architecture.rst is up-to-date

Signed-off-by: Dmitry Chigarev <[email protected]>

anmyachev · 2023-02-13T22:50:54Z

modin/core/storage_formats/pandas/utils.py

+    return [
+        # Sliced MultiIndex still stores all encoded values of the original index, explicitly
+        # asking it to drop unused values in order to save memory.
+        chunk.set_axis(chunk.axes[axis].remove_unused_levels(), axis=axis, copy=False)


@dchigarev Looks promising! How does this affect the reverse concatenation operation?

dchigarev · 2023-02-28T20:34:42Z

Now I've run some benchmarks for this approach. What I was measuring is the time for splitting a MultiIndex into pieces and then restoring an original index from these pieces with and without the .remove_unused_levels method applied.

I was measuring with and without putting the data into Ray. When Ray was not involved I've seen performance degradation only at the stage of splitting the index (because an extra step of removing unused levels was applied), however, there were zero differences in concatenating pieces with and without unused levels being dropped.

When measuring with Ray involved there was also a slight slowdown at the splitting stage that was although eliminated with increasing the lengths of the values stored in the index (str_len):

just in case you're interested, here's the table that includes testing without ray:

complete table with the results

script I used to measure all this

import pandas as pd
import numpy as np
from timeit import default_timer as timer

from modin.core.execution.ray.common.utils import initialize_ray

import ray
initialize_ray()

import uuid

def get_random_str(length):
    res = str(uuid.uuid1())
    return res[:length] if len(res) <= length else res * (length // len(res))

def get_level_values(level_len, str_len):
    return [get_random_str(str_len) for _ in range(level_len)]

def get_mi(idx_len, nlevels, str_len, is_highly_unique):
    if is_highly_unique:
        levels = [get_level_values(idx_len, str_len) for _ in range(nlevels)]
        return pd.MultiIndex.from_arrays(levels)
    else:
        unique_rate = 0.2
        levels = [
            np.random.choice(get_level_values(int(idx_len * unique_rate), str_len), idx_len)
            for _ in range(nlevels)
        ]
        return pd.MultiIndex.from_arrays(levels)

def split_and_restore(mi, nparts=112, drop_levels=True, use_ray=False):
    print(f"running with {drop_levels=}")
    one_part = len(mi) // nparts
    refs = [None] * nparts

    t1 = timer()
    for idx in range(nparts):
        split = mi[idx * one_part : (idx + 1) * one_part]
        if drop_levels:
            split = split.remove_unused_levels()
        
        if use_ray:
            refs[idx] = ray.put(split)
        else:
            refs[idx] = split
    
    splitting_time = timer() - t1
    print("\tsplitting took", splitting_time)

    t1 = timer()
    if use_ray:
        refs = ray.get(refs)
    refs[0].append(refs[1:])
    restoring_time = timer() - t1
    print("\trestoring took", restoring_time)
    return splitting_time, restoring_time


idx_lens = [
    100_000,
    1_000_000,
    10_000_000
]
nlevels = [3]
str_lens = [10, 100]

result = pd.DataFrame(
    index=np.array([(f"{x} split", f"{x} concat") for x in idx_lens]).flatten(),
    columns=pd.MultiIndex.from_product(
        [[0, 1], ["sparsed", "dense"], str_lens, [0, 1]],
        names=["use ray", "data density", "str_len", "remove unused levels"]
    )
)

for use_ray in [False, True]:
    print(f"========== USE RAY={use_ray} ===========")
    for idx_len in idx_lens:
        for nlevel in nlevels:
            for str_len in str_lens:
                print(f"==== {idx_len=} {nlevel=} {str_len=} sparsed data ====")
                key = (idx_len, nlevel, str_len, "sparsed")
                mi = get_mi(idx_len, nlevels=nlevel, str_len=str_len, is_highly_unique=False)

                print("index done")
                s_time, c_time = split_and_restore(mi, use_ray=use_ray)
                result.loc[f"{idx_len} split", (int(use_ray), "sparsed", str_len, 1)] = s_time
                result.loc[f"{idx_len} concat", (int(use_ray), "sparsed", str_len, 1)] = c_time

                s_time, c_time = split_and_restore(mi, drop_levels=False, use_ray=use_ray)
                result.loc[f"{idx_len} split", (int(use_ray), "sparsed", str_len, 0)] = s_time
                result.loc[f"{idx_len} concat", (int(use_ray), "sparsed", str_len, 0)] = c_time

                print(f"==== {idx_len=} {nlevel=} {str_len=} higly unique data ====")
                key = (idx_len, nlevel, str_len, "dense")
                mi = get_mi(idx_len, nlevels=nlevel, str_len=str_len, is_highly_unique=True)

                print("index done")
                s_time, c_time = split_and_restore(mi, use_ray=use_ray)
                result.loc[f"{idx_len} split", (int(use_ray), "dense", str_len, 1)] = s_time
                result.loc[f"{idx_len} concat", (int(use_ray), "dense", str_len, 1)] = c_time
        
                s_time, c_time = split_and_restore(mi, drop_levels=False, use_ray=use_ray)
                result.loc[f"{idx_len} split", (int(use_ray), "dense", str_len, 0)] = s_time
                result.loc[f"{idx_len} concat", (int(use_ray), "dense", str_len, 0)] = c_time

                result.to_excel("mi_res.xlsx") # store the result in case OOM occurs at further iterations

anmyachev · 2023-02-28T23:40:25Z

@dchigarev Great results, I see no reason not to add this change. However, for the sake of completeness, I suggest making the same table for nparts=16 (closer situation for the user) if you don't mind.

dchigarev · 2023-03-01T12:21:12Z

I suggest making the same table for nparts=16 (closer situation for the user)

good point.

Here are the measurements for the case of the lower number of splits, overall the trend stayed the same:

time for num_splits=16

I've also run memory consumption tests, measuring ray memory before and after putting the index pieces into Ray storage:

memory consumption tests

It showed that removing unused levels has only a noticeable benefit in cases of large strings, otherwise, memory consumption reduction is not significant but we still lose time in doing this extra step of removing levels.

So I guess we should need some kind of a switch between removing or not removing these unused levels, although I don't know yet what the condition could be.

anmyachev · 2023-03-01T12:54:20Z

It showed that removing unused levels has only a noticeable benefit in cases of large strings, otherwise, memory consumption reduction is not significant but we still lose time in doing this extra step of removing levels.
So I guess we should need some kind of a switch between removing or not removing these unused levels, although I don't know yet what the condition could be.

The biggest slowdown is about 0.2 seconds (when using also serialization, this is our main case for the pandas backend), while we save thousands of megabytes and avoid out of memory cases, which can happen very often when using a multi-index on machines where there is RAM<100GB. Considering also that Modin aims to run workloads with 100 GB and larger datasets, this looks like a great deal.

P.S. it is also possible that there is room for performance improvement in remove_unused_levels function itself.

+1 to this change as it is

anmyachev

LGTM!

PERF-#5247: Make MultiIndex use memory more efficiently

425c1db

Signed-off-by: Dmitry Chigarev <[email protected]>

dchigarev changed the title ~~PERF-#5247: Make MultiIndex use memory more efficiently~~ PERF-#5247: Make MultiIndex to use memory more efficiently Feb 7, 2023

dchigarev marked this pull request as ready for review February 7, 2023 15:22

dchigarev requested a review from a team as a code owner February 7, 2023 15:22

dchigarev changed the title ~~PERF-#5247: Make MultiIndex to use memory more efficiently~~ PERF-#5247: Decrease memory consumption for MultiIndex Feb 7, 2023

dchigarev added the Ready for review label Feb 10, 2023

anmyachev reviewed Feb 13, 2023

View reviewed changes

anmyachev approved these changes Mar 1, 2023

View reviewed changes

dchigarev mentioned this pull request Mar 1, 2023

BUG: modin can't share categories between multiple frames when pandas can #5722

Closed

3 tasks

YarShev approved these changes Mar 1, 2023

View reviewed changes

YarShev merged commit 6afee33 into master Mar 1, 2023

dchigarev deleted the issue_5247 branch March 1, 2023 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#5247: Decrease memory consumption for MultiIndex #5632

PERF-#5247: Decrease memory consumption for MultiIndex #5632

dchigarev commented Feb 7, 2023 •

edited

Loading

anmyachev Feb 13, 2023

dchigarev commented Feb 28, 2023

anmyachev commented Feb 28, 2023

dchigarev commented Mar 1, 2023

anmyachev commented Mar 1, 2023

anmyachev left a comment

PERF-#5247: Decrease memory consumption for MultiIndex #5632

PERF-#5247: Decrease memory consumption for MultiIndex #5632

Conversation

dchigarev commented Feb 7, 2023 • edited Loading

What do these changes do?

anmyachev Feb 13, 2023

Choose a reason for hiding this comment

dchigarev commented Feb 28, 2023

anmyachev commented Feb 28, 2023

dchigarev commented Mar 1, 2023

anmyachev commented Mar 1, 2023

anmyachev left a comment

Choose a reason for hiding this comment

dchigarev commented Feb 7, 2023 •

edited

Loading