Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#5247: Decrease memory consumption for MultiIndex #5632

Merged
merged 1 commit into from
Mar 1, 2023
Merged

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Feb 7, 2023

Signed-off-by: Dmitry Chigarev [email protected]

What do these changes do?

This change should decrease memory consumption for dataframes that use MultiIndex (see this comment).

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves MultiIndex takes up a huge amount of storage space #5247
  • tests are passing
  • module layout described at docs/development/architecture.rst is up-to-date

@dchigarev dchigarev changed the title PERF-#5247: Make MultiIndex use memory more efficiently PERF-#5247: Make MultiIndex to use memory more efficiently Feb 7, 2023
@dchigarev dchigarev marked this pull request as ready for review February 7, 2023 15:22
@dchigarev dchigarev requested a review from a team as a code owner February 7, 2023 15:22
@dchigarev dchigarev changed the title PERF-#5247: Make MultiIndex to use memory more efficiently PERF-#5247: Decrease memory consumption for MultiIndex Feb 7, 2023
return [
# Sliced MultiIndex still stores all encoded values of the original index, explicitly
# asking it to drop unused values in order to save memory.
chunk.set_axis(chunk.axes[axis].remove_unused_levels(), axis=axis, copy=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev Looks promising! How does this affect the reverse concatenation operation?

@dchigarev
Copy link
Collaborator Author

Now I've run some benchmarks for this approach. What I was measuring is the time for splitting a MultiIndex into pieces and then restoring an original index from these pieces with and without the .remove_unused_levels method applied.

I was measuring with and without putting the data into Ray. When Ray was not involved I've seen performance degradation only at the stage of splitting the index (because an extra step of removing unused levels was applied), however, there were zero differences in concatenating pieces with and without unused levels being dropped.

When measuring with Ray involved there was also a slight slowdown at the splitting stage that was although eliminated with increasing the lengths of the values stored in the index (str_len):

image

just in case you're interested, here's the table that includes testing without ray:

complete table with the results

image

script I used to measure all this
import pandas as pd
import numpy as np
from timeit import default_timer as timer

from modin.core.execution.ray.common.utils import initialize_ray

import ray
initialize_ray()

import uuid

def get_random_str(length):
    res = str(uuid.uuid1())
    return res[:length] if len(res) <= length else res * (length // len(res))

def get_level_values(level_len, str_len):
    return [get_random_str(str_len) for _ in range(level_len)]

def get_mi(idx_len, nlevels, str_len, is_highly_unique):
    if is_highly_unique:
        levels = [get_level_values(idx_len, str_len) for _ in range(nlevels)]
        return pd.MultiIndex.from_arrays(levels)
    else:
        unique_rate = 0.2
        levels = [
            np.random.choice(get_level_values(int(idx_len * unique_rate), str_len), idx_len)
            for _ in range(nlevels)
        ]
        return pd.MultiIndex.from_arrays(levels)

def split_and_restore(mi, nparts=112, drop_levels=True, use_ray=False):
    print(f"running with {drop_levels=}")
    one_part = len(mi) // nparts
    refs = [None] * nparts

    t1 = timer()
    for idx in range(nparts):
        split = mi[idx * one_part : (idx + 1) * one_part]
        if drop_levels:
            split = split.remove_unused_levels()
        
        if use_ray:
            refs[idx] = ray.put(split)
        else:
            refs[idx] = split
    
    splitting_time = timer() - t1
    print("\tsplitting took", splitting_time)

    t1 = timer()
    if use_ray:
        refs = ray.get(refs)
    refs[0].append(refs[1:])
    restoring_time = timer() - t1
    print("\trestoring took", restoring_time)
    return splitting_time, restoring_time


idx_lens = [
    100_000,
    1_000_000,
    10_000_000
]
nlevels = [3]
str_lens = [10, 100]

result = pd.DataFrame(
    index=np.array([(f"{x} split", f"{x} concat") for x in idx_lens]).flatten(),
    columns=pd.MultiIndex.from_product(
        [[0, 1], ["sparsed", "dense"], str_lens, [0, 1]],
        names=["use ray", "data density", "str_len", "remove unused levels"]
    )
)

for use_ray in [False, True]:
    print(f"========== USE RAY={use_ray} ===========")
    for idx_len in idx_lens:
        for nlevel in nlevels:
            for str_len in str_lens:
                print(f"==== {idx_len=} {nlevel=} {str_len=} sparsed data ====")
                key = (idx_len, nlevel, str_len, "sparsed")
                mi = get_mi(idx_len, nlevels=nlevel, str_len=str_len, is_highly_unique=False)

                print("index done")
                s_time, c_time = split_and_restore(mi, use_ray=use_ray)
                result.loc[f"{idx_len} split", (int(use_ray), "sparsed", str_len, 1)] = s_time
                result.loc[f"{idx_len} concat", (int(use_ray), "sparsed", str_len, 1)] = c_time

                s_time, c_time = split_and_restore(mi, drop_levels=False, use_ray=use_ray)
                result.loc[f"{idx_len} split", (int(use_ray), "sparsed", str_len, 0)] = s_time
                result.loc[f"{idx_len} concat", (int(use_ray), "sparsed", str_len, 0)] = c_time

                print(f"==== {idx_len=} {nlevel=} {str_len=} higly unique data ====")
                key = (idx_len, nlevel, str_len, "dense")
                mi = get_mi(idx_len, nlevels=nlevel, str_len=str_len, is_highly_unique=True)

                print("index done")
                s_time, c_time = split_and_restore(mi, use_ray=use_ray)
                result.loc[f"{idx_len} split", (int(use_ray), "dense", str_len, 1)] = s_time
                result.loc[f"{idx_len} concat", (int(use_ray), "dense", str_len, 1)] = c_time
        
                s_time, c_time = split_and_restore(mi, drop_levels=False, use_ray=use_ray)
                result.loc[f"{idx_len} split", (int(use_ray), "dense", str_len, 0)] = s_time
                result.loc[f"{idx_len} concat", (int(use_ray), "dense", str_len, 0)] = c_time

                result.to_excel("mi_res.xlsx") # store the result in case OOM occurs at further iterations

@anmyachev
Copy link
Collaborator

@dchigarev Great results, I see no reason not to add this change. However, for the sake of completeness, I suggest making the same table for nparts=16 (closer situation for the user) if you don't mind.

@dchigarev
Copy link
Collaborator Author

I suggest making the same table for nparts=16 (closer situation for the user)

good point.

Here are the measurements for the case of the lower number of splits, overall the trend stayed the same:

time for num_splits=16

image

I've also run memory consumption tests, measuring ray memory before and after putting the index pieces into Ray storage:

memory consumption tests

image

It showed that removing unused levels has only a noticeable benefit in cases of large strings, otherwise, memory consumption reduction is not significant but we still lose time in doing this extra step of removing levels.

So I guess we should need some kind of a switch between removing or not removing these unused levels, although I don't know yet what the condition could be.

@anmyachev
Copy link
Collaborator

It showed that removing unused levels has only a noticeable benefit in cases of large strings, otherwise, memory consumption reduction is not significant but we still lose time in doing this extra step of removing levels.
So I guess we should need some kind of a switch between removing or not removing these unused levels, although I don't know yet what the condition could be.

The biggest slowdown is about 0.2 seconds (when using also serialization, this is our main case for the pandas backend), while we save thousands of megabytes and avoid out of memory cases, which can happen very often when using a multi-index on machines where there is RAM<100GB. Considering also that Modin aims to run workloads with 100 GB and larger datasets, this looks like a great deal.

P.S. it is also possible that there is room for performance improvement in remove_unused_levels function itself.

+1 to this change as it is

Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@YarShev YarShev merged commit 6afee33 into master Mar 1, 2023
@dchigarev dchigarev deleted the issue_5247 branch March 1, 2023 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiIndex takes up a huge amount of storage space
3 participants