-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#5247: Decrease memory consumption for MultiIndex #5632
Conversation
Signed-off-by: Dmitry Chigarev <[email protected]>
return [ | ||
# Sliced MultiIndex still stores all encoded values of the original index, explicitly | ||
# asking it to drop unused values in order to save memory. | ||
chunk.set_axis(chunk.axes[axis].remove_unused_levels(), axis=axis, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev Looks promising! How does this affect the reverse concatenation operation?
Now I've run some benchmarks for this approach. What I was measuring is the time for splitting a MultiIndex into pieces and then restoring an original index from these pieces with and without the I was measuring with and without putting the data into Ray. When Ray was not involved I've seen performance degradation only at the stage of splitting the index (because an extra step of removing unused levels was applied), however, there were zero differences in concatenating pieces with and without unused levels being dropped. When measuring with Ray involved there was also a slight slowdown at the splitting stage that was although eliminated with increasing the lengths of the values stored in the index (str_len): just in case you're interested, here's the table that includes testing without ray: script I used to measure all thisimport pandas as pd
import numpy as np
from timeit import default_timer as timer
from modin.core.execution.ray.common.utils import initialize_ray
import ray
initialize_ray()
import uuid
def get_random_str(length):
res = str(uuid.uuid1())
return res[:length] if len(res) <= length else res * (length // len(res))
def get_level_values(level_len, str_len):
return [get_random_str(str_len) for _ in range(level_len)]
def get_mi(idx_len, nlevels, str_len, is_highly_unique):
if is_highly_unique:
levels = [get_level_values(idx_len, str_len) for _ in range(nlevels)]
return pd.MultiIndex.from_arrays(levels)
else:
unique_rate = 0.2
levels = [
np.random.choice(get_level_values(int(idx_len * unique_rate), str_len), idx_len)
for _ in range(nlevels)
]
return pd.MultiIndex.from_arrays(levels)
def split_and_restore(mi, nparts=112, drop_levels=True, use_ray=False):
print(f"running with {drop_levels=}")
one_part = len(mi) // nparts
refs = [None] * nparts
t1 = timer()
for idx in range(nparts):
split = mi[idx * one_part : (idx + 1) * one_part]
if drop_levels:
split = split.remove_unused_levels()
if use_ray:
refs[idx] = ray.put(split)
else:
refs[idx] = split
splitting_time = timer() - t1
print("\tsplitting took", splitting_time)
t1 = timer()
if use_ray:
refs = ray.get(refs)
refs[0].append(refs[1:])
restoring_time = timer() - t1
print("\trestoring took", restoring_time)
return splitting_time, restoring_time
idx_lens = [
100_000,
1_000_000,
10_000_000
]
nlevels = [3]
str_lens = [10, 100]
result = pd.DataFrame(
index=np.array([(f"{x} split", f"{x} concat") for x in idx_lens]).flatten(),
columns=pd.MultiIndex.from_product(
[[0, 1], ["sparsed", "dense"], str_lens, [0, 1]],
names=["use ray", "data density", "str_len", "remove unused levels"]
)
)
for use_ray in [False, True]:
print(f"========== USE RAY={use_ray} ===========")
for idx_len in idx_lens:
for nlevel in nlevels:
for str_len in str_lens:
print(f"==== {idx_len=} {nlevel=} {str_len=} sparsed data ====")
key = (idx_len, nlevel, str_len, "sparsed")
mi = get_mi(idx_len, nlevels=nlevel, str_len=str_len, is_highly_unique=False)
print("index done")
s_time, c_time = split_and_restore(mi, use_ray=use_ray)
result.loc[f"{idx_len} split", (int(use_ray), "sparsed", str_len, 1)] = s_time
result.loc[f"{idx_len} concat", (int(use_ray), "sparsed", str_len, 1)] = c_time
s_time, c_time = split_and_restore(mi, drop_levels=False, use_ray=use_ray)
result.loc[f"{idx_len} split", (int(use_ray), "sparsed", str_len, 0)] = s_time
result.loc[f"{idx_len} concat", (int(use_ray), "sparsed", str_len, 0)] = c_time
print(f"==== {idx_len=} {nlevel=} {str_len=} higly unique data ====")
key = (idx_len, nlevel, str_len, "dense")
mi = get_mi(idx_len, nlevels=nlevel, str_len=str_len, is_highly_unique=True)
print("index done")
s_time, c_time = split_and_restore(mi, use_ray=use_ray)
result.loc[f"{idx_len} split", (int(use_ray), "dense", str_len, 1)] = s_time
result.loc[f"{idx_len} concat", (int(use_ray), "dense", str_len, 1)] = c_time
s_time, c_time = split_and_restore(mi, drop_levels=False, use_ray=use_ray)
result.loc[f"{idx_len} split", (int(use_ray), "dense", str_len, 0)] = s_time
result.loc[f"{idx_len} concat", (int(use_ray), "dense", str_len, 0)] = c_time
result.to_excel("mi_res.xlsx") # store the result in case OOM occurs at further iterations |
@dchigarev Great results, I see no reason not to add this change. However, for the sake of completeness, I suggest making the same table for |
good point. Here are the measurements for the case of the lower number of splits, overall the trend stayed the same: I've also run memory consumption tests, measuring It showed that removing unused levels has only a noticeable benefit in cases of large strings, otherwise, memory consumption reduction is not significant but we still lose time in doing this extra step of removing levels. So I guess we should need some kind of a switch between removing or not removing these unused levels, although I don't know yet what the condition could be. |
The biggest slowdown is about 0.2 seconds (when using also serialization, this is our main case for the pandas backend), while we save thousands of megabytes and avoid out of memory cases, which can happen very often when using a multi-index on machines where there is RAM<100GB. Considering also that Modin aims to run workloads with 100 GB and larger datasets, this looks like a great deal. P.S. it is also possible that there is room for performance improvement in +1 to this change as it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Dmitry Chigarev [email protected]
What do these changes do?
This change should decrease memory consumption for dataframes that use MultiIndex (see this comment).
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date