Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix_alloc memory leak in Erlang 24 and 25 #6947

Closed
nickva opened this issue Mar 1, 2023 · 11 comments
Closed

fix_alloc memory leak in Erlang 24 and 25 #6947

nickva opened this issue Mar 1, 2023 · 11 comments
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@nickva
Copy link
Contributor

nickva commented Mar 1, 2023

Describe the bug

There seems to be a memory leak in fix_alloc allocator from nsched_pid_ref_entry

Nodes upgraded from OTP 23 to 24 started to show a steady increase in fix_alloc memory usage. Over the course a month, or so, some nodes have consumed as much as 35GB of memory:

fixallocleak

It was determined to be fix_alloc from a few of these diagnostic calls:

> proplists:get_value(fix_alloc, recon_alloc:memory(allocated_types, current)).
24744067072

> f(L), {ok, {512, L}} = instrument:carriers(#{allocator_types=>[fix_alloc]}), L, lists:sum([T || {fix_alloc, _, T, _, _, _} <- L]).
24744067072

Enabling tagging for fix_alloc allowed some additional stats to be shown in instrument:allocations() result. However, that required restarting the nodes so I opted to gather some stats with erlang:system_info({memory_internal, Ref, [fix_alloc]}) call on nodes which already had about 20GB of leaked memory:

(apologies for the ugly script, it's most likely wrong and underestimates the amount)

(fun() -> Ref=make_ref(), erlang:system_info({memory_internal, Ref, [fix_alloc]}), Res = (fun Recv(Acc) -> receive {Ref, _N, [{fix_alloc, _N, [{fix_types, _Mask, Props}|_]}]} -> Recv([ [{PName, A} || {PName, A, U} <- Props, A > 0, A < 1 bsl 48] | Acc]) after 1000 -> Acc end end)([]), {Map, Total} = lists:foldl(fun({N, A}, {R, T}) -> {maps:update_with(N, fun(V) -> V + A end, A, R), T+A} end, {#{},0}, lists:flatten(Res)), io:format("~n", []), lists:foreach(fun({K, V}) -> io:format("~p  = ~p MB~n", [K, V bsr 20]) end, lists:reverse(lists:keysort(2, maps:to_list(Map)))) ,io:format("~nTOTAL: ~p MB~n", [Total bsr 20]) end)().
nsched_pid_ref_entry  = 14674 MB
proc  = 1981 MB
monitor  = 461 MB
link  = 259 MB
bif_timer  = 125 MB
msg_ref  = 73 MB
receive_marker_block  = 6 MB
ll_ptimer  = 4 MB
sl_thr_q_element  = 0 MB
magic_indirection  = 0 MB
driver_select_data_state  = 0 MB
TOTAL: 17588 MB

That pointed me to the nsched_pid_ref_entry allocations. That in turn lead me to the new aliases feature in OTP 24.

I noticed that implementation was subsequently removed in 26rc1 which might explain why I couldn't reproduce the issue in 26rc1.

To Reproduce

I think I managed to reproduce it with a modified multi-process pid pinger script using aliases

https://gist.github.com/nickva/9762b8dca22af82e12203b71cc254bcc

I tested 24 and 25 and they show the problem. 26rc1 doesn't show it. Tested on both Linux and MacOS.

The dist part in the reproducer may not be necessary I just adapted an older leak reproducer for dist binary memory as a starting point.

Expected behavior

Memory not to leak

Affected versions

OTP 24 and 25. OTP 26rc1 doesn't seem to be affected.

@nickva nickva added the bug Issue is reported as a bug label Mar 1, 2023
@rickard-green rickard-green self-assigned this Mar 1, 2023
@rickard-green rickard-green added the team:VM Assigned to OTP team VM label Mar 1, 2023
@rickard-green
Copy link
Contributor

Thanks for an excellent bug report! The leak is due to active aliases not being cleaned up in the alias table when a process terminates. The alias table is not necessary in OTP 26, and has, as you noted, been completely removed there, which is why the issue does not exist in OTP 26. I'll soon make a PR with a fix.

rickard-green added a commit to rickard-green/otp that referenced this issue Mar 1, 2023
… into rickard/alias-cleanup-fix/25.2.3/erlangGH-6947/OTP-18496

* rickard/alias-cleanup-fix/24.3/erlangGH-6947/OTP-18496:
  [erts] Ensure cleanup of alias table on alias destruction
@rickard-green
Copy link
Contributor

PR #6953 fixes this. It should be part of the next OTP 25 and 24 patches unless something unexpected happens.

@nickva
Copy link
Contributor Author

nickva commented Mar 1, 2023

Thanks for investigating and the quick fix @rickard-green.

The reproducer script shows the memory stays stable around 93-115MB

% export OTP_GITHUB_URL="https://github.com/nickva/otp.git"

% asdf install erlang ref:rickard-alias-cleanup-fix

% asdf shell erlang ref:rickard-alias-cleanup-fix

% erl -name [email protected] +MFatags true

([email protected])1> c(fixalloc), fixalloc:go('[email protected]').
PidRefHist:{1,0,0,0} AllocSize(MB):0
PidRefHist:{81090,0,0,0} AllocSize(MB):93
PidRefHist:{17330,0,0,0} AllocSize(MB):93
PidRefHist:{85236,0,0,0} AllocSize(MB):87
...
PidRefHist:{16640,0,0,0} AllocSize(MB):100
PidRefHist:{41802,0,0,0} AllocSize(MB):108

rickard-green added a commit that referenced this issue Mar 2, 2023
@rickard-green
Copy link
Contributor

I've merged the fix into the maint branch now. This will be released in OTP 25.3 next week.

@nickva
Copy link
Contributor Author

nickva commented Mar 2, 2023

@rickard-green thank you, much appreciated!

We run 24 in production currently. Any chance there might be an OTP 24 point release with a fix?

@rickard-green
Copy link
Contributor

We will soon release an OTP 24.3.4.10 patch with this fix included as well. I'm not sure if it will be this week or next week, though.

Commit 419356c contains the fix for OTP 24 which is based on OTP 24.3.4 and should merge cleanly to any OTP 24.3.4.* version if you want to apply the fix temporarily by yourself before the patch is released.

@nickva
Copy link
Contributor Author

nickva commented Mar 6, 2023

We will soon release an OTP 24.3.4.10 patch with this fix included as well. I'm not sure if it will be this week or next week, though.

Thank you, that would be perfect!

Commit 419356c contains the fix for OTP 24 which is based on OTP 24.3.4 and should merge cleanly to any OTP 24.3.4.* version if you want to apply the fix temporarily by yourself before the patch is released.

That's what we did and will be updating our production environment soon

rickard-green pushed a commit that referenced this issue Mar 17, 2023
…maint-24

* rickard/alias-cleanup-fix/24.3/GH-6947/OTP-18496:
  [erts] Ensure cleanup of alias table on alias destruction
@rickard-green
Copy link
Contributor

The patch package OTP 24.3.4.10 including this fix has now been released.

@nickva
Copy link
Contributor Author

nickva commented Mar 17, 2023

Thank you, Rickard. That's much appreciated.

@pallix
Copy link
Contributor

pallix commented Jan 8, 2024

My team experienced memory leak last year with OTP 25.3.2.6 so the fix for this issue was definitely not sufficient. It was tricky to analyze because it would take days before the machine would run out of memory. We could observe that upgrading to OTP 26 fixed the leak. Unfortunately, once a solution was found, we did not find much time to do a deep analysis and find the cause of the leak in OTP 25.3.2.6.

The leak could not be observed in all our environments but only the ones where the database connectivity was not great. I see now that there is another leak mentioned in #7834 so that the leak my team observed could be an instance of it. If it's the case then all is good, otherwise there may be a third leak.

If we do find time to reproduce it (no guarantees, since the environment where we observed the leak has changed since we observed it) or gather new information, I will post the new information we gathered.

Thank you for the fast responses and fixes for all the issues!

@nickva
Copy link
Contributor Author

nickva commented Jan 8, 2024

@pallix if you can use remsh on one of the nodes, could try to identify the type of memory which leaks.

Here is what I used for this issue:

proplists:get_value(fix_alloc, recon_alloc:memory(allocated_types, current)).

Or

f(L), {ok, {512, L}} = instrument:carriers(#{allocator_types=>[fix_alloc]}), L, lists:sum([T || {fix_alloc, _, T, _, _, _} <- L]).

And if that's a large proportion compared to the total memory used then it could be related.

Can try narrowing it down even further by restarting the node with +MFatags true and

(fun() -> Ref=make_ref(), erlang:system_info({memory_internal, Ref, [fix_alloc]}), Res = (fun Recv(Acc) -> receive {Ref, _N, [{fix_alloc, _N, [{fix_types, _Mask, Props}|_]}]} -> Recv([ [{PName, A} || {PName, A, U} <- Props, A > 0, A < 1 bsl 48] | Acc]) after 1000 -> Acc end end)([]), {Map, Total} = lists:foldl(fun({N, A}, {R, T}) -> {maps:update_with(N, fun(V) -> V + A end, A, R), T+A} end, {#{},0}, lists:flatten(Res)), io:format("~n", []), lists:foreach(fun({K, V}) -> io:format("~p  = ~p MB~n", [K, V bsr 20]) end, lists:reverse(lists:keysort(2, maps:to_list(Map)))) ,io:format("~nTOTAL: ~p MB~n", [Total bsr 20]) end)().

Otherwise if you think it's most likely the issue in #7834 can also see the types of memory used there with instrument:allocations() and erlang:memory()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

3 participants