-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix_alloc memory leak in Erlang 24 and 25 #6947
Comments
Thanks for an excellent bug report! The leak is due to active aliases not being cleaned up in the alias table when a process terminates. The alias table is not necessary in OTP 26, and has, as you noted, been completely removed there, which is why the issue does not exist in OTP 26. I'll soon make a PR with a fix. |
… into rickard/alias-cleanup-fix/25.2.3/erlangGH-6947/OTP-18496 * rickard/alias-cleanup-fix/24.3/erlangGH-6947/OTP-18496: [erts] Ensure cleanup of alias table on alias destruction
PR #6953 fixes this. It should be part of the next OTP 25 and 24 patches unless something unexpected happens. |
Thanks for investigating and the quick fix @rickard-green. The reproducer script shows the memory stays stable around 93-115MB
|
…/25.2.3/GH-6947/OTP-18496 Alias cleanup fix
I've merged the fix into the |
@rickard-green thank you, much appreciated! We run 24 in production currently. Any chance there might be an OTP 24 point release with a fix? |
We will soon release an OTP 24.3.4.10 patch with this fix included as well. I'm not sure if it will be this week or next week, though. Commit 419356c contains the fix for OTP 24 which is based on OTP 24.3.4 and should merge cleanly to any OTP 24.3.4.* version if you want to apply the fix temporarily by yourself before the patch is released. |
Thank you, that would be perfect!
That's what we did and will be updating our production environment soon |
…maint-24 * rickard/alias-cleanup-fix/24.3/GH-6947/OTP-18496: [erts] Ensure cleanup of alias table on alias destruction
The patch package OTP 24.3.4.10 including this fix has now been released. |
Thank you, Rickard. That's much appreciated. |
My team experienced memory leak last year with OTP 25.3.2.6 so the fix for this issue was definitely not sufficient. It was tricky to analyze because it would take days before the machine would run out of memory. We could observe that upgrading to OTP 26 fixed the leak. Unfortunately, once a solution was found, we did not find much time to do a deep analysis and find the cause of the leak in OTP 25.3.2.6. The leak could not be observed in all our environments but only the ones where the database connectivity was not great. I see now that there is another leak mentioned in #7834 so that the leak my team observed could be an instance of it. If it's the case then all is good, otherwise there may be a third leak. If we do find time to reproduce it (no guarantees, since the environment where we observed the leak has changed since we observed it) or gather new information, I will post the new information we gathered. Thank you for the fast responses and fixes for all the issues! |
@pallix if you can use remsh on one of the nodes, could try to identify the type of memory which leaks. Here is what I used for this issue:
Or
And if that's a large proportion compared to the total memory used then it could be related. Can try narrowing it down even further by restarting the node with
Otherwise if you think it's most likely the issue in #7834 can also see the types of memory used there with |
Describe the bug
There seems to be a memory leak in fix_alloc allocator from
nsched_pid_ref_entry
Nodes upgraded from OTP 23 to 24 started to show a steady increase in
fix_alloc
memory usage. Over the course a month, or so, some nodes have consumed as much as 35GB of memory:It was determined to be
fix_alloc
from a few of these diagnostic calls:Enabling tagging for
fix_alloc
allowed some additional stats to be shown ininstrument:allocations()
result. However, that required restarting the nodes so I opted to gather some stats witherlang:system_info({memory_internal, Ref, [fix_alloc]})
call on nodes which already had about 20GB of leaked memory:(apologies for the ugly script, it's most likely wrong and underestimates the amount)
That pointed me to the
nsched_pid_ref_entry
allocations. That in turn lead me to the new aliases feature in OTP 24.I noticed that implementation was subsequently removed in 26rc1 which might explain why I couldn't reproduce the issue in 26rc1.
To Reproduce
I think I managed to reproduce it with a modified multi-process pid pinger script using aliases
https://gist.github.com/nickva/9762b8dca22af82e12203b71cc254bcc
I tested 24 and 25 and they show the problem. 26rc1 doesn't show it. Tested on both Linux and MacOS.
The dist part in the reproducer may not be necessary I just adapted an older leak reproducer for dist binary memory as a starting point.
Expected behavior
Memory not to leak
Affected versions
OTP 24 and 25. OTP 26rc1 doesn't seem to be affected.
The text was updated successfully, but these errors were encountered: