Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple 'System.OutOfMemoryException' errors in .NET 7 #78959

Open
theolivenbaum opened this issue Nov 29, 2022 · 69 comments
Open

Multiple 'System.OutOfMemoryException' errors in .NET 7 #78959

theolivenbaum opened this issue Nov 29, 2022 · 69 comments
Assignees
Milestone

Comments

@theolivenbaum
Copy link

I'm seeing an issue very similar to this one when running a memory-heavy app on a linux container with memory limit >128GB RAM.

The app started throwing random OutOfMemoryException in many unexpected places since we migrated to net70, while under no memory pressure (usually with more than 30% free memory).

I can see the original issue was closed, but I'm not sure if it was fixed on the final net70 release or if the suggestion to set COMPlus_GCRegionRange=10700000000 is the expected workaround.

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 29, 2022
@ghost
Copy link

ghost commented Nov 29, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

I'm seeing an issue very similar to this one when running a memory-heavy app on a linux container with memory limit >128GB RAM.

The app started throwing random OutOfMemoryException in many unexpected places since we migrated to net70, while under no memory pressure (usually with more than 30% free memory).

I can see the original issue was closed, but I'm not sure if it was fixed on the final net70 release or if the suggestion to set COMPlus_GCRegionRange=10700000000 is the expected workaround.

Author: theolivenbaum
Assignees: -
Labels:

area-GC-coreclr, untriaged

Milestone: -

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Nov 29, 2022
@mangod9 mangod9 added this to the 8.0.0 milestone Nov 29, 2022
@mangod9
Copy link
Member

mangod9 commented Nov 29, 2022

Thanks for reporting this issue. This looks like its separate than the original issue -- we are investigating something similar with another customer. Would it be possible to share a dump when the OOM happens?

@theolivenbaum
Copy link
Author

Unfortunately not as this is running within a customer infrastructure and the dump would most probably contain confidential data. Is there an issue here on GitHub I can subscribe to?

@mangod9
Copy link
Member

mangod9 commented Nov 29, 2022

We dont have an issue yet, so will use this to provide updates. Its most likely something which is fixed in main which might need porting to 7: #77478

@mangod9
Copy link
Member

mangod9 commented Nov 29, 2022

Hi @theolivenbaum, would it be possible for you to try out a private to ensure the fix resolves your issue?

Thanks

@theolivenbaum
Copy link
Author

That might be hard as it would involve changing how we build our docker images. But fine to wait till this is back ported to 7 - any idea on a timeline for the next service release?

@Quppa
Copy link

Quppa commented Dec 1, 2022

We're also seeing a lot of OOM exceptions since migrating to .NET 7 from .NET 5 (we're now testing .NET 6). In our case, we're running under Windows via Azure App Services. Reported memory usage is low - perhaps lower than what it was under .NET 5. The project in question loads large-ish files in memory.

@mangod9
Copy link
Member

mangod9 commented Dec 1, 2022

can you try if setting COMPlus_GCName=clrgc.dll or COMPlus_GCName=libclrgc.so make the OOMs go away? We are working on a fix, but hoping this could be a temporary workaround. Thx.

@Quppa
Copy link

Quppa commented Dec 1, 2022

We'll try to find time to test this.

@jeremyosterhoudt
Copy link

@mangod9 We're seeing a similar issue with .NET7 on Ubuntu loading larger files (10+ MB) with File.ReadAllBytes. This works fine on .NET6

Setting COMPlus_GCName=libclrgc.so resolves the issue for our setup with .NET7

@Maoni0
Copy link
Member

Maoni0 commented Dec 2, 2022

it'd be helpful to see what !ao displays (it's an sos extension). would that be possible? that's always the 1st step if you have a dump.

@mangod9
Copy link
Member

mangod9 commented Dec 2, 2022

Setting COMPlus_GCName=libclrgc.so resolves the issue for our setup with .NET7

ok good to know, yeah like Maoni suggests getting a dump or trace can help confirm whether its the same issue. We hope to get it fixed in an upcoming servicing release.

@jeremyosterhoudt
Copy link

Hopefully I did this right. I followed this guild. Here is the output:

---------Heap 1 ---------
Managed OOM occurred after GC #4 (Requested to allocate 6028264 bytes)
Reason: Could not do a full GC

@mangod9
Copy link
Member

mangod9 commented Dec 2, 2022

thanks, it does look similar to other cases we have seen.

@Maoni0
Copy link
Member

Maoni0 commented Dec 5, 2022

would it be possible to try out a private fix? we could deliver a libclrgc.so to you and you could use it the same way you used the shipped version. that would be really helpful.

@theolivenbaum
Copy link
Author

That would probably be possible!
Also while we're at it, is there any recommendation on how to get memory dumps from within a container?

@mangod9
Copy link
Member

mangod9 commented Dec 5, 2022

I have copied a private libcoreclr.so at https://1drv.ms/u/s!AtaveiZOervriJhkWC64gVEV8dAHug?e=IyBaP3, if you want to give that a try. You will want to remove the COMPlus_GCName config.

@theolivenbaum
Copy link
Author

@mangod9 @Maoni0 just got the chance to test today the library you sent, and after a day of usage under load no issues so far!

@mangod9
Copy link
Member

mangod9 commented Dec 6, 2022

ok thanks for trying it out. We will do additional validation and add it to a .NET 7 servicing release (due to holidays might be in Feb).

@theolivenbaum
Copy link
Author

@mangod9 meanwhile what would you recommend? Keep using the version you shared or using some of the flags suggested above?

@mangod9
Copy link
Member

mangod9 commented Dec 6, 2022

you could keep using the private if that works for you scenario. If you pickup a new servicing release it might not work however. Using COMPlus_GCName might be ok to use temporarily too.

@theolivenbaum
Copy link
Author

Thanks! Will keep that in mind then! Just out of curiosity, how come the COMPlus_GCName flag is a workaround? Does the runtime includes two copies of the GC?

@mangod9
Copy link
Member

mangod9 commented Dec 6, 2022

in .NET 7 we have enabled new Regions functionality within the GC. Here are the details: #43844. Since this was a foundational change, we also shipped a separate GC which keeps the previous "Segments" functionality -- in case there are some issues like this one. Going forward, we do plan to use a similar mechanism to release newer GC changes and could have multiple GC implementations sometime in the future.

@theolivenbaum
Copy link
Author

That makes a lot of sense and what I was imagine happened! Looking forward then to the service release next year!

@qwertoyo
Copy link

qwertoyo commented Jan 3, 2023

👋 Hello! We recently upgraded a series of console apps/generic hosts (and 1 asp.net/webhost), running on alpine linux, from dotnet 6 to dotnet 7, and this issue (OOM while there's plenty of memory available) started happening on some of them, when under load.

From what I can tell, it is not happening on the apps in which we have set the GC mode to server with
ENV DOTNET_gcServer 1 in the dockerfile, but only in console apps that have that flag not set (=> client GC mode). Also not happening in the asp.net one, which has that flag set by default AFAIK.

I will try now to set ENV COMPlus_GCName libclrgc.so on those suffering and retest, but a question: do you think enabling server GC mode in those could be another workaround?

@Maoni0
Copy link
Member

Maoni0 commented Apr 14, 2023

I'm confused, if you are still using libclrgc and not getting OOM, and you want to know why it gets OOM without libclrgc, wouldn't you want to get rid of libclrgc and repro the OOM, and then do analysis there?

the corresponding name of coreclr on linux would be libcoreclr.so. so if you want to look at this in windbg, you'd do libcoreclr instead of coreclr.

@dave-yotta
Copy link

Sorry for the confusion, let me try to clear it up.

Using the older libclrgc solved the in this comment above.

But we have another problem; we are still using libclrgc in .net7.0.4, and have a lot of allocated native memory and GC time we can't pin down as seen in this comment. eeheap is giving:

GC Allocated Heap Size:    Size: 0x182061e8 (404775400) bytes.
GC Committed Heap Size:    Size: 0x29f58000 (703954944) bytes.

for a process with around 1.8gb resident (same scenario as above).

I also noticed (for one of our other processes in this scenario) that using workstation GC gave better GC performance (or at least the resident memory observed did not fluctuate to high values).

This led me to wonder if you were actually seeing a problem common to both GCs. Not sure if this has been helpful in end however! Windbg is showing 0 for all those gc_heap values:

0:000> ?? libcoreclr!SVR::gc_heap::global_regions_to_decommit
SVR::region_free_list [3] 0x00007fdb`44ffaea0
   +0x000 num_free_regions : 0
   +0x008 size_free_regions : 0
   +0x010 size_committed_in_free_regions : 0
   +0x018 num_free_regions_added : 0
   +0x020 num_free_regions_removed : 0
   +0x028 head_free_region : (null) 
   +0x030 tail_free_region : (null) 
0:000> ?? libcoreclr!SVR::gc_heap::global_regions_to_decommit[0]
SVR::region_free_list
   +0x000 num_free_regions : 0
   +0x008 size_free_regions : 0
   +0x010 size_committed_in_free_regions : 0
   +0x018 num_free_regions_added : 0
   +0x020 num_free_regions_removed : 0
   +0x028 head_free_region : (null) 
   +0x030 tail_free_region : (null) 
0:000> ?? libcoreclr!SVR::gc_heap::global_regions_to_decommit[1]
SVR::region_free_list
   +0x000 num_free_regions : 0
   +0x008 size_free_regions : 0
   +0x010 size_committed_in_free_regions : 0
   +0x018 num_free_regions_added : 0
   +0x020 num_free_regions_removed : 0
   +0x028 head_free_region : (null) 
   +0x030 tail_free_region : (null) 

Sorry if this is unrelated/unhelpful, can open a different issue. I'll double check against .NET6, very possibly something we've caused here too.

@Maoni0
Copy link
Member

Maoni0 commented Apr 17, 2023

hi @dave-yotta, if your heap is actually growing, then it's a distinctly different issue from what I mentioned above. if you could open a new issue so we can track them better, that'd be great!

would it be possible to capture a top level GC trace? that's the first step at diagnosing a memory problem. it's described here. it's very low overhead so you can keep it on for a long time. if this problem shows up pretty quickly you could start capturing right before the process is started and terminate tracing when it's exhibited the "memory not being released and the heap size is too large" behavior.

iif you cannot repro with libclrgc, that's most likely a problem in GC so we'd like to track this down with your help. thanks!

@dave-yotta
Copy link

dave-yotta commented Apr 28, 2023

hey @Maoni0, the (used) heap isn't growing, unmanaged memory is growing. not sure if that's actually the heap free space or something else - but there's a lot of GC time and we found a lot of allocations/deallocations totalling 12gb (but never exceeding about 300mb at any one point), will try reducing the memory traffic...and I'll run that gc-collect trace before I do. Take awhile to get around to though sorry! :D

@Maoni0
Copy link
Member

Maoni0 commented Apr 28, 2023

no worries. whenever you get a chance, a gc-collect trace would be very helpful to us.

@theolivenbaum
Copy link
Author

@Maoni0 quick update, just tested the latest runtime without setting COMPlus_GCName=libclrgc.so, and the container in question always crashes with OOM when starting (there's a memory-intensive load phase when starting, but there's also enough memory for it to happen). With libclrgc.so it starts without issues.

@Maoni0
Copy link
Member

Maoni0 commented May 5, 2023

@theolivenbaum do you have a dump when it gets OOM that you could share? if there's privacy concerns, could you capture a top level GC trace so we can at least understand if "when starting" means "when starting and still in the initialization phase" or "after it's done some GCs"?

@theolivenbaum
Copy link
Author

theolivenbaum commented May 5, 2023

@Maoni0 I'm having issues capturing a dump inside a container. Managed to install the dotnet tools but gcdump gives incomplete results, and dump just fails with an error related to not running as root user

Update: This is the error message from dotnet-dump:
Problem launching createdump (may not have execute permissions): execve(/app/createdump) FAILED Permission denied (13)

@Maoni0
Copy link
Member

Maoni0 commented May 5, 2023

what about dotnet trace?

@theolivenbaum
Copy link
Author

theolivenbaum commented May 5, 2023

How can I get a memory dump using dotnet-trace?

@Maoni0
Copy link
Member

Maoni0 commented May 5, 2023

you don't. you capture a GC trace -

if there's privacy concerns, could you capture a top level GC trace so we can at least understand if "when starting" means "when starting and still in the initialization phase" or "after it's done some GCs"?

@hoyosjs
Copy link
Member

hoyosjs commented May 9, 2023

@Maoni0 Maoni Stephens FTE I'm having issues capturing a dump inside a container. Managed to install the dotnet tools but gcdump gives incomplete results, and dump just fails with an error related to not running as root user

Update: This is the error message from dotnet-dump: Problem launching createdump (may not have execute permissions): execve(/app/createdump) FAILED Permission denied (13)

@theolivenbaum can you make sure /app/createdump:

  • Has the executable bit set.
  • Is owned by the same user that's running the app.
  • /app is also owned by the same user.

@theolivenbaum
Copy link
Author

@Maoni0 @hoyosjs good news: found the issue and it was not related to the .NET runtime. The memory allocator used by RocksDB by default on Linux can severally leak memory, and switching to Jemalloc fixed the issue on the server we're observing the problem. Thanks again for the support and we can close the issue now!

@NKnusperer
Copy link

Has this really been resolved? We observed the same issue and have mitigated it since then using COMPlus_GCName=libclrgc.so and are not using RocksDB (or this is some kind of embedded dependency for the .NET runtime?).

@Maoni0
Copy link
Member

Maoni0 commented Jun 23, 2023

have you tried preview 5? if you are still seeing OOM without using libclrgc.so, is it possible to share a dump with us?

@NKnusperer
Copy link

Do you mean .NET 8 Preview 5? I'm talking about Net 7. If this has been fixed with .Net 8 do we get a backport to .Net 7 ?

@Maoni0
Copy link
Member

Maoni0 commented Jun 23, 2023

yeah, .net 8 preview 7. if you cannot try it, could you share a dump from .net 7 but without using libclrgc.so? you may or may not be hitting the same issue that other people hit so there's no guarantee that even if we backported it would fix the issue you hit.

you could also look at the symbols I mentioned above in a dump yourself.

@mangod9
Copy link
Member

mangod9 commented Jun 23, 2023

Also @NKnusperer might make sense to create a separate issue for it, since there could be different reasons for OOMs.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 23, 2023
@markples
Copy link
Member

Reopening - repro given in #78959 (comment) and derivatives from it (all 16MB allocations) are not all solved

@markples markples reopened this Aug 15, 2024
@markples markples modified the milestones: 8.0.0, 9.0.0 Aug 15, 2024
@markples markples self-assigned this Aug 15, 2024
@markples markples modified the milestones: 9.0.0, 10.0.0 Sep 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests