-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC: Fix CheckPromoted for frozen objects #76251
Conversation
… unreachable because it was not promoted
Tagging subscribers to this area: @dotnet/gc Issue DetailsFixes #76219 The problem is - when GC scans SyncBlocks in GCWeakPtrScanElement and meets a SyncBlock holding a weak pointer to a frozen object (
|
Ah, I assume it either has to be moved to |
I am not sure about the best place to add this check. I would like to hear @Maoni0's opinion. I suspect that we may have similar problem for weak handles and dependent handles that will also lead to bad failure modes. |
Looks similar to my previous change in gc.cpp - https://github.com/dotnet/runtime/pull/73110/files |
@jkotas, had a chat with @Maoni0 and I have an impression that the optimization that we decided to use - avoid managed pinned handles (for both RuntimeType and string literals) is the reason. Previously, all RuntimeType objects were handled by that pinned table so RuntimeType objects were always marked while now e.g. RuntimeType can be completely unused for some period of time and then used again (with header containing out-of-date syncblock index) |
Repro with strings: // Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
using System;
using System.Runtime.CompilerServices;
public class Class1
{
public static void Main()
{
Test();
// "hello" is unused now, SyncBlock is cleared
GC.Collect();
// Now we need "hello" again!
Test();
}
static void Test()
{
Say("hello");
}
static void Say(string str)
{
lock (str)
{
Console.WriteLine(str);
}
}
} Didn't test it with Regions yet, I am still on x86 build with old gc. |
Possible solutions (after some chatting with Maoni):
|
Implemented 3rd option |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
I am not sure I understand, but are we saying we are going to mark/unmark all frozen objects for every GC as if they are roots? |
The fix for this problem should also cover objects from file-mapped frozen heaps. File-mapped frozen heaps do not have GC handles or an extra array pointing to the frozen heap objects. It looks like the whole point of registry that you have implemented is to mark every single object on FOH. It is cheaper to just walk FOH and mark everything on it directly. We do not need an extra array to do that. What prevents us from doing that if we believe that marking all objects on FOH during the GC is the right tradeoff? |
Is there a better solution? We used to mark/unmark the same objects before FOH too so the main savings here are jit that is able to bake direct references to various objects in codegen + omit write barriers barriers when we assign them to fields/arrays. Also, we allocate less of pinned handles in VM. My initial fix was Basically, I want to fix at least two cases:
|
Yes, that's the point. What you suggest makes sense to me but I am not sure I understand how to do it, I assume somewhere in gc.cpp at mark phase GC should just walk all objects in FOH and mark (and then unmark) them, right? |
no, we do not mark every object on FOH. they are only marked (if anyone refers to them) on the FOH segments that are in range. this naturally happens with GC's mark process. |
Right, but previously all the objects we moved now to FOH were always marked because they were referenced by some special array objects in VM. Now these objects are not handled by any special internal arrays/handlers, and instead are saved to some unmanaged data structures, e.g. string literal hash table or |
Oh, this is a segment-only problem? Frozen segments are always out of range for regions? |
I might be wrong but the problem only reproduces with segments yes, the gcstress crash we had + winforms issue are both x86 where regions are not enabled |
Just in case it is a segment-only issue, we can potentially optimize away all these in case we are running under regions, which is going to be the default anyway. Setting the mark bit or not, having the |
let's have a quick meeting to discuss this and get to a final solution instead of trying various things. |
I think the original intent and promise for frozen segments is that these objects are NOT managed by the GC at all, we just did minimal work to make the runtime happy with that (e.g. the Write Barrier won't crash, we don't accidentally leave mark bits, ...). It would be unfortunate if we break that intent and spent time proportional to the number of frozen objects per GC. In the abstract, the bug is that the GC is inadvertently freeing the "associated resources" of a frozen object if it is "in range". A natural way of fixing this should be to avoid that from happening.
The problem with fixing this directly is that it might be expensive to check if the "associated resource" belongs to a frozen object or not. This issue could be side-stepped if the "associated resource" were labeled as such. With some thought, it is not hard to see the labeling must be done by the VM. (It would be meaningless if the GC needs to run the expensive check during GC to determine the label). If we were changing the VM anyway, we also have the opportunity to seggregate those resources so that they are not presented to the GC at all. Suppose when the sync-block were created, we knew they belong to frozen objects, and so we store them somewhere else, then the GC wouldn't be looping through them, truly achieving frozen objects has zero GC cost.
Weak Reference and Dependent Handle are particularly easy to handle if we were to change the VM. We can simply replace that with a strong handle underneath when we detected it is a frozen object, and we are done. A dependent handle depending on a frozen object is just really dumb and costly. |
That intent is broken already (see the existing
It is still valuable to free the non-precious syncblocks, even when get attached to the frozen object. I do not see a problem here.
Well, we would need to introduce new types of handles to implement your idea. The new type of handle is somewhat expensive too. It would be worth it only if the dependent handles pointing to frozen objects are very common. I am pretty sure that it is not the case. I believe that having a few dependent handles pointing to frozen objects is the best among the available options. |
So presumably the question is where to fix the problem: on GC side or VM (two VMs) side
Do I understand you correctly that you want to leave everything as is (as of Main) and fix the problem on VM side e.g.:
At least this is how I understand the desire to collect unused SyncBlocks for frozen objects. PS: SyncBlocks should be rare beasts for frozen objects, right? I mean the general pattern we promote is to have a |
The change you have right should allow the non-precious syncblocks to be collected. I like the change you have in the PR right now. I think it is the best path to fix this issue. |
Right. SyncBlocks are rare in general. The non-precious syncblocks handle the case where you have a little bit of contention on a long-lived instance, syncblock gets created and then never used again. We do not want to be stuck with this syncblock forever. We want to be able to free it while the long-lived instance is still alive. |
+ moved where we are calling mark_ro_segments to the right places - for BGC this needs to be in the 2nd non concurrent phase) for blocking GCs I just moved it to a more appropriate place + for full blocking GCs it's not enough to do mark_ro_segments when gen start seg is not ephemeral. when we are doing a gen2 GC, even if we have acquired a new seg for that heap, we still need to mark all in range ro segs because our logic in IsPromoted expects when there's in range ro segs we need the mark bit to tell us these ro objects are marked. + got rid of some unnecessary checks Note that I do make the assumption that ro segs are always threaded at the beginning of gen2. We actually always thread these into heap 0's gen2 seg list but I can see a chance of that changing, if we want a better balancing of the mark_ro_segments work.
Thanks for the pushed changes! Looks like there are two test failures I am able to reproduce locally - a frozen string is not marked when
ro_segments_in_range is TRUE. What is weird that this code is executed just right after "mark all ro segments" logic - |
…isabled via env.var but we'll get rid of that soon)
@@ -11009,7 +11009,7 @@ void gc_heap::seg_set_mark_array_bits_soh (heap_segment* seg) | |||
if (bgc_mark_array_range (seg, FALSE, &range_beg, &range_end)) | |||
{ | |||
size_t beg_word = mark_word_of (align_on_mark_word (range_beg)); | |||
size_t end_word = mark_word_of (align_on_mark_word (range_beg)); | |||
size_t end_word = mark_word_of (align_on_mark_word (range_end)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You pasted that commit message from another PR, right? :-)
@Maoni0 @cshung the tests seem to be passing now (except |
/azp run runtime-coreclr outerloop, runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 2 pipeline(s). |
Outerloop failures are #76511, gcstress are pretty clean except one build error/timeout. Reverted the test change to disable gc regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed what I wrote in the commit msgs I pushed slightly to reflect the final state. please use this when you merge -
+ we need to proactively go mark all the in range ro segs because these objects' life time isn't accurately
expressed. The expectation is all objects on ro segs are reported as marked.
+ fixed an existing bug - for full blocking GCs it's not enough to handle ro segs when gen start seg is
not ephemeral, it needs to be checking for condemned gen because you can still have ro segs in range
even with one segment per heap.
+ got rid of some unnecessary checks.
+ I do make the assumption that ro segs are always threaded at
the beginning of gen2 so we can exit as soon as we see an rw seg. we actually always thread these into
heap 0's gen2 seg list but I can see a chance of that changing, if we want a better balancing of work
done for ro segs.
+ a bit of code refactoring since now we have more code related to ro segs.
…en objects" (#76649) This PR reverts #76235 with a few manual modifications: The main issue why the initial PRs (#75573 and #76135) were reverted has just been resolved via #76251: GC could collect some associated (with frozen objects) objects as unreachable, e.g. it could collect a SyncBlock, WeakReferences and Dependent handles associated with frozen objects which could (e.g. for a short period of time) be indeed unreachable but return back to life after. Co-authored-by: Jan Kotas <[email protected]> Co-authored-by: Jakob Botsch Nielsen <[email protected]>
Fixes #76219
The problem is - when GC scans SyncBlocks in GCWeakPtrScanElement and meets a SyncBlock holding a weak pointer to a frozen object (
RuntimeType
in my case) it invokes this handler which is effectively CheckPromoted where that weak ref is recognized as unreachable because it doesn't passif (!g_theGCHeap->IsPromoted(*pRef))
so the syncblock is cleared/removed/reused.cc @jkotas @cshung @Maoni0
Easily reproduces on this small snippet on Windows-x86 Checked - #76219 (comment)
(not on the latest Main since frozen type objects were just reverted in #76235 because of this issue)