Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider splitting up System.Text.Json tests to allow running them completely with wasm/AOT #87078

Closed
radical opened this issue Jun 2, 2023 · 24 comments
Labels
area-System.Text.Json test-enhancement Improvements of test source code
Milestone

Comments

@radical
Copy link
Member

radical commented Jun 2, 2023

Running System.Text.Json.Tests with wasm/AOT crashes V8, and chrome. Issue: #86164

[22:06:45] info: [STRT] System.Text.Json.Serialization.Tests.StreamTests_DeserializeAsyncEnumerable.DeserializeAsyncEnumerable_ReadSimpleObjectAsync(count: 1000, bufferSize: 1000)
[22:06:56] info: 
[22:06:56] info: <--- Last few GCs --->
[22:06:56] info: 
[22:06:56] info: [70:0x560a9068c280]    63266 ms: Mark-Compact (reduce) 2048.2 (2071.2) -> 2048.2 (2071.2) MB, 667.36 / 0.00 ms  (average mu = 0.728, current mu = 0.001) allocation failure; scavenge might not succeed
[22:06:56] info: 
[22:06:56] info: 
[22:06:56] info: <--- JS stacktrace --->
[22:06:56] info: 
[22:06:56] info: 
[22:06:56] info: #
[22:06:56] info: # Fatal JavaScript out of memory: Reached heap limit
[22:06:56] info: #
[22:06:56] info: 
[22:07:23] info: /home/helixbot/.jsvu/bin/v8: line 2:    70 Trace/breakpoint trap   (core dumped) "/home/helixbot/.jsvu/engines/v8/v8" --snapshot_blob="/home/helixbot/.jsvu/engines/v8/snapshot_blob.bin" "$@"
[22:07:23] info: Process v8 exited with 133

We run into limits while trying to run this test suite, and try to work around that in different ways. It would be extremely useful to possibly run the tests in parts. But I'm not sure how we could split this up. We could either split this into multiple projects if there is a good logical boundary for that. Or we could pick handful of big classes, and explicitly run them in a separate helix job.

cc @kg @pavelsavara for more details on the issues we are running into.
cc @eiriktsarpalis

@radical radical added area-System.Text.Json test-enhancement Improvements of test source code labels Jun 2, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 2, 2023
@ghost
Copy link

ghost commented Jun 2, 2023

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis
See info in area-owners.md if you want to be subscribed.

Issue Details

Running System.Text.Json.Tests with wasm/AOT crashes V8, and chrome. Issue: #86164

[22:06:45] info: [STRT] System.Text.Json.Serialization.Tests.StreamTests_DeserializeAsyncEnumerable.DeserializeAsyncEnumerable_ReadSimpleObjectAsync(count: 1000, bufferSize: 1000)
[22:06:56] info: 
[22:06:56] info: <--- Last few GCs --->
[22:06:56] info: 
[22:06:56] info: [70:0x560a9068c280]    63266 ms: Mark-Compact (reduce) 2048.2 (2071.2) -> 2048.2 (2071.2) MB, 667.36 / 0.00 ms  (average mu = 0.728, current mu = 0.001) allocation failure; scavenge might not succeed
[22:06:56] info: 
[22:06:56] info: 
[22:06:56] info: <--- JS stacktrace --->
[22:06:56] info: 
[22:06:56] info: 
[22:06:56] info: #
[22:06:56] info: # Fatal JavaScript out of memory: Reached heap limit
[22:06:56] info: #
[22:06:56] info: 
[22:07:23] info: /home/helixbot/.jsvu/bin/v8: line 2:    70 Trace/breakpoint trap   (core dumped) "/home/helixbot/.jsvu/engines/v8/v8" --snapshot_blob="/home/helixbot/.jsvu/engines/v8/snapshot_blob.bin" "$@"
[22:07:23] info: Process v8 exited with 133

We run into limits while trying to run this test suite, and try to work around that in different ways. It would be extremely useful to possibly run the tests in parts. But I'm not sure how we could split this up. We could either split this into multiple projects if there is a good logical boundary for that. Or we could pick handful of big classes, and explicitly run them in a separate helix job.

cc @kg @pavelsavara for more details on the issues we are running into.
cc @eiriktsarpalis

Author: radical
Assignees: -
Labels:

area-System.Text.Json, test-enhancement

Milestone: -

@kg
Copy link
Contributor

kg commented Jun 2, 2023

S.T.J is also very close to the limit in the wasm interpreter configuration. If you raise the Jiterpreter limit by a few more megabytes the test suite will also hit V8's heap size limit and crash. So I think we've been just barely staying below this 2GB threshold for a while.

@eiriktsarpalis
Copy link
Member

In order to understand the problem correctly, is this an issue of assembly size, number of tests, or specific tests having a very deep stack?

@MichalStrehovsky
Copy link
Member

@eiriktsarpalis this is also a problem for Native AOT. The Json tests have been failing in the runtime-extra-platforms runs ever since they got enabled:

artifacts/bin/coreclr/linux.arm64.Release/build/Microsoft.NETCore.Native.targets(271,5): error MSB3073: (NETCORE_ENGINEERING_TELEMETRY=Build) The command ""/__w/1/s/artifacts/bin/coreclr/linux.arm64.Release/x64/ilc/ilc" @"/__w/1/s/artifacts/obj/System.Text.Json.SourceGeneration.Roslyn4.4.Tests/Release/net8.0/native/System.Text.Json.SourceGeneration.Roslyn4.4.Tests.ilc.rsp"" exited with code 137

It's an OOM. The test does compile for me locally but takes forever, consumes gigabytes of memory, and compiles into a 150 MB executable with a 1.5 GB PDB. That's rather large.

There are three issues:

  1. Generic recursion. The compiler does detect this and cuts it off, but it still causes quite a bit of damage because the cutoff only kicks in for deep nesting to prevent compiling forever/until memory runs out. It also produces warnings about this but unfortunately we disable them for test builds so it's not surfaced:
    <SuppressAotAnalysisWarnings>true</SuppressAotAnalysisWarnings>
  2. There are tens of megabytes of JsonConverter<T> instances. Instances over value types are not shared and there's a lot of valuetypes being serialized. Some of them might not be intentional - for example ClassWithReadOnlyPropertyQueue says it's a class but it's in fact a struct - there are dozens of these ClassWith that are structs. Each unique JsonConverter<T> instance over a struct costs more than 100 kB in size in the test. The 90% of this 100+ kB is caused by:
  3. CreateCastingConverter generic method. This is a generic virtual method on a generic type. This patterns causes an N * M generic expansion in AOT. The test ends up calling the method with about 250 unique T's (we only count valuetypes). This means that each of the JsonConverter<T> instances in the test get extra 250 specializations. There's hundreds of JsonConverter instances in the app so this multiplies into unreasonable numbers (tens of thousands of CreateCastingConverter method bodies) very quickly.

Getting rid of the generic recursion would be probably the best first step. If we can turn some of the structs in the test into classes, that would help too. The generic virtual method is a huge problem because of the recursion but it will probably scale pretty poorly for large JSON models in general.

I have reasons to believe the recursion is still around handling of Queue<T> because the warnings look like this:

ILC: AOT analysis warning IL3054: System.Text.Json.Serialization.Metadata.JsonTypeInfo`1.<DeserializeAsyncEnumerable>d__4<Queue`1<Queue`1<Queue`1<Queue`1<SByte[]>>>>>: Generic expansion to 'System.Text.Json.Serialization.Metadata.JsonTypeInfo`1<Queue`1<Queue`1<Queue`1<Queue`1<Queue`1<SByte[]>>>>>>' was aborted due to generic recursion. An exception will be thrown at runtime if this codepath is ever reached. Generic recursion also negatively affects compilation speed and the size of the compilation output. It is advisable to remove the source of the generic recursion by restructuring the program around the source of recursion. The source of generic recursion might include: 'System.Text.Json.Serialization.Metadata.JsonTypeInfo`1', 'System.Text.Json.Serialization.JsonConverter`1', 'System.Text.Json.Serialization.JsonResumableConverter`1', 'System.Text.Json.Serialization.JsonDictionaryConverter`1', 'System.Text.Json.Serialization.JsonConverter.CreateCastingConverter<TTarget>()', 'System.Text.Json.Serialization.JsonConverterFactory.CreateCastingConverter<TTarget>()', 'System.Text.Json.Serialization.JsonConverter`1.CreateCastingConverter<TTarget>()', 'System.Text.Json.Serialization.Converters.CastingConverter`1', 'System.Text.Json.Serialization.Converters.JsonMetadataServicesConverter`1', 'System.Text.Json.Serialization.Metadata.DefaultJsonTypeInfoResolver.DeterminePropertyAccessors<T>(JsonPropertyInfo`1<T>,MemberInfo)', 'System.Text.Json.Serialization.Metadata.JsonPropertyInfo`1', 'System.Text.Json.Serialization.Metadata.JsonTypeInfo`1.<DeserializeAsync>d__1', 'System.Text.Json.Serialization.Metadata.JsonTypeInfo`1.<DeserializeAsyncEnumerable>d__4', 'System.Text.Json.Serialization.Metadata.JsonTypeInfo`1.<DeserializeAsObjectAsync>d__6', 'System.Text.Json.Serialization.Metadata.JsonTypeInfo`1.<SerializeAsync>d__10', 'System.Text.Json.Serialization.Metadata.JsonParameterInfo`1'

@MichalStrehovsky
Copy link
Member

If you want to see the extent of it, dotnet tool install sizoscope --global --framework net7.0-windows, run sizoscope and open this file in the GUI:

System.Text.Json.SourceGeneration.Roslyn4.4.Tests.zip

You'll see something like this:

image

@kg
Copy link
Contributor

kg commented Jun 7, 2023

In order to understand the problem correctly, is this an issue of assembly size, number of tests, or specific tests having a very deep stack?

AFAIK stack depth is not the problem here, for the wasm interp scenario the native heap as a whole is growing until it hits a limit. If I had to guess it's probably a mix of type info, generated interpreter code, jitted wasm code (for the interp+jiterp scenario), etc. I don't know offhand how to get measurements out of the mono runtime on this stuff but if you need measurements I can probably figure it out. In the V8 console scenario, the native heap has to approach 2GB of allocated space before this happens. I'm not sure why the wasm AOT version is running out of memory, to be honest - I would expect it to use less memory, but it's possible that loaded AOT binaries count against V8's heap size limit.

MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this issue Jun 7, 2023
We're getting OOM-killed in the CI. See dotnet#87078.

I'm also changing how we do warning suppressions. We still need to suppress most of AOT warnings because xUnit/tests are not warning clean, but the warning about generic recursion is critical for the product and should fail the build.
MichalStrehovsky added a commit that referenced this issue Jun 7, 2023
We're getting OOM-killed in the CI. See #87078.

I'm also changing how we do warning suppressions. We still need to suppress most of AOT warnings because xUnit/tests are not warning clean, but the warning about generic recursion is critical for the product and should fail the build.
@eiriktsarpalis eiriktsarpalis removed the untriaged New issue has not been triaged by the area owner label Jun 7, 2023
@eiriktsarpalis eiriktsarpalis added this to the 8.0.0 milestone Jun 7, 2023
@eiriktsarpalis
Copy link
Member

There are tens of megabytes of JsonConverter instances. Instances over value types are not shared and there's a lot of valuetypes being serialized. Some of them might not be intentional - for example ClassWithReadOnlyPropertyQueue says it's a class but it's in fact a struct - there are dozens of these ClassWith that are structs.

I think that's unavoidable given the nature of testing serializers, effectively we tend to define one new type per test and many of these tests need to include coverage for structs (to make sure that each facet of the serializer correctly accounts for struct semantics). I think we could try cutting a few struct definitions here and there when not strictly necessary from a test coverage perspective but I wouldn't expect huge gains from that exercise.

CreateCastingConverter generic method. This is a generic virtual method on a generic type. This patterns causes an N * M generic expansion in AOT.

I recently merged #80755 that made changes to CreateCastingConverter out of similar size concerns, and it appears that this method no longer needs to defined on JsonConverter<T>. I'll follow-up with a PR.

@eiriktsarpalis
Copy link
Member

@MichalStrehovsky new sizes after #87211 got merged.

image

@eiriktsarpalis
Copy link
Member

With #87276 it looks like the total size has dropped to 60 MB. Should we try re-enabling the tests once merged?

@radical
Copy link
Member Author

radical commented Jun 8, 2023

With #87276 it looks like the total size has dropped to 60 MB. Should we try re-enabling the tests once merged?

I have opened #87256 which we can try once #87276 is merged.

@eiriktsarpalis
Copy link
Member

PR is merged, feel free to rebase your PR branch.

@radical
Copy link
Member Author

radical commented Jun 12, 2023

With #87276 it looks like the total size has dropped to 60 MB. Should we try re-enabling the tests once merged?

It still fails:

[19:22:37] info: [STRT] System.Text.Json.Serialization.Tests.EnumConverterTests+UInt16EnumDictionary.SerilizeDictionaryWhenCacheIsFull
[19:22:43] info: 
[19:22:43] info: <--- Last few GCs --->
[19:22:43] info: 
[19:22:43] info: [92:0x563b327ef2f0]    89765 ms: Mark-Compact 2034.6 (2069.8) -> 2034.6 (2069.8) MB, 672.20 / 0.00 ms  (average mu = 0.913, current mu = 0.006) allocation failure; scavenge might not succeed
[19:22:43] info: [92:0x563b327ef2f0]    90441 ms: Mark-Compact 2037.5 (2072.8) -> 2037.5 (2072.8) MB, 674.53 / 0.00 ms  (average mu = 0.827, current mu = 0.002) allocation failure; scavenge might not succeed
[19:22:43] info: 
[19:22:43] info: 
[19:22:43] info: <--- JS stacktrace --->
[19:22:43] info: 
[19:22:43] info: 
[19:22:43] info: #
[19:22:43] info: # Fatal JavaScript out of memory: Reached heap limit
[19:22:43] info: #
[19:22:43] info: 
[19:23:11] info: /home/helixbot/.jsvu/bin/v8: line 2:    92 Trace/breakpoint trap   (core dumped) "/home/helixbot/.jsvu/engines/v8/v8" --snapshot_blob="/home/helixbot/.jsvu/engines/v8/snapshot_blob.bin" "$@"

@pavelsavara
Copy link
Member

Adding pre-test yield to few test classes will allow JS engine to run timer loops. It may give our GC chance to run.

xunit syntax is

        public async Task InitializeAsync()
        {
            await Task.Yield();
        }

But v8 timer loop is strange, still worth trying, I think.

@lewing
Copy link
Member

lewing commented Jul 12, 2023

cc @BrzVlad @vargaz

@vargaz
Copy link
Contributor

vargaz commented Jul 13, 2023

Did some experiments:
Running with -s MAXIMUM_MEMORY=1073741824 (1GB) succeeds.
Running with -s -s MAXIMUM_MEMORY=536870912 (512mb) fails in a controller manner:

  �[40m�[32minfo�[39m�[22m�[49m: [STRT] System.Text.Json.Serialization.Tests.StreamTests_AsyncWithSmallBuffer.VeryLargeJsonFileTest(payloadSize: 1000, ignoreNull: False, writeIndented: False)
  �[40m�[32minfo�[39m�[22m�[49m: console.debug: [MONO] GC_MINOR: (Nursery full) time 3.62ms, stw 3.62ms promoted 3302K major size: 41984K in use: 33033K los size: 62016K in use: 51968K
  �[40m�[32minfo�[39m�[22m�[49m: console.debug: [MONO] GC_MINOR: (Nursery full) time 3.92ms, stw 3.92ms promoted 3710K major size: 44608K in use: 36825K los size: 62016K in use: 51968K
  �[40m�[32minfo�[39m�[22m�[49m: console.debug: [MONO] GC_MINOR: (Nursery full) time 3.67ms, stw 3.67ms promoted 3709K major size: 48128K in use: 40615K los size: 62016K in use: 51968K
  �[41m�[30mfail�[39m�[22m�[49m: [FAIL] System.Text.Json.Serialization.Tests.StreamTests_AsyncWithSmallBuffer.VeryLargeJsonFileTest(payloadSize: 1000, ignoreNull: False, writeIndented: False)
  �[40m�[32minfo�[39m�[22m�[49m: System.OutOfMemoryException : Out of memory
  �[40m�[32minfo�[39m�[22m�[49m:    at System.Text.Json.JsonReaderHelper.TranscodeHelper(ReadOnlySpan`1 )
  �[40m�[32minfo�[39m�[22m�[49m:    at System.Text.Json.JsonSerializer.WriteString[List`1](List`1& , JsonTypeInfo`1 )
  �[40m�[32minfo�[39m�[22m�[49m:    at System.Text.Json.Serialization.Tests.StreamTests.VeryLargeJsonFileTest(Int32 payloadSize, Boolean ignoreNull, Boolean writeIndented)
  �[40m�[32minfo�[39m�[22m�[49m:    at System.Text.Json.JsonReaderHelper.TranscodeHelper(ReadOnlySpan`1 )
  �[40m�[32minfo�[39m�[22m�[49m:    at System.Text.Json.JsonSerializer.WriteString[List`1](List`1& , JsonTypeInfo`1 )

So it looks like the linear memory usage is somewhere between 512mb and 1gb. Not sure what causes v8 to consume 2gb of memory.

@vargaz
Copy link
Contributor

vargaz commented Jul 13, 2023

I can reproduce this when running with v8 --max-old-space-size=2024.

@lewing
Copy link
Member

lewing commented Jul 13, 2023

@pavelsavara @kg are there js objects we could be leaking/retaining?

@kg
Copy link
Contributor

kg commented Jul 13, 2023

Could this be heap growth? i.e. growing from 512mb to 1gb requires being able to allocate contiguously 512mb + 1gb in the v8 perm gen, and it's not big enough

@pavelsavara
Copy link
Member

@pavelsavara @kg are there js objects we could be leaking/retaining?

That unit test doesn't do any JS interop. The only one is that we hold Task/Promise pair for the main method.

I'm thinking about emscripten's posix emulation leaking.
If there are VFS files created as part of the test, they would consume JS heap.
What else ?

And I guess I should better understand how we allocate/free memory. I don't know what #ifdef in major_free_swept_blocks means yet.

major_free_swept_blocks (size_t section_reserve)
{
SGEN_ASSERT (0, sweep_state == SWEEP_STATE_SWEPT, "Sweeping must have finished before freeing blocks");
#if defined(HOST_WIN32) || defined(HOST_ORBIS) || defined (HOST_WASM)
/*
* sgen_free_os_memory () asserts in mono_vfree () because windows doesn't like freeing the middle of
* a VirtualAlloc ()-ed block.
*/
return;

@BrzVlad
Copy link
Member

BrzVlad commented Jul 13, 2023

A major GC block has 16kb and when we allocate blocks we allocate them in chunks (of 32 blocks at a time). Each block will be in a free list and will be used by the GC when needed. After a collection, in major_free_swept_blocks we attempt to free blocks also in various chunk sizes, that is not identical to the original allocation. So we attempt to unmap in the middle of mapped memory, which I know doesn't work on Windows. Not sure how much benefit we get for this complication of avoiding to call into OS map/unmap.

@pavelsavara
Copy link
Member

To do more testing see #88825 (comment)

@eiriktsarpalis
Copy link
Member

It's unlikely we'll be able to address this in .NET 8, moving to future.

@pavelsavara
Copy link
Member

From #88825 (comment)

It would be useful to get some GC log for those offending test suites. Env: MONO_LOG_LEVEL=debug and MONO_LOG_MASK=gc

@lewing
Copy link
Member

lewing commented Feb 9, 2024

cc @vitek-karas @artl93

closing as resolved but the large trim unfriendly test assemblies are still very much a thing

@lewing lewing closed this as completed Feb 9, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.Json test-enhancement Improvements of test source code
Projects
None yet
Development

No branches or pull requests

8 participants