Reliability issue on ARM64 Stage1 #86929

MichalStrehovsky · 2023-05-30T23:50:35Z

Stage1 Ampere Linux dashboard is showing drops in RPS with NativeAOT on ARM64. This seems to be exacerbated when enabling speed optimizations:

Similarly, socket errors:

Cc @VSadov

ghost · 2023-05-30T23:50:44Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Stage1 Ampere Linux dashboard is showing drops in RPS with NativeAOT on ARM64. This seems to be exacerbated when enabling speed optimizations:

Similarly, socket errors:

Cc @VSadov

Author:	MichalStrehovsky
Assignees:	-
Labels:	`area-NativeAOT-coreclr`
Milestone:	8.0.0

EgorBo · 2023-05-31T13:11:16Z

We usually wait for more data points before making conclusions, in this case it's just a single data point, can be some infra issue (happens time to time)

MichalStrehovsky · 2023-05-31T21:09:23Z

We usually wait for more data points before making conclusions, in this case it's just a single data point, can be some infra issue (happens time to time)

I see at least 3 data points - 1 for blended mode and 2 for speedopt. For comparison, we never saw this outside of NativeAOT.

EgorBo · 2023-05-31T21:12:50Z

We usually wait for more data points before making conclusions, in this case it's just a single data point, can be some infra issue (happens time to time)

I see at least 3 data points - 1 for blended mode and 2 for speedopt. For comparison, we never saw this outside of NativeAOT.

I mean the graph for speedopt only, it has just 1 data point in the lowest state:

It seesm like it has some infra issues and is scheduled to run once in a week rather then some codegen issue to me at this point

MichalStrehovsky · 2023-05-31T21:15:18Z

This is correlated with the Bad Responses + Socket error chart below. We're getting low RPS because there's a problem with the response.

I'm not saying it's a codegen issue - just that it seems to be exacerbated with speedopts.

EgorBo · 2023-05-31T21:19:54Z

I see, I only wanted to note that we usually give CI more time to produce >1 data points before we start investigations - the TE benchmarks are too volatile (+ rare infra failures) to be able to only compare two data points (atlhough, the same is true for dotnet/performance microbenchmarks)🙂 But it seems that with the current velocity we need to wait a few weeks for that (or run locally to validate).

VSadov · 2023-06-01T00:25:21Z

If I try running this benchmark (via crank), every few times there are socket errors and sometimes the app just crashes (reported as "Connection refused").

VSadov · 2023-06-01T01:00:19Z

In the first chart - the first notch for regular Stage1AOT was when every platform had issues. Since then there is only one single point when regular run had issues.
The problem is clearly sensitive to /p:OptimizationPreference=Speed.

I guess it only tells us it is unlikely to be in the native runtime (including GC), since that is unaffected by OptimizationPreference.

I'll try running libraries tests with /p:OptimizationPreference=Speed to see if we get a local repro.

EgorBo · 2023-06-01T11:26:58Z

@VSadov can you please share the exact crank query? I want to validate it's not caused by recent GDV changes.

VSadov · 2023-06-01T15:19:29Z

I have run the libraries tests a few times with /p:OptimizationPreference=Speed locally, but saw no failures. That was on OSX-arm64.
It is possible that the failure is specific to Linux or Ampere, but I hoped it is just arm64.
It is also possible that it is a bug in the app itself (or in some code that not used much by the libraries) that comes up only when optimized for speed.

I guess we need to run the actual test on the actual machine to get the repro. Or obtain a crashdump from a lab run.

VSadov · 2023-06-01T15:22:47Z

can you please share the exact crank query?

I have sent instructions. If GDV is your worry, it would be interesting to try running that with GDV disabled via some build setting, if that is possible. It could help to rule out if this is GDV specific.

I've already tried things like server/workstation GC, disabling concurrent GC or using conservative stackwalking. None of that had any difference. It is likely something with managed code or with how it is compiled.

VSadov · 2023-06-01T15:48:18Z

If I do not pass /p:OptimizationPreference=Speed to the benchmark it passes. I tried quite a few times.

agocke · 2023-06-23T19:29:09Z

Looking at the history of the benchmarks, I think the standard results are fine. The question is OptPref=Speed -- we may need more data to rule it out.

MichalStrehovsky · 2024-01-12T07:12:27Z

Haven't seen any failures since at least October.

MichalStrehovsky added the area-NativeAOT-coreclr label May 30, 2023

MichalStrehovsky added this to the 8.0.0 milestone May 30, 2023

agocke modified the milestones: 8.0.0, 9.0.0 Aug 9, 2023

MichalStrehovsky mentioned this issue Sep 11, 2023

ilc crash while compiling System.Text.Json.SourceGeneration.Roslyn tests #91885

Closed

MichalStrehovsky closed this as not planned Won't fix, can't repro, duplicate, stale Jan 12, 2024

github-actions bot locked and limited conversation to collaborators Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliability issue on ARM64 Stage1 #86929

Reliability issue on ARM64 Stage1 #86929

MichalStrehovsky commented May 30, 2023

ghost commented May 30, 2023

EgorBo commented May 31, 2023

MichalStrehovsky commented May 31, 2023

EgorBo commented May 31, 2023

MichalStrehovsky commented May 31, 2023

EgorBo commented May 31, 2023 •

edited

Loading

VSadov commented Jun 1, 2023 •

edited

Loading

VSadov commented Jun 1, 2023 •

edited

Loading

EgorBo commented Jun 1, 2023

VSadov commented Jun 1, 2023

VSadov commented Jun 1, 2023

VSadov commented Jun 1, 2023

agocke commented Jun 23, 2023

MichalStrehovsky commented Jan 12, 2024

Reliability issue on ARM64 Stage1 #86929

Reliability issue on ARM64 Stage1 #86929

Comments

MichalStrehovsky commented May 30, 2023

ghost commented May 30, 2023

EgorBo commented May 31, 2023

MichalStrehovsky commented May 31, 2023

EgorBo commented May 31, 2023

MichalStrehovsky commented May 31, 2023

EgorBo commented May 31, 2023 • edited Loading

VSadov commented Jun 1, 2023 • edited Loading

VSadov commented Jun 1, 2023 • edited Loading

EgorBo commented Jun 1, 2023

VSadov commented Jun 1, 2023

VSadov commented Jun 1, 2023

VSadov commented Jun 1, 2023

agocke commented Jun 23, 2023

MichalStrehovsky commented Jan 12, 2024

EgorBo commented May 31, 2023 •

edited

Loading

VSadov commented Jun 1, 2023 •

edited

Loading

VSadov commented Jun 1, 2023 •

edited

Loading