Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliability issue on ARM64 Stage1 #86929

Closed
MichalStrehovsky opened this issue May 30, 2023 · 14 comments
Closed

Reliability issue on ARM64 Stage1 #86929

MichalStrehovsky opened this issue May 30, 2023 · 14 comments

Comments

@MichalStrehovsky
Copy link
Member

Stage1 Ampere Linux dashboard is showing drops in RPS with NativeAOT on ARM64. This seems to be exacerbated when enabling speed optimizations:

image

Similarly, socket errors:

image

Cc @VSadov

@ghost
Copy link

ghost commented May 30, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Stage1 Ampere Linux dashboard is showing drops in RPS with NativeAOT on ARM64. This seems to be exacerbated when enabling speed optimizations:

image

Similarly, socket errors:

image

Cc @VSadov

Author: MichalStrehovsky
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: 8.0.0

@EgorBo
Copy link
Member

EgorBo commented May 31, 2023

We usually wait for more data points before making conclusions, in this case it's just a single data point, can be some infra issue (happens time to time)

@MichalStrehovsky
Copy link
Member Author

We usually wait for more data points before making conclusions, in this case it's just a single data point, can be some infra issue (happens time to time)

I see at least 3 data points - 1 for blended mode and 2 for speedopt. For comparison, we never saw this outside of NativeAOT.

@EgorBo
Copy link
Member

EgorBo commented May 31, 2023

We usually wait for more data points before making conclusions, in this case it's just a single data point, can be some infra issue (happens time to time)

I see at least 3 data points - 1 for blended mode and 2 for speedopt. For comparison, we never saw this outside of NativeAOT.

I mean the graph for speedopt only, it has just 1 data point in the lowest state:
image

It seesm like it has some infra issues and is scheduled to run once in a week rather then some codegen issue to me at this point

@MichalStrehovsky
Copy link
Member Author

This is correlated with the Bad Responses + Socket error chart below. We're getting low RPS because there's a problem with the response.

I'm not saying it's a codegen issue - just that it seems to be exacerbated with speedopts.

@EgorBo
Copy link
Member

EgorBo commented May 31, 2023

I see, I only wanted to note that we usually give CI more time to produce >1 data points before we start investigations - the TE benchmarks are too volatile (+ rare infra failures) to be able to only compare two data points (atlhough, the same is true for dotnet/performance microbenchmarks)🙂 But it seems that with the current velocity we need to wait a few weeks for that (or run locally to validate).

@VSadov
Copy link
Member

VSadov commented Jun 1, 2023

If I try running this benchmark (via crank), every few times there are socket errors and sometimes the app just crashes (reported as "Connection refused").

@VSadov
Copy link
Member

VSadov commented Jun 1, 2023

In the first chart - the first notch for regular Stage1AOT was when every platform had issues. Since then there is only one single point when regular run had issues.
The problem is clearly sensitive to /p:OptimizationPreference=Speed.

I guess it only tells us it is unlikely to be in the native runtime (including GC), since that is unaffected by OptimizationPreference.

I'll try running libraries tests with /p:OptimizationPreference=Speed to see if we get a local repro.

@EgorBo
Copy link
Member

EgorBo commented Jun 1, 2023

@VSadov can you please share the exact crank query? I want to validate it's not caused by recent GDV changes.

@VSadov
Copy link
Member

VSadov commented Jun 1, 2023

I have run the libraries tests a few times with /p:OptimizationPreference=Speed locally, but saw no failures. That was on OSX-arm64.
It is possible that the failure is specific to Linux or Ampere, but I hoped it is just arm64.
It is also possible that it is a bug in the app itself (or in some code that not used much by the libraries) that comes up only when optimized for speed.

I guess we need to run the actual test on the actual machine to get the repro. Or obtain a crashdump from a lab run.

@VSadov
Copy link
Member

VSadov commented Jun 1, 2023

can you please share the exact crank query?

I have sent instructions. If GDV is your worry, it would be interesting to try running that with GDV disabled via some build setting, if that is possible. It could help to rule out if this is GDV specific.

I've already tried things like server/workstation GC, disabling concurrent GC or using conservative stackwalking. None of that had any difference. It is likely something with managed code or with how it is compiled.

@VSadov
Copy link
Member

VSadov commented Jun 1, 2023

If I do not pass /p:OptimizationPreference=Speed to the benchmark it passes. I tried quite a few times.

@agocke
Copy link
Member

agocke commented Jun 23, 2023

Looking at the history of the benchmarks, I think the standard results are fine. The question is OptPref=Speed -- we may need more data to rule it out.

@MichalStrehovsky
Copy link
Member Author

Haven't seen any failures since at least October.

@MichalStrehovsky MichalStrehovsky closed this as not planned Won't fix, can't repro, duplicate, stale Jan 12, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Feb 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Archived in project
Development

No branches or pull requests

4 participants