Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent sporadic CI failures due to agents not responding #1617

Closed
StephanTLavavej opened this issue Feb 3, 2021 · 2 comments
Closed

Frequent sporadic CI failures due to agents not responding #1617

StephanTLavavej opened this issue Feb 3, 2021 · 2 comments
Labels
infrastructure Related to repository automation not reproducible We can’t reproduce the described behavior

Comments

@StephanTLavavej
Copy link
Member

For a long time, CI runs have been sporadically failing when agents stop responding. (It usually goes away after a rerun, but sometimes it takes several tries.) The error messages are of the form:

##[error]We stopped hearing from agent BUILD000382. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

We aren't absolutely sure about the root cause, but we suspect that it's CPU starvation during our intensely multithreaded test runs. We've tried to mitigate this by using only 30 out of 32 cores:

$testParallelism = $env:NUMBER_OF_PROCESSORS - 2

However, one of our tests builds with /MP, so it'll consume all available cores (I haven't checked whether agents that stop responding are highly correlated with machines that run a configuration of this test):

exportHeaderOptions = ['/exportHeader', '/Fo', '/MP']

It's unclear what we should do to fix this, but it's a recurring productivity drain. Some ideas:

  • Contact the Azure Pipelines team and investigate whether agents can run at High priority.
  • Investigate using our "jobify" machinery so we can run the compiler and test executables at Low priority.
@StephanTLavavej StephanTLavavej added the infrastructure Related to repository automation label Feb 3, 2021
@cbezault
Copy link
Contributor

cbezault commented Mar 23, 2021

There is a LIT-specific mitigation that we could try giving a shot.
There is a concept of parallelism groups in LIT where only n tests that are a part of a given parallelism group are allowed to run.
We could have a "parallel test" parallelism group that we would make parallel tests a part of. This would limit potential over-subscription of the VMs to ~2x the number of cores.

@StephanTLavavej StephanTLavavej added the help wanted Extra attention is needed label Sep 28, 2021
@StephanTLavavej StephanTLavavej added not reproducible We can’t reproduce the described behavior and removed help wanted Extra attention is needed labels Apr 16, 2022
@StephanTLavavej
Copy link
Member Author

Since switching to Standard_D32ads_v5 in #2474 followed by Windows Server 2022 in #2496, the sporadic CI failures have become exceptionally rare. We've had 2 classic stalled agents in the last ~3 months. ("Publish Tests" once took 10 minutes and timed out; that hasn't happened again.) I was even able to remove the N-2 parallelism hack in #2611.

While we still need to watch out for stalled agents and rerun them, this is no longer common/urgent enough to deserve any significant amount of investigation. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Related to repository automation not reproducible We can’t reproduce the described behavior
Projects
None yet
Development

No branches or pull requests

2 participants