Frequent sporadic CI failures due to agents not responding #1617

StephanTLavavej · 2021-02-03T08:11:21Z

For a long time, CI runs have been sporadically failing when agents stop responding. (It usually goes away after a rerun, but sometimes it takes several tries.) The error messages are of the form:

##[error]We stopped hearing from agent BUILD000382. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

We aren't absolutely sure about the root cause, but we suspect that it's CPU starvation during our intensely multithreaded test runs. We've tried to mitigate this by using only 30 out of 32 cores:

STL/azure-devops/cmake-configure-build.yml

Line 27 in c7b1405

$testParallelism = $env:NUMBER_OF_PROCESSORS - 2

However, one of our tests builds with /MP, so it'll consume all available cores (I haven't checked whether agents that stop responding are highly correlated with machines that run a configuration of this test):

STL/tests/std/tests/P1502R1_standard_library_header_units/custom_format.py

Line 104 in c7b1405

exportHeaderOptions = ['/exportHeader', '/Fo', '/MP']

It's unclear what we should do to fix this, but it's a recurring productivity drain. Some ideas:

Contact the Azure Pipelines team and investigate whether agents can run at High priority.
Investigate using our "jobify" machinery so we can run the compiler and test executables at Low priority.

The text was updated successfully, but these errors were encountered:

cbezault · 2021-03-23T21:57:40Z

There is a LIT-specific mitigation that we could try giving a shot.
There is a concept of parallelism groups in LIT where only n tests that are a part of a given parallelism group are allowed to run.
We could have a "parallel test" parallelism group that we would make parallel tests a part of. This would limit potential over-subscription of the VMs to ~2x the number of cores.

StephanTLavavej · 2022-04-16T03:57:48Z

Since switching to Standard_D32ads_v5 in #2474 followed by Windows Server 2022 in #2496, the sporadic CI failures have become exceptionally rare. We've had 2 classic stalled agents in the last ~3 months. ("Publish Tests" once took 10 minutes and timed out; that hasn't happened again.) I was even able to remove the N-2 parallelism hack in #2611.

While we still need to watch out for stalled agents and rerun them, this is no longer common/urgent enough to deserve any significant amount of investigation. Closing for now.

StephanTLavavej added the infrastructure Related to repository automation label Feb 3, 2021

This was referenced Mar 23, 2021

Add standard library header units test to parallelism group #1776

Merged

Add multi-threaded tests to the multi-threaded parallelism group #1777

Closed

StephanTLavavej added the help wanted Extra attention is needed label Sep 28, 2021

StephanTLavavej mentioned this issue Oct 26, 2021

Optimize fiopen #2095

Merged

StephanTLavavej added not reproducible We can’t reproduce the described behavior and removed help wanted Extra attention is needed labels Apr 16, 2022

StephanTLavavej closed this as completed Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent sporadic CI failures due to agents not responding #1617

Frequent sporadic CI failures due to agents not responding #1617

StephanTLavavej commented Feb 3, 2021

cbezault commented Mar 23, 2021 •

edited

Loading

StephanTLavavej commented Apr 16, 2022

Frequent sporadic CI failures due to agents not responding #1617

Frequent sporadic CI failures due to agents not responding #1617

Comments

StephanTLavavej commented Feb 3, 2021

cbezault commented Mar 23, 2021 • edited Loading

StephanTLavavej commented Apr 16, 2022

cbezault commented Mar 23, 2021 •

edited

Loading