Frequent sporadic CI failures due to agents not responding #1617
Labels
infrastructure
Related to repository automation
not reproducible
We can’t reproduce the described behavior
For a long time, CI runs have been sporadically failing when agents stop responding. (It usually goes away after a rerun, but sometimes it takes several tries.) The error messages are of the form:
We aren't absolutely sure about the root cause, but we suspect that it's CPU starvation during our intensely multithreaded test runs. We've tried to mitigate this by using only 30 out of 32 cores:
STL/azure-devops/cmake-configure-build.yml
Line 27 in c7b1405
However, one of our tests builds with
/MP
, so it'll consume all available cores (I haven't checked whether agents that stop responding are highly correlated with machines that run a configuration of this test):STL/tests/std/tests/P1502R1_standard_library_header_units/custom_format.py
Line 104 in c7b1405
It's unclear what we should do to fix this, but it's a recurring productivity drain. Some ideas:
The text was updated successfully, but these errors were encountered: