-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989
Comments
This is the same as web-platform-tests/results-collection#625 Looking at the logs it seems like we are seeing Chrome crash on startup (historically that's where I've seen the BadStatusLine error) and then we aren't handling it properly. But there are also some worrying errors earlier in the log that look like maybe the stash is failing or something. It's not impossible that the environment is broken in some way (e.g. OOM), but we should certainly ensure that we handle the failure more gracefully rather than just stopping. |
https://taskcluster-artifacts.net/OFUJ8juCTeW4ZJ8xksovXw/0/public/logs/live_backing.log is a Firefox failure, which suspiciously comes in the same CSP tests. It looks a lot like the server died in that case. Which could be a bug in the server or, again, could be OOM or similar. |
This appears to be a race condition in wptserver's "Stash" feature. I've filed a pull request to fix the bug and included mode detail there. |
Results for commit fc1a5b7 are also missing. It was triggered 3 times: |
The master push is the It's the cryptography package that's failing to build, ultimately as "fatal error: openssl/opensslv.h: No such file or directory". @jgraham, halp? |
What's the status of this issue? It's labelled urgent. |
Friendly ecosystem infra rotation nudge for @Hexcles |
We are using the "m5.large" instance type on AWS which has 8GB of memory. It is possible that under certain circumstances browsers may use a significant portion of that. Considering the Python runner itself also needs a couple of GB (for manifest), running into OOM is plausible. @jugglinmike also recently reported on IRC that the tasks might run out of memory during manifest generation (especially if somehow the code decides to generate manifest from scratch). I don't quite understand why that'd happen; manifest generation doesn't seem to need more than 8GB. Regardless, some logging would be very helpful to figure out the root cause (#14290 ). And then regarding the breakage of the |
Hmm, according to the log, we are using "m3.xlarge" which has 12GB of memory. https://taskcluster-artifacts.net/Fif-hWo-SH6FK3tR07dIcA/0/public/logs/live_backing.log |
Sorry, clicked the wrong button... |
I misunderstood the resource model of Taskcluster. Apparently each worker type can assign a capacity N to a certain EC2 instance type, which will allow N of that workload to execute on the same instance. I talked to @imbstack . Turns out we had "c3.xlarge" (8GB memory) with capacity = 2, which means on average (more on this later) each container gets 4GB. That's a real risk for OOM and can well explain why generating manifest from scratch may run out of memory. Now regarding "on average", Taskcluster doesn't limit each container's memory usage, so each container can see the total amount of memory on that host and race each other for shared memory on a host. This makes monitoring (e.g. #14290) less useful. After talking to @imbstack , we adjusted the capacity settings to 1 so that we guarantee a minimum of 8GB of available memory and remove the interference between containers. Hopefully this will help. |
Ping @Hexcles any updates on this? |
Let me actually close this issue. I'm pretty confident that at least OOM shouldn't happen anymore. Though, again, because of the lack of logging originally, we don't know for sure whether the failures were because of OOM. And if OOM were to happen again, the logging I added should help us to investigate. Please feel free to reopen if anyone suspects this happens again. P.S. Monitoring Taskcluster itself is traced by #14210 . |
Commit 71fe0a6 was the previous epochs/daily commit, and Taskcluster did run for it:
https://tools.taskcluster.net/groups/FJTQgT1GQ1qVCsAwYXVKzg
However, it failed, so no Taskcluster results are available in wpt.fyi:
https://wpt.fyi/test-runs?sha=71fe0a6c3d
The Chrome Dev runs (found via API) also failed:
https://tools.taskcluster.net/groups/RoFpb0S_SB2-KBrk-6wGcg
In both cases it was testharness chunk 13 that failed, which seems suspicious.
Looking at https://github.com/web-platform-tests/wpt/commits/master, it looks like Taskcluster reliability has been very, very bad recently.
@Hexcles @jgraham @jugglinmike can any of you looks into this?
@Hexcles, what sort of monitoring should we have in place to notice this immediately when it begins to happen? Quite likely there's some change that was landed days ago that should be reverted.
The text was updated successfully, but these errors were encountered: