Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989

foolip · 2018-11-08T21:27:21Z

Commit 71fe0a6 was the previous epochs/daily commit, and Taskcluster did run for it:
https://tools.taskcluster.net/groups/FJTQgT1GQ1qVCsAwYXVKzg

However, it failed, so no Taskcluster results are available in wpt.fyi:
https://wpt.fyi/test-runs?sha=71fe0a6c3d

The Chrome Dev runs (found via API) also failed:
https://tools.taskcluster.net/groups/RoFpb0S_SB2-KBrk-6wGcg

In both cases it was testharness chunk 13 that failed, which seems suspicious.

Looking at https://github.com/web-platform-tests/wpt/commits/master, it looks like Taskcluster reliability has been very, very bad recently.

@Hexcles @jgraham @jugglinmike can any of you looks into this?

@Hexcles, what sort of monitoring should we have in place to notice this immediately when it begins to happen? Quite likely there's some change that was landed days ago that should be reverted.

The text was updated successfully, but these errors were encountered:

jgraham · 2018-11-09T10:34:07Z

This is the same as web-platform-tests/results-collection#625

Looking at the logs it seems like we are seeing Chrome crash on startup (historically that's where I've seen the BadStatusLine error) and then we aren't handling it properly. But there are also some worrying errors earlier in the log that look like maybe the stash is failing or something. It's not impossible that the environment is broken in some way (e.g. OOM), but we should certainly ensure that we handle the failure more gracefully rather than just stopping.

jgraham · 2018-11-09T11:53:24Z

https://taskcluster-artifacts.net/OFUJ8juCTeW4ZJ8xksovXw/0/public/logs/live_backing.log is a Firefox failure, which suspiciously comes in the same CSP tests. It looks a lot like the server died in that case. Which could be a bug in the server or, again, could be OOM or similar.

jugglinmike · 2018-11-12T21:01:41Z

This appears to be a race condition in wptserver's "Stash" feature. I've filed a pull request to fix the bug and included mode detail there.

foolip · 2018-11-13T09:19:57Z

Results for commit fc1a5b7 are also missing. It was triggered 3 times:

master: https://tools.taskcluster.net/groups/Z3yVp6OJS8-7MnNkBtg7qg
epochs/daily: https://tools.taskcluster.net/groups/dQZqm60lTvagxmKggDsRJw
epochs/weekly: https://tools.taskcluster.net/groups/NbgtLtGzSK6koii6xOfZOQ

foolip · 2018-11-13T09:27:11Z

The master push is the BadStatusLine failure, but for the other two Chrome stable/beta are failing inside ./wpt run in a pip install step:
https://tools.taskcluster.net/groups/NbgtLtGzSK6koii6xOfZOQ/tasks/bgnqDOm5RYWE7-KS8TPoOg/runs/0/logs/public%2Flogs%2Flive.log#L510
https://tools.taskcluster.net/groups/dQZqm60lTvagxmKggDsRJw/tasks/fzXhRFaZQOiEt1A__ux50Q/runs/0/logs/public%2Flogs%2Flive.log#L533

It's the cryptography package that's failing to build, ultimately as "fatal error: openssl/opensslv.h: No such file or directory".

@jgraham, halp?

lukebjerring · 2018-11-16T13:44:30Z

What's the status of this issue? It's labelled urgent.

foolip · 2018-11-16T17:45:24Z

It's presumably still happening and causing missing results. @Hexcles I'll assigned to you since you have the end-to-end knowledge of how Taskcluster runs end up on wpt.fyi and can verify when it works. @jgraham will you be able to help investigate?

mdittmer · 2018-11-23T15:20:18Z

Friendly ecosystem infra rotation nudge for @Hexcles

Hexcles · 2018-11-28T23:15:18Z

We are using the "m5.large" instance type on AWS which has 8GB of memory. It is possible that under certain circumstances browsers may use a significant portion of that. Considering the Python runner itself also needs a couple of GB (for manifest), running into OOM is plausible. @jugglinmike also recently reported on IRC that the tasks might run out of memory during manifest generation (especially if somehow the code decides to generate manifest from scratch). I don't quite understand why that'd happen; manifest generation doesn't seem to need more than 8GB. Regardless, some logging would be very helpful to figure out the root cause (#14290 ).

And then regarding the breakage of the cryptography pypi package noticed by @foolip . That was just an unfortunate unrelated upstream breakage: https://github.com/pyca/cryptography/blob/master/CHANGELOG.rst#241---2018-11-11 (It was fixed within a day.) We could consider freezing all dependencies if this happens often. WDYT, @jgraham ?

Hexcles · 2018-11-28T23:45:00Z

Hmm, according to the log, we are using "m3.xlarge" which has 12GB of memory. https://taskcluster-artifacts.net/Fif-hWo-SH6FK3tR07dIcA/0/public/logs/live_backing.log

Hexcles · 2018-11-28T23:48:08Z

Sorry, clicked the wrong button...

Hexcles · 2018-11-29T22:00:26Z

I misunderstood the resource model of Taskcluster. Apparently each worker type can assign a capacity N to a certain EC2 instance type, which will allow N of that workload to execute on the same instance.

I talked to @imbstack . Turns out we had "c3.xlarge" (8GB memory) with capacity = 2, which means on average (more on this later) each container gets 4GB. That's a real risk for OOM and can well explain why generating manifest from scratch may run out of memory. Now regarding "on average", Taskcluster doesn't limit each container's memory usage, so each container can see the total amount of memory on that host and race each other for shared memory on a host. This makes monitoring (e.g. #14290) less useful.

After talking to @imbstack , we adjusted the capacity settings to 1 so that we guarantee a minimum of 8GB of available memory and remove the interference between containers. Hopefully this will help.

lukebjerring · 2018-12-03T20:57:25Z

Ping @Hexcles any updates on this?

Hexcles · 2018-12-03T21:16:39Z

Let me actually close this issue. I'm pretty confident that at least OOM shouldn't happen anymore. Though, again, because of the lack of logging originally, we don't know for sure whether the failures were because of OOM. And if OOM were to happen again, the logging I added should help us to investigate.

Please feel free to reopen if anyone suspects this happens again.

P.S. Monitoring Taskcluster itself is traced by #14210 .

foolip added infra priority:urgent labels Nov 8, 2018

foolip mentioned this issue Nov 8, 2018

Widespread timeouts in Edge 17 web-platform-tests/results-collection#563

Open

jugglinmike mentioned this issue Nov 12, 2018

[wptserve] Eliminate race condition #14024

Merged

foolip assigned Hexcles Nov 16, 2018

foolip mentioned this issue Nov 23, 2018

Monitor Taskcluster PR and master runs #14210

Closed

Hexcles mentioned this issue Nov 28, 2018

Attempt to monitor memory from userspace #14290

Merged

Hexcles closed this as completed Nov 28, 2018

Hexcles reopened this Nov 28, 2018

Hexcles closed this as completed Dec 3, 2018

Hexcles mentioned this issue Feb 20, 2019

Tasks going OOM on Taskcluster #12874

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989

Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989

foolip commented Nov 8, 2018

jgraham commented Nov 9, 2018

jgraham commented Nov 9, 2018

jugglinmike commented Nov 12, 2018

foolip commented Nov 13, 2018

foolip commented Nov 13, 2018

lukebjerring commented Nov 16, 2018

foolip commented Nov 16, 2018

mdittmer commented Nov 23, 2018

Hexcles commented Nov 28, 2018

Hexcles commented Nov 28, 2018

Hexcles commented Nov 28, 2018

Hexcles commented Nov 29, 2018 •

edited

Loading

lukebjerring commented Dec 3, 2018

Hexcles commented Dec 3, 2018

Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989

Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989

Comments

foolip commented Nov 8, 2018

jgraham commented Nov 9, 2018

jgraham commented Nov 9, 2018

jugglinmike commented Nov 12, 2018

foolip commented Nov 13, 2018

foolip commented Nov 13, 2018

lukebjerring commented Nov 16, 2018

foolip commented Nov 16, 2018

mdittmer commented Nov 23, 2018

Hexcles commented Nov 28, 2018

Hexcles commented Nov 28, 2018

Hexcles commented Nov 28, 2018

Hexcles commented Nov 29, 2018 • edited Loading

lukebjerring commented Dec 3, 2018

Hexcles commented Dec 3, 2018

Hexcles commented Nov 29, 2018 •

edited

Loading