Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989

Closed
foolip opened this issue Nov 8, 2018 · 14 comments
Closed

Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989

foolip opened this issue Nov 8, 2018 · 14 comments

Comments

@foolip
Copy link
Member

foolip commented Nov 8, 2018

Commit 71fe0a6 was the previous epochs/daily commit, and Taskcluster did run for it:
https://tools.taskcluster.net/groups/FJTQgT1GQ1qVCsAwYXVKzg

However, it failed, so no Taskcluster results are available in wpt.fyi:
https://wpt.fyi/test-runs?sha=71fe0a6c3d

The Chrome Dev runs (found via API) also failed:
https://tools.taskcluster.net/groups/RoFpb0S_SB2-KBrk-6wGcg

In both cases it was testharness chunk 13 that failed, which seems suspicious.

Looking at https://github.com/web-platform-tests/wpt/commits/master, it looks like Taskcluster reliability has been very, very bad recently.

@Hexcles @jgraham @jugglinmike can any of you looks into this?

@Hexcles, what sort of monitoring should we have in place to notice this immediately when it begins to happen? Quite likely there's some change that was landed days ago that should be reverted.

@jgraham
Copy link
Contributor

jgraham commented Nov 9, 2018

This is the same as web-platform-tests/results-collection#625

Looking at the logs it seems like we are seeing Chrome crash on startup (historically that's where I've seen the BadStatusLine error) and then we aren't handling it properly. But there are also some worrying errors earlier in the log that look like maybe the stash is failing or something. It's not impossible that the environment is broken in some way (e.g. OOM), but we should certainly ensure that we handle the failure more gracefully rather than just stopping.

@jgraham
Copy link
Contributor

jgraham commented Nov 9, 2018

https://taskcluster-artifacts.net/OFUJ8juCTeW4ZJ8xksovXw/0/public/logs/live_backing.log is a Firefox failure, which suspiciously comes in the same CSP tests. It looks a lot like the server died in that case. Which could be a bug in the server or, again, could be OOM or similar.

@jugglinmike
Copy link
Contributor

This appears to be a race condition in wptserver's "Stash" feature. I've filed a pull request to fix the bug and included mode detail there.

@foolip
Copy link
Member Author

foolip commented Nov 13, 2018

@foolip
Copy link
Member Author

foolip commented Nov 13, 2018

The master push is the BadStatusLine failure, but for the other two Chrome stable/beta are failing inside ./wpt run in a pip install step:
https://tools.taskcluster.net/groups/NbgtLtGzSK6koii6xOfZOQ/tasks/bgnqDOm5RYWE7-KS8TPoOg/runs/0/logs/public%2Flogs%2Flive.log#L510
https://tools.taskcluster.net/groups/dQZqm60lTvagxmKggDsRJw/tasks/fzXhRFaZQOiEt1A__ux50Q/runs/0/logs/public%2Flogs%2Flive.log#L533

It's the cryptography package that's failing to build, ultimately as "fatal error: openssl/opensslv.h: No such file or directory".

@jgraham, halp?

@lukebjerring
Copy link
Contributor

What's the status of this issue? It's labelled urgent.

@foolip
Copy link
Member Author

foolip commented Nov 16, 2018

It's presumably still happening and causing missing results. @Hexcles I'll assigned to you since you have the end-to-end knowledge of how Taskcluster runs end up on wpt.fyi and can verify when it works. @jgraham will you be able to help investigate?

@mdittmer
Copy link
Contributor

Friendly ecosystem infra rotation nudge for @Hexcles

@Hexcles
Copy link
Member

Hexcles commented Nov 28, 2018

We are using the "m5.large" instance type on AWS which has 8GB of memory. It is possible that under certain circumstances browsers may use a significant portion of that. Considering the Python runner itself also needs a couple of GB (for manifest), running into OOM is plausible. @jugglinmike also recently reported on IRC that the tasks might run out of memory during manifest generation (especially if somehow the code decides to generate manifest from scratch). I don't quite understand why that'd happen; manifest generation doesn't seem to need more than 8GB. Regardless, some logging would be very helpful to figure out the root cause (#14290 ).

And then regarding the breakage of the cryptography pypi package noticed by @foolip . That was just an unfortunate unrelated upstream breakage: https://github.com/pyca/cryptography/blob/master/CHANGELOG.rst#241---2018-11-11 (It was fixed within a day.) We could consider freezing all dependencies if this happens often. WDYT, @jgraham ?

@Hexcles
Copy link
Member

Hexcles commented Nov 28, 2018

Hmm, according to the log, we are using "m3.xlarge" which has 12GB of memory. https://taskcluster-artifacts.net/Fif-hWo-SH6FK3tR07dIcA/0/public/logs/live_backing.log

@Hexcles Hexcles closed this as completed Nov 28, 2018
@Hexcles
Copy link
Member

Hexcles commented Nov 28, 2018

Sorry, clicked the wrong button...

@Hexcles Hexcles reopened this Nov 28, 2018
@Hexcles
Copy link
Member

Hexcles commented Nov 29, 2018

I misunderstood the resource model of Taskcluster. Apparently each worker type can assign a capacity N to a certain EC2 instance type, which will allow N of that workload to execute on the same instance.

I talked to @imbstack . Turns out we had "c3.xlarge" (8GB memory) with capacity = 2, which means on average (more on this later) each container gets 4GB. That's a real risk for OOM and can well explain why generating manifest from scratch may run out of memory. Now regarding "on average", Taskcluster doesn't limit each container's memory usage, so each container can see the total amount of memory on that host and race each other for shared memory on a host. This makes monitoring (e.g. #14290) less useful.

After talking to @imbstack , we adjusted the capacity settings to 1 so that we guarantee a minimum of 8GB of available memory and remove the interference between containers. Hopefully this will help.

@lukebjerring
Copy link
Contributor

Ping @Hexcles any updates on this?

@Hexcles
Copy link
Member

Hexcles commented Dec 3, 2018

Let me actually close this issue. I'm pretty confident that at least OOM shouldn't happen anymore. Though, again, because of the lack of logging originally, we don't know for sure whether the failures were because of OOM. And if OOM were to happen again, the logging I added should help us to investigate.

Please feel free to reopen if anyone suspects this happens again.

P.S. Monitoring Taskcluster itself is traced by #14210 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants