Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebDriver community stability jobs are frequently timing out #31499

Open
whimboo opened this issue Nov 4, 2021 · 5 comments
Open

WebDriver community stability jobs are frequently timing out #31499

whimboo opened this issue Nov 4, 2021 · 5 comments

Comments

@whimboo
Copy link
Contributor

whimboo commented Nov 4, 2021

This is a problem that we frequently hit when changes are done that affect a large portion of WebDriver tests, eg. when changing fixtures or helpers. The stability jobs as triggered by the sink job will then most likely fail because of the 120 minutes timeout.

It would be great to get this problem investigated because it's strange that it takes that long for the jobs to run. When we run wdspec jobs in our own CI for Firefox each of the 3 chunks take approximately 20 minutes which means that we end-up with 60 minutes. In headless mode we even have only 2 chunks that take around 15 minutes.

Seeing a log of such a stability jobs for Firefox I noticed the following:

  1. We do not run in headless mode. Maybe it might be good to change that to reduce the duration of the full job? If it cannot be done for Chrome maybe we could change it for Firefox?

  2. For Firefox a lot of timeouts are visible for tests related to test_no_top_browsing_context. These are failures that we do not see in our own CI and are causing delays of 30s or 3 minutes for each instance of this test for different WebDriver commands - based on if the timeout=long meta tag is present or not. With the amount of these tests all sum up to around 22 minutes of extra time.

For now I would like to get started with the above two issues. CC'ing @jgraham, @juliandescottes, @foolip for their ideas and feedback.

@whimboo
Copy link
Contributor Author

whimboo commented Nov 4, 2021

I have PR #31499 open to run some jobs to investigate the issues.

@juliandescottes
Copy link
Contributor

I have PR #31499 open to run some jobs to investigate the issues.

I guess this should be PR #31501

@jgraham
Copy link
Contributor

jgraham commented Nov 15, 2021

Fixing the slow running tests here is a good idea for general perf/resource usage reasons. But the stability jobs are running each test 10x in total, so if we can't run all the tests in under 12 minutes we aren't going to see the jobs complete. In practice the fact that changes to conftest.py end up running all webdriver tests means that the jobs are always going to time out. There are two plausible solutions here:

  1. Select fewer tests to run e.g. if a fixture is changed try to work out which tests actually use that fixture. I think pytest --setup-plan might help here.
  2. Time limit stability checking rather than blindly keeping going once we're certain to timeout c.f. When many tests are affected, CI stability jobs will time out #7660

In paratice I think we need to do both to fix this issue.

@whimboo
Copy link
Contributor Author

whimboo commented Jan 10, 2022

2. For Firefox a lot of timeouts are visible for tests related to `test_no_top_browsing_context`. These are failures that we do not see in our own CI and are causing delays of 30s or 3 minutes for each instance of this test for different WebDriver commands - [based on if the `timeout=long` meta tag is present or not](https://searchfox.org/mozilla-central/rev/88cd13997fb0747cdcd78638fc762ff2d75e1fc5/testing/web-platform/tests/tools/wptrunner/wptrunner/wpttest.py#642-648). With the amount of these tests all sum up to around 22 minutes of extra time.

This work fine now with a recent Firefox Nightly build. The calls take under 1s now:

https://community-tc.services.mozilla.com/tasks/JpQbO5NURVmeN4LWVO96RQ/runs/0/logs/public/logs/live.log#L1264842-1264845

1. We do not run in headless mode. Maybe it might be good to change that to reduce the duration of the full job? If it cannot be done for Chrome maybe we could change it for Firefox?

@jgraham, I would propose that we run the stability jobs in headless if possible. Should I check if that works fine across all browsers, or is that maybe already used for other wpt CI jobs?

@jgraham
Copy link
Contributor

jgraham commented Jan 11, 2022

I think for all CI jobs we're currently not using headless. I'm happy to use headless for Firefox stability jobs in particular, if you think that improves performance (for non-stability jobs correctness is more important, for stability jobs the tradeoff is more delicate).

Currently we hardcode --headless in https://github.com/web-platform-tests/wpt/blob/master/tools/ci/taskcluster-run.py#L81 but we could move it into the task definitions at https://github.com/web-platform-tests/wpt/blob/master/tools/ci/tc/tasks/test.yml (although that might be tricky) or we could just update the logic to not pass in --no-headless in the firefox + --verify case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@whimboo @jgraham @juliandescottes and others