Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing GitHub Actions font-test was marked as successful #17396

Closed
Snuffleupagus opened this issue Dec 9, 2023 · 16 comments · Fixed by #17477 or #17555
Closed

Failing GitHub Actions font-test was marked as successful #17396

Snuffleupagus opened this issue Dec 9, 2023 · 16 comments · Fixed by #17477 or #17555
Assignees
Labels

Comments

@Snuffleupagus
Copy link
Collaborator

In a recent PR the font-tests failed completely on Windows, however the GitHub Actions workflow didn't actually catch the failure which means that the test was marked as successful: https://github.com/mozilla/pdf.js/actions/runs/7150415038/job/19473662790?pr=17395#step:8:20

That seems quite bad, since it means that we could very easily overlook a failing test. (Running gulp fonttest locally on Windows works fine though, so I'm not sure how to debug this.)

/cc @timvandermeij Any ideas here?

@timvandermeij
Copy link
Contributor

Yikes, that's not great indeed. I'll have a look at that.

@timvandermeij timvandermeij self-assigned this Dec 9, 2023
@Snuffleupagus Snuffleupagus changed the title Failing GiHub Actions font-test was marked as successful Failing GitHub Actions font-test was marked as successful Dec 9, 2023
@timvandermeij
Copy link
Contributor

timvandermeij commented Dec 10, 2023

I have managed to reproduce the issue by putting back the code from #17172 (comment) (it talks about an invalid browsing context in BiDi which is what we also see here) and a hothacked node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js file where the exception on line 123 is raised unconditionally.

Putting back the page closing lines is required to trigger the #onContextDestroyed function and that in turn raises the exception. However, the Puppeteer exception is not bubbling up to our test runner process, which causes it to exit with status 0 and indicates success. Somewhere in Puppeteer it seems that the exception is logged and then swallowed, or it's occurring in the browser context and not bubbling up further. In any case, this looks like a regression from enabling BiDi and sadly this doesn't seem possible to fix on our end given that the exception in Puppeteer doesn't even reach our code.

@whimboo @OrKoN Does this perhaps ring a bell for you, also given the comment at #17172 (comment) which was needed for BiDi? The exception in Puppeteer is triggered as soon as await page.close(); is called, but it's only logged and doesn't cause this promise to reject (putting a try/catch around this does nothing). This feels like a bug in Puppeteer to me, also given the apparent behavior change that required the removal of this loop, but you probably know much more about how the Puppeteer internals handle this than I do.

@Snuffleupagus
Copy link
Collaborator Author

As a temporary solution, depending on how difficult/time-consuming a proper fix is, could we do something similar to the old botio font-test and check that the success message is being logged and manually throw an exception?
https://github.com/mozilla/botio-files-pdfjs/pull/42/files#diff-aa6bfe19954403f5a8eac1c2259476609786b1c9b05bfaa81a360a87430a7710L26-L33

@timvandermeij
Copy link
Contributor

timvandermeij commented Dec 10, 2023

Yes, I think that could work as a temporary solution indeed. Let's await a response to the message above, and if it's indeed an upstream issue that we can't fix here we can decide to:

  1. Temporarily revert from BiDi to CDP which didn't have this issue, and go back to BiDi once the upstream fix has landed.
  2. Temporarily introduce the workaround, and remove it later once the upstream fix has landed.
  3. Wait for a new upstream release if the fix is trivial and can be made fairly quickly.

@whimboo
Copy link
Contributor

whimboo commented Dec 11, 2023

This is indeed strange. So when a browsing context gets destroyed in the browser it's basically gone after the browsingContext.contextDestroyed event has been received. So something might have already removed the given browsing context from this.#browsingContexts in Connection.ts. The only method removing entries is unregisterBrowsingContexts, which is called from dispose in BrowsingContext.ts.

@sadym-chromium could you maybe help while @OrKoN is away?

@sadym-chromium
Copy link

@whimboo could you please provide a link to the test in pdfjs which is failing?

@whimboo
Copy link
Contributor

whimboo commented Dec 11, 2023

I don't know which exact test fails here but see the initial comment with a link to the CI log. All the font tests can be found at https://github.com/mozilla/pdf.js/tree/master/test/font.

@timvandermeij and @Snuffleupagus could you may give more details about the code that actually triggers this exception?

@Snuffleupagus
Copy link
Collaborator Author

This breaks when run, on Windows, as part of this GitHub Actions workflow: https://github.com/mozilla/pdf.js/blob/master/.github/workflows/font_tests.yml

@timvandermeij
Copy link
Contributor

timvandermeij commented Dec 11, 2023

I have been working on creating a reduced test case that would demonstrate the issue in isolation, and while doing this I found that I had been chasing a red herring in my previous analysis where I thought the Puppeteer exception didn't bubble up to our test runner. It turns out it actually does, and that is an issue in our test runner that has been there since basically "forever" (I could trace it back to a commit from 7 years ago). For some reason we never noticed this before, but the following code is problematic here:

https://github.com/mozilla/pdf.js/blob/master/gulpfile.mjs#L693-L695

The issue that we take the exit code from the spawned Node.js process and then... do nothing with it. This pattern exists in more places, so that's something we can fix on our end, and we should also audit the other similar call sites.

However, if we do that now the builds will actually fail and likely block merging PRs, so we still need to do something about the Puppeteer exception being raised in the first place, so one of the options from #17396 (comment). Strangely this only happens on Windows in our CI, and I also can't reproduce it on Linux at all. However, it does happen consistently in the CI; the most recent run is https://github.com/mozilla/pdf.js/actions/runs/7173542922/job/19533117343 and for completeness I'm listing the full traceback here:

Started 7 tests for firefox with seed 89123.
file:///D:/a/pdf.js/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js:123
            throw new Error(`BrowsingContext ${contextId} does not exist.`);
                  ^

Error: BrowsingContext undefined does not exist.
    at BidiConnection.getBrowsingContext (file:///D:/a/pdf.js/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js:123:19)
    at #onContextDestroyed (file:///D:/a/pdf.js/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Browser.js:161:42)
    at file:///D:/a/pdf.js/pdf.js/node_modules/puppeteer-core/lib/esm/third_party/mitt/mitt.js:1:242
    at Array.map (<anonymous>)
    at Object.emit (file:///D:/a/pdf.js/pdf.js/node_modules/puppeteer-core/lib/esm/third_party/mitt/mitt.js:1:226)
    at BidiConnection.emit (file:///D:/a/pdf.js/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:85:23)
    at BidiConnection.onMessage (file:///D:/a/pdf.js/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js:93:26)
    at WebSocket.<anonymous> (file:///D:/a/pdf.js/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:46:32)
    at callListener (D:\a\pdf.js\pdf.js\node_modules\ws\lib\event-target.js:290:14)
    at WebSocket.onMessage (D:\a\pdf.js\pdf.js\node_modules\ws\lib\event-target.js:209:9)

Does the traceback perhaps provide any hints as to why the browsing context in Puppeteer could be undefined on Windows only? Would it perhaps make sense to check for that in Puppeteer?

I'm not sure how useful this is, but in an attempt to help with reproducing this I have tried to construct a reduced test case of how we use Puppeteer (note that we no longer have the page closing loop because that's indeed unnecessary, but we had it before and was seen to be problematic in #17172 (comment) so it might help with triggering the bug), so perhaps this can be used for reproducing/debugging on Windows:

import puppeteer from "puppeteer";

async function reproduce() {
  const browser = await puppeteer.launch({
    product: "firefox",
    protocol: "webDriverBiDi",
    headless: false,
  });
  const page = await browser.newPage();
  for (const page of await browser.pages()) {
    await page.close();
  }
  await browser.close();
}

await reproduce();

Save this as repro.js and run it with node repro.js. On Linux this works, but I can only hope this fails on Windows. If this does work, it could be that we're perhaps hitting a Windows-only BiDi bug triggered by Jasmine or our test runner code because the browser does seem to start and load the Jasmine test runner since we do get to

sendInfo(`Started ${total} tests for ${browser} with seed ${seed}.`);

in the traceback. However, after that it seems to crash because no actual Jasmine tests are run.

I hope this all helps to debug the issue, and if I can somehow provide more information, feel free to let me know.

@whimboo
Copy link
Contributor

whimboo commented Dec 11, 2023

Thank you for the update Tim! To get even more details of what might go wrong I would suggest the following:

  1. For the Puppeteer options add the dumpio field and set it to true.
  2. Especially for Firefox set the preference remote.log.level="Trace" which will enable trace logs
  3. Upload the changes to the CI and trigger the relevant tests

I could then have a look at the logs tomorrow to see what might go wrong on our side.

But you could also additionally add the DEBUG="puppeteer:* environment variable (maybe as a separate log) so that we can see which puppeteer commands and events are actually in use.

@timvandermeij
Copy link
Contributor

timvandermeij commented Dec 17, 2023

@whimboo Sorry for the delay here. I have made #17432 to apply the debugging suggestions you mentioned (removed all other workflows and limited the font tests workflow to just Windows to make sure the logs are as usable as possible).

You can find the build here but since I don't know how long the retention is I have also downloaded the full raw logs and attach them here as logs.txt. Fortunately on Windows the run crashes pretty quickly after Puppeteer has started, but it's still quite a bit of logs that I hope will be helpful for debugging the issue.

The final logs when the exception happens are:

1702845086027	RemoteAgent	DEBUG	WebDriverBiDiConnection 0cc49e5b-65db-42dd-a862-46623e93bec5 <- {"type":"event","method":"browsingContext.contextDestroyed","params":{"url":"chrome://gfxsanity/content/sanitytest.html","children":null,"parent":null}}
file:///D:/a/pdf.js/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js:123
            throw new Error(`BrowsingContext ${contextId} does not exist.`);

This looks interesting to me; I'm not sure what "chrome://gfxsanity/content/sanitytest.html" is exactly, but maybe that's the anomaly that causes this only on Windows somehow since I don't recognize this URL as something that we use or trigger ourselves.

@whimboo
Copy link
Contributor

whimboo commented Dec 18, 2023

Oh wow! Yes, this is indeed a problem with the sanity test window then! This window should not be used at all and we have it disabled for geckodriver and Marionette but forgot about it in Puppeteer. So temporarily you could set the preference media.sanity-test.disabled to true. Mind checking if that (alone) works? If yes I'll go ahead and provide a patch for Puppeteer.

@whimboo
Copy link
Contributor

whimboo commented Dec 20, 2023

I've created a PR for Puppeteer so that it should be fixed with the next release.

@timvandermeij
Copy link
Contributor

@whimboo I have created #17457 and can confirm that setting media.sanity-test.disabled to true fixes the issue on Windows. You can find the logs here: https://github.com/mozilla/pdf.js/actions/runs/7309787007/job/19917756293?pr=17457. Thank you for making the patch upstream; we'll await a new release then and update the test code on our side as well to respect non-zero exit codes.

@whimboo
Copy link
Contributor

whimboo commented Jan 4, 2024

A new puppeteer release is out which contains my fix:
https://github.com/puppeteer/puppeteer/releases/tag/puppeteer-core-v21.7.0

So you could get this issue fixed by upgrading Puppeteer. CC @calixteman.

@Snuffleupagus Snuffleupagus linked a pull request Jan 4, 2024 that will close this issue
@timvandermeij timvandermeij reopened this Jan 6, 2024
@timvandermeij
Copy link
Contributor

I'm reopening this because even though the upstream issue is fixed now we still need to update our Gulpfile to not ignore non-zero exit codes, to prevent this from going by unnoticed again in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants