Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket reconnect from ping timeout does not work correctly #3652

Open
johnjbarton opened this issue Feb 9, 2021 · 14 comments · May be fixed by #3653
Open

socket reconnect from ping timeout does not work correctly #3652

johnjbarton opened this issue Feb 9, 2021 · 14 comments · May be fixed by #3653
Assignees

Comments

@johnjbarton
Copy link
Contributor

This is a long standing issue I've tried to fix in the past.

A user in my company hit a case that failed 30% of the time or so and he ran down the reason:

  • the client takes a long hunk of time (parsing or running a test)
  • the server hits the pingTimeout, disconnects, waits for reconnect
  • the client sends results << These results are lost
  • the client gets pingTimeout, tries reconnect (unsure)
  • server sees reconnect as a new socket and connects.
  • server never sees results, unpredictable result.

As far as I can tell, any data that the client sends after the server thinks the socket is disconnected is lost. The socket.io layer has no replay and neither does karma.

The problem is fundamental. Either we implement replay or we work to avoid disconnect but fail if we hit it.

Replay is not super hard since we know that tests are finite in duration.

But I think that, for the most part, disconnect is self-inflicted. The client is parsing a huge chunk of JS or it is off doing a long test: it fails to respond to a ping. The browser is there and the runtime is live, but we are starving the ping reply. If we increase the pingTimeout we can avoid the disconnect.

The other ways that the pingTimeout can be hit is browser/tab crash or network partition. With a large pingTimeout we wait a long time to discover the tab crash.

I can't come up with a scenario where automatic socket reconnect makes sense for us. Any time we lose connection, we lose data and the test is invalid. We could say that, eg logging up until the test starts is "optional" so we could reconnect if we lost connection before start. But if we lose connection before start the thing we need to figure out why is exactly those logs.

I think we should flush the entire reconnect logic, all of the timeouts other than pingTimeout, and retry logic in favor error messages directing users to look into the issue and then increase the pingTimeout.

@johnjbarton
Copy link
Contributor Author

We have logic to relaunch browsers (and thus retry the test) if the socket disconnects during a test.

Is this worth having? It's not fundamentally wrong: if the test passes its fine. It seems marginally wrong in that the most common case would be a flaky test that is close to the pingTimeout (too large JS or too long test). The retry might pass if the first one fails, but it means we don't address the underlying problem (pingTimeout too small).

Can the network could be legitimately flaky in some cases these days?

@johnjbarton
Copy link
Contributor Author

There is some replay code:
https://github.com/karma-runner/karma/blob/master/lib/server.js#L293
This supports the case where the browser connects then registers and in between it sends some other events.
I can't see any way this can matter. Until the execute event is sent from the server to the client, none of those event can happen. And if they did which ones did we miss before the connect event?

johnjbarton added a commit to johnjbarton/karma that referenced this issue Feb 10, 2021
This feature has been trouble and there are cases it cannot work.
Any socket disconnect during a test can lose data. The server will disconnect
if the client does not reply. Combined it makes any long running or large
test error prone.  Better to bail and restart the test.
Fixes karma-runner#3652
@johnjbarton johnjbarton linked a pull request Feb 10, 2021 that will close this issue
@trevorgithub
Copy link

A co-worker and myself as new users of karma ran into this issue (Disconnected reconnect failed before timeout of 2000ms (ping timeout)) many times during testing. We pulled John's PR and after some pretty exhaustive testing, I can say it has 100% resolved our problems. Thank you! Looking forward to this getting merged.

@trevorgithub
Copy link

@johnjbarton, any update on this? The reason my team recently adopted Karma is so that we could leverage the integration with Istanbul to generate coverage reports for our suite of Jasmine tests. When using Karma with headless Chrome via Puppeteer, the current Karma code fails far more often than not at the very end of the test run (with the ping timeout error). As a result, the Istanbul tests are not written to disk. For such a use case, the pull request completely resolves the problem (based on our testing) and is essential to generating coverage reports.

@johnjbarton
Copy link
Contributor Author

I just want to be sure that all karma 6.0 issues are resolved before we go ahead. This reconnect code is only triggered when things go bad so I don't trust our test coverage. I appreciate your efforts and I will try to pick up the pace.

johnjbarton added a commit to johnjbarton/karma that referenced this issue Apr 15, 2021
This feature has been trouble and there are cases it cannot work.
Any socket disconnect during a test can lose data. The server will disconnect
if the client does not reply. Combined it makes any long running or large
test error prone.  Better to bail and restart the test.
Fixes karma-runner#3652
johnjbarton added a commit to johnjbarton/karma that referenced this issue Apr 20, 2021
This feature has been trouble and there are cases it cannot work.
Any socket disconnect during a test can lose data. The server will disconnect
if the client does not reply. Combined it makes any long running or large
test error prone.  Better to bail and restart the test.
Fixes karma-runner#3652
@trevorgithub
Copy link

@johnjbarton, just wanted to check in on this one. I appreciate it's only one issue of many the team is focused on, so I understand it can take some time.

I also wanted to update you that today I pulled the disconnect branch from your forked repo, and tested again. Unfortunately, the original problem has resurfaced. When I originally tested the fix you made in February, it was rock solid. Today when looking at the git log on that disconnect branch, I can see your fix at the top of the disconnect branch, but a number of other commits with a more recent date beneath it the git log. I guess maybe you rebased with the base branch, and perhaps something was incompatible with your changes. Perhaps this is something you already know, but I wanted to bring it to your attention in case it's news. Thanks!

@gjyalpha
Copy link

gjyalpha commented May 7, 2021

Hi @trevorgithub , John has left the team and probably won't be active on Karma project. @jginsburgn and I (@gjyalpha ) will keep working on the Karma open source project. Well, due to other priorities, I'm not sure when we will get another chance to look on this yet. Will keep this thread posted if we have any update. Thanks!

@trevorgithub
Copy link

@gjyalpha , thank you very much for the prompt feedback, it is much appreciated. I'm saddened to hear that work may be lost (or at least postponed) but I am very much grateful for the information.

@devoto13
Copy link
Collaborator

@trevorgithub can you please share a log from the failing run using the latest PR build with the --log-level debug flag? The code in the PR looks correct, so I wonder if we can get some insights into why it didn't work for you?

@devoto13
Copy link
Collaborator

Actually, please wait a bit. I see that the last build in PR failed, I'll fix it in the coming days and then ask you to re-test. Sorry for the noise!

@johnjbarton
Copy link
Contributor Author

johnjbarton commented May 12, 2021 via email

@jginsburgn
Copy link
Member

Thanks so much for taking a look @johnjbarton. Enjoy your retirement!

@devoto13: Do you want me to take a look at this?

@devoto13
Copy link
Collaborator

@jginsburgn Yes, once I have some time to spare :)

jginsburgn pushed a commit that referenced this issue Oct 28, 2021
This feature has been trouble and there are cases it cannot work.
Any socket disconnect during a test can lose data. The server will disconnect
if the client does not reply. Combined it makes any long running or large
test error prone.  Better to bail and restart the test.
Fixes #3652
@jr01
Copy link

jr01 commented Dec 11, 2021

Could this issue be why sometimes onBrowserStart is not fired, breaking reporters? I made a comment about junit and progress reporter here: karma-runner/karma-junit-reporter#188 (comment)

Perhaps the following occurs:

  • Karma client emits start and sets startEmitted=true
  • socket gets closed
  • start message is not delivered to the Karma server and lost
  • The socket reconnects and client now emits register with isSocketReconnect=true
  • Server browser.reconnect is executed and does not trigger execute...
  • Client will never send start because startEmitted === true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants