Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More clarity around waiting for an active worker to finish #916

Closed
mfalken opened this issue Jun 17, 2016 · 30 comments
Closed

More clarity around waiting for an active worker to finish #916

mfalken opened this issue Jun 17, 2016 · 30 comments
Labels
Milestone

Comments

@mfalken
Copy link
Member

mfalken commented Jun 17, 2016

I'm working on a Chrome bug where we're immediately terminating the incumbent worker during activation instead of waiting for it to finish its "in-progress requests" as the spec says. Since update occurs soon after page load, with skipWaiting() we run into this issue in the wild.

I'm wondering though, what exactly is a "request". Is this referring to fetch event only or is it also intended to mean functional events?

Elsewhere the spec also mentions "in-flight requests" which are "triggered by" a worker, which seems a different thing that "in-progress requests" being handled by a worker.

The spec also doesn't really mention what to do with new incoming requests: do they go to the incumbent worker or the soon-to-be activating worker?

My current plan is to:

  • Wait for all events already dispatched to the incumbent worker to finish (subject to our usual timeout)
  • All new events are put on hold until the soon-to-be activating worker is activated

How does Firefox deal with this issue?

@wanderview
Copy link
Member

How does Firefox deal with this issue?

We have emulated chrome behavior here and don't wait currently. :-)

Is this a real problem in practice? It sounds like a lot of add complexity for what is probably fairly rare problem.

@wanderview
Copy link
Member

It looks like if someone is holding on to the ServiceWorker object in js we will mark it redundant, but not immediately kill it.

@NekR
Copy link

NekR commented Jun 17, 2016

All new events are put on hold until the soon-to-be activating worker is activated

By "soon-to-be activating" you mean while e.waitUntil(...) is waiting, right? What if it fails?

@wanderview
Copy link
Member

No, this is referring to service workers in the waiting state that are being promoted to the active state. So things like FetchEvent have to be held in a queue until the old worker exits per the current spec. I'm still not sure this is a good idea.

@mfalken
Copy link
Member Author

mfalken commented Jun 17, 2016

Is this a real problem in practice? It sounds like a lot of add complexity for what is probably fairly rare problem.

I think we're seeing this issue on Inbox and it's causing broken page loads. The page starts loading, there's sometimes one second of network quiescence that triggers our update on navigation, then skipWaiting() results in termination of the worker with unfinished fetch events.

I see Firefox uses DOMContentLoaded rather than the network quiescence heuristic to trigger update, maybe that would be safer.

@wanderview
Copy link
Member

It seems like waiting for events to clear is essentially creating a new state.

Can we instead just restructure things such that:

  1. Normal criteria for going active is no controlled documents and no events being processed.
  2. Calling skipWaiting bypasses the controlled document requirement, but we still defer going active until all outstanding events are done.

This lets us use the existing wait mechanism without inventing a new one.

Only real downside is we don't fire new events on soon-to-be-active worker, but I don't think that should really break anything. It's just like getting update slightly later.

Thoughts?

@mfalken
Copy link
Member Author

mfalken commented Jun 18, 2016

I think I like that. You're right, the pseudo-state was the annoying part to implement around.

The one worry implementation-wise is to ensure closing the last document doesn't end up with stalled fetch events: they should either terminate or finish normally, so that activation can proceed in a timely manner.

Also, a constant stream of events could delay activation indefinitely, even with skipWaiting (e.g., lots of XHR or postMessage), but we could have a fixed timeout to escape that.

@jakearchibald @jungkees WDYT?

@wanderview
Copy link
Member

Some ideas:

Implementations could dynamically lower their "kill a long waitUntil" timer if skipWaiting has been called. This would help deal with stalled fetches.

We could also cancel the fetch group associated with the service worker going redundant. I think this is weakly spec'd right now, but it's basically the same mechanism that cancels network when a pages are closed.

Aborting the SW's fetch group should ensure all outstanding fetches resolve quickly.

Of course, we may want both since other code can hold a waitUntil open.

@mfalken
Copy link
Member Author

mfalken commented Jun 23, 2016

Yep, I think the implementations can handle avoiding stalled fetches.

What do spec people think about wanderview's idea:

Normal criteria for going active is no controlled documents and no events being processed.
Calling skipWaiting bypasses the controlled document requirement, but we still defer going active until all outstanding events are done.

This sounds reasonable to me and I'm itching to fix chromium bug 616331.

@jakearchibald
Copy link
Contributor

Yeah, I like @wanderview's idea. Wait until all waitUntils & respondWiths in the incumbent worker are done.

We can drop our grace period in this case too, to reduce the chance of overlapping events keeping things open.

If we see problems with this we can look at queueing events made after skipWaiting so they run in the new worker when it activates.

@mfalken
Copy link
Member Author

mfalken commented Jun 24, 2016

OK here's roughly what I plan to start implementing to fix the Chrome bug:

  • The waiting worker gets promoted to active once there are no controllees and the incumbent worker has no in-flight events (fetch/sync/push etc).
  • When a page closes, we expect its related network requests to get cancelled, so outstanding fetch events should not block activation when there are no controllees.
  • skipWaiting bypasses the no controllees requirement, but we still wait until there are no in-flight events (optionally we might decrease the event timeout).
  • As usual, incoming fetch/sync/push events always go to the active worker.

@wanderview
Copy link
Member

Our gecko bug for this is here:

https://bugzilla.mozilla.org/show_bug.cgi?id=1170543

Incidentally, I think we should count "in-flight event" from the moment the task is queued to perform the dispatch. So we should maintain a counter of events on the registration or something. The counter is then decremented when the waitUntil/respondWith promises settle or immediately after the event if neither has been called.

@mfalken
Copy link
Member Author

mfalken commented Jun 27, 2016

Yes I think that's what Chromium will be doing.

@jakearchibald
Copy link
Contributor

If the browser's need for a request is dropped (page closes or user hits X on a navigation), then the respondWith promise for the related fetch no longer counts. However, I'm inclined to continue counting fetchEvent.waitUntil, since those actions may still be useful.

@wanderview
Copy link
Member

If the browser's need for a request is dropped (page closes or user hits X on a navigation), then the respondWith promise for the related fetch no longer counts. However, I'm inclined to continue counting fetchEvent.waitUntil, since those actions may still be useful.

Yea, I agree we should handle FetchEvent just like other events. Even when the last controlled page closes we will still have to handle push/bg-sync events that are queued up. Seems safest to just flush fetch events naturally.

@delapuente
Copy link

delapuente commented Jun 29, 2016

If the browser's need for a request is dropped (page closes or user hits X on a navigation), then the respondWith promise for the related fetch no longer counts.

What happen with side effects happening inside the respondWith() promise like cache or IndexedDB alterations. They could be interrupted if it does not count.

@mfalken
Copy link
Member Author

mfalken commented Jun 30, 2016

Yes, seems safest for wait for fetch events to settle normally. Just have to ensure there is no implementation bug preventing it from being settled after the page is closed.

@wanderview
Copy link
Member

Yes, seems safest for wait for fetch events to settle normally. Just have to ensure there is no implementation bug preventing it from being settled after the page is closed.

Are you worried about the window being closed while the FetchEvent is still in the event queue or after the SW has started processing it? Or maybe being closed between start of FetchEvent processing and queueing the event?

I was trying to think about how we would exercise the case in a WPT test.

@mfalken
Copy link
Member Author

mfalken commented Jun 30, 2016

Both actually. Chrome has a bug where if a request gets killed the corresponding fetch event remains "in-flight" until timeout (I described it more in https://bugs.chromium.org/p/chromium/issues/detail?id=616331). But this is an implementation bug, not sure if a WPT would be suitable.

@wanderview
Copy link
Member

We could make a WPT test for the "SW has started processing the FetchEvent while the owning window was closed" case, but probably not for the others.

@mfalken
Copy link
Member Author

mfalken commented Jul 20, 2016

Can we add milestone v1 to this? Chrome already implemented it.

@wanderview wanderview added this to the Version 1 milestone Jul 20, 2016
@wanderview
Copy link
Member

@mattto When do you expect this change to ship in chrome? I'd like to try to get the fix into FF around the same time frame. Thanks!

@mfalken
Copy link
Member Author

mfalken commented Jul 20, 2016

It should be in Chrome 53, expected to reach stable channel in early September.

MXEBot pushed a commit to mirror/chromium that referenced this issue Jul 22, 2016
This patch makes changes as per discussion at
w3c/ServiceWorker#916

Namely,
* The waiting worker gets promoted to active once there are no
controllees and the incumbent worker has no in-flight events
(fetch/sync/push/message etc).
* skipWaiting bypasses the no controllees requirement, but we
still wait until there are no in-flight events.
* As usual, incoming fetch/sync/push events always go to the
active worker.

BUG=616331

Review-Url: https://codereview.chromium.org/2119143002
Cr-Commit-Position: refs/heads/master@{#405628}
(cherry picked from commit 6c95b0f)

Review URL: https://codereview.chromium.org/2164113003 .

Cr-Commit-Position: refs/branch-heads/2785@{#258}
Cr-Branched-From: 6862397-refs/heads/master@{#403382}
@jakearchibald
Copy link
Contributor

Pre F2F notes: This seems agreed. Let's make sure that cancelling a request means the related fetch event's reapondWith promise no longer extends this event.

@jakearchibald
Copy link
Contributor

F2F:

  • Wait on all extension promises, then promote to active

@wanderview
Copy link
Member

It should be in Chrome 53, expected to reach stable channel in early September.

FYI, this has been fixed in gecko and should ship with FF 49 around September 13.

Also, the WPT test blink wrote is here and will be uplifted to the WPT repo on our next sync:

https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/service-workers/service-worker/activation.https.html

@wanderview
Copy link
Member

I think there are a couple other places we need to check for "idle" state or no extension promises:

  1. Anywhere we link to "using the service worker" should probably also check idle state.

https://slightlyoff.github.io/ServiceWorker/spec/service_worker/#dfn-use

  1. When a client is unloaded we clear the registration if the uninstalling flag is set. We should really check for idle state there first.

We then also need to add a hook for when the active worker goes idle to clear the registration if the uninstalling flag is set and there are no controlled clients.

I discovered this because I have a bug from a user showing that we unexpectedly abort a fetch event when they race an unregister() against a window.location.reload(). Its basically the same issue, just triggered from the unregister() instead of an update.

@wanderview
Copy link
Member

Another nuance to clarify. If there is a waiting worker and the uninstalling flag is set when the active worker goes idle, what happens first?

  1. The waiting worker is promoted to active. Or...
  2. The registration is cleared and the waiting worker is terminated.

I think giving the unregister priority here makes sense since the waiting worker is by definition not actively handling events yet.

@wanderview
Copy link
Member

Also, should we delay activation if the previous .active worker is still in the activating state? For example, wait for its activate event to finish processing. It seems like the safest thing to do would be to wait.

jungkees added a commit that referenced this issue Feb 3, 2017
This adds concepts and algorithms to track events (all service worker
events: lifecycle events, functional events, extendable message events)
that have pending extension promises so they can be used to check if the
registration's waiting worker can be promoted.

This also changes Install algorithm and related steps to match to the
implementations behavior. That is, instead of waiting for the promotion
condition in Install algorithm, newly introduced Try Activate algorithm
just tries to activate depending on the condition and return. Try
Activate is called when:
 - A service worker is installed.
 - The last client controlled by the existing active worker is unloaded.
 - skipWaiting() is called.
 - The extend lifetime promises for the existing active worker settle.

Related issue: #916.
jungkees added a commit that referenced this issue Feb 16, 2017
This adds concepts and algorithms to track events (all service worker
events: lifecycle events, functional events, extendable message events)
that have pending extension promises so they can be used to check if the
registration's waiting worker can be promoted.

This also changes Install algorithm and related steps to match to the
implementations behavior. That is, instead of waiting for the promotion
condition in Install algorithm, newly introduced Try Activate algorithm
just tries to activate depending on the condition and return. Try
Activate is called when:
 - A service worker is installed.
 - The last client controlled by the existing active worker is unloaded.
 - skipWaiting() is called.
 - The extend lifetime promises for the existing active worker settle.

Related issue: #916.
@jungkees
Copy link
Collaborator

Status update: f1a1120 addressed the resolution in #916 (comment) by introducing Try Activate.

There are still two remaining issues from @wanderview's follow-up comments. I plan to do:

  1. Make sure to wait until workers are idle before clearing registration: More clarity around waiting for an active worker to finish #916 (comment) and More clarity around waiting for an active worker to finish #916 (comment).
  • There are three call sites to Clear Registration: Handle Client Unload, extension promises' fulfillment handlers, Handle User Agent Shutdown. I think we can introduce Try Clear Registration that invokes Clear Registration only when (registration's not used && relevant workers are idle) && unregister flag is set (Handle User Agent Shutdown shouldn't need this condition.)
  • When both unregister and Activate need to be checked, unregister has a priority over Activate.
  1. Make sure to not loose any events scheduled to dispatch on the active worker in the activating state: More clarity around waiting for an active worker to finish #916 (comment).
  • Agree to delay the activation of the new one. When Handle Fetch is waiting to dispatch an event already, that means the client is already controlled by the existing active worker in activating state. So, delaying the activation of the new one seems to make the behavior consistent.

jungkees added a commit that referenced this issue Feb 23, 2017
This is a follow-up change, to [1], which addresses the issues posed in
[2] and the comments afterwards.

This change makes the callers to the Clear Registration algorithm to
check if no client is using the registration and the registration's
associated service workers have no pending extended events before
invoking the algorithm. Unregister, Handle Service Worker Client Unload,
and the microtasks queued by the extension promises' settlement handlers
invoke Try Clear Registration.

Other error handling steps on registrations that have no associated
worker and Handle User Agent Shutdown invoke Clear Registration directly
as they do.

[1] f1a1120
[2] #916 (comment)

Summary of the issues: #916 (comment).

Fixes #916.
jungkees added a commit that referenced this issue Feb 25, 2017
This is a follow-up change, to [1], which addresses the issues posed in
[2] and the comments afterwards.

This change makes the callers to the Clear Registration algorithm to
check if no client is using the registration and the registration's
associated service workers have no pending extended events before
invoking the algorithm. Unregister, Handle Service Worker Client Unload,
and the microtasks queued by the extension promises' settlement handlers
invoke Try Clear Registration.

Other error handling steps on registrations that have no associated
worker and Handle User Agent Shutdown invoke Clear Registration directly
as they do.

[1] f1a1120
[2] #916 (comment)

Summary of the issues: #916 (comment).

Fixes #916.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants