More clarity around waiting for an active worker to finish #916

mfalken · 2016-06-17T08:46:54Z

I'm working on a Chrome bug where we're immediately terminating the incumbent worker during activation instead of waiting for it to finish its "in-progress requests" as the spec says. Since update occurs soon after page load, with skipWaiting() we run into this issue in the wild.

I'm wondering though, what exactly is a "request". Is this referring to fetch event only or is it also intended to mean functional events?

Elsewhere the spec also mentions "in-flight requests" which are "triggered by" a worker, which seems a different thing that "in-progress requests" being handled by a worker.

The spec also doesn't really mention what to do with new incoming requests: do they go to the incumbent worker or the soon-to-be activating worker?

My current plan is to:

Wait for all events already dispatched to the incumbent worker to finish (subject to our usual timeout)
All new events are put on hold until the soon-to-be activating worker is activated

How does Firefox deal with this issue?

wanderview · 2016-06-17T10:12:00Z

How does Firefox deal with this issue?

We have emulated chrome behavior here and don't wait currently. :-)

Is this a real problem in practice? It sounds like a lot of add complexity for what is probably fairly rare problem.

wanderview · 2016-06-17T10:16:13Z

It looks like if someone is holding on to the ServiceWorker object in js we will mark it redundant, but not immediately kill it.

NekR · 2016-06-17T10:20:20Z

All new events are put on hold until the soon-to-be activating worker is activated

By "soon-to-be activating" you mean while e.waitUntil(...) is waiting, right? What if it fails?

wanderview · 2016-06-17T10:34:26Z

No, this is referring to service workers in the waiting state that are being promoted to the active state. So things like FetchEvent have to be held in a queue until the old worker exits per the current spec. I'm still not sure this is a good idea.

mfalken · 2016-06-17T17:53:43Z

Is this a real problem in practice? It sounds like a lot of add complexity for what is probably fairly rare problem.

I think we're seeing this issue on Inbox and it's causing broken page loads. The page starts loading, there's sometimes one second of network quiescence that triggers our update on navigation, then skipWaiting() results in termination of the worker with unfinished fetch events.

I see Firefox uses DOMContentLoaded rather than the network quiescence heuristic to trigger update, maybe that would be safer.

wanderview · 2016-06-17T19:02:00Z

It seems like waiting for events to clear is essentially creating a new state.

Can we instead just restructure things such that:

Normal criteria for going active is no controlled documents and no events being processed.
Calling skipWaiting bypasses the controlled document requirement, but we still defer going active until all outstanding events are done.

This lets us use the existing wait mechanism without inventing a new one.

Only real downside is we don't fire new events on soon-to-be-active worker, but I don't think that should really break anything. It's just like getting update slightly later.

Thoughts?

mfalken · 2016-06-18T12:15:42Z

I think I like that. You're right, the pseudo-state was the annoying part to implement around.

The one worry implementation-wise is to ensure closing the last document doesn't end up with stalled fetch events: they should either terminate or finish normally, so that activation can proceed in a timely manner.

Also, a constant stream of events could delay activation indefinitely, even with skipWaiting (e.g., lots of XHR or postMessage), but we could have a fixed timeout to escape that.

@jakearchibald @jungkees WDYT?

wanderview · 2016-06-18T12:31:17Z

Some ideas:

Implementations could dynamically lower their "kill a long waitUntil" timer if skipWaiting has been called. This would help deal with stalled fetches.

We could also cancel the fetch group associated with the service worker going redundant. I think this is weakly spec'd right now, but it's basically the same mechanism that cancels network when a pages are closed.

Aborting the SW's fetch group should ensure all outstanding fetches resolve quickly.

Of course, we may want both since other code can hold a waitUntil open.

mfalken · 2016-06-23T09:23:12Z

Yep, I think the implementations can handle avoiding stalled fetches.

What do spec people think about wanderview's idea:

Normal criteria for going active is no controlled documents and no events being processed.
Calling skipWaiting bypasses the controlled document requirement, but we still defer going active until all outstanding events are done.

This sounds reasonable to me and I'm itching to fix chromium bug 616331.

jakearchibald · 2016-06-24T09:18:37Z

Yeah, I like @wanderview's idea. Wait until all waitUntils & respondWiths in the incumbent worker are done.

We can drop our grace period in this case too, to reduce the chance of overlapping events keeping things open.

If we see problems with this we can look at queueing events made after skipWaiting so they run in the new worker when it activates.

mfalken · 2016-06-24T09:58:45Z

OK here's roughly what I plan to start implementing to fix the Chrome bug:

The waiting worker gets promoted to active once there are no controllees and the incumbent worker has no in-flight events (fetch/sync/push etc).
When a page closes, we expect its related network requests to get cancelled, so outstanding fetch events should not block activation when there are no controllees.
skipWaiting bypasses the no controllees requirement, but we still wait until there are no in-flight events (optionally we might decrease the event timeout).
As usual, incoming fetch/sync/push events always go to the active worker.

wanderview · 2016-06-24T18:38:50Z

Our gecko bug for this is here:

https://bugzilla.mozilla.org/show_bug.cgi?id=1170543

Incidentally, I think we should count "in-flight event" from the moment the task is queued to perform the dispatch. So we should maintain a counter of events on the registration or something. The counter is then decremented when the waitUntil/respondWith promises settle or immediately after the event if neither has been called.

mfalken · 2016-06-27T05:45:07Z

Yes I think that's what Chromium will be doing.

jakearchibald · 2016-06-27T10:59:59Z

If the browser's need for a request is dropped (page closes or user hits X on a navigation), then the respondWith promise for the related fetch no longer counts. However, I'm inclined to continue counting fetchEvent.waitUntil, since those actions may still be useful.

wanderview · 2016-06-27T14:19:37Z

If the browser's need for a request is dropped (page closes or user hits X on a navigation), then the respondWith promise for the related fetch no longer counts. However, I'm inclined to continue counting fetchEvent.waitUntil, since those actions may still be useful.

Yea, I agree we should handle FetchEvent just like other events. Even when the last controlled page closes we will still have to handle push/bg-sync events that are queued up. Seems safest to just flush fetch events naturally.

delapuente · 2016-06-29T11:43:03Z

If the browser's need for a request is dropped (page closes or user hits X on a navigation), then the respondWith promise for the related fetch no longer counts.

What happen with side effects happening inside the respondWith() promise like cache or IndexedDB alterations. They could be interrupted if it does not count.

mfalken · 2016-06-30T06:52:31Z

Yes, seems safest for wait for fetch events to settle normally. Just have to ensure there is no implementation bug preventing it from being settled after the page is closed.

wanderview · 2016-06-30T14:00:30Z

Yes, seems safest for wait for fetch events to settle normally. Just have to ensure there is no implementation bug preventing it from being settled after the page is closed.

Are you worried about the window being closed while the FetchEvent is still in the event queue or after the SW has started processing it? Or maybe being closed between start of FetchEvent processing and queueing the event?

I was trying to think about how we would exercise the case in a WPT test.

mfalken · 2016-06-30T14:09:56Z

Both actually. Chrome has a bug where if a request gets killed the corresponding fetch event remains "in-flight" until timeout (I described it more in https://bugs.chromium.org/p/chromium/issues/detail?id=616331). But this is an implementation bug, not sure if a WPT would be suitable.

wanderview · 2016-06-30T14:17:14Z

We could make a WPT test for the "SW has started processing the FetchEvent while the owning window was closed" case, but probably not for the others.

mfalken · 2016-07-20T13:30:51Z

Can we add milestone v1 to this? Chrome already implemented it.

wanderview · 2016-07-20T14:06:31Z

@mattto When do you expect this change to ship in chrome? I'd like to try to get the fix into FF around the same time frame. Thanks!

mfalken · 2016-07-20T14:38:28Z

It should be in Chrome 53, expected to reach stable channel in early September.

This patch makes changes as per discussion at w3c/ServiceWorker#916 Namely, * The waiting worker gets promoted to active once there are no controllees and the incumbent worker has no in-flight events (fetch/sync/push/message etc). * skipWaiting bypasses the no controllees requirement, but we still wait until there are no in-flight events. * As usual, incoming fetch/sync/push events always go to the active worker. BUG=616331 Review-Url: https://codereview.chromium.org/2119143002 Cr-Commit-Position: refs/heads/master@{#405628} (cherry picked from commit 6c95b0f) Review URL: https://codereview.chromium.org/2164113003 . Cr-Commit-Position: refs/branch-heads/2785@{#258} Cr-Branched-From: 6862397-refs/heads/master@{#403382}

jakearchibald · 2016-07-27T15:15:28Z

Pre F2F notes: This seems agreed. Let's make sure that cancelling a request means the related fetch event's reapondWith promise no longer extends this event.

jakearchibald · 2016-07-28T20:51:01Z

F2F:

Wait on all extension promises, then promote to active

wanderview · 2016-08-01T13:43:06Z

It should be in Chrome 53, expected to reach stable channel in early September.

FYI, this has been fixed in gecko and should ship with FF 49 around September 13.

Also, the WPT test blink wrote is here and will be uplifted to the WPT repo on our next sync:

https://dxr.mozilla.org/mozilla-central/source/testing/web-platform/tests/service-workers/service-worker/activation.https.html

wanderview · 2016-08-03T15:27:04Z

I think there are a couple other places we need to check for "idle" state or no extension promises:

Anywhere we link to "using the service worker" should probably also check idle state.

https://slightlyoff.github.io/ServiceWorker/spec/service_worker/#dfn-use

When a client is unloaded we clear the registration if the uninstalling flag is set. We should really check for idle state there first.

We then also need to add a hook for when the active worker goes idle to clear the registration if the uninstalling flag is set and there are no controlled clients.

I discovered this because I have a bug from a user showing that we unexpectedly abort a fetch event when they race an unregister() against a window.location.reload(). Its basically the same issue, just triggered from the unregister() instead of an update.

wanderview · 2016-08-03T15:37:19Z

Another nuance to clarify. If there is a waiting worker and the uninstalling flag is set when the active worker goes idle, what happens first?

The waiting worker is promoted to active. Or...
The registration is cleared and the waiting worker is terminated.

I think giving the unregister priority here makes sense since the waiting worker is by definition not actively handling events yet.

wanderview · 2016-08-04T20:59:40Z

Also, should we delay activation if the previous .active worker is still in the activating state? For example, wait for its activate event to finish processing. It seems like the safest thing to do would be to wait.

This adds concepts and algorithms to track events (all service worker events: lifecycle events, functional events, extendable message events) that have pending extension promises so they can be used to check if the registration's waiting worker can be promoted. This also changes Install algorithm and related steps to match to the implementations behavior. That is, instead of waiting for the promotion condition in Install algorithm, newly introduced Try Activate algorithm just tries to activate depending on the condition and return. Try Activate is called when: - A service worker is installed. - The last client controlled by the existing active worker is unloaded. - skipWaiting() is called. - The extend lifetime promises for the existing active worker settle. Related issue: #916.

jungkees · 2017-02-21T08:17:34Z

Status update: f1a1120 addressed the resolution in #916 (comment) by introducing Try Activate.

There are still two remaining issues from @wanderview's follow-up comments. I plan to do:

Make sure to wait until workers are idle before clearing registration: More clarity around waiting for an active worker to finish #916 (comment) and More clarity around waiting for an active worker to finish #916 (comment).

There are three call sites to Clear Registration: Handle Client Unload, extension promises' fulfillment handlers, Handle User Agent Shutdown. I think we can introduce Try Clear Registration that invokes Clear Registration only when (registration's not used && relevant workers are idle) && unregister flag is set (Handle User Agent Shutdown shouldn't need this condition.)
When both unregister and Activate need to be checked, unregister has a priority over Activate.

Make sure to not loose any events scheduled to dispatch on the active worker in the activating state: More clarity around waiting for an active worker to finish #916 (comment).

Agree to delay the activation of the new one. When Handle Fetch is waiting to dispatch an event already, that means the client is already controlled by the existing active worker in activating state. So, delaying the activation of the new one seems to make the behavior consistent.

This is a follow-up change, to [1], which addresses the issues posed in [2] and the comments afterwards. This change makes the callers to the Clear Registration algorithm to check if no client is using the registration and the registration's associated service workers have no pending extended events before invoking the algorithm. Unregister, Handle Service Worker Client Unload, and the microtasks queued by the extension promises' settlement handlers invoke Try Clear Registration. Other error handling steps on registrations that have no associated worker and Handle User Agent Shutdown invoke Clear Registration directly as they do. [1] f1a1120 [2] #916 (comment) Summary of the issues: #916 (comment). Fixes #916.

wanderview added this to the Version 1 milestone Jul 20, 2016

jakearchibald mentioned this issue Jul 27, 2016

Create F2F agenda - 28-29 July 2016 #932

Closed

jakearchibald added the decided label Jul 28, 2016

jungkees mentioned this issue Feb 3, 2017

Improve Activate with Try Activate #1065

Merged

jungkees mentioned this issue Feb 23, 2017

Introduce Try Clear Registration; Delay activation #1079

Merged

jungkees closed this as completed in #1079 Feb 25, 2017

wanderview mentioned this issue Mar 30, 2017

[service-workers] Avoid race condition in test web-platform-tests/wpt#4879

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More clarity around waiting for an active worker to finish #916

More clarity around waiting for an active worker to finish #916

mfalken commented Jun 17, 2016

wanderview commented Jun 17, 2016

wanderview commented Jun 17, 2016

NekR commented Jun 17, 2016

wanderview commented Jun 17, 2016

mfalken commented Jun 17, 2016

wanderview commented Jun 17, 2016

mfalken commented Jun 18, 2016

wanderview commented Jun 18, 2016

mfalken commented Jun 23, 2016

jakearchibald commented Jun 24, 2016

mfalken commented Jun 24, 2016

wanderview commented Jun 24, 2016

mfalken commented Jun 27, 2016

jakearchibald commented Jun 27, 2016

wanderview commented Jun 27, 2016

delapuente commented Jun 29, 2016 •

edited

Loading

mfalken commented Jun 30, 2016

wanderview commented Jun 30, 2016

mfalken commented Jun 30, 2016

wanderview commented Jun 30, 2016

mfalken commented Jul 20, 2016

wanderview commented Jul 20, 2016

mfalken commented Jul 20, 2016

jakearchibald commented Jul 27, 2016

jakearchibald commented Jul 28, 2016

wanderview commented Aug 1, 2016

wanderview commented Aug 3, 2016

wanderview commented Aug 3, 2016

wanderview commented Aug 4, 2016

jungkees commented Feb 21, 2017

More clarity around waiting for an active worker to finish #916

More clarity around waiting for an active worker to finish #916

Comments

mfalken commented Jun 17, 2016

wanderview commented Jun 17, 2016

wanderview commented Jun 17, 2016

NekR commented Jun 17, 2016

wanderview commented Jun 17, 2016

mfalken commented Jun 17, 2016

wanderview commented Jun 17, 2016

mfalken commented Jun 18, 2016

wanderview commented Jun 18, 2016

mfalken commented Jun 23, 2016

jakearchibald commented Jun 24, 2016

mfalken commented Jun 24, 2016

wanderview commented Jun 24, 2016

mfalken commented Jun 27, 2016

jakearchibald commented Jun 27, 2016

wanderview commented Jun 27, 2016

delapuente commented Jun 29, 2016 • edited Loading

mfalken commented Jun 30, 2016

wanderview commented Jun 30, 2016

mfalken commented Jun 30, 2016

wanderview commented Jun 30, 2016

mfalken commented Jul 20, 2016

wanderview commented Jul 20, 2016

mfalken commented Jul 20, 2016

jakearchibald commented Jul 27, 2016

jakearchibald commented Jul 28, 2016

wanderview commented Aug 1, 2016

wanderview commented Aug 3, 2016

wanderview commented Aug 3, 2016

wanderview commented Aug 4, 2016

jungkees commented Feb 21, 2017

delapuente commented Jun 29, 2016 •

edited

Loading