Push component should automatically recover from the server losing its uaid #3695

rfk · 2020-11-05T02:37:52Z

As noted in #3314 (comment) and subsequent comments, if the autopush server receives a 404 or 410 error when trying to delivery a push via one of the mobile bridges, it will completely drop the registration record for that uaid. AFAICT any attempts by the client to operate on that uaid will produce an error.

I don't think the push component currently handles this case well. In fact I don't see any codepaths that would case us to detect that self.uaid has become invalid and is being rejected by the server.

This seems like something that could be handled in verify_connection by:

Detecting the "uaid does not exist" error.
Registering a new uaid.
Reporting all existing subscriptions as expired.

We should also add some handling of this case in the update method, which clients may call when in this state in an attempt to repair this push subscriptions.

┆Issue is synchronized with this Jira Bug
┆Epic: FxA Ecosystem (backlog)
┆Fix Versions: Release 92
┆Sprint End Date: 2021-07-09

The text was updated successfully, but these errors were encountered:

rfk · 2021-06-25T02:27:04Z

The docs and structure of the push component and overdue for a bit of an overhaul, we sadly haven't really kept up very good maintenance of it since it was initially written by the push team. However, that's an undertaking far beyond the scope of this bug! (But if you happen to find opportunities to clean up the docs or clarify things as you go here, that's also valuable).

I've added some high-level context-setting in #4303 which I hope will help a bit.

Specifically for this bug, we want to deal with the unhandled error being reported from Fenix users here and here:


W/App     (29404): mozilla.appservices.push.CommunicationError: Error(Communication Error: "Unhandled client error 410 : \"{\\\"code\\\": 410, \\\"errno\\\": 103, \\\"error\\\": \\\"\\\", \\\"more_info\\\": \\\"http://autopush.readthedocs.io/en/latest/http.html#error-codes\\\", \\\"message\\\": \\\"Request did not validate UAID not found\\\"}\"")
W/App     (29404):   at mozilla.appservices.push.RustError.intoException(RustError.kt:13)
W/App     (29404):   at mozilla.appservices.push.PushManager.subscribe(PushManager.kt:13)
W/App     (29404):   at mozilla.components.feature.push.RustPushConnection.subscribe(Connection.kt:4)
W/App     (29404):   at mozilla.components.feature.push.AutoPushFeature$subscribe$$inlined$ifInitialized$lambda$2.invokeSuspend(AutoPushFeature.kt:5)
W/App     (29404):   at mozilla.components.feature.push.AutoPushFeature$subscribe$$inlined$ifInitialized$lambda$2.invoke(Unknown Source:10)
W/App     (29404):   at mozilla.components.feature.push.ext.CoroutineScopeKt$launchAndTry$2.invokeSuspend(CoroutineScope.kt:5)

This error occurs when the push server (for whatever reason) throws away our existing registration, forgetting everything about the uaid that we initially created when registering our first subscription here.

Unfortunately the management of uaid is a little bit implicit in the control flow at the moment - it gets created when we register our first subscription and then we assume it will stick around forever. It's also part of the primary key of each subscription record stored in the db. Absent a significant refactoring, I think the only tractable way to recover from the server losing our uaid is to let the app catch it, trigger a call to verifyConnection and do a full state reset and cross-check.

We need the app to be in control of the process here, because we're going to be re-creating a bunch of subscription records, and the app needs to be able to route the updated data to the relevant consumers elsewhere in the app (such as the FxA client, or service workers). This is what the verifyConnection API is designed for - it returns a list of the subscriptions that have changed so that the app can route the new info to the right place.

So, at the top level of a fix here, the app needs to be able to catch this error and call verifyConnection in response. It may be enough to just catch the error as-is, but we could also consider making a dedicated error class like VerificationNeeded to help with this. @jonalmeida thoughts?

Next level down, we need to implement some recovery logic in the underlying Rust verify_connection call. I'll admit I don't fully understand the verification logic there, but maybe something like this will work:

In the underlying Connection::verify_connection call, catch this "UAID not found" error and return false, signalling to the next level up that there is something wrong with the server state.
In the higher-level PushManager::verify_connection when we see that false being returned, drop the the current uaid and its associated push records from our local database, essentially making our local state match the server state by making them both empty.
Continue returning the list of existing subscriptions to calling code like this. Something will need to re-create them, and the process of re-creating these subscriptions will cause us to implicitly create a new uaid value and put the system back in a working state.

Where things get fuzzy for me is in step (3) here. It's not clear to me what the API contract actually is in practice. The Rust code is just returning the list of existing subscriptions, so I guess it's expecting the calling code to re-create each one. But the example Kotlin code here suggests that the returned list contains newly-created subscription info and all the app has to do is route those new details to appropriate places. The wrapper code in a-c certainly seem worded in as way that suggests the subscriptions have already been re-created. But then when it gets turned into an AutoPushSubscriptionChanged notification, that doesn't contain enough info about the subscription for consumers to actually update themselves, so for example the FxA client needs to explicitly call subscribe which will re-create the subscription if one doesn't already exist for that channel.

So...I think we maybe it's fine for PushManager::verify_connection to just delete the old subscriptions from its db and let consumers re-create the ones they need, but I'm not 100% sure. @jonalmeida is the best person to confirm or deny this hypothesis, and we should probably update docs and names throughout the call path here to clarify expectations.

jonalmeida · 2021-06-25T20:58:58Z

Will have to come back to this issue after PTO, but I wanted to add one note:

(But if you happen to find opportunities to clean up the docs or clarify things as you go here, that's also valuable).

You might find this meta bug I had filed also valuable #2763 of mostly nits during the initial integration of the push component into AC/Fenix/Fire TV.

rfk · 2021-06-26T01:37:53Z

Thanks @jonalmeida! Sorry for a lack of context on my previous comment - I'm going to be PTO for the week after the wellness week, but we're hoping that @tarikeshaq will be able to spend some time on this bug during that week, so I as leaving as much async context as I could. This can all definitely wait until after PTO.

jonalmeida · 2021-07-08T02:30:45Z

So, at the top level of a fix here, the app needs to be able to catch this error and call verifyConnection in response. It may be enough to just catch the error as-is, but we could also consider making a dedicated error class like VerificationNeeded to help with this. @jonalmeida thoughts?

Yes, this sounds fine to me. If it makes it easier, our current logic/contract is to call verify_connection once a day, at best. That might avoid the need to even throw an error that needs to be caught and acted upon. Either way works, and I'm happy to break old contracts to form better ones where needed.

The wrapper code in a-c certainly seem worded in as way that suggests the subscriptions have already been re-created. But then when it gets turned into an AutoPushSubscriptionChanged notification, that doesn't contain enough info about the subscription for consumers to actually update themselves, so for example the FxA client needs to explicitly call subscribe which will re-create the subscription if one doesn't already exist for that channel.

So...I think we maybe it's fine for PushManager::verify_connection to just delete the old subscriptions from its db and let consumers re-create the ones they need, but I'm not 100% sure. @jonalmeida is the best person to confirm or deny this hypothesis, and we should probably update docs and names throughout the call path here to clarify expectations.

In the KDoc, "updated" may not have been the best choice of words in hind sight. My understanding that, verify_connection gives us a list of subscriptions that are alreay trashed.

Then, as you correctly followed in the linked code, we want to use that list to inform our consumers (FxA and service workers) that the subscription they relied on no longer exists, so it's up to them to decide if they want to renew that subscription or not. In FxA's case, we handle that bit of logic in FxaPushSupportFeature. Service workers handle it based on the web app's own logic when they receive this notification that we pass to GeckoView which delivers it to the service worker.

I just remembered that we have a nice little paragraph in the README specifically about the current behaviour and understanding of verify_connection so we don't have to keep that mental model all the time. 😄

tarikeshaq · 2021-07-08T18:06:16Z

Thanks a ton @jonalmeida!!

I'll probably add a bit more error handling to make sure we can catch the error, but after an initial look I THINK I might have a good theory on why we are ending up with this error in the first place, @jrconlin could possibly verify my theory here 😄

When we verify_connection on the rust component, and we find a mismatch, we end up unsubscribing from all channels over here..
Then, in unsubscribe we end up sending a network request to the AutoPush server telling it to wipe all records associated with our UAID (and telling it that the UAID is no longer valid), ref, the Unregister UAID docs, which eventually deletes the UAID from the server's database. in the component, the network URL is formated here, note how when the channel_id is None we trigger the Unregister UAID request
However, in-memory, our ConnectHttp object still has a uaid that is Some. So, when a future subscribe is triggered, we'll automatically try to attach that uaid to the request, this happens over here.
Now, the server was told earlier (in the unsubscribe) that the UAID is invalid, then we trigger a subscribe with that UAID, so naturally, the server complains that UAID not found

A super easy fix is I'll try is to simply set the uaid of the ConnectHttp to None when we unsubscribe from all the channels.

But all that is beside the point that we should be recovering from problems like this automatically - I'll jot down some more thoughts specifically on that and possibly open a draft PR to look at

This was referenced Nov 5, 2020

Confirm whether uaids are being dropped on 404/410 from bridged servers. mozilla-services/autopush#1444

Open

Investigate verify_connection incorrectly returning empty list #3314

Closed

rfk added the 📣 Push label Nov 5, 2020

rfk changed the title ~~Push component should automatically recover from the server losings it uaid~~ Push component should automatically recover from the server losing its uaid Nov 5, 2020

This was referenced Dec 15, 2020

Revert changes to unsubscribe when observing subscriptionExpired mozilla-mobile/android-components#9222

Closed

Sent tabs not received by Fenix until Sync Now button clicked mozilla-mobile/fenix#16721

Closed

jdragojevic added epic-fxa-ecosystem-sync-2123 and removed 📣 Push labels Jun 17, 2021

rfk mentioned this issue Jun 25, 2021

Add some context-setting docs for the push component. #4303

Merged

data-sync-user assigned tarikeshaq Jul 6, 2021

tarikeshaq mentioned this issue Jul 8, 2021

Eliminate outdated UAID's from push component #4325

Merged

4 tasks

tarikeshaq mentioned this issue Jul 20, 2021

Adds improvements to push component error handling #4347

Merged

7 tasks

mergify bot closed this as completed in #4347 Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push component should automatically recover from the server losing its uaid #3695

Push component should automatically recover from the server losing its uaid #3695

rfk commented Nov 5, 2020 •

edited by data-sync-user

Loading

rfk commented Jun 25, 2021

jonalmeida commented Jun 25, 2021

rfk commented Jun 26, 2021

jonalmeida commented Jul 8, 2021 •

edited

Loading

tarikeshaq commented Jul 8, 2021

Push component should automatically recover from the server losing its uaid #3695

Push component should automatically recover from the server losing its uaid #3695

Comments

rfk commented Nov 5, 2020 • edited by data-sync-user Loading

rfk commented Jun 25, 2021

jonalmeida commented Jun 25, 2021

rfk commented Jun 26, 2021

jonalmeida commented Jul 8, 2021 • edited Loading

tarikeshaq commented Jul 8, 2021

rfk commented Nov 5, 2020 •

edited by data-sync-user

Loading

jonalmeida commented Jul 8, 2021 •

edited

Loading