Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push component should automatically recover from the server losing its uaid #3695

Closed
rfk opened this issue Nov 5, 2020 · 5 comments · Fixed by #4347
Closed

Push component should automatically recover from the server losing its uaid #3695

rfk opened this issue Nov 5, 2020 · 5 comments · Fixed by #4347
Assignees

Comments

@rfk
Copy link
Contributor

rfk commented Nov 5, 2020

As noted in #3314 (comment) and subsequent comments, if the autopush server receives a 404 or 410 error when trying to delivery a push via one of the mobile bridges, it will completely drop the registration record for that uaid. AFAICT any attempts by the client to operate on that uaid will produce an error.

I don't think the push component currently handles this case well. In fact I don't see any codepaths that would case us to detect that self.uaid has become invalid and is being rejected by the server.

This seems like something that could be handled in verify_connection by:

  • Detecting the "uaid does not exist" error.
  • Registering a new uaid.
  • Reporting all existing subscriptions as expired.

We should also add some handling of this case in the update method, which clients may call when in this state in an attempt to repair this push subscriptions.

┆Issue is synchronized with this Jira Bug
┆Epic: FxA Ecosystem (backlog)
┆Fix Versions: Release 92
┆Sprint End Date: 2021-07-09

@rfk
Copy link
Contributor Author

rfk commented Jun 25, 2021

The docs and structure of the push component and overdue for a bit of an overhaul, we sadly haven't really kept up very good maintenance of it since it was initially written by the push team. However, that's an undertaking far beyond the scope of this bug! (But if you happen to find opportunities to clean up the docs or clarify things as you go here, that's also valuable).

I've added some high-level context-setting in #4303 which I hope will help a bit.

Specifically for this bug, we want to deal with the unhandled error being reported from Fenix users here and here:


W/App     (29404): mozilla.appservices.push.CommunicationError: Error(Communication Error: "Unhandled client error 410 : \"{\\\"code\\\": 410, \\\"errno\\\": 103, \\\"error\\\": \\\"\\\", \\\"more_info\\\": \\\"http://autopush.readthedocs.io/en/latest/http.html#error-codes\\\", \\\"message\\\": \\\"Request did not validate UAID not found\\\"}\"")
W/App     (29404):   at mozilla.appservices.push.RustError.intoException(RustError.kt:13)
W/App     (29404):   at mozilla.appservices.push.PushManager.subscribe(PushManager.kt:13)
W/App     (29404):   at mozilla.components.feature.push.RustPushConnection.subscribe(Connection.kt:4)
W/App     (29404):   at mozilla.components.feature.push.AutoPushFeature$subscribe$$inlined$ifInitialized$lambda$2.invokeSuspend(AutoPushFeature.kt:5)
W/App     (29404):   at mozilla.components.feature.push.AutoPushFeature$subscribe$$inlined$ifInitialized$lambda$2.invoke(Unknown Source:10)
W/App     (29404):   at mozilla.components.feature.push.ext.CoroutineScopeKt$launchAndTry$2.invokeSuspend(CoroutineScope.kt:5)

This error occurs when the push server (for whatever reason) throws away our existing registration, forgetting everything about the uaid that we initially created when registering our first subscription here.

Unfortunately the management of uaid is a little bit implicit in the control flow at the moment - it gets created when we register our first subscription and then we assume it will stick around forever. It's also part of the primary key of each subscription record stored in the db. Absent a significant refactoring, I think the only tractable way to recover from the server losing our uaid is to let the app catch it, trigger a call to verifyConnection and do a full state reset and cross-check.

We need the app to be in control of the process here, because we're going to be re-creating a bunch of subscription records, and the app needs to be able to route the updated data to the relevant consumers elsewhere in the app (such as the FxA client, or service workers). This is what the verifyConnection API is designed for - it returns a list of the subscriptions that have changed so that the app can route the new info to the right place.

So, at the top level of a fix here, the app needs to be able to catch this error and call verifyConnection in response. It may be enough to just catch the error as-is, but we could also consider making a dedicated error class like VerificationNeeded to help with this. @jonalmeida thoughts?

Next level down, we need to implement some recovery logic in the underlying Rust verify_connection call. I'll admit I don't fully understand the verification logic there, but maybe something like this will work:

  1. In the underlying Connection::verify_connection call, catch this "UAID not found" error and return false, signalling to the next level up that there is something wrong with the server state.
  2. In the higher-level PushManager::verify_connection when we see that false being returned, drop the the current uaid and its associated push records from our local database, essentially making our local state match the server state by making them both empty.
  3. Continue returning the list of existing subscriptions to calling code like this. Something will need to re-create them, and the process of re-creating these subscriptions will cause us to implicitly create a new uaid value and put the system back in a working state.

Where things get fuzzy for me is in step (3) here. It's not clear to me what the API contract actually is in practice. The Rust code is just returning the list of existing subscriptions, so I guess it's expecting the calling code to re-create each one. But the example Kotlin code here suggests that the returned list contains newly-created subscription info and all the app has to do is route those new details to appropriate places. The wrapper code in a-c certainly seem worded in as way that suggests the subscriptions have already been re-created. But then when it gets turned into an AutoPushSubscriptionChanged notification, that doesn't contain enough info about the subscription for consumers to actually update themselves, so for example the FxA client needs to explicitly call subscribe which will re-create the subscription if one doesn't already exist for that channel.

So...I think we maybe it's fine for PushManager::verify_connection to just delete the old subscriptions from its db and let consumers re-create the ones they need, but I'm not 100% sure. @jonalmeida is the best person to confirm or deny this hypothesis, and we should probably update docs and names throughout the call path here to clarify expectations.

@jonalmeida
Copy link
Collaborator

Will have to come back to this issue after PTO, but I wanted to add one note:

(But if you happen to find opportunities to clean up the docs or clarify things as you go here, that's also valuable).

You might find this meta bug I had filed also valuable #2763 of mostly nits during the initial integration of the push component into AC/Fenix/Fire TV.

@rfk
Copy link
Contributor Author

rfk commented Jun 26, 2021

Thanks @jonalmeida! Sorry for a lack of context on my previous comment - I'm going to be PTO for the week after the wellness week, but we're hoping that @tarikeshaq will be able to spend some time on this bug during that week, so I as leaving as much async context as I could. This can all definitely wait until after PTO.

@jonalmeida
Copy link
Collaborator

jonalmeida commented Jul 8, 2021

So, at the top level of a fix here, the app needs to be able to catch this error and call verifyConnection in response. It may be enough to just catch the error as-is, but we could also consider making a dedicated error class like VerificationNeeded to help with this. @jonalmeida thoughts?

Yes, this sounds fine to me. If it makes it easier, our current logic/contract is to call verify_connection once a day, at best. That might avoid the need to even throw an error that needs to be caught and acted upon. Either way works, and I'm happy to break old contracts to form better ones where needed.

The wrapper code in a-c certainly seem worded in as way that suggests the subscriptions have already been re-created. But then when it gets turned into an AutoPushSubscriptionChanged notification, that doesn't contain enough info about the subscription for consumers to actually update themselves, so for example the FxA client needs to explicitly call subscribe which will re-create the subscription if one doesn't already exist for that channel.

So...I think we maybe it's fine for PushManager::verify_connection to just delete the old subscriptions from its db and let consumers re-create the ones they need, but I'm not 100% sure. @jonalmeida is the best person to confirm or deny this hypothesis, and we should probably update docs and names throughout the call path here to clarify expectations.

In the KDoc, "updated" may not have been the best choice of words in hind sight. My understanding that, verify_connection gives us a list of subscriptions that are alreay trashed.

Then, as you correctly followed in the linked code, we want to use that list to inform our consumers (FxA and service workers) that the subscription they relied on no longer exists, so it's up to them to decide if they want to renew that subscription or not. In FxA's case, we handle that bit of logic in FxaPushSupportFeature. Service workers handle it based on the web app's own logic when they receive this notification that we pass to GeckoView which delivers it to the service worker.

I just remembered that we have a nice little paragraph in the README specifically about the current behaviour and understanding of verify_connection so we don't have to keep that mental model all the time. 😄

@tarikeshaq
Copy link
Contributor

Thanks a ton @jonalmeida!!

I'll probably add a bit more error handling to make sure we can catch the error, but after an initial look I THINK I might have a good theory on why we are ending up with this error in the first place, @jrconlin could possibly verify my theory here 😄

  • When we verify_connection on the rust component, and we find a mismatch, we end up unsubscribing from all channels over here..
  • Then, in unsubscribe we end up sending a network request to the AutoPush server telling it to wipe all records associated with our UAID (and telling it that the UAID is no longer valid), ref, the Unregister UAID docs, which eventually deletes the UAID from the server's database. in the component, the network URL is formated here, note how when the channel_id is None we trigger the Unregister UAID request
  • However, in-memory, our ConnectHttp object still has a uaid that is Some. So, when a future subscribe is triggered, we'll automatically try to attach that uaid to the request, this happens over here.
  • Now, the server was told earlier (in the unsubscribe) that the UAID is invalid, then we trigger a subscribe with that UAID, so naturally, the server complains that UAID not found

A super easy fix is I'll try is to simply set the uaid of the ConnectHttp to None when we unsubscribe from all the channels.

But all that is beside the point that we should be recovering from problems like this automatically - I'll jot down some more thoughts specifically on that and possibly open a draft PR to look at

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants