Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate verify_connection incorrectly returning empty list #3314

Closed
jonalmeida opened this issue Jul 2, 2020 · 28 comments
Closed

Investigate verify_connection incorrectly returning empty list #3314

jonalmeida opened this issue Jul 2, 2020 · 28 comments
Assignees

Comments

@jonalmeida
Copy link
Collaborator

jonalmeida commented Jul 2, 2020

Starting from #3127 (comment), we were trying to find the various causes of push messages not being delivered.

In the investigation ticket in android components, we found some cases (that are now fixed) where we were not calling PushManager.verify_connection at the expected interval. However, we still were seeing an empty list of subscriptions returned back to us when we should have been seeing a list of the expired subscriptions.

With mozilla-services/autopush#1397 resolved, we are now left to investigate the code path in the native Rust push manager.

┆Issue is synchronized with this Jira Task
┆Story Points: 5
┆Epic: FxA Ecosystem (backlog)
┆Sprint: SYNC - end 2020-11-06

@eoger
Copy link
Contributor

eoger commented Jul 9, 2020

@jrconlin I've spent some time this morning auditing the set of Rust functions involved in the verify_connection Kotlin call and found nothing out of the ordinary. Is it possible the Autopush server is returning expired subscriptions, making the Rust code think its local subscriptions are up-to-date?

@data-sync-user data-sync-user assigned eoger and unassigned eoger Jul 9, 2020
@jrconlin
Copy link
Contributor

@eoger The server returns the list of known subscriptions for a given user. These are subscriptions where the UA has not performed an unsubscribe action. Subscriptions do not have an expiration date and should continue to be valid until either the UA sends an unsubscribe or the bridge specifies that the associated bridge token is no longer valid.

In terms of the server code, the verify function fetches the channel list from the server, which reduces to calling the all_channels() function.

The all_channels test is here.

@rfk
Copy link
Contributor

rfk commented Oct 19, 2020

Subscriptions do not have an expiration date and should continue to be valid until either the UA sends an unsubscribe
or the bridge specifies that the associated bridge token is no longer valid.

I wonder if the problem here is that the FxA server comes to believe a subscription has expired, while the push server itself does not.

The logic on the FxA server is:

  • Try to send the push by making a POST request to the provided push subscription URL
  • If this fails with a 404 Not Found or 410 Gone, mark the provided push subscription URL as expired, and refuse to send any more push notifications until the client provides a new one.

Are there circumstances under which the push server might return a 404 or 410, while keeping the underlying subscription alive in its database and returning it in the query above?

@rfk
Copy link
Contributor

rfk commented Oct 19, 2020

Alternately, re-reading mozilla-services/autopush#1397, I wonder if the client is registering a new FCM token and causing its existing subscriptions to "recover" and be useable again, rather than generating a new subscription that it then needs to re-register with the FxA server.

I feel like my mental model of the system here is not quite right, but I'm not sure how to debug it. @jonalmeida could you please help us re-invigorate this bug by summarizing the sequence of events you're expecting to see and steps you're expecting to perform in the client code along this recovery path?

@jonalmeida
Copy link
Collaborator Author

@jonalmeida could you please help us re-invigorate this bug by summarizing the sequence of events you're expecting to see and steps you're expecting to perform in the client code along this recovery path?

Sure! Our expected flow is below:

  1. The app (AC/Fenix) observes the Account Manager and the Device Constellation attached to it for changes (DeviceConstellationObserver).
  2. When we receive an update from the app startup, polling, or the manual 'Sync Now' button clicked, we check the currentDevice in the ConstellationState to see if the subscriptionExpired flag is set.
  3. If it is and if we haven't tried to fix this issue 500+ times in the last 24 hours, we tell the push service (FCM) to delete our app's push token.
  4. This is expected to trigged the push service to provide us a new push token that we submit to the native push layer.
  5. The next time the app is launched, we try to verify that our subscriptions are active, which is expected to fail because we have set a new token.
  6. If this is true, we notify our subscribers (FxA, GeckoEngine) with the onSubscriptionChanged call that tells us to get a new subscription for FxA.

We have verified that at point 4, we are getting a new push token after telling FCM to delete it.

I wonder if the problem here is that the FxA server comes to believe a subscription has expired, while the push server itself does not.

I think you're on to something here! In #2490, we noticed that the native push layer does not communicate with the AutoPush servers in update until we call subscribe to get a new UAID. So my suspicion is that we are seeing the same thing here as well.

In the Android Component, we create a fake subscription on a fresh install to ensure we received the UAID to avoid that particular bug, but maybe we need to always do this hack when we receive a new token to notify the push servers that our token has now changed, and therefore allowing for the condition in point 6 to be true.

I'm going to investigate if that works today!

@jonalmeida
Copy link
Collaborator Author

jonalmeida commented Oct 19, 2020

..maybe we need to always do this hack when we receive a new token to notify the push servers that our token has now changed, and therefore allowing for the condition in point 6 to be true.

I'm going to investigate if that works today!

Unfortunately, this did not work. In my test, I waited for receiving a new push token (similar to point 4 from my comment above), created a fake subscription right after, and then performed a verify_connection call. Unfortunately, we are still getting back no results.

@jrconlin
Copy link
Contributor

Thanks. I feel like there is definitely a server side bug in this too. I've filed mozilla-services/autopush#1442 to track it.

@rfk
Copy link
Contributor

rfk commented Oct 20, 2020

Thank you! Let me see if I can break some of your steps down into some substeps involving the server and the appservices client lib:

  1. (No comment, seems to be working as expected)
  2. (No comment, seems to be working as expected)
  3. (No comment, seems to be working as expected)
  4. This is expected to trigger the push service to provide us a new push token that we submit to the native push layer.
    1. When it receives the new FCM token, the autopush server stores it but does not change any existing registered channels (per the tests added in Ensure that update-token does not delete or alter existing subscriptions for UAID mozilla-services/autopush#1397)
  5. The next time the app is launched, we try to verify that our subscriptions are active, which is expected to fail because we have set a new token.
    1. The app calls verify_connection in the appservices push client, asking for a list of subscriptions which have changed.
    2. The appservices push client reads the list of channels from local storage.
    3. The appservices push client asks the autopush server for the list of all current channels for this uaid.
      1. The autopush server responds with the same list of channels as it had before, since it deliberately didn't change any of the subscriptions in step (4) above.
    4. The appservices push client diffs the list of current channels from the app against the list of current channels from the server, and tells the app which subscriptions have changed.
      1. The channels all match, so it returns an empty list.
  6. If this is true, we notify our subscribers (FxA, GeckoEngine) with the onSubscriptionChanged call that tells us to get a new subscription for FxA.
    1. It's not true; the call to verify_connection says that nothing has changed, so this step doesn't happen and we don't fix up the FxA push subscription.

@jrconlin does that sound right to you from the server's perspective? I'm guessing a little bit at the intent and effects of mozilla-services/autopush#1397.

(If this is accurate, I have some thoughts on how to untangle it, but I don't want to proceed further based on any misunderstandings)

(Edited to clarify that the appservices push component is the thing that list the local channels and provides the list of which have changed)

@data-sync-user data-sync-user assigned rfk and unassigned jdragojevic Oct 21, 2020
@data-sync-user
Copy link
Collaborator

➤ Ryan Kelly commented:

I'm taking this issue and pulling it into the next sprint because I think we have an reinvigorated interest and leads on this.

@jrconlin
Copy link
Contributor

So, some additional musings from pair debugging with @jonalmeida yesterday:

There's a potential for a race condition that I'm not quite sure how to handle. Although it's supposed to be very rare, FCM can change the FCM token at any time. It is up to the client to report this to the server, and that's what the push_update call is for.

When that happens, it's not known if values for the old token are accepted. It could be that there is a race condition between when that new token is provisioned, when the server has been updated to use that token, and when outbound messages are being sent out. If a message is sent out to FCM with an older token, and it's returned with a 404/410, the server presumes that token to no longer be active and all information for that user is dropped (since that user is no longer reachable).

Holding messages for accounts that are being actively rejected in hopes that they might be come back soon would add a fair bit of complexity and some cost to the server. It may be worthwhile for the client to presume that when a new FCM token has been provisioned, and that this is not a new install, it might be worth it for the client to check if there are any pending sync actions, and just presume that there are push messages that have been dropped.

I'm not saying that this is the exact reason for the bug we're seeing, but it is a possible cause.

@jrconlin
Copy link
Contributor

Following up to @rfk's note:

Yes mozilla-services/autopush#1397 was a test to confirm that a token update does not change the value of any existing channels, endpoints or other data that may already be registered. The only value that's changed is the FCM_token passed to FCM and used to identify the recipient process by FCM. This means that all existing endpoints should continue to work. The one caution would be the (hopefully, very low chance) race condition I outlined above, where the server attempts to send a message to the client in the period between the client getting a new FCM token and the server registering it, which would lead to the user data being dropped as invalid.

Perhaps a "pause/resume" function might be useful in that case, but it would be a significant change for the push server.

@jrconlin
Copy link
Contributor

Since there doesn't appear to be any direct work required by the Autopush service at this time, going to de-prioritize the associated bugs for now.

@rfk
Copy link
Contributor

rfk commented Oct 27, 2020

If a message is sent out to FCM with an older token, and it's returned with a 404/410, the server presumes that
token to no longer be active and all information for that user is dropped (since that user is no longer reachable).

@jrconlin could you please link me to the code in the server that handles this case? I had a poke around and I think I was able to find the places in router/fcm_v1.py and router/fcm.py that handle 404/410 from the server, but I wasn't able to follow the logic through to where it drops the user data. I want to make sure I understand all the intended steps here as much as I can.

Are there any circumstances under which the autopush server might report a 404/410 when sending a notification, but not purge user data from the db?

Although it's supposed to be very rare, FCM can change the FCM token at any time. It is up to the client to report
this to the server, and that's what the push_update call is for.

@jonalmeida, a similar question to you for the client-side piece of this - could you please link me to the code that receives this notification and tunnels it through to push_update in the appservices code? If I understand correctly this is the onNewToken method of AbstractFirebasePushService but I wasn't quite sure how that gets hooked up to actually listen for change events. Is it all done automagically via the android manifest as suggested here?

I really want to be able to get my head around all the moving parts here end-to-end.

@rfk
Copy link
Contributor

rfk commented Oct 27, 2020

It may be worthwhile for the client to presume that when a new FCM token has been provisioned,
and that this is not a new install, it might be worth it for the client to check if there are any pending
sync actions, and just presume that there are push messages that have been dropped.

Going one step further, we could have the client assume that it needs to update all of its push subscriptions after getting a new token, rather than the current thing where it tries to cleverly update only those that have changed.

@rfk
Copy link
Contributor

rfk commented Oct 27, 2020

Are there any circumstances under which the autopush server might report a 404/410 when sending a notification,
but not purge user data from the db?

To clarify why I'm asking this: the "autopush server reports a 404/410 when sending a notification" part is definitely happening and happening with some regularity, because this is what causes the FxA server to set the subscriptionExpired flag on the client's device record. The FxA server also emits a specific metrics event in this case.

So I want to confirm our shared understanding that the "all information for that user is dropped" part is also happening with similar regularity.

@rfk
Copy link
Contributor

rfk commented Oct 27, 2020

@jrconlin could you please link me to the code in the server that handles this case?
I had a poke around and I think I was able to find the places in router/fcm_v1.py and router/fcm.py
that handle 404/410 from the server, but I wasn't able to follow the logic through to where it drops the user data

Aha, I think I found it here:

@jonalmeida
Copy link
Collaborator Author

@jonalmeida, a similar question to you for the client-side piece of this - could you please link me to the code that receives this notification and tunnels it through to push_update in the appservices code? If I understand correctly this is the onNewToken method of AbstractFirebasePushService but I wasn't quite sure how that gets hooked up to actually listen for change events. Is it all done automagically via the android manifest as suggested here?

@rfk the AbstractFirebasePushService is a thin wrapper around the Firebase SDK so that we can use different push services. The AbstractFirebasePushService.onNewToken feeds the message to AutoPushFeature.onNewToken.

In that method we save the token to the component's disk cache and also to the native component here. You will notice there is a bit of logic in there for two things:

  1. To wrap the native push manager's class initialization vs "updateToken" call.
  2. To avoid a known crashing bug if the native layer is in a bad state.

You can see the use case for the first point in the AutoPushFeature, where we initialize the native layer with cached token on class initialization.

@rfk
Copy link
Contributor

rfk commented Oct 27, 2020

(Sorry for the stream of consciousness here but it really helps me think to do it out loud).

Thinking through what's supposed to happen when there's some problem with the FCM token, IIUC it's something like this:

  1. The FxA server tries to send a push notification to the device.
  2. Autopush forwards the notification to FCM.
  3. FCM returns a 404 or 410 error response.
  4. Autopush drops the user by deleting that uaid from the "router" table.
  5. Autopush returns the 404 or 410 back to FxA.
  6. FxA marks the device record as subscriptionExpired and logs some metrics about the push failure.
  7. Some time later, the Fenix app discovers that it has subscriptionExpired in its device record, regenerates its FCM token, and calls update to pass it to the server.
  8. The appservices push client library calls the autopush /registration/{uaid} endpoint using the existing uaid value in order to set the new token.
  9. Autopush re-creates the entry for that uaid in the "router" table with the new token.
  10. Fenix queries Autopush for the list of active channels.
  11. Autopush says that there are no active channels, because we dropped all the user data in step (4).
  12. Fenix re-creates channels for its existing subscribers and fires the onSubscriptionChanged handler for each.

Does this match your expectations @jrconlin @jonalmeida?

Note that step (8) resurrects the uaid that was deleted in step (4), and I'm wondering if this resurrection behavior may have unintended consequences. For example, IIUC the list of active channels for a uaid is stored in the "message" table. Might resurrecting the uaid in the "router" table risk resurrecting its previously-active channels and invalidating the assumption at step (11)?

@jonalmeida
Copy link
Collaborator Author

At 12:

Fenix re-creates channels for its existing subscribers and ...

Probably a bit pedantic, but we do not re-create the subscription in the push component layer, instead this is done in the application layer - for Android Fxa, this is the FxaPushSupportFeature that subscribes immediately when notified.

The push layer only notifies the channel subscribers in order to follow the same web content expectation as in the docs.

The FxaPushSupportFeature connects to this notification in the AutoPushFeature. JR and I have verified that the subscriptionChanges.isNotEmpty() check is what fails to get into the code-block.

Everything else looks correct to me. 🙂

@jrconlin
Copy link
Contributor

Yeah, there are a lot of moving parts at play here. (Honestly, if I had to do it again, I'd absolutely only use the native push system as a "wake up" and then ask the client to create a websocket connection back to our servers whenever possible. There would still be a few issues around how iOS would have to handle things, but we could treat that as the one off rather than the rule.

@jonalmeida
Copy link
Collaborator Author

jonalmeida commented Nov 3, 2020

While testing my changes for mozilla-mobile/android-components#8846 (comment) I encountered a strange result that may be related to this issue as well. Below is my expected flow:

  1. Sign into Sync - verify a push subscription is created.
  2. Send a tab (verify it's received via push)
  3. Force FCM token registration renewal (verify a new token is received and updated to a-s layer)
  4. Execute verify_connection to assert that the push subscription from step 1 was invalidated.
  5. Send a tab (verify NO tab is received)

What I see instead is:

  1. Sign into Sync - verify a push subscription is created.
2020-11-02 22:23:02.256 org.mozilla.fenix.debug I/AccountObserver: Created a new subscription: AutoPushSubscription(scope=fxa_push_scope_d394d7b229b4433aa0aa9022a8a88244, endpoint=https://updates.push.services.mozilla.com/wpush/v1/gAAAAABfoM0XMubi5wgho5jJv1bTSGKshyybqJZsR1u5w_ehh4JHCv8MrtUT9ga2zVGpxbztiWyPv0Qoj9AForEo8vUjIkJIs-q0ykKZap0LZZU9y7lke_go7EKLDl0ipoUCStb6dD4i, publicKey=BGxp4DGqTP51Q6XdFjMu3Q0uB0k3S9gGrovy4c6bOWsz8gRhwKesozYbMPh9sHGuDAbDFOr3Qu9QqrgKVDADzIM, authKey=7X4MdmDUBoALHhbXgboASg, appServerKey=null)
2020-11-02 22:23:02.326 org.mozilla.fenix.debug D/fxaclient_ffi: fxa_set_push_subscription
  1. Send a tab (verify it's received via push)
2020-11-02 22:23:51.034 org.mozilla.fenix.debug I/AutoPushFeature: New push message decrypted.
2020-11-02 22:23:51.034 org.mozilla.fenix.debug I/[INFO][Leanplum]: [com.leanplum.LeanplumPushFirebaseMessagingService::onMessageReceived::74]: Received: {con=aes128gcm, body=<<encrypted body>>, chid=e35d4eb45b1f3c98926af593dd72ff39}
2020-11-02 22:23:51.035 org.mozilla.fenix.debug D/push_ffi: push_dispatch_info_for_chid
2020-11-02 22:23:51.037 org.mozilla.fenix.debug D/push_ffi: push_decrypt
  1. Force FCM token registration renewal (verify a new token is received and updated to a-s layer)
2020-11-02 22:24:23.608 org.mozilla.fenix.debug I/AbstractFirebasePushService: Got new Firebase token: f3R_ZkADzlI:APA91bF8VD4vFTOpnN1ov4mOI49wYHklNRrTBctYpkXBRiTAXRzmDcbfu8TCLUKVJxsb9vVV5Ck8e3ojIPxBybmRYk-VtIxvvejPh0NoywXo2xN5oj9WONDHAGqeIQ0M1wzuNplljb_s
2020-11-02 22:24:23.609 org.mozilla.fenix.debug I/AutoPushFeature: Received a new registration token from push service.
2020-11-02 22:24:23.611 org.mozilla.fenix.debug D/push_ffi: push_update
  1. Execute verify_connection to assert that the push subscription from step 1 was invalidated.
2020-11-02 22:25:33.621 org.mozilla.fenix.debug I/AutoPushFeature: Trying to check validity of push subscriptions.
2020-11-02 22:25:33.621 org.mozilla.fenix.debug I/AutoPushFeature: Checking now..
2020-11-02 22:25:33.622 org.mozilla.fenix.debug D/push_ffi: push_verify_connection
2020-11-02 22:25:34.578 org.mozilla.fenix.debug I/AutoPushFeature: No change to subscriptions. Doing nothing.
  1. Send a tab (verify NO tab is received ⚠️)
2020-11-02 22:26:03.674 org.mozilla.fenix.debug I/AutoPushFeature: New push message decrypted.
2020-11-02 22:26:03.675 org.mozilla.fenix.debug I/[INFO][Leanplum]: [com.leanplum.LeanplumPushFirebaseMessagingService::onMessageReceived::74]: Received: {con=aes128gcm, body=<<encrypted body>>, chid=e35d4eb45b1f3c98926af593dd72ff39}
2020-11-02 22:26:03.675 org.mozilla.fenix.debug D/push_ffi: push_dispatch_info_for_chid
2020-11-02 22:26:03.677 org.mozilla.fenix.debug D/push_ffi: push_decrypt

By step 5, the push subscription on the Autopush server should have invalidated the previous endpoint so the second send tab would have failed to send, thus flipping the subscriptionExpired flag as expected in my comment above: #3314 (comment)

This seems to be another side-effect from verify_connection called after update_token aside from returning an empty list.

As a result, I'm not sure now how to reproduce a realistic broken state between Autopush and FxA.

@rfk
Copy link
Contributor

rfk commented Nov 4, 2020

Force FCM token registration renewal (verify a new token is received and updated to a-s layer)

@jonalmeida how does one do this in practice? I'll try to spend some time today setting myself up to reproduce this issue based on the links you shared with me previously.

@rfk
Copy link
Contributor

rfk commented Nov 4, 2020

  1. Force FCM token registration renewal (verify a new token is received and updated to a-s layer)
  2. Execute verify_connection to assert that the push subscription from step 1 was invalidated.

My understanding from @jrconlin's comments (e.g. #3314 (comment)) is that 4. is not accurate. Having the client register a new FCM token doesn't invalidate any existing subscriptions, and indeed the server deliberately tries to keep existing subscriptions alive when changing the FCM token.

@jonalmeida
Copy link
Collaborator Author

Force FCM token registration renewal (verify a new token is received and updated to a-s layer)

@jonalmeida how does one do this in practice? I'll try to spend some time today setting myself up to reproduce this issue based on the links you shared with me previously.

This is done in AutoPushFeature:

override fun renewRegistration() {
    logger.warn("Forcing registration renewal by deleting our (cached) token.")

    // Remove the cached token we have.
    deleteToken(context)

    // Tell the service to delete the token as well, which will trigger a new token to be
    // retrieved the next time it hits the server.
    service.deleteToken()

    // Starts the service if needed to trigger a new registration.
    service.start(context)
}

deleteToken(context) removes the token that we have cached in the a-c layer. This cache is used for initializing the rust client on a cold boot.

service.deleteToken() tells the push service (FCM) to delete the token in their client cache/server. This converts to the Firebase API call here: FirebaseInstanceId.getInstance().deleteInstanceId()

service.start(context) is called for safety in case somewhere in the app code, we have called service.stop which would not allow us to receive a new FCM push token. It is safe to call this method several times.

Deleting the FCM token instantly triggers the SDK to retrieve a new token from the FCM servers and trigger the onNewToken method which then updates the rust client.

My understanding from jrconlin's comments (e.g. #3314 (comment)) is that 4. is not accurate. Having the client register a new FCM token doesn't invalidate any existing subscriptions, and indeed the server deliberately tries to keep existing subscriptions alive when changing the FCM token.

Ah my mistake, in that case ignore my last comment! Although, that makes it harder to verify if we get a new subscription endpoint from Autopush when FxA sets the subscriptionExpired flag (this is a separate problem for another issue).

@rfk
Copy link
Contributor

rfk commented Nov 4, 2020

I spent a few quality hours with this bug today, but I think I'm now even more confused.

My sequence of events from #3314 (comment) can't play out quite as I imagined, specifically these steps:

  1. The appservices push client library calls the autopush /registration/{uaid} endpoint using the existing uaid value in order to set the new token.
  2. Autopush re-creates the entry for that uaid in the "router" table with the new token.

When the appservices push client tries to update the token for a user that has been dropped, its request will fail with an auth error because the push server has thrown away the record for the uaid. So step (9) doesn't happen...but then, I don't really know how the client is supposed to recover from this state at all. I don't see any codepaths by which the appservices push client can change its uaid if that record has been dropped by the server in the manner suggested by #3314 (comment).

@rfk
Copy link
Contributor

rfk commented Nov 5, 2020

Thanks for the long and detailed discussion in this bug folks! However, I think it has served its purpose and is growing beyond the point where we can keep all the threads straight. I've filed several follow-up bugs based on the discussion here:

The bug as titled here has been completed; verify_connection is indeed returning an empty list, but as far as we can tell it is correct for it to do so, since the autopush server still believes the subscriptions to be valid. Whatever the problem is that prevents recovering from an invalid FxA device subscription, it's happening earlier in the process than the verify_connection call. We can continue investigating it upstream (e.g. mozilla-services/autopush#1444) or we can try to work around it on the client (something like #3696) but there's nothing further to do in this bug.

@rfk rfk closed this as completed Nov 5, 2020
@jonalmeida
Copy link
Collaborator Author

jonalmeida commented Nov 6, 2020

verify_connection is indeed returning an empty list, but as far as we can tell it is correct for it to do so, since the autopush server still believes the subscriptions to be valid.

@jrconlin we saw that when we made the autopush server tell us that a subscription was dropped, verify connection still came up with an empty list, right? It was a few weeks ago and my memory doesn't hold. I realized now that we didn't keep a log of that test.

@rfk
Copy link
Contributor

rfk commented Nov 6, 2020

when we made the autopush server tell us that a subscription was dropped,

Oh, interesting, I did not pick up on this in the thread. How did you make it do this? I can try to repro locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants