-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ios: App lifecycle & network condition handling #128
Comments
Posting the findings from the analysis I did locally. EnvironmentiPhone X, iOS 12.3.1, Verizon / Google Fi ExperimentCreated a modified version of the Lyft passenger app which included the On app launch, started up Envoy with a configuration which re-routed all traffic from Under the hood, the Lyft networking stack called into a single FindingsInitially, I started the app and allowed network traffic to flow for a while, observing successes coming back in HTTP responses as expected. Next, I started toggling the phone from wifi to cellular. This resulted in nearly no change in responses coming back from the server, seemingly having no effect on Envoy Mobile. Doing this numerous times yielded the same result:
This seemed a little bit strange to me since we've observed issues with the cellular radio in gRPC in the past which resulted in network requests hanging when the phone switched between wifi/cellular due to the fact that the gRPC core library uses BSD sockets to communicate over the internet (rather than the native iOS networking stack). We currently work around this socket issue within the Lyft app and upstream in gRPC by starting a new gRPC channel when we observe changes to the network such as toggling between wifi/cellular or 3G/LTE. My theory is that we don't see this issue with Envoy Mobile because it's listening to traffic that's sent through To test this theory, I ran requests through both URLSession/Envoy and gRPC, and disabled the gRPC channel restarts to better understand how calling BSD sockets directly (without
Additionally, backgrounding/foregrounding the app seems to have no effect on requests/responses sent through Envoy Mobile via Conclusions
|
Another theory that was discussed offline is that Envoy's 5s DNS refresh could be "fixing" the HTTP issues that gRPC encounters using BSD sockets. To see if this was the case, I applied the the |
Resolves #128. Signed-off-by: Michael Rebello <[email protected]>
Docs: #224 |
Resolves #128. Signed-off-by: Michael Rebello <[email protected]>
@mattklein123 @junr03 following up on the Slack discussion here: I ran Envoy Mobile again alongside gRPC Swift - this time with the log level set to Let me know what you think of those logs. Key things to look for:
I did notice quite a few logs that look potentially interesting that happen after the switch from WiFi:
|
I haven't looked through the logs yet, but my suspicion is the difference here is that the gRPC client is using blocking IO with BSD sockets whereas Envoy is using non-blocking IO and kqueue. My guess is that the async nature of the IO is causing the kernel to handle it differently. If it's true that the gRPC client is using blocking IO, this is probably what I would discuss w/ the Apple folks. |
From looking at the logs, initial impressions:
I think this might be a red herring. Every time Envoy reads from the raw buffer it reports a successful read followed by resource temporarily unavailable. I have not read the codepath closely yet. But my main point is that that is not happening only when the network transition happens. It seems that there were no envoy requests in flight when the network transition happened. However it looks, via connection ID (C) that connections do get recycled across network transitions. I also looked for socket close events and the three events that I saw happened after both network transitions had already occurred. So that leaves me with no good clues about how Envoy is handling the network transition events smoothly. Especially since it seems to be using the same connections across the transition. |
@Reflejo and I paired and dug into this a bit more, and we believe we have a good understanding of what's going on now. ExperimentBuilt the Envoy Mobile library using the following flags to allow us to build to a device with debugging symbols:
Ran an example app on an iPhone X that:
In the active scheme of the app's Environment Variables, set We then slowly made network requests, one at a time, through the app. FindingsWhenever we switched from WiFi to cellular (or vice versa), the next 1 request would consistently fail with code When we executed the next request, it would complete successfully. If we did this more quickly and sent several requests in rapid succession, the first would still fail and the subsequent requests would complete normally. Setting the AnalysisThe current setup looks like this:
The experiment above indicates that when the working connection changes to inactive (i.e., disabling WiFi and forcing the phone to switch to cellular), the sockets aren't notified of the change. This is a commonly understood issue with BSD sockets on iOS, and is why Apple strongly advises against using them. Executing another network request over
Essentially,
Additionally, putting the phone in airplane mode results in all requests failing immediately (instead of waiting for the specified timeout) because iOS is aware that it has no connectivity. Next stepsIt's very clear that Envoy will need to have good support for these kinds of network transitions before it goes to production. I propose the following next steps as a way to proceed:
Thoughts @mattklein123 @junr03 @goaway? |
Thanks for doing this investigation @rebello95 and @Reflejo! I think observing network changes is necessary no matter what (1.), since the CF interfaces (2.) will get us a socket on the current "right" or preferred interface, but won't automatically signal the switch (IIRC). Network.framework (3.) let's us specify interfaces explicitly which is nice because we don't have to infer them, but I don't know if there's any additional baked in reachability signaling. Ultimately, internally what I think we'll do is classify and maintain connections (while viable) on both interfaces and have strategies (e.g. reachability signals or observed health) for favoring them. (This is already similar to Envoy's connection pool model and I think we can likely adapt it to support this.) |
Awesome work @Reflejo @rebello95! This all makes sense to me now. Can we put together a short design doc on iOS transport socket options? Agreed we can look at stopgap hacks for alpha, including possibly even just tearing down and recreating Envoy on interface changes. I would also recommend looking at what cronet does as a data point. Whatever we do we will need to figure out the right way to bridge the iOS platform APIs with Envoy's kqueue based event loop. |
I actually think we should prioritize this conversation ASAP. I don't see any easy way to support high(er) level APIs on iOS with envoy, so I agree we should put a design doc on the options. I think the short term solution is to observe reachability and close the socket; and test throughly The longer term is what we should think about and something I'm a bit concerned. Ideally we should use Since they are moving to user-space, the abstractions we have on envoy right now don't map very well to it. We need to think through it. |
Those are great callouts, and I definitely think this is something that we need to (and are) prioritizing. @goaway and @junr03 will be heading up the work to implement the socket layer on iOS, which will be tracked in #13. After discussing in planning today, we're going to have a doc with basic proposed options by EOW, and will start implementation in early August when @junr03 gets back from OOO. |
Updating based on #128 (comment). Signed-off-by: Michael Rebello <[email protected]>
Updating based on #128 (comment). Signed-off-by: Michael Rebello <[email protected]>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
@rebello95 going to close this in favor of keeping #13 open. |
Resolves envoyproxy/envoy-mobile#128. Signed-off-by: Michael Rebello <[email protected]> Signed-off-by: JP Simard <[email protected]>
Updating based on envoyproxy/envoy-mobile#128 (comment). Signed-off-by: Michael Rebello <[email protected]> Signed-off-by: JP Simard <[email protected]>
Resolves envoyproxy/envoy-mobile#128. Signed-off-by: Michael Rebello <[email protected]> Signed-off-by: JP Simard <[email protected]>
Updating based on envoyproxy/envoy-mobile#128 (comment). Signed-off-by: Michael Rebello <[email protected]> Signed-off-by: JP Simard <[email protected]>
React to changes by restarting Envoy core - shutting down connections, etc. as necessary
Google doc reference here.
Android issue: #129
The text was updated successfully, but these errors were encountered: