Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ios: App lifecycle & network condition handling #128

Closed
rebello95 opened this issue Jun 17, 2019 · 13 comments · Fixed by #224
Closed

ios: App lifecycle & network condition handling #128

rebello95 opened this issue Jun 17, 2019 · 13 comments · Fixed by #224
Assignees
Labels
platform/ios Issues related to iOS
Milestone

Comments

@rebello95
Copy link
Contributor

rebello95 commented Jun 17, 2019

  • Envoy Mobile will need to be aware of certain application lifecycle events (such as backgrounding/foregrounding, termination, etc.) in order to appropriately shut down connections, restart, or tell the OS that the app has requests in flight while it’s backgrounded.
  • These should be handled entirely by the native Envoy Mobile layers, and will likely perform the following tasks:
  • Upon instantiation/startup, begin observing application-level lifecycle events
    React to changes by restarting Envoy core - shutting down connections, etc. as necessary
  • Ensure that the Envoy core is able to make requests when the app is backgrounded

Google doc reference here.
Android issue: #129

@rebello95 rebello95 added the platform/ios Issues related to iOS label Jun 17, 2019
@rebello95 rebello95 added this to the v0.2 "Primo" milestone Jun 17, 2019
@rebello95 rebello95 self-assigned this Jun 17, 2019
@rebello95 rebello95 changed the title ios: App lifecycle handling ios: App lifecycle & network condition handling Jun 17, 2019
@rebello95
Copy link
Contributor Author

rebello95 commented Jun 28, 2019

Posting the findings from the analysis I did locally.

Environment

iPhone X, iOS 12.3.1, Verizon / Google Fi

Experiment

Created a modified version of the Lyft passenger app which included the Envoy.framework artifact.

On app launch, started up Envoy with a configuration which re-routed all traffic from api.lyft.com through Envoy using http://localhost:9001.

Under the hood, the Lyft networking stack called into a single URLSession instance for this traffic, which was then sent through Envoy.

Findings

Initially, I started the app and allowed network traffic to flow for a while, observing successes coming back in HTTP responses as expected.

Next, I started toggling the phone from wifi to cellular. This resulted in nearly no change in responses coming back from the server, seemingly having no effect on Envoy Mobile. Doing this numerous times yielded the same result:

 1. **HTTP		[OK] #130 IngestLocationsResponseDTO
 2. **HTTP		[OK] #129 ReadDropoffTimesResponseDTO
 3. **HTTP		[OK] #131 PingResponseDTO
 4. **Network	changed wifiToCellular
 5. **HTTP		[OK] #135 PingResponseDTO
 6. **HTTP		[OK] #136 OffersDTO
 7. **HTTP		[OK] #138 IngestLocationsResponseDTO
 8. **HTTP		[OK] #137 ReadDropoffTimesResponseDTO
 9. **HTTP		[OK] #139 PingResponseDTO
10. **HTTP		[OK] #140 OffersDTO
11. **HTTP		[OK] #141 ReadDropoffTimesResponseDTO
12. **Network	changed cellularToWifi
13. **HTTP		[OK] #142 IngestLocationsResponseDTO
14. **HTTP		[OK] #143 PingResponseDTO
15. **HTTP		[OK] #134 IngestLocationsResponseDTO
16. **HTTP		[OK] #144 PingResponseDTO
17. **Network	changed wifiToCellular
18. **HTTP		[OK] #147 IngestLocationsResponseDTO
19. **HTTP		[OK] #146 ReadDropoffTimesResponseDTO
20. **HTTP		[OK] #148 PingResponseDTO
21. **HTTP		[OK] #149 OffersDTO
22. **HTTP		[OK] #150 ReadTripsResponseDTO
23. **HTTP		[OK] #152 IngestLocationsResponseDTO
24. **HTTP		[OK] #153 PingResponseDTO
25. **HTTP		[OK] #154 OffersDTO
26. **HTTP		[OK] #155 ReadDropoffTimesResponseDTO

This seemed a little bit strange to me since we've observed issues with the cellular radio in gRPC in the past which resulted in network requests hanging when the phone switched between wifi/cellular due to the fact that the gRPC core library uses BSD sockets to communicate over the internet (rather than the native iOS networking stack).

We currently work around this socket issue within the Lyft app and upstream in gRPC by starting a new gRPC channel when we observe changes to the network such as toggling between wifi/cellular or 3G/LTE. My theory is that we don't see this issue with Envoy Mobile because it's listening to traffic that's sent through URLSession, rather than replacing URLSession entirely.

To test this theory, I ran requests through both URLSession/Envoy and gRPC, and disabled the gRPC channel restarts to better understand how calling BSD sockets directly (without URLSession) affects the traffic. The results I saw indicate that my theory appears to be correct:

  • gRPC requests hang (no responses are received after switching from wifi to cellular), while requests made through URLSession/Envoy continue to succeed normally
  • Below, you can see responses received normally from both sources until line 26 when the phone is toggled to cellular
  • At this point, responses to pings over gRPC stall
  • HTTP responses continue coming back in the correct order they were sent (as indicated by the # values)
  • When cellular is switched back off on line 40, the pings from gRPC come back over the wire, and HTTP traffic remains uninterrupted
 1. **GRPC		pinging upstream (expecting .noop back)
 2. **GRPC		received control: noop
 3. **GRPC		pinging upstream (expecting .noop back)
 4. **GRPC		received control: noop
 5. **HTTP		[OK] #136 OffersDTO
 6. **HTTP		[OK] #137 IngestLocationsResponseDTO
 7. **HTTP		[OK] #139 PingResponseDTO
 8. **GRPC		pinging upstream (expecting .noop back)
 9. **GRPC		received control: noop
10. **HTTP		[OK] #138 ReadDropoffTimesResponseDTO
11. **GRPC		pinging upstream (expecting .noop back)
12. **GRPC		received control: noop
13. **GRPC		pinging upstream (expecting .noop back)
14. **GRPC		received control: noop
15. **GRPC		pinging upstream (expecting .noop back)
16. **GRPC		received control: noop
17. **GRPC		pinging upstream (expecting .noop back)
18. **HTTP		[OK] #140 OffersDTO
19. **GRPC		received control: noop
20. **HTTP		[OK] #141 IngestLocationsResponseDTO
21. **HTTP		[OK] #143 PingResponseDTO
22. **GRPC		pinging upstream (expecting .noop back)
23. **GRPC		received control: noop
24. **HTTP		[OK] #142 ReadDropoffTimesResponseDTO
25. **GRPC		pinging upstream (expecting .noop back)
26. **Network	changed wifiToCellular
27. **GRPC		pinging upstream (expecting .noop back)
28. **GRPC		pinging upstream (expecting .noop back)
29. **GRPC		pinging upstream (expecting .noop back)
30. **GRPC		pinging upstream (expecting .noop back)
31. **GRPC		pinging upstream (expecting .noop back)
32. **HTTP		[OK] #147 ReadDropoffTimesResponseDTO
33. **GRPC		pinging upstream (expecting .noop back)
34. **GRPC		pinging upstream (expecting .noop back)
35. **GRPC		pinging upstream (expecting .noop back)
36. **HTTP		[OK] #148 OffersDTO
37. **HTTP		[OK] #149 PingResponseDTO
38. **HTTP		[OK] #150 IngestLocationsResponseDTO
39. **HTTP		[OK] #146 IngestLocationsResponseDTO
40. **Network	changed cellularToWifi
41. **GRPC		pinging upstream (expecting .noop back)
42. **GRPC		received control: noop
43. **GRPC		received control: noop
44. **GRPC		received control: noop
45. **GRPC		received control: noop
46. **GRPC		received control: noop
47. **GRPC		received control: noop
48. **GRPC		received control: noop
49. **GRPC		received control: noop
50. **GRPC		received control: noop
51. **GRPC		received control: noop
52. **HTTP		[OK] #145 PingResponseDTO
53. **HTTP		[OK] #151 ReadDropoffTimesResponseDTO
54. **GRPC		pinging upstream (expecting .noop back)
55. **GRPC		received control: noop
56. **HTTP		[OK] #152 PingResponseDTO
57. **GRPC		pinging upstream (expecting .noop back)
58. **GRPC		received control: noop
59. **GRPC		pinging upstream (expecting .noop back)
60. **GRPC		received control: noop
61. **GRPC		pinging upstream (expecting .noop back)
62. **HTTP		[OK] #153 OffersDTO
63. **GRPC		received control: noop
64. **HTTP		[OK] #154 ReadTripsResponseDTO
65. **HTTP		[OK] #155 IngestLocationsResponseDTO
66. **GRPC		pinging upstream (expecting .noop back)
67. **GRPC		received control: noop
68. **GRPC		pinging upstream (expecting .noop back)
69. **GRPC		received control: noop
70. **HTTP		[OK] #156 ReadDropoffTimesResponseDTO
71. **GRPC		pinging upstream (expecting .noop back)
72. **HTTP		[OK] #157 PingResponseDTO
73. **GRPC		received control: noop
74. **GRPC		pinging upstream (expecting .noop back)
75. **GRPC		received control: noop
76. **GRPC		pinging upstream (expecting .noop back)
77. **GRPC		received control: noop
78. **HTTP		[OK] #158 OffersDTO
79. **HTTP		[OK] #160 IngestLocationsResponseDTO
80. **GRPC		pinging upstream (expecting .noop back)
81. **GRPC		received control: noop
82. **HTTP		[OK] #159 ReadDropoffTimesResponseDTO
83. **GRPC		pinging upstream (expecting .noop back)
84. **GRPC		received control: noop

Additionally, backgrounding/foregrounding the app seems to have no effect on requests/responses sent through Envoy Mobile via URLSession.

Conclusions

  • With the current configuration of sending traffic over URLSession and having Envoy proxy it through seems to yield no issues on iOS when switching between wifi/cellular or background/foreground
  • This is likely because URLSession is doing something smart under the hood to handle these changes
  • I would recommend re-running these tests when we switch to calling Envoy Mobile directly as a library (rather than running on top of calls to URLSession)
  • Depending on the outcome of transport: native sockets on iOS/Android #13, we probably won't have issues as long as we use Apple-approved network solutions for the transport on iOS (such as CFNetwork/Network.framework/etc.)

@rebello95
Copy link
Contributor Author

rebello95 commented Jul 2, 2019

Another theory that was discussed offline is that Envoy's 5s DNS refresh could be "fixing" the HTTP issues that gRPC encounters using BSD sockets.

To see if this was the case, I applied the the config.yaml set from #218 to my local build (essentially slowing down DNS refreshes to 60s), and was able to see requests going through the same way as before. Thus, it still looks like URLSession (rather than DNS refreshing) is responsible for the fact that Envoy can make requests when switching between wifi/cellular.

rebello95 added a commit that referenced this issue Jul 3, 2019
Resolves #128.

Signed-off-by: Michael Rebello <[email protected]>
@rebello95
Copy link
Contributor Author

Docs: #224

rebello95 added a commit that referenced this issue Jul 4, 2019
Resolves #128.

Signed-off-by: Michael Rebello <[email protected]>
@rebello95
Copy link
Contributor Author

@mattklein123 @junr03 following up on the Slack discussion here:

I ran Envoy Mobile again alongside gRPC Swift - this time with the log level set to trace. The output is available in this gist. The hope was to identify if Envoy is doing something that would cause it to work properly when switching between WiFi/cellular even though gRPC (which uses BSD sockets) breaks.

Let me know what you think of those logs. Key things to look for:

  • **NETWORK is logged when the device switches between WiFi/cellular
  • **STREAM is a gRPC stream that is maintained throughout the lifetime of the app (never manually restarted)
  • **UNARY is a gRPC unary request sent over the same channel as the streaming requests
  • **HTTP are HTTP unary requests sent over URLSession/Envoy

I did notice quite a few logs that look potentially interesting that happen after the switch from WiFi:

  • socket event: 1
  • read error: Resource temporarily unavailable
  • new stream

@mattklein123
Copy link
Member

I haven't looked through the logs yet, but my suspicion is the difference here is that the gRPC client is using blocking IO with BSD sockets whereas Envoy is using non-blocking IO and kqueue. My guess is that the async nature of the IO is causing the kernel to handle it differently. If it's true that the gRPC client is using blocking IO, this is probably what I would discuss w/ the Apple folks.

@junr03
Copy link
Member

junr03 commented Jul 11, 2019

From looking at the logs, initial impressions:

I did notice quite a few logs that look potentially interesting that happen after the switch from WiFi

I think this might be a red herring. Every time Envoy reads from the raw buffer it reports a successful read followed by resource temporarily unavailable. I have not read the codepath closely yet. But my main point is that that is not happening only when the network transition happens.

It seems that there were no envoy requests in flight when the network transition happened. However it looks, via connection ID (C) that connections do get recycled across network transitions.

I also looked for socket close events and the three events that I saw happened after both network transitions had already occurred.

So that leaves me with no good clues about how Envoy is handling the network transition events smoothly. Especially since it seems to be using the same connections across the transition.

@rebello95
Copy link
Contributor Author

@Reflejo and I paired and dug into this a bit more, and we believe we have a good understanding of what's going on now.

Experiment

Built the Envoy Mobile library using the following flags to allow us to build to a device with debugging symbols:

--ios_multi_cpus=arm64 --copt=-ggdb3

Ran an example app on an iPhone X that:

  • Made single requests to a simple Python server from which we could see when the client connected, disconnected, and made requests
  • Routed these requests through Envoy using the socket implementation of Envoy Mobile via URLSession (not calling directly into the Envoy Mobile library)

In the active scheme of the app's Environment Variables, set CFNETWORK_DIAGNOSTICS=3 to enable more verbose CFNetwork logs. Additionally, set Envoy's logs to trace.

We then slowly made network requests, one at a time, through the app.

Findings

Whenever we switched from WiFi to cellular (or vice versa), the next 1 request would consistently fail with code -1001 kCFURLErrorTimedOut. At the same time, we'd see the connection terminate on the server, and we'd see logs from both Envoy and CFNetwork indicating that a new connection was established.

When we executed the next request, it would complete successfully.

If we did this more quickly and sent several requests in rapid succession, the first would still fail and the subsequent requests would complete normally.

Setting the URLSessionConfiguration's httpMaximumConnectionsPerHost to 1 (preventing concurrent connections) and sending several requests in rapid succession resulted in all of them failing after the timeoutIntervalForRequest specified on the URLSessionConfiguration. This is the same behavior seen with libraries like gRPC which use BSD sockets.

Analysis

The current setup looks like this:

[URLSession] --> [Socket] --> [Envoy Mobile] --> [Socket] --> [Internet] 

The experiment above indicates that when the working connection changes to inactive (i.e., disabling WiFi and forcing the phone to switch to cellular), the sockets aren't notified of the change. This is a commonly understood issue with BSD sockets on iOS, and is why Apple strongly advises against using them.

Executing another network request over URLSession results in the following, which is why URLSession previously appeared to "fix" the issue:

  • iOS realizes that the connection is dead and terminates its socket connection with Envoy, then re-establishes it
  • When the connection with Envoy is terminated, Envoy in turn terminates its socket connection with the outside Internet
  • When iOS reconnects to Envoy, Envoy also reconnects and selects the first available connection (cellular in this case)
  • Future requests succeed because they're sent over the new/valid connection

Essentially, URLSession is forcing Envoy to reconnect/switch to a valid connection when a request fails due to the fact that it's disconnecting from Envoy and reconnecting to it. This means:

  • When we switch to calling Envoy as a library (instead of over a socket), this will break completely as we originally expected it to because nothing will be forcing Envoy to reconnect to a valid connection
  • Restricting concurrent connections makes this problem immediately apparent even in today's setup because the only existing connection becomes invalid

Additionally, putting the phone in airplane mode results in all requests failing immediately (instead of waiting for the specified timeout) because iOS is aware that it has no connectivity.

Next steps

It's very clear that Envoy will need to have good support for these kinds of network transitions before it goes to production. I propose the following next steps as a way to proceed:

  1. (Optional) For internal/alpha testing, add a stop-gap by observing network changes and forcing Envoy to reconnect. This is what's done in SwiftGRPC, gRPC Core, and is essentially what's happening today under the hood via URLSession. If we take this approach, we may want to consider building in functionality for listening in on SCNetworkReachabilityCreateWithAddressPair
  2. Proceed with implementing a socket abstraction for iOS (transport: native sockets on iOS/Android #13) which uses CFNetwork under the hood (potentially using something like CFSocketGetNative). This will allow us to support older (pre-iOS 12) devices and unblock production
  3. As an immediate next step, implement the same socket abstraction using Network.framework so that we can be future-facing since this is the newly accepted best practice for this kind of networking on iOS

Thoughts @mattklein123 @junr03 @goaway?

@goaway
Copy link
Contributor

goaway commented Jul 30, 2019

Thanks for doing this investigation @rebello95 and @Reflejo!

I think observing network changes is necessary no matter what (1.), since the CF interfaces (2.) will get us a socket on the current "right" or preferred interface, but won't automatically signal the switch (IIRC). Network.framework (3.) let's us specify interfaces explicitly which is nice because we don't have to infer them, but I don't know if there's any additional baked in reachability signaling.

Ultimately, internally what I think we'll do is classify and maintain connections (while viable) on both interfaces and have strategies (e.g. reachability signals or observed health) for favoring them. (This is already similar to Envoy's connection pool model and I think we can likely adapt it to support this.)

@mattklein123
Copy link
Member

Awesome work @Reflejo @rebello95! This all makes sense to me now.

Can we put together a short design doc on iOS transport socket options? Agreed we can look at stopgap hacks for alpha, including possibly even just tearing down and recreating Envoy on interface changes. I would also recommend looking at what cronet does as a data point. Whatever we do we will need to figure out the right way to bridge the iOS platform APIs with Envoy's kqueue based event loop.

@Reflejo
Copy link
Contributor

Reflejo commented Jul 31, 2019

I actually think we should prioritize this conversation ASAP. I don't see any easy way to support high(er) level APIs on iOS with envoy, so I agree we should put a design doc on the options.

I think the short term solution is to observe reachability and close the socket; and test throughly libevent is propagating it through kevents, without changing any envoy syscalls. As mike mentioned, neither CFSocket nor BSD sockets support scoped routing on iOS and so the hand-off won't happen. It's worth noting that cellular radio is also not turned on by neither of them and VPNs won't work. This is probably ok for the short term. Another option here is to investigate CFStream.

The longer term is what we should think about and something I'm a bit concerned. Ideally we should use Network.framework, which is what apple is investing in (reachability signaling is baked into it AFAIK) but this is going to be very challenging since we can't just pass a fd and observe on it.

Since they are moving to user-space, the abstractions we have on envoy right now don't map very well to it. We need to think through it.

@rebello95
Copy link
Contributor Author

Those are great callouts, and I definitely think this is something that we need to (and are) prioritizing. @goaway and @junr03 will be heading up the work to implement the socket layer on iOS, which will be tracked in #13. After discussing in planning today, we're going to have a doc with basic proposed options by EOW, and will start implementation in early August when @junr03 gets back from OOO.

rebello95 added a commit that referenced this issue Aug 2, 2019
Updating based on #128 (comment).

Signed-off-by: Michael Rebello <[email protected]>
rebello95 added a commit that referenced this issue Aug 2, 2019
Updating based on #128 (comment).

Signed-off-by: Michael Rebello <[email protected]>
@stale
Copy link

stale bot commented Aug 31, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 31, 2019
@junr03
Copy link
Member

junr03 commented Sep 6, 2019

@rebello95 going to close this in favor of keeping #13 open.

@junr03 junr03 closed this as completed Sep 6, 2019
@junr03 junr03 removed the stale label Sep 6, 2019
jpsim pushed a commit to envoyproxy/envoy that referenced this issue Nov 28, 2022
jpsim pushed a commit to envoyproxy/envoy that referenced this issue Nov 28, 2022
Updating based on envoyproxy/envoy-mobile#128 (comment).

Signed-off-by: Michael Rebello <[email protected]>
Signed-off-by: JP Simard <[email protected]>
jpsim pushed a commit to envoyproxy/envoy that referenced this issue Nov 29, 2022
jpsim pushed a commit to envoyproxy/envoy that referenced this issue Nov 29, 2022
Updating based on envoyproxy/envoy-mobile#128 (comment).

Signed-off-by: Michael Rebello <[email protected]>
Signed-off-by: JP Simard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform/ios Issues related to iOS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants