-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race Condition on opening a Stream over a DataChannel #2443
Comments
Upon reviewing the pEvent code it more detail and thinking about it a little more. I believe this issue can only occur with events emitted from native code, such as the DataChannel event the WebRTC Transport library is trying to await. Otherwise, the pEvent code is effectively synchronous, so this would not happen with an event emitted from JavaScript code. |
Just to follow up on this, we added a small patch to address this, it was happening fairly regularly and we haven't seen this problem any more after applying the patch. Basically, we shortened the timeout and we just check it again if pEvent times out. This second check covers the case that pEvent misses the open event from the DataChannel. I think a better solution long-term here might be to 'poll' the state periodically in addition to listening for the event.
|
Would an using 'Promise.all' across pWaitFor and pEvent be an acceptable solution? If so, we would be happy to submit a PR. |
Sorry for the delay in getting to this, it does look like very strange behaviour. Looking through the But you're seeing some behaviour that's hard to explain, so maybe something is causing the current tick to end. I'd like to dig into this further - do you have a way to replicate the problem? |
It is a race condition that happens only if the open is not too fast or too slow. Probably in a window of about 4 to 8 milliseconds based on my testing. We aren't doing anything special, we are using Circuit-Relay and the WebRTCTransport is managing opening the connections. The only variable that impacts the ocurrence is the latency of the 'open' event. However, because the handling of creating the DataChannel is all internal to libp2p, I don't see an easy way to inject a delay to trigger the race without either modifying the WebRTCTransport implementation or doing it at the network level. We have some tests that use 'tc' on Linux to inject packet latency and packet loss between two nodes, but they require a fair amount of setup code building virtual interfaces and bridges to allow controlling the network behavior. |
4-8ms is an eternity for synchronous code so there's definitely something weird happening. I've tried to create a repro from your description here https://github.com/achingbrain/libp2p-webrtc-ping I'm having problems replicating with Chrome*, can you please take a look at tell me what I need to change or add to be able to replicate the issue? It shouldn't be necessary to modify libp2p internals to replicate the issue - we should be able to do it with external changes only. Since we might want to modify libp2p to fix the issue, we can't introduce new code there that creates the bug, because to "fix" it we'd just delete the code we just added 😅. * = For Firefox I came across this which is not great, but we'll see what the devs say. |
I agree, code should not need to be added to reproduce the issue. However, since the condition only manifests within a small range of open event latency, ensuring the condition will manifest is not possible without being able to control the latency of the open event. When you run on two nodes on localhost the latency will be so low that the datachannel is in the open state by the first time the first check happens, so pEvent will never be called. You'll have to run the listener and the sender on two different machines in the same LAN. What we found was that certain network and browser combinantions would never see this issue, for example, connecting over the Internet or a VPN. Other configurations would see this issue a non-trivial percentage of the time. We found that the problem seemed to occur more frequently when one of the peers was running Chrome on MacOS. Seems like the open latency was slightly higher with MacOS. A more deterministic option would be to create two 'veth' interfaces, one for the listener and one for the sender, and use 'tc' to add latency to the network path. I appreciate your interest in this. It is not an easy one I think. I would be happy to discuss this real-time if it is helpful. You can reach me directly at domain ryzyliant dot com, user justin dot mcateer. |
I've tried running https://github.com/achingbrain/libp2p-webrtc-ping on two MacOS machines on a LAN with both running Chrome but I still can't replicate the issue. If the problem is that the If anything we'd want to reduce latency to expose the issue since the time between the check and the adding of the event listener will be significantly less than 1ms as the code is all synchronous. That the code is synchronous means that it's not possible to interrupt it, so assuming it all happens in the current tick, there's no way the event can fire before the listener is added, since the callback from the native code that will cause the event to fire will be invoked when the callback queue is processed, which happens after the current tick has ended (e.g. the call stack is empty) and adding the listener will increase the calls in the call stack, since it's done synchronously. If the native callback is executed in a way that interrupts the JS call stack, the bug is in the browser since that's not how JavaScript works as pointed out. I need more assistance in replicating this issue as it doesn't seem possible as described. |
Thanks again for taking the time to dig into this. We will work on an automated method of reproducing this issue, possibly based on the test repo you put together. We will try to get to this as quickly as possible, but it may take a couple of weeks. |
Awesome, thanks - I'll check back here in a couple of weeks. |
@justin0mcateer did you manage to create a reproduction for this issue? |
Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days. |
I am looking into whether the latency capabilities of the Chrome DevTools can be used to control the latency of the connection setup. There was some work done recently in Chrome to enable this. It isnt ckear whether that is in the released versions yet. |
Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days. |
This issue was closed because it is missing author input. |
Version:
1.2.4
Platform:
Linux master 6.7.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 23 Feb 2024 16:31:48 +0000 x86_64 GNU/Linux
Windows 11
MacOS
Google Chrome 122.0.6261.94
Brave Browser 122.1.63.165
Subsystem:
WebRTC Transport
Severity:
High
Description:
DataChannels open but Streams timeout and fail.
The main symptom is that we end up with 'one way' Streams, where the incoming Stream was successfully opened, but the outgoing Stream aborts.
Steps to reproduce the error:
The timing window for the race condition is extremely small ~4ms, so the error only occurs when the time required to open a DataChannel is within a very small window. We typically see it when trying to establish connections in a LAN environment, but then the timing varies slightly based on the operating system and browser. Changing browsers may fix the problem, enabling a VPN will almost always fix the problem (due to added latency). The problem seems to occur with much greater frequency on Chrome on MacOS.
We only began seeing this problem when upgrading from 0.46 to 1.1 (now 1.2). The problem seems to have been introduced in the refactoring done here:
@achingbrain
#2237
What we are seeing is that there is a very small delay between checking for the current state of the DataChannel and pEvent actually registering the event handler to capture the 'open' event. Such that if the timing of the DataChannel open event is just right, pEvent never receives the event and throws on timing out.
https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L228-L230
Catching the Timeout error and checking again, proceeding if the DataChannel is in fact open works around the issue.
Debugging log points have been added in this log output which illustrate the issue:
If pEvent caught the 'open' event this log message should be printed, but it is not:
https://github.com/libp2p/js-libp2p/blob/main/packages/transport-webrtc/src/stream.ts#L231
It seems that due to the way pEvent is implemented, it leaves a tiny window of time between when the pEvent function is called and when 'addEventHandler' is actually called. A custom function that periodically polls, in addition to capturing the event may be required, since the timing is so tight in this case.
The text was updated successfully, but these errors were encountered: