Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix | Check Connected property of sockets during parallel connect #2877

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

David-Engel
Copy link
Contributor

@David-Engel David-Engel commented Sep 20, 2024

Address the failure in the DNN scenario described in #2400.

In the MSF parallel connect path, socket connections can fail quickly (as opposed to timing out), resulting in the first socket connect task being the "success" task that closes all other sockets. This happens because in a VNN, there is no listener at the other end of inactive IP addresses and socket connects take a while before they time out. In a DNN, the gateway responds with an RST on the socket for inactive IP addresses and the RST may come in before the connection completes on the active IP address. This means a "success" socket connect task (success just meaning ConnectAsync finished) will disconnect all other attempted sockets.

The change here is to define success as the ConnectAsync task finished and the socket.Connected property is true.

If this addresses the DNN issue, this simple fix would be a good candidate for backport. There are other code changes that could be made to eliminate tasks on this sync path. I have some drafted up, but the changes would require much more testing.

@JRahnama
Copy link
Contributor

I can ask the client to test with this set of changes. Hope they do not need it to be signed.

@DavoudEshtehari DavoudEshtehari added the 🐛 Bug! Issues that are bugs in the drivers we maintain. label Sep 20, 2024
@DavoudEshtehari DavoudEshtehari added this to the 6.0-preview2 milestone Sep 20, 2024
Copy link

codecov bot commented Sep 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.05%. Comparing base (cb11317) to head (0059548).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2877      +/-   ##
==========================================
- Coverage   72.06%   72.05%   -0.01%     
==========================================
  Files         299      299              
  Lines       61457    61457              
==========================================
- Hits        44290    44285       -5     
- Misses      17167    17172       +5     
Flag Coverage Δ
addons 92.90% <ø> (ø)
netcore 76.00% <100.00%> (+<0.01%) ⬆️
netfx 70.24% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cheenamalhotra
Copy link
Member

@JRahnama can you confirm if this has been tested by client? Thanks.

@JRahnama
Copy link
Contributor

@JRahnama can you confirm if this has been tested by client? Thanks.

I have provided the package and waiting for client's response.

@cheenamalhotra cheenamalhotra removed this from the 6.0-preview2 milestone Oct 5, 2024
@cheenamalhotra cheenamalhotra removed the 🐛 Bug! Issues that are bugs in the drivers we maintain. label Oct 5, 2024
@David-Engel
Copy link
Contributor Author

I don't think this fix will actually work. Testing #2915 with a mock server that sends a RST after connect, MDS moves on before the RST even registers on the socket. The RST might never even register on the socket without using a socket.Poll (or other read/write). Even adding socket.Poll in the parallel logic only gets me ~30ms grace between connect and RST before MDS will pass the connected socket on up the stack. I'm very hesitant to add a wait here, watching for the endpoint to send a delayed RST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants