Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

close quic streams on Dispose #54798

Merged
merged 7 commits into from
Jul 7, 2021
Merged

close quic streams on Dispose #54798

merged 7 commits into from
Jul 7, 2021

Conversation

wfurt
Copy link
Member

@wfurt wfurt commented Jun 27, 2021

This change has multiple parts:

Primarily this changes Dispose behavior to flush all pending data via gracefully shutdown (contributes to #756).
This can possibly block if the connection is up but receiver would stop reading and we exceed sending window.
We may solve this via shutdown API taking Cancellation token.
This behavior is essential for all failing stream conformance tests.

On reading side we may way for graceful shutdown or rest but that can also take long time and it can possibly lead to DoS. I look and for NetwrkStream we call shutdown for both directions. So I decided to call read abort. In ideal case the application (like H/3) know when to stop reading/writing so peer can gracefully close the stream before Dispose e.g. the dispose is just measure to avoid internal reset with code 0.

With this in place we may not safely cleanup all the handles in Dispose method. To deal with this I added simple atomic state so it can be alternatively deferred to HandleEventShutdownComplete(). I feel there is opportunity to perhaps simplify the logic more but I wanted more testing and experiment while I wanted to get this change above out.
The tricky part is that the state can often change during Dispose when calling shutdown/reset so I use simple state to cleanup on either place.

To make debug easier I set the safe handled to IntPtr.Zero during release so when called twice or after I get nice SafeHandle exception instead of possibly corrupting memory and getting native crash.

We have conformance test expect that stream.CanRead and CanWrite would reflect state of the stream. So they are no longer readonly.

With the change the WaitForAvailable*Stream tests started to fail. It turned out they made incorrect assumption that the limit would increase when the stream is disposed locally. Big thanks to @nibanks who helped me to debug the fact that we need to release it on the other side as well to get predicable behavior. I feel some H/3 tests may have similar issue but I did not dig deeper.

With this all the conformance tests are passing with exception of parallel IO.
I'm still running tests in loop and looking for odd failures.
This may need some more adjustments but I wanted to get it out for some feedback.

@wfurt wfurt requested review from stephentoub and a team June 27, 2021 22:00
@wfurt wfurt self-assigned this Jun 27, 2021
@ghost
Copy link

ghost commented Jun 27, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

This change has multiple parts:

Primarily this changes Dispose behavior to flush all pending data via gracefully shutdown (contributes to #756).
This can possibly block if the connection is up but receiver would stop reading and we exceed sending window.
We may solve this via shutdown API taking Cancellation token.
This behavior is essential for all failing stream conformance tests.

On reading side we may way for graceful shutdown or rest but that can also take long time and it can possibly lead to DoS. I look and for NetwrkStream we call shutdown for both directions. So I decided to call read abort. In ideal case the application (like H/3) know when to stop reading/writing so peer can gracefully close the stream before Dispose e.g. the dispose is just measure to avoid internal reset with code 0.

With this in place we may not safely cleanup all the handles in Dispose method. To deal with this I added simple atomic state so it can be alternatively deferred to HandleEventShutdownComplete(). I feel there is opportunity to perhaps simplify the logic more but I wanted more testing and experiment while I wanted to get this change above out.
The tricky part is that the state can often change during Dispose when calling shutdown/reset so I use simple state to cleanup on either place.

To make debug easier I set the safe handled to IntPtr.Zero during release so when called twice or after I get nice SafeHandle exception instead of possibly corrupting memory and getting native crash.

We have conformance test expect that stream.CanRead and CanWrite would reflect state of the stream. So they are no longer readonly.

With the change the WaitForAvailable*Stream tests started to fail. It turned out they made incorrect assumption that the limit would increase when the stream is disposed locally. Big thanks to @nibanks who helped me to debug the fact that we need to release it on the other side as well to get predicable behavior. I feel some H/3 tests may have similar issue but I did not dig deeper.

With this all the conformance tests are passing with exception of parallel IO.
I'm still running tests in loop and looking for odd failures.
This may need some more adjustments but I wanted to get it out for some feedback.

Author: wfurt
Assignees: wfurt
Labels:

area-System.Net.Quic

Milestone: -

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully on board with _canRead/Write change as well as ShutdownDone. To me, it's another state to check on every step. I'd rather combine those in their respective state fields.

I also ran the tests, H/3 still intermittently crashes, and some cookie tests were failing for me. Those are being handled by Geoff AFAIK.

@@ -70,7 +70,7 @@ internal sealed class State
SingleWriter = true,
});

public void RemoveStream(MsQuicStream stream)
public void RemoveStream(MsQuicStream? stream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we even passing in the stream? Doesn't seem to be used at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going back and forth on this one. When debugging it is very nice to have the stream here so one can emit traces from single location. This is basically same for the AddStream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand why we need this at all... It seems like we are deferring Dispose of the connection until all streams in use on it are completed. Is that correct? Why do we want to do this? Can we just get rid of this tracking completely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is the intend. We saw crashes without it but it can perhaps be done in different way. If so I would do it in separate PR. The reason why this was touches is that the HandleEventShutdownComplete does not have access to the Stream itself so it would pass NULL.
I'm OK removing the parameter completely as it is not use for product or perhaps using it Debug builds only and for example dump the Stream info if we are on path to Assert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we should consider for a separate PR.

Mostly I want to make sure we are making an intentional decision about behavior here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When chatting with @nibanks he sad we would not get HandleEventShutdownComplete with pending streams. So if we delay cleanup in Dispose and if we can relay on that we could perhaps remove to logic. It would be nice to have stress tests up at that time so we can verify the impact.

@@ -650,6 +650,14 @@ private async Task FlushAcceptQueue()
_state.AcceptQueue.Writer.TryComplete();
await foreach (MsQuicStream item in _state.AcceptQueue.Reader.ReadAllAsync().ConfigureAwait(false))
{
if (item.CanRead)
{
item.AbortRead(0xffffffff);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-coded error-code is not probably something we would want here, isn't it? All the error codes should come from user app since there aren't any officially defined in QUIC protocol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If necessary we should hard code the appropriate H3 error code here, for now. But ideally this would not be H3 specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have suggestion for value and name @geoffkizer? I agree this should preferably not be H3 specific so I picked the MaxValue instead.
Good behaving app would probably shutdown the Listener and accept and dispose all the pending streams.

@@ -252,7 +252,7 @@ private static uint HandleEventShutdownComplete(State state, ref ConnectionEvent
state.ShutdownTcs.SetResult(MsQuicStatusCodes.Success);

// Stop accepting new streams.
state.AcceptQueue.Writer.Complete();
state.AcceptQueue.Writer.TryComplete();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there are a bunch of places now where we either call Writer.Complete or Writer.TryComplete. Are all of these necessary? I would assume we only have to do this in a limited number of places, e.g. HandleEventShutdownInitiatedByTransport and HandleEventShutdownInitiatedByPeer, and maybe one other for when we initiate shutdown ourselves.

I always get nervous when we do stuff like call TryComplete instead of Complete, because it seems like we don't have a clean handling of exactly when the Complete operation needs to happen. I'd prefer to see the TryComplete calls changed to either Complete or to assert that it's already completed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I was getting exception when when the connection was already Disposed and that triggered flush of events and ShutdownComplete. I could wrap it with if (_disposed == 0) but changing it to Try look better to me. Since this only means we will not put more items to the queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I look at the cases @geoffkizer and it is not that we would not know if the Complete operation should happen. We just don't know if happened before. Like in this case the shutdown may be initiated by peer or somebody may Dispose the connection but than we get this final event and it will throw if we try to complete it again. Unless I missed something I did not find API on the Writer to know if it was completed or not.
One option I can think of you we making this nullable and set it to null after completion. Or we can add yet another state. Let me know what you want me to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine for this PR, but we should revisit the logic here moving forward.

}

// We already got final event.
releaseHandles = Interlocked.Exchange(ref _state.ShutdownDone, 1) == 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already got a final event, then do we actually need to call shutdown/abort below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way how it currently work is that ShutdownComplete will not release everything if Dispose was not called. See note bellow. We could change that but then from example Shutdown and/or Rest can throw if handle is gone.

@@ -824,6 +899,18 @@ private static uint HandleEventShutdownComplete(State state, ref StreamEvent evt
state.ShutdownCompletionSource.SetResult();
}

// Dispose was called before complete event.
bool releaseHandles = Interlocked.Exchange(ref state.ShutdownDone, 2) == 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the case that we always get this ShutdownComplete event? So then can't we always release the handles here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it. However in some cases when can ShutdownComplete without us asking for it .e.g. Disposing. With that, attempt to shutdown/reset will throw. So we would need careful synchronization before (and perhaps during) every operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the handle is only needed for event callbacks, right? I don't think other code should be accessing it. So I don't think there's any real synchronization needed here, is there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the comment. The cleanup does more than just handle and we also pass the handle in functions like Shutdown().

@geoffkizer
Copy link
Contributor

Related: #55044

@wfurt
Copy link
Member Author

wfurt commented Jul 2, 2021

I addressed most feedback and left some questions/comments. It would be great if you can take another look @geoffkizer @ManickaP

@ManickaP
Copy link
Member

ManickaP commented Jul 2, 2021

I did a run of the current code and neither Http nor Quic errors. 🥳

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still some unanswered questions.

@ManickaP
Copy link
Member

ManickaP commented Jul 6, 2021

@wfurt please let me know when you'll want me to re-review. I see some open discussion with @geoffkizer so I'm not sure whether you plan to do more changes here.

@wfurt
Copy link
Member Author

wfurt commented Jul 7, 2021

contributes to #49157

@wfurt wfurt merged commit 0d1e9ca into dotnet:main Jul 7, 2021
@wfurt wfurt deleted the quicDispose branch July 7, 2021 19:25
@karelz karelz added this to the 6.0.0 milestone Jul 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants