Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUIC] Intermittent hangs in CI #55642

Closed
ManickaP opened this issue Jul 14, 2021 · 7 comments · Fixed by #56346
Closed

[QUIC] Intermittent hangs in CI #55642

ManickaP opened this issue Jul 14, 2021 · 7 comments · Fixed by #56346
Assignees
Milestone

Comments

@ManickaP
Copy link
Member

ManickaP commented Jul 14, 2021

Sporadic hangs of System.Net.Quic.Functional.Tests.

Since yesterday we're seeing that sometime the work item fails with "timeout" meaning a test hang and subsequent kill by the helix agent.
The potential culprit that might have caused this is #55505. Firstly, because the time when it started occurring correlates with the time of its merge. Secondly, the changes come from #52929 and there's been multiple test hangs historically.

Unique tests picked from the console logs:

[Long Running Test] 'System.Net.Quic.Tests.MsQuicQuicStreamConformanceTests.ReadWrite_Success', Elapsed: 00:14:04
[Long Running Test] 'System.Net.Quic.Tests.QuicStreamTests_MsQuicProvider.ReadWrite_Random_Success', Elapsed: 00:14:00
[Long Running Test] 'System.Net.Quic.Tests.QuicStreamTests_MsQuicProvider.LargeDataSentAndReceived', Elapsed: 00:14:04
[Long Running Test] 'System.Net.Quic.Tests.MsQuicQuicStreamConformanceTests.ReadWrite_MessagesSmallerThanReadBuffer_Success', Elapsed: 00:14:05
[Long Running Test] 'System.Net.Quic.Tests.MsQuicTests.ByteMixingOrNativeAVE_MinimalFailingTest', Elapsed: 00:14:33

Kusto queries to get the data

Failing work items (9 records):

WorkItems
| where FriendlyName == "System.Net.Quic.Functional.Tests" and Status != "Pass" and ExitCode == -3
| join kind=inner Jobs on JobId
| where DockerTag == "mcr.microsoft.com/dotnet-buildtools/prereqs:fedora-32-helix-20210710031120-870c408"
| order by Queued desc

Success for comparison (465 total, 135 since the first failure):

WorkItems
| where FriendlyName == "System.Net.Quic.Functional.Tests" and Status == "Pass"
| join kind=inner Jobs on JobId
| where DockerTag == "mcr.microsoft.com/dotnet-buildtools/prereqs:fedora-32-helix-20210710031120-870c408"
| order by Queued desc

cc: @CarnaViire @karelz

Build Kind Start Time
1236570 PR 55594 2021-14-07
@ManickaP ManickaP added this to the 6.0.0 milestone Jul 14, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Jul 14, 2021
@ghost
Copy link

ghost commented Jul 14, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Sporadic hangs of System.Net.Quic.Functional.Tests.

Since yesterday we're seeing that sometime the work item fails with "timeout" meaning a test hang and subsequent kill by the helix agent.
The potential culprit that might have caused this is #55505. Firstly, because the time when it started occurring correlates with the time of its merge. Secondly, the changes come from #52929 and there's been multiple test hangs historically.

Unique tests picked from the console logs:

[Long Running Test] 'System.Net.Quic.Tests.MsQuicQuicStreamConformanceTests.ReadWrite_Success', Elapsed: 00:14:04
[Long Running Test] 'System.Net.Quic.Tests.QuicStreamTests_MsQuicProvider.ReadWrite_Random_Success', Elapsed: 00:14:00
[Long Running Test] 'System.Net.Quic.Tests.QuicStreamTests_MsQuicProvider.LargeDataSentAndReceived', Elapsed: 00:14:04
[Long Running Test] 'System.Net.Quic.Tests.MsQuicQuicStreamConformanceTests.ReadWrite_MessagesSmallerThanReadBuffer_Success', Elapsed: 00:14:05
[Long Running Test] 'System.Net.Quic.Tests.MsQuicTests.ByteMixingOrNativeAVE_MinimalFailingTest', Elapsed: 00:14:33

Kusto queries to get the data

Failing work items (9 records):

WorkItems
| where FriendlyName == "System.Net.Quic.Functional.Tests" and Status != "Pass" and ExitCode == -3
| join kind=inner Jobs on JobId
| where DockerTag == "mcr.microsoft.com/dotnet-buildtools/prereqs:fedora-32-helix-20210710031120-870c408"
| order by Queued desc

Success for comparison (465 total, 135 since the first failure):

WorkItems
| where FriendlyName == "System.Net.Quic.Functional.Tests" and Status == "Pass"
| join kind=inner Jobs on JobId
| where DockerTag == "mcr.microsoft.com/dotnet-buildtools/prereqs:fedora-32-helix-20210710031120-870c408"
| order by Queued desc

cc: @CarnaViire @karelz

Author: ManickaP
Assignees: -
Labels:

area-System.Net.Quic

Milestone: 6.0.0

@ManickaP ManickaP removed the untriaged New issue has not been triaged by the area owner label Jul 14, 2021
@ManickaP
Copy link
Member Author

If we decide to punt this, we should be sure this is not a product bug.

@wfurt
Copy link
Member

wfurt commented Jul 15, 2021

I can reproduce locally. Definitely seems like regression.

@geoffkizer
Copy link
Contributor

We should not punt this.

@wfurt
Copy link
Member

wfurt commented Jul 16, 2021

We seems to always hang in synchronous Read or Write. We are generally in ConnectionClose state.
I'll continue tomorrow to see why we are not continuing/failing as we should

@wfurt
Copy link
Member

wfurt commented Jul 20, 2021

It seems like we have problem with completing the tasks.
The underlying task completion is set as finished and that should wake the chained WriteAsync we call from Write via

WriteAsync(buffer.ToArray()).AsTask().GetAwaiter().GetResult();

The continuation hangs on thread-pool without being completed.
At this point #53471 is main suspect for the regression @kouvel.
Full dump is posted https://microsoft-my.sharepoint.com/:u:/p/toweinfu/Ed_HLuTVJFJIvA1g5ZhgT8YBBV-h7GXMOWzr73pNosiA7g?e=O93P6L and it would be great if you can take a look why thread pool is not running the tasks.
I look at it with @stephentoub and it seems like the old code should work but it does not. Changes from #55985 should not make difference but they do. I don't know if contributing factor is fact that my VM is 2-core system. I'de be happy to test it with more (or less) cores.

also note that other tasks get stuck too at this point. I added

  using CancellationTokenSource cts = new CancellationTokenSource();
  cts.CancelAfter(_readTimeout);

and the cancellation would not expire either at this point.

kouvel added a commit to kouvel/runtime that referenced this issue Jul 27, 2021
- In dotnet#53471 the thread count goal was moved out of `ThreadCounts`, it turns out that are a few subtle races that it was avoiding. There are other ways to fix it, but I've added the goal back into `ThreadCounts` for now.
- Reverted PR dotnet#55985, which worked around the issue in the CI

Fixes dotnet#55642
@ghost
Copy link

ghost commented Jul 27, 2021

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

Sporadic hangs of System.Net.Quic.Functional.Tests.

Since yesterday we're seeing that sometime the work item fails with "timeout" meaning a test hang and subsequent kill by the helix agent.
The potential culprit that might have caused this is #55505. Firstly, because the time when it started occurring correlates with the time of its merge. Secondly, the changes come from #52929 and there's been multiple test hangs historically.

Unique tests picked from the console logs:

[Long Running Test] 'System.Net.Quic.Tests.MsQuicQuicStreamConformanceTests.ReadWrite_Success', Elapsed: 00:14:04
[Long Running Test] 'System.Net.Quic.Tests.QuicStreamTests_MsQuicProvider.ReadWrite_Random_Success', Elapsed: 00:14:00
[Long Running Test] 'System.Net.Quic.Tests.QuicStreamTests_MsQuicProvider.LargeDataSentAndReceived', Elapsed: 00:14:04
[Long Running Test] 'System.Net.Quic.Tests.MsQuicQuicStreamConformanceTests.ReadWrite_MessagesSmallerThanReadBuffer_Success', Elapsed: 00:14:05
[Long Running Test] 'System.Net.Quic.Tests.MsQuicTests.ByteMixingOrNativeAVE_MinimalFailingTest', Elapsed: 00:14:33

Kusto queries to get the data

Failing work items (9 records):

WorkItems
| where FriendlyName == "System.Net.Quic.Functional.Tests" and Status != "Pass" and ExitCode == -3
| join kind=inner Jobs on JobId
| where DockerTag == "mcr.microsoft.com/dotnet-buildtools/prereqs:fedora-32-helix-20210710031120-870c408"
| order by Queued desc

Success for comparison (465 total, 135 since the first failure):

WorkItems
| where FriendlyName == "System.Net.Quic.Functional.Tests" and Status == "Pass"
| join kind=inner Jobs on JobId
| where DockerTag == "mcr.microsoft.com/dotnet-buildtools/prereqs:fedora-32-helix-20210710031120-870c408"
| order by Queued desc

cc: @CarnaViire @karelz

Build Kind Start Time
1236570 PR 55594 2021-14-07
Author: ManickaP
Assignees: wfurt
Labels:

area-System.Threading, bug

Milestone: 6.0.0

@kouvel kouvel assigned kouvel and unassigned wfurt Jul 27, 2021
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jul 27, 2021
kouvel added a commit that referenced this issue Jul 27, 2021
* Fix thread pool hang

- In #53471 the thread count goal was moved out of `ThreadCounts`, it turns out that are a few subtle races that it was avoiding. There are other ways to fix it, but I've added the goal back into `ThreadCounts` for now.
- Reverted PR #55985, which worked around the issue in the CI

Fixes #55642

* Revert "mitigation for quic tests hangs (#55985)"

This reverts commit 0a5e93b.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 27, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants