Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x86] SequenceEqual returns false for equal spans #65292

Closed
VincentBu opened this issue Feb 14, 2022 · 66 comments · Fixed by #65490
Closed

[x86] SequenceEqual returns false for equal spans #65292

VincentBu opened this issue Feb 14, 2022 · 66 comments · Fixed by #65490

Comments

@VincentBu
Copy link
Contributor

VincentBu commented Feb 14, 2022

Run: runtime-libraries-coreclr outerloop 20220213.4

Failed test:

net7.0-windows-Release-x86-CoreCLR_release-Windows.Amd64.Server2022.Open

- System.IO.Tests.UnseekableDeviceFileStreamConnectedConformanceTests.ReadWrite_Success_Large(mode: AsyncMemory, writeSize: 10485760, startWithFlush: True)
- System.IO.Tests.AnonymousPipeFileStreamConnectedConformanceTests.ReadWrite_Success_Large(mode: SyncArray, writeSize: 10485760, startWithFlush: False)

Error message:

Showing first 10 differences
Total number of differences: 0 out of 10485760


Stack trace
   at System.AssertExtensions.SequenceEqual[T](ReadOnlySpan`1 expected, ReadOnlySpan`1 actual) in /_/src/libraries/Common/tests/TestUtilities/System/AssertExtensions.cs:line 473
   at System.IO.Tests.ConnectedStreamConformanceTests.ReadWrite_Success(ReadWriteMode mode, Int32 writeSize, Boolean startWithFlush) in /_/src/libraries/Common/tests/StreamConformanceTests/System/IO/StreamConformanceTests.cs:line 1765
   at System.IO.Tests.ConnectedStreamConformanceTests.ReadWrite_Success_Large(ReadWriteMode mode, Int32 writeSize, Boolean startWithFlush) in /_/src/libraries/Common/tests/StreamConformanceTests/System/IO/StreamConformanceTests.cs:line 1719
--- End of stack trace from previous location ---
Build Kind Start Time
1612202 Rolling 2022-14-02
1613425 Rolling 2022-15-02
1616198 Rolling 2022-16-02
1616876 Rolling 2022-16-02
1617545 Rolling 2022-17-02
1618267 Rolling 2022-17-02
@VincentBu VincentBu added arch-x86 area-System.IO os-windows blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs labels Feb 14, 2022
@ghost
Copy link

ghost commented Feb 14, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Run: runtime-libraries-coreclr outerloop 20220213.4

Failed test:

net7.0-windows-Release-x86-CoreCLR_release-Windows.Amd64.Server2022.Open

- System.IO.Tests.UnseekableDeviceFileStreamConnectedConformanceTests.ReadWrite_Success_Large(mode: AsyncMemory, writeSize: 10485760, startWithFlush: True)
- System.IO.Tests.AnonymousPipeFileStreamConnectedConformanceTests.ReadWrite_Success_Large(mode: SyncArray, writeSize: 10485760, startWithFlush: False)

Error message:

Showing first 10 differences
Total number of differences: 0 out of 10485760


Stack trace
   at System.AssertExtensions.SequenceEqual[T](ReadOnlySpan`1 expected, ReadOnlySpan`1 actual) in /_/src/libraries/Common/tests/TestUtilities/System/AssertExtensions.cs:line 473
   at System.IO.Tests.ConnectedStreamConformanceTests.ReadWrite_Success(ReadWriteMode mode, Int32 writeSize, Boolean startWithFlush) in /_/src/libraries/Common/tests/StreamConformanceTests/System/IO/StreamConformanceTests.cs:line 1765
   at System.IO.Tests.ConnectedStreamConformanceTests.ReadWrite_Success_Large(ReadWriteMode mode, Int32 writeSize, Boolean startWithFlush) in /_/src/libraries/Common/tests/StreamConformanceTests/System/IO/StreamConformanceTests.cs:line 1719
--- End of stack trace from previous location ---
Author: VincentBu
Assignees: -
Labels:

arch-x86, area-System.IO, os-windows, blocking-outerloop

Milestone: -

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 14, 2022
@adamsitnik adamsitnik self-assigned this Feb 14, 2022
@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Feb 14, 2022
@adamsitnik
Copy link
Member

To my surprise I am able to reproduce it locally:

.\build.cmd -c Release -subset clr+libs+libs.tests -arch x86
.\dotnet.cmd build .\src\libraries\System.IO.FileSystem\tests\System.IO.FileSystem.Tests.csproj /t:Test /p:Configuration=Release /p:TargetArchitecture=x86 /p:Outerloop=true

@adamsitnik adamsitnik changed the title Test failure System.IO.Tests.UnseekableDeviceFileStreamConnectedConformanceTests.ReadWrite_Success_Large [x86] SequenceEqual returns false for equal spans Feb 14, 2022
@ghost
Copy link

ghost commented Feb 14, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Run: runtime-libraries-coreclr outerloop 20220213.4

Failed test:

net7.0-windows-Release-x86-CoreCLR_release-Windows.Amd64.Server2022.Open

- System.IO.Tests.UnseekableDeviceFileStreamConnectedConformanceTests.ReadWrite_Success_Large(mode: AsyncMemory, writeSize: 10485760, startWithFlush: True)
- System.IO.Tests.AnonymousPipeFileStreamConnectedConformanceTests.ReadWrite_Success_Large(mode: SyncArray, writeSize: 10485760, startWithFlush: False)

Error message:

Showing first 10 differences
Total number of differences: 0 out of 10485760


Stack trace
   at System.AssertExtensions.SequenceEqual[T](ReadOnlySpan`1 expected, ReadOnlySpan`1 actual) in /_/src/libraries/Common/tests/TestUtilities/System/AssertExtensions.cs:line 473
   at System.IO.Tests.ConnectedStreamConformanceTests.ReadWrite_Success(ReadWriteMode mode, Int32 writeSize, Boolean startWithFlush) in /_/src/libraries/Common/tests/StreamConformanceTests/System/IO/StreamConformanceTests.cs:line 1765
   at System.IO.Tests.ConnectedStreamConformanceTests.ReadWrite_Success_Large(ReadWriteMode mode, Int32 writeSize, Boolean startWithFlush) in /_/src/libraries/Common/tests/StreamConformanceTests/System/IO/StreamConformanceTests.cs:line 1719
--- End of stack trace from previous location ---
Author: VincentBu
Assignees: adamsitnik
Labels:

arch-x86, area-System.Memory, os-windows, blocking-outerloop

Milestone: -

@adamsitnik adamsitnik removed their assignment Feb 14, 2022
@adamsitnik
Copy link
Member

Total number of differences: 0 out of 10485760

It seems to be a bug in System.Memory or codegen, as SequenceEquals returns true, but the helper fails to find any differences:

if (!expected.SequenceEqual(actual))
{
if (expected.Length != actual.Length)
{
throw new XunitException($"Expected: Span of length {expected.Length}{Environment.NewLine}Actual: Span of length {actual.Length}");
}
else
{
const int MaxDiffsToShow = 10; // arbitrary; enough to be useful, hopefully, but still manageable
int diffCount = 0;
string message = $"Showing first {MaxDiffsToShow} differences{Environment.NewLine}";
for (int i = 0; i < expected.Length; i++)
{
if (!expected[i].Equals(actual[i]))
{
diffCount++;
// Add up to 10 differences to the exception message
if (diffCount <= MaxDiffsToShow)
{
message += $" Position {i}: Expected: {expected[i]}, Actual: {actual[i]}{Environment.NewLine}";
}
}
}
message += $"Total number of differences: {diffCount} out of {expected.Length}";

I was able to reproduce it using the following test:

public class Repro
{
    [Fact]
    public void TryToMimic()
    {
        while (true)
        {
            byte[] writerBytes = RandomNumberGenerator.GetBytes(10 * 1024 * 1024);
            var readerBytes = new byte[writerBytes.Length];

            writerBytes.CopyTo(readerBytes, 0);

            Assert.True(writerBytes.SequenceEqual(readerBytes));
        }
    }
}

@EgorBo is there any chance you could take a look?

@EgorBo EgorBo self-assigned this Feb 14, 2022
@EgorBo
Copy link
Member

EgorBo commented Feb 14, 2022

Thanks for the minimal repro, looking at it now

@EgorBo
Copy link
Member

EgorBo commented Feb 15, 2022

I can't reproduce it locally 🙁 the

.\build.cmd -c Release -subset clr+libs+libs.tests -arch x86
.\dotnet.cmd build .\src\libraries\System.IO.FileSystem\tests\System.IO.FileSystem.Tests.csproj /t:Test /p:Configuration=Release /p:TargetArchitecture=x86 /p:Outerloop=true

and the repro specifically finish just fine for me (win10-x64, core i7 8700K)

@adamsitnik
Copy link
Member

I can't reproduce it locally 🙁 the

I can't always reproduce it when running all System.IO.FileSystem tests, but if I run the small repro for a minute or two it fails. Have you tried that?

@EgorBo
Copy link
Member

EgorBo commented Feb 15, 2022

I can't reproduce it locally 🙁 the

I can't always reproduce it when running all System.IO.FileSystem tests, but if I run the small repro for a minute or two it fails. Have you tried that?

ah, let me try to run it a bit longer then 👍

@jkotas
Copy link
Member

jkotas commented Feb 15, 2022

@jkotas jkotas added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Feb 15, 2022
@adamsitnik
Copy link
Member

It looks like a Windows x86 specific GC hole to me.

@jkotas could you please explain what makes you think it's a GC hole? I am just curious what symptoms GC holes have in general.

I was expecting it to be a miss-aligned read since it does not reproduce always and there is no hang/AV.

@jkotas
Copy link
Member

jkotas commented Feb 15, 2022

https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/clr-code-guide.md#211-how-gc-holes-are-created explains what GC hole is.

The typical manifestations of GC holes are intermittent crashes and data corruptions.

@adamsitnik
Copy link
Member

@jkotas thanks for the pointer!

A GC hole occurs when code inside the CLR creates a reference to a GC object, neglects to tell the GC about that reference, performs some operation that directly or indirectly triggers a GC, then tries to use the original reference. At this point, the reference points to garbage memory and the CLR will either read out a wrong value or corrupt whatever that reference is pointing to.

Is my understanding correct that switching from managed to native memory should "solve" the GC hole issue for my small repro ?

Because I am still able to repro it using the following code:

const int size = 10 * 1024 * 1024;

while (true)
{
    void* writerBytesPointer = NativeMemory.Alloc(size);
    void* readerBytesPointer = NativeMemory.Alloc(size);

    try
    {
        Span<byte> writerBytes = new Span<byte>(writerBytesPointer, size);
        Span<byte> readerBytes = new Span<byte>(readerBytesPointer, size);

        RandomNumberGenerator.Fill(writerBytes);
        writerBytes.CopyTo(readerBytes);

        Assert.True(writerBytes.SequenceEqual(readerBytes));
    }
    finally
    {
        NativeMemory.Free(readerBytesPointer);
        NativeMemory.Free(writerBytesPointer);
    }
}

@EgorBo
Copy link
Member

EgorBo commented Feb 15, 2022

Is my understanding correct that switching from managed to native memory should "solve" the GC hole issue for my small repro ?

GC holes can be anywhere even in a simple C#-only code where JIT doesn't report gcinfo correctly so when GC interupts it doesn't update all regs correctly.

Are you on the most recent Main btw?

@adamsitnik
Copy link
Member

I am on:

Author:			Jakob Botsch Nielsen <[email protected]>
Date:			1 day ago (2/14/2022 11:08:08 AM)
Committer:		GitHub <[email protected]>
Commit hash:	e23382c32bf03d3fea5a2f1157a99148a51ca3f4
Child:			Commit index
Parent:			6e05d78d

Keep information about CSE defs on assignment nodes (#64896)

Reuse the gtCSEnum field on assignments to signal "this is a performed
CSE def". gtCanSwapOrder uses the CSE def information to reason about
evaluation order, so without keeping this information we could risk
substitution a CSE def, followed by reordering operands, and then
finally substituting a CSE use.

Fix #64883
Contained in branches:
main

@adamsitnik
Copy link
Member

If I run just this unit test, I am not able to repro.

Should I get a memory dump?

@EgorBo
Copy link
Member

EgorBo commented Feb 15, 2022

If I run just this unit test, I am not able to repro.

Should I get a memory dump?

wouldn't hurt I'd guess.

I've been running your repro in the FileSystem.IO test suite for 7 minutes already and still no signs of the bug

@adamsitnik
Copy link
Member

@adamsitnik Could you please run the simple repo from #47016 (comment) on your machine to see whether it hits the problem?

@jkotas it does:

PS D:\projects\runtime\artifacts\bin\testhost\net7.0-windows-Release-x86\shared\Microsoft.NETCore.App\7.0.0> .\corerun.exe C:\Users\adsitnik\source\repos\SequenceEqualRepro\bin\Release\net6.0\SequenceEqualRepro.dll
Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at SequenceEqualRepro.Program.<Main>g__Work|1_0(Object _) in C:\Users\adsitnik\source\repos\SequenceEqualRepro\Program.cs:line 23
   at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute() in D:\projects\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\ThreadPoolWorkQueue.cs:line 965
   at System.Threading.ThreadPoolWorkQueue.Dispatch() in D:\projects\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\ThreadPoolWorkQueue.cs:line 715
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() in D:\projects\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\PortableThreadPool.WorkerThread.cs:line 77
   at System.Threading.Thread.StartCallback() in D:\projects\runtime\src\coreclr\System.Private.CoreLib\src\System\Threading\Thread.CoreCLR.cs:line 105

@davidwrighton
Copy link
Member

@VSadov I'm pretty sure the only path where AVX registers need careful handling is the restore path used in suspension. Fortunately, the AVX registers are not considered saved state for the purposes of function calls, so EH logic has to tolerate them being trashed anyways.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 17, 2022
@VSadov
Copy link
Member

VSadov commented Feb 17, 2022

@davidwrighton Right. extended registers are volatile and thus in a case of software exception that goes via RaiseException will not be preserved anyways. I am less sure what is expected in hardware exception case.

In the fix (#65292) I am no longer using ReplaceExceptionContextRecord to restore suspension context - to separate suspension scenario from exception handling in general. Also because CopyContext is Windows - only API.

ReplaceExceptionContextRecord seems to be specifically trying to handle extended registers. It could be just for suspension/resume case though. I'd defer to you and @janvorli on this.
Perhaps support for extended could be removed there once suspension does not call it.

Ideally, we will switch x86 to the same plan as x64, but it is a bit longer story and I think we need a quick fix for now.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 18, 2022
@filipnavara
Copy link
Member

Should this be reopened to track the need for servicing backport?

@VSadov VSadov reopened this Feb 18, 2022
@ericstj
Copy link
Member

ericstj commented Feb 24, 2022

Should this also be ported to Preview2 (if there is time?)

@jeffschwMSFT
Copy link
Member

Preview 2 is basically complete. We will snap for Preview 3 in the coming weeks. In the meantime we can provide privates if a workaround is needed.

@danmoseley danmoseley removed blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs labels Feb 28, 2022
@danmoseley
Copy link
Member

@MaximLipnin @VSadov I assume this is not blocking anything any more as it is fixed in main and this issue is to track backports?

@jeffschwMSFT
Copy link
Member

Once we have a full fix, we will likely be taking the fix down level to all supported versions.

@VSadov
Copy link
Member

VSadov commented May 4, 2022

the last tracked piece - porting to 3.1 has been completed. Closing.

@VSadov VSadov closed this as completed May 4, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jun 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.