Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: fix thread sanitizer job #3303

Merged
merged 13 commits into from
Nov 3, 2023
Merged

Conversation

armcknight
Copy link
Member

@armcknight armcknight commented Sep 25, 2023

fixes #3200 and #3069

Fixed the races that were detected locally. Here are the outputs from each:
data race in ANR tracker.txt
data race in SentryNetworkTracker.isCaptureFailedRequestsEnabled.txt
data race in SentrySpan.isFinished.txt
data race in SentryThreadInspectorTests.txt
data race in TestThread (1).txt
data race in TestThread (2).txt
data race in binaryImageAdded
data race in SentryFileManager currentFileCounter

TODO

  • i just discovered the suppressions file listing some skipped tests. Let's work on trying to fix those in a subsequent PR.

#skip-changelog

@armcknight armcknight changed the title DONTMERGE: remove all but thread sanitizer job to test in ci ci: fix thread sanitizer job Sep 25, 2023
@armcknight armcknight linked an issue Sep 25, 2023 that may be closed by this pull request
@armcknight
Copy link
Member Author

Current status: "unexpected exit, crash, or test timeout" when run under TSAN in the following test cases

  • SentryNSDataTrackerTests.testWriteAtomically_CheckTransaction_FilterOut_nonProcessFrames
  • SentryANRTrackerTests.testAppSuspended_NoANR
  • SentryANRTrackerTests.testMultipleListeners
  • -[SentryCrashBinaryImageCacheTests testAddBinaryImageInParallel]
  • SentryFileManagerTests.testDefaultMaxEnvelopesConcurrent

The message appears after a TSAN data race warning in each of them, along with this message (not sure if it's a red herring): xctest(8596,0x10ef62280) malloc: nano zone abandoned due to inability to reserve vm space.

@armcknight
Copy link
Member Author

armcknight commented Sep 26, 2023

Right now I'm thinking we can merge this, and after merging main into #3290 et al, see if any of those failures resolve, then we can skip the remaining failures and open subsequent issues to investigate/fix them. What do you think @brustolin ?

@armcknight
Copy link
Member Author

I changed #3290 to merge into this branch and reran all the unit tests against it, so let's see if any ANRTracker-based tsan warnings pop up there.

@codecov
Copy link

codecov bot commented Oct 11, 2023

Codecov Report

Merging #3303 (2199efa) into main (5f8ee7a) will increase coverage by 0.071%.
The diff coverage is 90.000%.

Impacted file tree graph

@@              Coverage Diff              @@
##              main     #3303       +/-   ##
=============================================
+ Coverage   89.276%   89.347%   +0.071%     
=============================================
  Files          500       500               
  Lines        54739     54862      +123     
  Branches     19644     19682       +38     
=============================================
+ Hits         48869     49018      +149     
+ Misses        5002      4972       -30     
- Partials       868       872        +4     
Files Coverage Δ
Sources/Sentry/SentryANRTracker.m 100.000% <100.000%> (ø)
Sources/Sentry/SentryFileManager.m 93.697% <100.000%> (ø)
Sources/Sentry/SentryFramesTracker.m 98.484% <100.000%> (+0.023%) ⬆️
Sources/Sentry/SentryLog.m 100.000% <100.000%> (ø)
Sources/Sentry/SentryNetworkTracker.m 92.953% <100.000%> (+0.512%) ⬆️
Sources/Sentry/SentrySDK.m 89.787% <100.000%> (+0.132%) ⬆️
Sources/Sentry/SentrySpan.m 95.705% <100.000%> (+0.135%) ⬆️
Sources/Sentry/SentryStacktraceBuilder.m 75.384% <100.000%> (+0.384%) ⬆️
Sources/Sentry/SentryThreadWrapper.m 100.000% <100.000%> (ø)
...entryCrash/Recording/SentryCrashBinaryImageCache.c 98.275% <100.000%> (+2.700%) ⬆️
... and 11 more

... and 14 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f8ee7a...2199efa. Read the comment docs.

@philipphofmann
Copy link
Member

The Thread Sanitizer CI job is still failing. @armcknight, what's the state of the PR?

@github-actions
Copy link

github-actions bot commented Oct 18, 2023

Performance metrics 🚀

  Plain With Sentry Diff
Startup time 1238.44 ms 1257.30 ms 18.86 ms
Size 22.85 KiB 414.06 KiB 391.21 KiB

Baseline results on branch: main

Startup times

Revision Plain With Sentry Diff
a176fc4 1250.29 ms 1257.88 ms 7.59 ms
257c2a9 1231.45 ms 1252.12 ms 20.67 ms
6943de0 1237.67 ms 1247.12 ms 9.45 ms
7419285 1209.53 ms 1244.72 ms 35.19 ms
fd6a31c 1204.73 ms 1222.34 ms 17.61 ms
98cca71 1199.08 ms 1227.36 ms 28.28 ms
9cc7e7c 1228.90 ms 1237.96 ms 9.06 ms
1db04d8 1250.20 ms 1258.12 ms 7.92 ms
afd1a08 1207.78 ms 1223.44 ms 15.66 ms
d8eb419 1221.91 ms 1253.62 ms 31.71 ms

App size

Revision Plain With Sentry Diff
a176fc4 22.84 KiB 403.24 KiB 380.39 KiB
257c2a9 20.76 KiB 401.36 KiB 380.60 KiB
6943de0 20.76 KiB 393.33 KiB 372.57 KiB
7419285 20.76 KiB 432.99 KiB 412.22 KiB
fd6a31c 20.76 KiB 436.50 KiB 415.74 KiB
98cca71 22.85 KiB 411.14 KiB 388.29 KiB
9cc7e7c 22.84 KiB 403.13 KiB 380.29 KiB
1db04d8 20.76 KiB 435.50 KiB 414.74 KiB
afd1a08 22.84 KiB 402.57 KiB 379.72 KiB
d8eb419 22.85 KiB 408.87 KiB 386.02 KiB

Previous results on branch: armcknight/ci/3200-fix-thread-sanitizer

Startup times

Revision Plain With Sentry Diff
346c775 1242.31 ms 1256.77 ms 14.46 ms
c58eb2f 1235.42 ms 1256.14 ms 20.72 ms
802d0eb 1250.78 ms 1257.14 ms 6.36 ms
b503810 1196.48 ms 1226.68 ms 30.20 ms
c773f09 1222.34 ms 1252.36 ms 30.02 ms
90da6b1 1230.67 ms 1240.33 ms 9.65 ms
c43eb20 1228.26 ms 1251.85 ms 23.59 ms
de0b274 1226.80 ms 1250.34 ms 23.54 ms
ce5be3f 1233.71 ms 1264.20 ms 30.49 ms
f2bc3f1 1192.73 ms 1224.49 ms 31.76 ms

App size

Revision Plain With Sentry Diff
346c775 22.85 KiB 413.08 KiB 390.23 KiB
c58eb2f 22.85 KiB 412.62 KiB 389.77 KiB
802d0eb 22.85 KiB 411.50 KiB 388.65 KiB
b503810 22.85 KiB 413.00 KiB 390.15 KiB
c773f09 22.85 KiB 411.50 KiB 388.65 KiB
90da6b1 22.85 KiB 412.07 KiB 389.22 KiB
c43eb20 22.85 KiB 412.07 KiB 389.22 KiB
de0b274 22.85 KiB 412.58 KiB 389.74 KiB
ce5be3f 22.85 KiB 411.80 KiB 388.95 KiB
f2bc3f1 22.85 KiB 411.49 KiB 388.64 KiB

@armcknight
Copy link
Member Author

As of the latest run in CI, it doesn't produce any TSAN warnings, but is again taking long enough to time out the job :/

Looking at the logs, I still see the following failures:

  • SentryTests.SentryFramesTrackerTests testFrameRateChange
  • SentryTests.SentryNetworkTrackerIntegrationTests testCustomURLProtocol_BlocksAllRequests
  • SentryTests.SentryNetworkTrackerIntegrationTests testGetCaptureFailedRequestsEnabled
  • SentryTests.SentryNetworkTrackerIntegrationTests testGetRequest_CompareSentryTraceHeader
  • SentryTests.SentryNetworkTrackerIntegrationTests testGetRequest_SpanCreatedAndBaggageHeaderAdded

@armcknight
Copy link
Member Author

I've got it narrowed down to one failing test, testStartOnTheMainThread. Added more logging to see what's up.

Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR could be split up into several PRs. Many changes make sense even if we don't enable the thread sanitizer in CI again. You don't have to do this as the PR is still manageable, but if it gets bigger, I would rather split it up.

@@ -60,6 +60,12 @@ + (instancetype)sharedInstance

+ (void)reset
{
#if !TARGET_OS_WATCH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could solve some flaky reachability tests. IMO, this should be in an extra PR, so we can merge it right away. I don't want to steal your credit, so I leave it up to you if you want to open a PR or not. Otherwise, I can do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's exactly what it was for. In general I did plan to split up some of the fixes in here... I've just been trying to plow through anything that could possibly be causing problems in the TSAN tests for now. Easier to test the combinations together, then split them apart for discussion/historical purposes afterwards!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #3356

Comment on lines 142 to 144
// We accept the tradeoff that the SDK might not be fully initialized directly after
// initializing it on a background thread because scheduling the init synchronously on the main
// thread could lead to deadlocks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: I think that comment should stay directly above [SentryThreadWrapper onMainThread:^{

let filteredFrames = await self.firstFrame()
waitForAsyncToRun.fulfill()
XCTAssertGreaterThanOrEqual(filteredFrames, 3, "The Stacktrace must include the async callers.")
}
wait(for: [waitForAsyncToRun], timeout: 1)

wait(for: [waitForAsyncToRun], timeout: 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: Increasing this to 10 looks a bit scary cause it can slow down our test suite. What made you increase this value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the test logs, I saw it taking 5-6 seconds between the log statements that define the boundary completing the async task. no idea why though 🙁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can either special case it for tsan only, or skip these completely, but that seems unwise given what they're supposed to test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe tsan doesn't work well with backtrace_async. I would rather special case them for tsan.

@armcknight
Copy link
Member Author

So, this finally got a green result on the tsan job, but I'm sure it's going to be really flaky moving forward. I think we could probably merge these changes, although I'm going to separate out some unrelated changes first.

If it continues to cause problems, we should consider making TSAN jobs non-failable,, and instead only report TSAN warnings and test failures as comments/annotations on PRs instead. WDYT @philipphofmann ?

@armcknight armcknight force-pushed the armcknight/ci/3200-fix-thread-sanitizer branch from 372aa8d to d99c31e Compare October 20, 2023 23:48
Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this finally got a green result on the tsan job, but I'm sure it's going to be really flaky moving forward. I think we could probably merge these changes, although I'm going to separate out some unrelated changes first.

I rather not enable tsan if it's flaky cause we won't trust it and ignore it most likely.

If it continues to cause problems, we should consider making TSAN jobs non-failable,, and instead only report TSAN warnings and test failures as comments/annotations on PRs instead. WDYT @philipphofmann ?

I don't think making them non-failable is a good idea cause I fear that we are going to ignore them if they are flaky. It could be a first step though, so your efforts here are not lost, if you think getting them stable is hard to achieve.

scripts/tests-with-thread-sanitizer.sh Show resolved Hide resolved
@armcknight
Copy link
Member Author

I don't think making them non-failable is a good idea cause I fear that we are going to ignore them if they are flaky. It could be a first step though, so your efforts here are not lost, if you think getting them stable is hard to achieve.

It's either that or we decide to throw them out, which seems unwise. It was finding actual issues. And, as I noted above, someone has been skipping tests anyways, as there is a suppression file with some test cases listed to skip.

It's a matter of how much effort we're willing to put into getting them stable and keeping them there.

I think it'll have better signal to noise ratio if it visibly comments on PRs that there is a warning on a given line of code, vs just a failing CI job that requires us to navigate through GitHub's clunky UI for checks and rerun it over and over again. Because what I'm seeing now are mostly flaky test failures, not failures specifically due to TSAN warnings like the text files I attached to this PR description.

I think that's better in any case because even if it fails, it's still a pain to track down what the TSAN failure actually was in said clunky UI, it's like 7-8 clicks. It'd be much better to have it bubble up in a comment where it's immediately visible.

Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, let's go with GH PR comments and see how it goes 👍. CI is complaining a bit. @armcknight, I think it's best to merge this PR after making TSAN non-failable, and add the GH PR comments in an extra PR.

scripts/tests-with-thread-sanitizer.sh Show resolved Hide resolved

// observed the async task taking a long time to finish if TSAN is attached
var timeout = 1
if threadSanitizerIsPresent() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, when CI is green and TSAN is non-failable.

@armcknight
Copy link
Member Author

Issue to track work for reporting TSAN warnings to PRs: #3358

@armcknight
Copy link
Member Author

One of the UI tests started failing with no obviously related code changes. Still investigating.

@philipphofmann
Copy link
Member

philipphofmann commented Nov 3, 2023

@armcknight, that could be because of #3372. Merging main back into your branch could solve the problem.

@armcknight armcknight merged commit 8a0317e into main Nov 3, 2023
71 checks passed
@armcknight armcknight deleted the armcknight/ci/3200-fix-thread-sanitizer branch November 3, 2023 17:37
philipphofmann added a commit that referenced this pull request May 3, 2024
Every log message needs to acquire a lock to evaluate whether the logger
should log it. #3303 added this logic to fix a data race the thread
sanitizer job found. As the SDK only calls the configure method when
starting, we don't need to put a synchronized keyword around the code
that evaluates the log level. This PR replaces the synchronized keyword
by ignoring the thread sanitizer.
philipphofmann added a commit that referenced this pull request May 3, 2024
Every log message needs to acquire a lock to evaluate whether the logger
should log it. #3303 added this logic to fix a data race the thread
sanitizer job found. As the SDK only calls the configure method when
starting, we don't need to put a synchronized keyword around the code
that evaluates the log level. This PR replaces the synchronized keyword
by ignoring the thread sanitizer.
philipphofmann added a commit that referenced this pull request May 3, 2024
Every log message needs to acquire a lock to evaluate whether the logger
should log it. #3303 added this logic to fix a data race the thread
sanitizer job found. As the SDK only calls the configure method when
starting, we don't need to put a synchronized keyword around the code
that evaluates the log level. This PR replaces the synchronized keyword
by ignoring the thread sanitizer.
philipphofmann added a commit that referenced this pull request May 6, 2024
Every log message needs to acquire a lock to evaluate whether the logger
should log it. #3303 added this logic to fix a data race the thread
sanitizer job found. As the SDK only calls the configure method when
starting, we don't need to put a synchronized keyword around the code
that evaluates the log level. This PR replaces the synchronized keyword
by ignoring the thread sanitizer.
dKasabwala pushed a commit to dKasabwala/sentry-cocoa that referenced this pull request May 6, 2024
Every log message needs to acquire a lock to evaluate whether the logger
should log it. getsentry#3303 added this logic to fix a data race the thread
sanitizer job found. As the SDK only calls the configure method when
starting, we don't need to put a synchronized keyword around the code
that evaluates the log level. This PR replaces the synchronized keyword
by ignoring the thread sanitizer.
threema-matteo pushed a commit to threema-ch/sentry-cocoa that referenced this pull request May 21, 2024
Every log message needs to acquire a lock to evaluate whether the logger
should log it. getsentry#3303 added this logic to fix a data race the thread
sanitizer job found. As the SDK only calls the configure method when
starting, we don't need to put a synchronized keyword around the code
that evaluates the log level. This PR replaces the synchronized keyword
by ignoring the thread sanitizer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Thread Sanitizer
2 participants