ci: fix thread sanitizer job #3303

armcknight · 2023-09-25T19:57:57Z

Fixed the races that were detected locally. Here are the outputs from each:
data race in ANR tracker.txt
data race in SentryNetworkTracker.isCaptureFailedRequestsEnabled.txt
data race in SentrySpan.isFinished.txt
data race in SentryThreadInspectorTests.txt
data race in TestThread (1).txt
data race in TestThread (2).txt

TODO

i just discovered the suppressions file listing some skipped tests. Let's work on trying to fix those in a subsequent PR.

#skip-changelog

armcknight · 2023-09-26T05:54:47Z

Current status: "unexpected exit, crash, or test timeout" when run under TSAN in the following test cases

SentryNSDataTrackerTests.testWriteAtomically_CheckTransaction_FilterOut_nonProcessFrames
SentryANRTrackerTests.testAppSuspended_NoANR
SentryANRTrackerTests.testMultipleListeners
-[SentryCrashBinaryImageCacheTests testAddBinaryImageInParallel]
SentryFileManagerTests.testDefaultMaxEnvelopesConcurrent

The message appears after a TSAN data race warning in each of them, along with this message (not sure if it's a red herring): xctest(8596,0x10ef62280) malloc: nano zone abandoned due to inability to reserve vm space.

armcknight · 2023-09-26T23:03:54Z

Right now I'm thinking we can merge this, and after merging main into #3290 et al, see if any of those failures resolve, then we can skip the remaining failures and open subsequent issues to investigate/fix them. What do you think @brustolin ?

armcknight · 2023-09-27T20:43:01Z

I changed #3290 to merge into this branch and reran all the unit tests against it, so let's see if any ANRTracker-based tsan warnings pop up there.

codecov · 2023-10-11T22:42:14Z

Codecov Report

Merging #3303 (2199efa) into main (5f8ee7a) will increase coverage by 0.071%.
The diff coverage is 90.000%.

@@              Coverage Diff              @@
##              main     #3303       +/-   ##
=============================================
+ Coverage   89.276%   89.347%   +0.071%     
=============================================
  Files          500       500               
  Lines        54739     54862      +123     
  Branches     19644     19682       +38     
=============================================
+ Hits         48869     49018      +149     
+ Misses        5002      4972       -30     
- Partials       868       872        +4

Files	Coverage Δ
Sources/Sentry/SentryANRTracker.m	`100.000% <100.000%> (ø)`
Sources/Sentry/SentryFileManager.m	`93.697% <100.000%> (ø)`
Sources/Sentry/SentryFramesTracker.m	`98.484% <100.000%> (+0.023%)`	⬆️
Sources/Sentry/SentryLog.m	`100.000% <100.000%> (ø)`
Sources/Sentry/SentryNetworkTracker.m	`92.953% <100.000%> (+0.512%)`	⬆️
Sources/Sentry/SentrySDK.m	`89.787% <100.000%> (+0.132%)`	⬆️
Sources/Sentry/SentrySpan.m	`95.705% <100.000%> (+0.135%)`	⬆️
Sources/Sentry/SentryStacktraceBuilder.m	`75.384% <100.000%> (+0.384%)`	⬆️
Sources/Sentry/SentryThreadWrapper.m	`100.000% <100.000%> (ø)`
...entryCrash/Recording/SentryCrashBinaryImageCache.c	`98.275% <100.000%> (+2.700%)`	⬆️
... and 11 more

... and 14 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f8ee7a...2199efa. Read the comment docs.

philipphofmann · 2023-10-12T13:53:15Z

The Thread Sanitizer CI job is still failing. @armcknight, what's the state of the PR?

github-actions · 2023-10-18T00:06:35Z

Performance metrics 🚀

	Plain	With Sentry	Diff
Startup time	1238.44 ms	1257.30 ms	18.86 ms
Size	22.85 KiB	414.06 KiB	391.21 KiB

Baseline results on branch: main

Startup times

Revision	Plain	With Sentry	Diff
`a176fc4`	1250.29 ms	1257.88 ms	7.59 ms
`257c2a9`	1231.45 ms	1252.12 ms	20.67 ms
`6943de0`	1237.67 ms	1247.12 ms	9.45 ms
`7419285`	1209.53 ms	1244.72 ms	35.19 ms
`fd6a31c`	1204.73 ms	1222.34 ms	17.61 ms
`98cca71`	1199.08 ms	1227.36 ms	28.28 ms
`9cc7e7c`	1228.90 ms	1237.96 ms	9.06 ms
`1db04d8`	1250.20 ms	1258.12 ms	7.92 ms
`afd1a08`	1207.78 ms	1223.44 ms	15.66 ms
`d8eb419`	1221.91 ms	1253.62 ms	31.71 ms

App size

Revision	Plain	With Sentry	Diff
`a176fc4`	22.84 KiB	403.24 KiB	380.39 KiB
`257c2a9`	20.76 KiB	401.36 KiB	380.60 KiB
`6943de0`	20.76 KiB	393.33 KiB	372.57 KiB
`7419285`	20.76 KiB	432.99 KiB	412.22 KiB
`fd6a31c`	20.76 KiB	436.50 KiB	415.74 KiB
`98cca71`	22.85 KiB	411.14 KiB	388.29 KiB
`9cc7e7c`	22.84 KiB	403.13 KiB	380.29 KiB
`1db04d8`	20.76 KiB	435.50 KiB	414.74 KiB
`afd1a08`	22.84 KiB	402.57 KiB	379.72 KiB
`d8eb419`	22.85 KiB	408.87 KiB	386.02 KiB

Previous results on branch: armcknight/ci/3200-fix-thread-sanitizer

Startup times

Revision	Plain	With Sentry	Diff
`346c775`	1242.31 ms	1256.77 ms	14.46 ms
`c58eb2f`	1235.42 ms	1256.14 ms	20.72 ms
`802d0eb`	1250.78 ms	1257.14 ms	6.36 ms
`b503810`	1196.48 ms	1226.68 ms	30.20 ms
`c773f09`	1222.34 ms	1252.36 ms	30.02 ms
`90da6b1`	1230.67 ms	1240.33 ms	9.65 ms
`c43eb20`	1228.26 ms	1251.85 ms	23.59 ms
`de0b274`	1226.80 ms	1250.34 ms	23.54 ms
`ce5be3f`	1233.71 ms	1264.20 ms	30.49 ms
`f2bc3f1`	1192.73 ms	1224.49 ms	31.76 ms

App size

Revision	Plain	With Sentry	Diff
`346c775`	22.85 KiB	413.08 KiB	390.23 KiB
`c58eb2f`	22.85 KiB	412.62 KiB	389.77 KiB
`802d0eb`	22.85 KiB	411.50 KiB	388.65 KiB
`b503810`	22.85 KiB	413.00 KiB	390.15 KiB
`c773f09`	22.85 KiB	411.50 KiB	388.65 KiB
`90da6b1`	22.85 KiB	412.07 KiB	389.22 KiB
`c43eb20`	22.85 KiB	412.07 KiB	389.22 KiB
`de0b274`	22.85 KiB	412.58 KiB	389.74 KiB
`ce5be3f`	22.85 KiB	411.80 KiB	388.95 KiB
`f2bc3f1`	22.85 KiB	411.49 KiB	388.64 KiB

armcknight · 2023-10-18T17:18:06Z

As of the latest run in CI, it doesn't produce any TSAN warnings, but is again taking long enough to time out the job :/

Looking at the logs, I still see the following failures:

SentryTests.SentryFramesTrackerTests testFrameRateChange
SentryTests.SentryNetworkTrackerIntegrationTests testCustomURLProtocol_BlocksAllRequests
SentryTests.SentryNetworkTrackerIntegrationTests testGetCaptureFailedRequestsEnabled
SentryTests.SentryNetworkTrackerIntegrationTests testGetRequest_CompareSentryTraceHeader
SentryTests.SentryNetworkTrackerIntegrationTests testGetRequest_SpanCreatedAndBaggageHeaderAdded

armcknight · 2023-10-18T21:24:20Z

I've got it narrowed down to one failing test, testStartOnTheMainThread. Added more logging to see what's up.

philipphofmann

I think this PR could be split up into several PRs. Many changes make sense even if we don't enable the thread sanitizer in CI again. You don't have to do this as the PR is still manageable, but if it gets bigger, I would rather split it up.

philipphofmann · 2023-10-20T13:36:09Z

Sources/Sentry/SentryDependencyContainer.m

@@ -60,6 +60,12 @@ + (instancetype)sharedInstance

 + (void)reset
 {
+#if !TARGET_OS_WATCH


I think this could solve some flaky reachability tests. IMO, this should be in an extra PR, so we can merge it right away. I don't want to steal your credit, so I leave it up to you if you want to open a PR or not. Otherwise, I can do it.

Yep, that's exactly what it was for. In general I did plan to split up some of the fixes in here... I've just been trying to plow through anything that could possibly be causing problems in the TSAN tests for now. Easier to test the combinations together, then split them apart for discussion/historical purposes afterwards!

philipphofmann · 2023-10-20T13:38:08Z

Sources/Sentry/SentrySDK.m

    // We accept the tradeoff that the SDK might not be fully initialized directly after
    // initializing it on a background thread because scheduling the init synchronously on the main
    // thread could lead to deadlocks.


m: I think that comment should stay directly above [SentryThreadWrapper onMainThread:^{

philipphofmann · 2023-10-20T13:41:26Z

Tests/SentryTests/SentryCrash/SentryStacktraceBuilderTests.swift

            let filteredFrames = await self.firstFrame()
            waitForAsyncToRun.fulfill()
            XCTAssertGreaterThanOrEqual(filteredFrames, 3, "The Stacktrace must include the async callers.")
        }
-        wait(for: [waitForAsyncToRun], timeout: 1)
+
+        wait(for: [waitForAsyncToRun], timeout: 10)


m: Increasing this to 10 looks a bit scary cause it can slow down our test suite. What made you increase this value?

looking at the test logs, I saw it taking 5-6 seconds between the log statements that define the boundary completing the async task. no idea why though 🙁

we can either special case it for tsan only, or skip these completely, but that seems unwise given what they're supposed to test.

Maybe tsan doesn't work well with backtrace_async. I would rather special case them for tsan.

armcknight · 2023-10-20T18:48:08Z

So, this finally got a green result on the tsan job, but I'm sure it's going to be really flaky moving forward. I think we could probably merge these changes, although I'm going to separate out some unrelated changes first.

If it continues to cause problems, we should consider making TSAN jobs non-failable,, and instead only report TSAN warnings and test failures as comments/annotations on PRs instead. WDYT @philipphofmann ?

…ly create it

philipphofmann

So, this finally got a green result on the tsan job, but I'm sure it's going to be really flaky moving forward. I think we could probably merge these changes, although I'm going to separate out some unrelated changes first.

I rather not enable tsan if it's flaky cause we won't trust it and ignore it most likely.

If it continues to cause problems, we should consider making TSAN jobs non-failable,, and instead only report TSAN warnings and test failures as comments/annotations on PRs instead. WDYT @philipphofmann ?

I don't think making them non-failable is a good idea cause I fear that we are going to ignore them if they are flaky. It could be a first step though, so your efforts here are not lost, if you think getting them stable is hard to achieve.

scripts/tests-with-thread-sanitizer.sh

armcknight · 2023-10-23T19:09:03Z

I don't think making them non-failable is a good idea cause I fear that we are going to ignore them if they are flaky. It could be a first step though, so your efforts here are not lost, if you think getting them stable is hard to achieve.

It's either that or we decide to throw them out, which seems unwise. It was finding actual issues. And, as I noted above, someone has been skipping tests anyways, as there is a suppression file with some test cases listed to skip.

It's a matter of how much effort we're willing to put into getting them stable and keeping them there.

I think it'll have better signal to noise ratio if it visibly comments on PRs that there is a warning on a given line of code, vs just a failing CI job that requires us to navigate through GitHub's clunky UI for checks and rerun it over and over again. Because what I'm seeing now are mostly flaky test failures, not failures specifically due to TSAN warnings like the text files I attached to this PR description.

I think that's better in any case because even if it fails, it's still a pain to track down what the TSAN failure actually was in said clunky UI, it's like 7-8 clicks. It'd be much better to have it bubble up in a comment where it's immediately visible.

…x-thread-sanitizer

philipphofmann

Then, let's go with GH PR comments and see how it goes 👍. CI is complaining a bit. @armcknight, I think it's best to merge this PR after making TSAN non-failable, and add the GH PR comments in an extra PR.

scripts/tests-with-thread-sanitizer.sh

philipphofmann · 2023-10-24T07:05:19Z

Tests/SentryTests/SentryCrash/SentryStacktraceBuilderTests.swift

+
+        // observed the async task taking a long time to finish if TSAN is attached
+        var timeout = 1
+        if threadSanitizerIsPresent() {


philipphofmann

LGTM, when CI is green and TSAN is non-failable.

armcknight · 2023-10-24T18:38:35Z

Issue to track work for reporting TSAN warnings to PRs: #3358

…x-thread-sanitizer

armcknight · 2023-11-02T20:26:05Z

One of the UI tests started failing with no obviously related code changes. Still investigating.

philipphofmann · 2023-11-03T09:33:52Z

@armcknight, that could be because of #3372. Merging main back into your branch could solve the problem.

Every log message needs to acquire a lock to evaluate whether the logger should log it. #3303 added this logic to fix a data race the thread sanitizer job found. As the SDK only calls the configure method when starting, we don't need to put a synchronized keyword around the code that evaluates the log level. This PR replaces the synchronized keyword by ignoring the thread sanitizer.

Every log message needs to acquire a lock to evaluate whether the logger should log it. getsentry#3303 added this logic to fix a data race the thread sanitizer job found. As the SDK only calls the configure method when starting, we don't need to put a synchronized keyword around the code that evaluates the log level. This PR replaces the synchronized keyword by ignoring the thread sanitizer.

armcknight requested review from philipphofmann and brustolin as code owners September 25, 2023 19:57

armcknight changed the title ~~DONTMERGE: remove all but thread sanitizer job to test in ci~~ ci: fix thread sanitizer job Sep 25, 2023

armcknight linked an issue Sep 25, 2023 that may be closed by this pull request

Fix Thread Sanitizer #3200

Closed

armcknight mentioned this pull request Sep 25, 2023

fix: App hang with race condition for tick counter #3290

Merged

7 tasks

philipphofmann reviewed Oct 20, 2023

View reviewed changes

fix: reenable TSAN and fix warnings

d99c31e

armcknight force-pushed the armcknight/ci/3200-fix-thread-sanitizer branch from 372aa8d to d99c31e Compare October 20, 2023 23:48

armcknight added 3 commits October 20, 2023 15:49

fix changelog

8ba3fc8

tweak test code

624dd47

log frame tracker instances; its showing up in tests that dont actual…

a155788

…ly create it

philipphofmann reviewed Oct 23, 2023

View reviewed changes

scripts/tests-with-thread-sanitizer.sh Show resolved Hide resolved

armcknight added 2 commits October 23, 2023 15:08

Merge remote-tracking branch 'origin/main' into armcknight/ci/3200-fi…

5a0ca24

…x-thread-sanitizer

special case timeouts for TSAN

c2b0dd4

philipphofmann reviewed Oct 24, 2023

View reviewed changes

philipphofmann approved these changes Oct 24, 2023

View reviewed changes

armcknight added 2 commits October 24, 2023 10:34

fix test build issue

dc9931f

make thread sanitizer nonfailable in CI

8e946de

armcknight mentioned this pull request Oct 24, 2023

Report thread sanitizer warnings as PR comments #3358

Open

fix watch/tv builds

2c55669

This was referenced Oct 27, 2023

ref: use xcode-test.sh to run sanitizers #3318

Closed

Data races in custom instrumentation (SentrySpan/SentryTracer) and Sentry initialization #3069

Closed

armcknight added 3 commits October 30, 2023 11:59

never allow the tsan ci job to fail

8f0901f

try different method to drop frames

9bb4d5a

Merge remote-tracking branch 'origin/main' into armcknight/ci/3200-fi…

52e055f

…x-thread-sanitizer

Merge branch 'main' into armcknight/ci/3200-fix-thread-sanitizer

2199efa

armcknight merged commit 8a0317e into main Nov 3, 2023
71 checks passed

armcknight deleted the armcknight/ci/3200-fix-thread-sanitizer branch November 3, 2023 17:37

armcknight mentioned this pull request Nov 3, 2023

meta: changelog tsan fixes #3377

Merged

philipphofmann mentioned this pull request May 3, 2024

impr: Remove not needed log for logging #3934

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: fix thread sanitizer job #3303

ci: fix thread sanitizer job #3303

armcknight commented Sep 25, 2023 •

edited

Loading

armcknight commented Sep 26, 2023

armcknight commented Sep 26, 2023 •

edited

Loading

armcknight commented Sep 27, 2023

codecov bot commented Oct 11, 2023 •

edited

Loading

philipphofmann commented Oct 12, 2023

github-actions bot commented Oct 18, 2023 •

edited

Loading

Baseline results on branch: main

Startup times

App size

Previous results on branch: armcknight/ci/3200-fix-thread-sanitizer

Startup times

App size

armcknight commented Oct 18, 2023

armcknight commented Oct 18, 2023

philipphofmann left a comment

philipphofmann Oct 20, 2023

armcknight Oct 20, 2023

armcknight Oct 21, 2023

philipphofmann Oct 20, 2023

philipphofmann Oct 20, 2023

armcknight Oct 20, 2023

armcknight Oct 20, 2023

philipphofmann Oct 23, 2023

armcknight commented Oct 20, 2023

philipphofmann left a comment

armcknight commented Oct 23, 2023

philipphofmann left a comment •

edited

Loading

philipphofmann Oct 24, 2023

philipphofmann left a comment •

edited

Loading

armcknight commented Oct 24, 2023

armcknight commented Nov 2, 2023

philipphofmann commented Nov 3, 2023 •

edited

Loading

ci: fix thread sanitizer job #3303

ci: fix thread sanitizer job #3303

Conversation

armcknight commented Sep 25, 2023 • edited Loading

TODO

armcknight commented Sep 26, 2023

armcknight commented Sep 26, 2023 • edited Loading

armcknight commented Sep 27, 2023

codecov bot commented Oct 11, 2023 • edited Loading

Codecov Report

philipphofmann commented Oct 12, 2023

github-actions bot commented Oct 18, 2023 • edited Loading

Performance metrics 🚀

Baseline results on branch: main

Startup times

App size

Previous results on branch: armcknight/ci/3200-fix-thread-sanitizer

Startup times

App size

armcknight commented Oct 18, 2023

armcknight commented Oct 18, 2023

philipphofmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armcknight commented Oct 20, 2023

philipphofmann left a comment

Choose a reason for hiding this comment

armcknight commented Oct 23, 2023

philipphofmann left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

philipphofmann left a comment • edited Loading

Choose a reason for hiding this comment

armcknight commented Oct 24, 2023

armcknight commented Nov 2, 2023

philipphofmann commented Nov 3, 2023 • edited Loading

armcknight commented Sep 25, 2023 •

edited

Loading

armcknight commented Sep 26, 2023 •

edited

Loading

codecov bot commented Oct 11, 2023 •

edited

Loading

github-actions bot commented Oct 18, 2023 •

edited

Loading

philipphofmann left a comment •

edited

Loading

philipphofmann left a comment •

edited

Loading

philipphofmann commented Nov 3, 2023 •

edited

Loading