Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: synchronize and copy global profiling data structures #3018

Merged
merged 21 commits into from
May 15, 2023

Conversation

armcknight
Copy link
Member

@armcknight armcknight commented May 9, 2023

📜 Description

We've had several customer reports of crashes involving an enumeration happening in json validation, which loops through a globally-modifiable data structure we should be copying before placing in the profiling payload.

Here's the representative stack trace reported by one customer:

image

This stack trace also appears in the report uploaded by a customer in #2993, although it was not the blamed frame/thread, but it's possible that Crashlytics just isn't analyzing the crash correctly and this is also the root cause of that crash.

💡 Motivation and Context

💚 How did you test it?

📝 Checklist

You have to check all boxes before merging:

  • I reviewed the submitted code.
  • I added tests to verify the changes.
  • No new PII added or SDK only sends newly added PII if sendDefaultPII is enabled.
    - [ ] I updated the docs if needed.
    - [ ] Review from the native team if needed.
  • No breaking change or entry added to the changelog.
  • No breaking change for hybrid SDKs or communicated to hybrid SDKs.

🔮 Next steps

@github-actions
Copy link

github-actions bot commented May 9, 2023

Performance metrics 🚀

  Plain With Sentry Diff
Startup time 1200.39 ms 1225.22 ms 24.83 ms
Size 20.76 KiB 434.88 KiB 414.12 KiB

Baseline results on branch: main

Startup times

Revision Plain With Sentry Diff
ddc9b9a 1201.71 ms 1226.70 ms 24.99 ms
03d9eb5 1227.35 ms 1231.24 ms 3.90 ms
0001a09 1223.90 ms 1249.56 ms 25.66 ms
2cf1b84 6265.25 ms 6277.98 ms 12.73 ms
9d56232 1192.09 ms 1228.86 ms 36.77 ms
dc0fe58 1198.56 ms 1220.98 ms 22.42 ms
4d68229 1233.50 ms 1262.92 ms 29.42 ms
8f397a7 1230.10 ms 1253.88 ms 23.77 ms
7192d9e 1221.86 ms 1248.28 ms 26.42 ms
b2f82fa 1236.94 ms 1262.86 ms 25.92 ms

App size

Revision Plain With Sentry Diff
ddc9b9a 20.76 KiB 420.40 KiB 399.65 KiB
03d9eb5 20.76 KiB 436.33 KiB 415.57 KiB
0001a09 20.76 KiB 433.19 KiB 412.43 KiB
2cf1b84 20.76 KiB 431.91 KiB 411.15 KiB
9d56232 20.76 KiB 425.80 KiB 405.04 KiB
dc0fe58 20.76 KiB 436.50 KiB 415.74 KiB
4d68229 20.76 KiB 432.34 KiB 411.58 KiB
8f397a7 20.76 KiB 420.55 KiB 399.79 KiB
7192d9e 20.76 KiB 431.71 KiB 410.95 KiB
b2f82fa 20.76 KiB 419.62 KiB 398.86 KiB

Previous results on branch: armcknight/fix/crashes

Startup times

Revision Plain With Sentry Diff
a753c1a 1208.27 ms 1225.14 ms 16.87 ms
619be64 1207.74 ms 1227.04 ms 19.30 ms
699940a 1205.20 ms 1229.72 ms 24.52 ms
2ff3f84 1250.53 ms 1263.92 ms 13.39 ms
6bc2bbe 1238.92 ms 1262.94 ms 24.02 ms
1e2582a 1233.71 ms 1248.94 ms 15.23 ms
13d5b25 1241.92 ms 1257.10 ms 15.18 ms
c8e8975 1236.02 ms 1249.16 ms 13.14 ms

App size

Revision Plain With Sentry Diff
a753c1a 20.76 KiB 437.02 KiB 416.26 KiB
619be64 20.76 KiB 437.02 KiB 416.26 KiB
699940a 20.76 KiB 434.80 KiB 414.04 KiB
2ff3f84 20.76 KiB 437.02 KiB 416.26 KiB
6bc2bbe 20.76 KiB 437.16 KiB 416.40 KiB
1e2582a 20.76 KiB 437.02 KiB 416.26 KiB
13d5b25 20.76 KiB 437.02 KiB 416.26 KiB
c8e8975 20.76 KiB 434.80 KiB 414.04 KiB

@codecov
Copy link

codecov bot commented May 10, 2023

Codecov Report

Merging #3018 (7254787) into main (b7bd719) will decrease coverage by 0.164%.
The diff coverage is 94.346%.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##              main     #3018       +/-   ##
=============================================
- Coverage   88.790%   88.627%   -0.164%     
=============================================
  Files          492       492               
  Lines        52893     53013      +120     
  Branches     18962     18951       -11     
=============================================
+ Hits         46964     46984       +20     
- Misses        4966      5109      +143     
+ Partials       963       920       -43     
Impacted Files Coverage Δ
Sources/Sentry/SentryTransactionContext.mm 100.000% <ø> (ø)
Sources/Sentry/SentryProfileTimeseries.mm 50.000% <81.250%> (-4.099%) ⬇️
Sources/Sentry/SentryProfiler.mm 84.873% <91.558%> (+1.430%) ⬆️
Tests/SentryProfilerTests/SentryProfilerTests.mm 98.880% <100.000%> (+0.767%) ⬆️

... and 51 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b7bd719...7254787. Read the comment docs.

@armcknight armcknight changed the title fix: copy global data structures in a synchronized context fix: copy global profiling data structures in a synchronized context May 10, 2023
Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to have a look a the tests. That's my first pass.

CHANGELOG.md Outdated Show resolved Hide resolved
Sources/Sentry/SentryProfileTimeseries.mm Show resolved Hide resolved
+ (NSDictionary<NSString *, id> *)serializeForTransaction:(SentryTransaction *)transaction
profileID:(SentryId *)profileID
{
std::lock_guard<std::mutex> l(_gProfilerLock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: If I'm not mistaken, this lock will be active until we finish serializing the profiling data below. I am unsure if that could cause some slowdowns in the code. I think we could unlock after copying the profiling data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually new logic, it just didn't diff well. It is the same behavior as is found here in the "removed" side of the diff, that came out of the refactor: https://github.com/getsentry/sentry-cocoa/pull/3018/files#diff-9d7d24f33a2554811f36d8236cdd5b9b2c7c2786b12be31e4a05a3934193441cL356

You might be on to something here, but I'm hesitant to change how this lock works in this PR. It forces exclusivity between querying whether the profiler is running, starting and stopping it (not only normal stops, but also due to elapsed timeout timer and app going to background), and querying for profiler info like is happening in this function.

Sources/Sentry/SentryProfiler.mm Show resolved Hide resolved
Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Almost LGTM.

philipphofmann and others added 6 commits May 10, 2023 15:36
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 2.3.2 to 2.3.3.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@f3feb00...29b1f65)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Copy link
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side LGTM. I leave this up to Indragie, as he requested changes.

@@ -155,7 +143,30 @@
}

{
std::lock_guard<std::mutex> l(_gSamplesArrayLock);
// critical section: here is where all mutable data structures are updated, which must be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stacks is still being mutated outside of the critical section on line 154

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, for some reason I thought I wouldn't need to put the modification of stacks or frames here, but now I can't remember why, because we do put the modification for samples here and it's essentially the same structure. I'll add them.

@armcknight armcknight changed the title fix: copy global profiling data structures in a synchronized context fix: synchronize and copy global profiling data structures May 15, 2023
@armcknight armcknight merged commit f938d24 into main May 15, 2023
@armcknight armcknight deleted the armcknight/fix/crashes branch May 15, 2023 20:43
@github-actions
Copy link

Fails
🚫 Please consider adding a changelog entry for the next release.

Instructions and example for changelog

Please add an entry to CHANGELOG.md to the "Unreleased" section. Make sure the entry includes this PR's number.

Example:

## Unreleased

- synchronize and copy global profiling data structures ([#3018](https://github.com/getsentry/sentry-cocoa/pull/3018))

If none of the above apply, you can opt out of this check by adding #skip-changelog to the PR description.

Generated by 🚫 dangerJS against 7254787

indragiek added a commit that referenced this pull request May 16, 2023
In #3018, there was a change to add locking where `processBacktrace` mutated data structures that could be accessed concurrently. It attempted to separate the code that did not access the shared data structures, and moved all mutating code to a section at the end of the function. However, it missed a few reads from those data structures that also needed to be synchronized (e.g. `frames.count` and `stacks.count`). Generally, separating the critical section from the rest of the logic in that function is not worth the pain because the taking the lock for the entire duration of the function is not expensive.

This refactor adds a new type, `SentryProfilingState`, that centralizes all logic for reading/writing the shared data structures. This has the additional benefit that you no longer need a `processBacktrace` function that takes many parameters where the lifetime of those parameters is unclear (this will be helpful for fixing the memory leaks in a subsequent PR).

---------

Co-authored-by: Sentry Github Bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants