Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose StressLog via CDAC and port StressLogAnalyzer to managed code #104999

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

jkoritzinsky
Copy link
Member

Current state in main

StressLog is an in-memory logging apparatus in CoreCLR and NativeAOT. This log is an in-memory circular log that stores messages as "offsets to format string in memory and array of arguments as size_t values".

This format has been generally stable but has evolved slowly over time. Unlike most types in CoreCLR, the StressLog was never DACized. Furthermore, it was never abstracted on the SOS-DAC boundary. As a result, SOS has intimate knowledge of the memory layout and format of the StressLog structures. The StressLog types are one of the reasons the SOSBreakingChangeVersion exists (and are responsible for 2 of the 4 version increases done over the years).

SOS supports the stresslog types by having copies of the stresslog header in its codebase and implementing some of the methods only declared when STRESS_LOG_READONLY is defined. The !dumplog and !histinit and other GC history commands utilize the stress log.

Additionally, dotnet/runtime has a tool called StressLogAnalyzer that supports reading memory-mapped stress logs (which don't require dumps). This code-base has a modified copy of the stress-log dumping/reading code from dotnet/diagnostics (including some dead code and some shims to make things compile) to support dumping and analyzing GC behavior based on stress logs.

After this PR

With this PR, a StressLog contract is defined in the cDAC. This contract provides 3 versions to enable SOS to use cDAC for all stress log dumping, even for older target runtimes. These versions differ by message format and the lookup process for the format string from the provided offset. Each of these messages corresponds to a different SOS Breaking Change Version. For versions that are supported by memory-mapped stress logs, each version corresponds to a version of the memory-mapped stress log. As there are few versions and they only differ by pointer-size, the contract descriptors for these cases can be hard-coded in SOS if a descriptor is not available.

CoreCLR (and NativeAOT) will implement version 2 of the contract, which is the newest version in this PR.

The StressLog contract exposes the information needed to dump the header of the stress log in SOS, get the logs for all threads, and get all messages from a given thread's log.

The contract does not include an implementation of formatting a stress log message into a string, but it does include a conceptual description of how to do so.

StressLogAnalyzer has been re-written into C# and uses the cDAC reader to read the StressLog. The following behavior differences are noted:

  • Specifying -tid without a thread while also specifying a thread will not change thread ids to hex. When filtering threads, specify --hex to get hex thread ids.
  • Specifying -f without a format string while also specifying a format string will not output format strings. When specifying format strings, pass -pf to print format strings as well. 

Follow-ups

This PR does not present a mechanism to push a managed cDAC package to the internal feed for SOS to consume. This would be necessary to move SOS to use cDAC to implement StressLog dumping in managed code, or any of the other commands currently implemented in the native portion of SOS.

Once a package exists that SOS can consume, the code to read the StressLog in C++ can be removed from CoreCLR as well. I don't want to remove it before we can remove the copy in SOS.

After this PR is merged, we can update the memory-mapped stresslog format to store a pointer to the contract descriptor in the runtime image. That will ensure that any future changes to the stresslog format won't require versioning the memory-mapped stresslog format version, only updating the contract.

Start on the data parsing and fill out the contract more

Implement stress message payload parsing and thread stress log iteration

Implement everything except for printing the string output

Implement output string formatting for stress log messages

Move all StressMsg logic into the contract and out of the data type.

Split the stress log contract into different versions and refactor out the StressLogHeader to be entirely a StressLogAnalyzer concern (not used in the cDAC implementation at all)

Expose the module table directly to avoid going through the StressLog object to read the module table (which isn't present in the memory-mapped log)

Give a nicer error message for an easy-to-fall-in special case in the data descriptor scraping.

Fix a few datadescriptor problems
Move stress message formatting out of the contract.
…more cases (including the alignment scenarios).
Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

…r debug return logging (make it a const instead of a function)
@Maoni0
Copy link
Member

Maoni0 commented Jul 22, 2024

please make sure that our dprintf stresslog scenario works, ie, turning on the stress log and analyzing it with the stresslog analyzer. if you have questions, @cshung can (continue to) help.

@jkoritzinsky
Copy link
Member Author

@Maoni0 I've validated locally (and pushed the necessary fixes to do so) that all of the various options that were supported on the old stress log analyzer (dprintf level filtering, facility ignore, value searching, gc time searching, time searching, etc) are all functioning the same as they were with the old stress log.

I've also added a --single-run flag that exits after outputting instead of prompting for new input (which I used in debugging and perf testing).

@Maoni0
Copy link
Member

Maoni0 commented Jul 23, 2024

nice, thank you @jkoritzinsky! @cshung, could you please also do some validation on your end?

@jkoritzinsky
Copy link
Member Author

Can I get a review pass on this?

@cshung have you validated all of the scenarios you want here before merging?

@cshung
Copy link
Member

cshung commented Sep 16, 2024

Can I get a review pass on this?

@cshung have you validated all of the scenarios you want here before merging?

I haven't. I plan to work on it today after doing a code review.
But you don't need to block on me on the initial merge, if I have found anything we can always fix them later.

@cshung
Copy link
Member

cshung commented Sep 16, 2024

@jkoritzinsky WIth the latest commit, it looks like the code no longer compiles if #define TRACE_GC is uncommented in gc.h?

Looks like something is wrong with respect to handling float.

@jkoritzinsky
Copy link
Member Author

@jkoritzinsky WIth the latest commit, it looks like the code no longer compiles if #define TRACE_GC is uncommented in gc.h?

Looks like something is wrong with respect to handling float.

That failure is unrelated to this PR (it exists in main). It's due to the fact that float and double were being handled incorrectly and float can't actually be distinguished from double (which was being handled incorrectly as well) in a printf format string.

It looks like a number of failures have crept in under TRACE_GC. I'll put out another PR to try to fix them (and maybe add a CI leg to validate the build in the future).

@jkoritzinsky
Copy link
Member Author

@Maoni0 @cshung I've opened #108089 for the fixes for the TRACE_GC build as they're in main and not due to this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants