Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cdac] Add data stream library #99442

Closed
wants to merge 5 commits into from
Closed

Conversation

elinor-fung
Copy link
Member

@elinor-fung elinor-fung commented Mar 8, 2024

This adds a data stream library that can be used as a channel between the runtime and diagnostic tools.

  • src/native/datastream
    • object library that can be directly linked in
    • shared library exporting the data stream APIs
  • coreclr
    • link in the data stream library
    • add debug_stream helpers for interfacing with the data stream
    • initialize the streams on startup (does not do any registration of types yet)

Contributes to #99298

Fixes #99299, fixes #99300

Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

**Q2** Why are the contents of a stream immutable?

**A2** Having streams that are mutable means the reader _must_ always re-read the full stream to validate for updates. If the contents of a stream are instead immutable _and_ in reverse chronological order (LIFO), then entries for "deleted" or "invalidated" data are possible, which enables readers to consume deltas and reduce cross-process inspection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least some of these data streams need to included in the minidump payload. How are we going to decide which data streams need to be included in the minidump payload?

Also, we need to keep an eye on the overhead that the data streams add to the raw minidump size so that it does not get out of control.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to keep an eye on the overhead that the data streams add to the raw minidump size so that it does not get out of control.

Maybe we should have an explicit design goal around this, like that the data streams won't grow the minidump size by more than X (X can be either relative - like 5% and/or absolute like 5kB).

Comment on lines +118 to +121
//
// Reader APIs
//

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reader APIs have some type confusion between the host and the debuggee target.

  • data_stream_context_t:streams is a data_stream_t* which is a host pointer-sized value. That seems wrong. It should be a target pointer (so in particular it might be 32-bit BE).

  • memory_reader_t:read_ptr is a function like ds_bool read_remote(memory_reader_t *self, intptr_t remote_addr, size_t* count, void **dest). Here remote_addr is not intptr_t it's an opaque target pointer

If we only ever plan to run the reader on a 64-bit arch, this is probably good enough (except we should be careful when we read data_stream_context_t from the remote end that we don't read more than count bytes)

I think we should use typedef uint64_t ds_foreignptr_t and store remote addresses as opaque 64-bit blobs.

We should probably use a different data_stream_remote_context_t struct for all the reader APIs (with a ds_foreignptr_t for the streams field) and a the remote_addr param of memory_reader_t:read_ptr should also be a ds_foreignptr_t.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's more 64-bit host / 32-bit target issues: read_in_block_data has size_t remote_block_read = sizeof(data_block_t); which is using the host architecture's sizeof for the data block (which has fields of host-architecture pointer types) to read in target data.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up @elinor-fung! I'd like to iterate on this design a bit before we commit to it. More comments inline.


After type enumeration is complete, instance streams can be enumerated and interpreted. The size contained within the type description allows the reader to read in the entire type and then use field offsets to poke into that memory. The reading in of the entire data type helps with efficiency and acts as a versioning resiliance mechanism. Adding new fields to a type, without changing the version, need not represent a breaking change.

### Design FAQs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These streams appear to be a redesign of a NativeAOT contract that was already created several years ago. I'm not tied to the NativeAOT design but I do think we should either state why the new design elements are preferable to what already exists, or we should better align with the existing design.

At first glance this is what the comparison looks like to me:

DotNetRuntimeDebugHeader and data_stream_context_t
NativeAOT has DotNetRuntimeDebugHeader (RDH), this design has data_stream_context_t. The purpose and fields are similar but with minor variations in layout that make them incompatible. data_context_t has both a size and a version, RDH has version only which implies the size. RDH has an explicit minor version allowing for back-compatible changes, I'm guessing data_stream_context size could also be used as a similar mechanism by increasing the size while keeping the version the same. NativeAOT doesn't define stream as a general purpose linked list of blocks, but rather it defines a DebugTypeEntry array and a GlobalValueEntry array that will hold similar data as streams[0] and streams[2] respectively. data_stream_context uses an extra indirection for the streams array but it appears to have a statically known size so I'm not sure the extra indirection is needed?

streams
In NativeAOT the type and global value information are represented as standard C arrays of a fixed type (either DebugTypeEntry or GlobalValueEntry). There is one contiguous memory allocation for each, they are null terminated to indicate the size to the reader, and all elements are fixed size so there are no offsets to the next stream element.

type and field offset information
NativeAOT is using a sequence of DebugTypeEntry structs, one per field (perhaps the struct would have been more aptly named DebugFieldEntry). Type and fields are referenced by name. There is no equivalent of the per-type ordinal or versioning information in this design.

global value information
NativeAOT uses GlobalValueEntry which references each global by name and associates it with an address. In this design name is replaced by an int16 type. I'm not sure if this type is intended to be a free-form ordinal or it specifically references the same type ordinals used in the type stream. If it is a reference to the type stream that suggests that multiple globals of the same type can't be distinguished?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NativeAOT uses GlobalValueEntry which references each global by name and associates it with an address. In this design name is replaced by an int16 type. I'm not sure if this type is intended to be a free-form ordinal or it specifically references the same type ordinals used in the type stream. If it is a reference to the type stream that suggests that multiple globals of the same type can't be distinguished?

I think type is a bad name. A better name is "role". In this design, the fields of structs and globals have a role which is the int16_t that indexes an entry in the role stream. Each role has a version. And the role together with the version imply something about the physical layout of the entity that has that role. So yes: if there are two globals in the runtime that have the same underlying physical representation, they will have two separate roles in the typestream. (Likewise, if a single struct has two fields with the same physical representation, they will have two separate roles)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining 👍 , it sounds like that is a standard ordinal similar to what would be found in a dll global export table which makes sense to me. I wouldn't be confused if you called it 'role' though 'ordinal' is the term I'm most familiar with for that concept.

At the broader scale IMO some aspects of the new design are better, some aspects of the old design are better, and some aspects seem different + equally effective. I'm not sure how you all feel about it but right now I'd lean towards something like:

  • Keep the DotNetRuntimeDebugHeader, likely with a bump to major version 2
  • don't use streams to encode global type/instance/value lists, instead use flat arrays
  • Replace DebugTypeEntry and GlobalValueEntry with structs that reference by ordinal to make the data more compact
  • I'm skeptical about the per-type version numbers, it feels unnecessarily fine-grained. We should probably have a discussion about how we plan to handle versioning generally.

Copy link
Member

@jkotas jkotas Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a discussion about how we plan to handle versioning generally.

+100

We have been using the data contract idea for native AOT for past two years, so we can get some idea about how the contract versions. I have created a diff of the contract between .NET 7 and .NET 9 P1: https://gist.github.com/jkotas/301687d631c373cfd5fe4f9d18cc801f/revisions .

The observation is that the field offsets alone change rarely. If they change, it is for the "dumping ground" datastructures like Module. The changes in algorithms (that are often accompanies with changes in the set and purpose of the fields) are more common.

Our design has been oriented too much around types and offsets. I think the design needs to be oriented more about the data contracts required to achieve tasks. Here are some examples:

  • CodeMap - contract that describes how to go from IP to (start IP, offset and internal runtime method handle)
  • MethodDescToken - contract that describes how to go from internal runtime method handle to (Metadata blob, token)
  • ECMAMetadata - contract that describes how to interpret metadata blob (this is one is easy - it is written down in ECMA spec)
  • GCHeap - contract that describes how to enumerate all objects on GCHeap
  • GCHeapInternal - extension of GCHeap contract that describes internal GC data structures that can be used to validate GC Heap self-consistency (!VerifyHeap command in SOS)

Each of these contracts should have a revision that describes the algorithm. The contract can be parametrized by global variable addresses, field offsets, bit masks, etc. Field offsets or bit masks are only needed for cases that are likely to change. They can be implied by the contract revision most of the time. It is a trade-off between flexibility and compactness. We will learn over time where to strike the right balance.

The set of contracts supported by the runtime instance, including the contract revision and the contract parameters should be attached to DotNetRuntimeDebugHeader. I do not have strong opinion about the encoding as long as it is reasonably compact and cheap to payload into a minidump.


**Q1** Why are streams allowed to grow?

**A1** Consider the case where a data structure in the target process has a specific use case but the reader has either stricter or looser requirements. An example would be a thread pool used in the target process. This structure would ideally only be concerned with current threads in the target process, exited threads having been removed. However, the reader process likely has a need for knowing when a thread instance has exited to update its own internal state. A possible solution is to fully query the thread pool data structure each time. However, if instead entries for created and deleted threads are recorded in a stream, the reader only needs to know the delta as opposed to querying the thread pool each time. The logic follows for any data structure that contains objects with transient lifetimes.
Copy link
Member

@noahfalk noahfalk Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should flip the question around a bit. Lets assume by definition that streams are data structures that support growth. Do we expect a global list of types and a global list of instances are scenarios that need growth? Both NativeAOT and CoreCLR have similar lists, neither of them currently support growth, and it has never been an issue. From my POV trying to use streams for that purpose seems to be introducing extra complexity in the data contract where a flat array would suffice.

Where growth appears to enter the picture is the assumption that we should use the same list both to encode global singletons (examples) and also dynamically created objects of interest (such as all Threads). I don't believe we are getting any benefit by combining them though.

However, if instead entries for created and deleted threads are recorded in a stream, the reader only needs to know the delta as opposed to querying the thread pool each time.

I don't think we should be maintaining a list of every deleted thread because the potential size of that list is unbounded. I'd advocate that memory usage in this feature must remain close to what it currently is regardless of app behavior. If an app wants to create and destroy threads continuously (threadpool with variable load may do just that) then that app needs to keep working.

It seems like there is an underlying assumption in the design that streams are a generally good data structure that will be used repeatedly for all sorts of different data and I don't agree with that premise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there is an underlying assumption in the design that streams are a generally good data structure that we will be used repeatedly for all sorts of different data and I don't agree with that premise.

The streams idea may be a useful concept for certain contracts (in particular contracts oriented around live-debugging). I agree with Noah that the streams idea does not seem to be applicable for the scenarios that we are going after first (#99298).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be maintaining a list of every deleted thread because the potential size of that list is unbounded.

We are not intending to do this. It was an example (perhaps poor) of the potential flexibility. I'd agree we currently don't have a specific initial use case for that flexibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not seem to be applicable for the scenarios that we are going after first (#99298).

I think this is a slight confusion and we keep focusing on this, which I don't fully understand. Note that the data_stream library itself was written as a contract that is not concerned with the contents. It can be dynamic or static - one initializes with a series of sizes (dynamic) the other could accept a large block of memory (static, but not implemented). The decision here to abstract that away in a library was precisely to avoid conflating dynamic vs static concerns. If we had avoided the data stream abstraction, it likely would force us down assumptions about the static nature, this design avoids that mistake with the abstraction. @elinor-fung mentions this above as well, it is an implementation detail that has costs we may not want to accept right now, however in the current form we have more degrees of freedom with the abstraction than without.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand how the streams fit into the minidump scenarios.

If we go with the streams idea, what are going to be the streams saved into the minidump?

Copy link
Member

@noahfalk noahfalk Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Elinor] I'd agree we currently don't have a specific initial use case for that flexibility.

Any objections to waiting until we do have such a use case before introducing it?

[Aaron] I think this is a slight confusion and we keep focusing on this, which I don't fully understand... The decision here to abstract that away in a library was precisely to avoid conflating dynamic vs static concerns.

The reason I focus on it is because abstracting it behind a library API doesn't resolve all the concerns:

  • One of the major outputs of this feature is a specification for a data format. That format spec is intended to allow anyone to create their own parser for this data. The existence of parsing code in one library doesn't preclude developers from needing to work directly with the underlying data.
  • The choice of data structures directly impacts performance characteristics such as application size on disk, application size in memory, dump size on disk, and data access time.
  • A growing stream vs. a static stream have different rules for when the data needs to be re-queried. This means callers will still care which one they are dealing with regardless of the code that does a point-in-time enumeration.


The .NET runtime data stream is mechanism for the runtime to encode information about itself in a way that is accessible to diagnostic tools. This enables de-coupling of the tooling (for example, the DAC) from the details of a specific version of the runtime.

Data Streams consist of three concepts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Data Streams consist of three concepts.
Data Streams consist of four concepts.

@lambdageek
Copy link
Member

I don't think we're ready to merge this in its current form at this time, give the discussions around #99936 - if nothing else this version of a data contract stream doesn't capture the idea of algorithm contracts.

@lambdageek lambdageek closed this Mar 22, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Apr 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[cdac] Implement basic data stream reader and writer in C [cdac] Publish data stream spec
5 participants