[cdac] Data Descriptor Spec #100253

lambdageek · 2024-03-25T20:28:22Z

Contributes to #100162 which is part of #99298

Follow-up to #99936 that removes "type layout" and "global value" contracts and instead replaces them with a "data descriptor" blob.

Conceptually a particular target runtime provides a pair of a logical data descriptor together with a set of algorithmic contract versions. The logical data descriptor is just a single blob that defines all the globals and type layouts relevant to the set of algorithmic contract versions.

A logical data descriptor is realized by merging several physical data descriptors in a proscribed order.

The physical data descriptors provide some subset of the type layouts or global values.

The physical data descriptors come in two flavors:

baseline descriptors that are checked into the dotnet/runtime repo and have well -known names
in-proc descriptors that get embedded into a target runtime.

Each physical descriptor may refer to a baseline and represents a delta applied on top of the baseline.

The data contract model works on top of a flattened logical data descriptor.

AaronRobinsonMSFT

As a first pass though, This seems more verbose and complicated than previous prototypes.

docs/design/datacontracts/data_descriptor.md

docs/design/datacontracts/datacontracts_design.md

docs/design/datacontracts/data_descriptor.md

docs/design/datacontracts/datacontracts_design.md

docs/design/datacontracts/data_descriptor.md

noahfalk · 2024-03-26T00:18:53Z

docs/design/datacontracts/data_descriptor.md

+struct BinaryBlobDataDescriptor
+{
+    struct Directory {
+        uint32_t TypesStart;


When would these Start values not be 0?

So this comes from teh requirement that we would like to be able to read binary descriptors directly out of object files without understanding the object file format.

That means that if the object file is produced by a C compiler, we want to be able to ignore any padding or alignment that the C compiler adds. So because we have fixed-size arrays of structs as part of BinaryBlobDataDescriptor I think the C compiler may add padding between when one array ends and the next one starts. So to make the blob self-describing, we need to capture the offsets of the fields of BinaryBlobDataDescriptor as part of itself.

Does that mean this field is the relative offset from the start of the BinaryBlobDescriptor to the start of the Types array and we'd never expect a reader to know that BinaryBlobDescriptor has a Types field? If I followed you correctly I'd be tempted to omit all of those array fields from the struct definition or add a comment saying those fields are implementation details that a reader shouldn't rely upon.

I'd be tempted to omit all of those array fields from the struct definition or add a comment saying those fields are implementation details that a reader shouldn't rely upon.

+1. We do not need to spec the intermediate artifacts used in the process of building the final descriptor. They can be just our internal implementation detail.

@jkotas ok, I'm gonna take one more pass at simplifying things:

only document the json physical format (but with a second flavor for the in-proc version that has an aux array for the global pointers)

keep the "object file blob" stuff as an implementation detail for tooling that will create the in-proc json data descriptor.

Drop the requirements to be able to scrape the on-disk representation out of the spec.

The build tooling (I'm thinking of it as cdac-build-tool) will work like this:

runtime build generates one or more object files with blobs containing offsets and struct sizes

cdac-build-tool scrapes the object files and a baseline and writes out a C file that contains a json data descriptor and an aux array of globals

we compile the C file and link it into the runtime

At diagnostic time:

DAC tools find the data descriptor in target process memory (not on disk)

noahfalk · 2024-03-26T00:29:09Z

docs/design/datacontracts/data_descriptor.md

+
+## Example
+
+And example C header describing some data types is given in [sample.data.h](./sample.data.h). And


Any thoughts on what this might look like for NativeAOT data structures defined in C#?
Or before that have we identified any NativeAOT data structures we'd want expose from managed code? So far I believe every type described in NativeAOT's DebugHeader has always been defined in native code.

In the resulting .o file it will look the same. it can be produced by other means - it doesn't need to be produced by C marcros. (In fact even for CoreCLR it doesn't need to be produced by macros. It can be created with C++ constexpr and templates). The preprocessor example is just meant to demonstrate that this can be constructed using C for the most restricted runtimes.

it doesn't need to be produced by C marcros

I wasn't worried that it would be constrained to macros :) I just have no understanding of what alternative method we would use to generate it instead and I'm searching for info that helps me fill in that gap.

I just have no understanding of what alternative method we would use to generate it instead

I have no idea. Perhaps a new phase for ILCompiler. or possibly some source generator that runs over NativeAOT's runtime source code and produces a C# blittable struct definition that follows the format we expect

Yes, I would expect a variant of cdac-build-tool to be part of the native AOT ILCompiler in the fullness of time.

docs/design/datacontracts/data_descriptor.md

noahfalk · 2024-03-26T06:46:27Z

docs/design/datacontracts/data_descriptor.md

+## Global value descriptors
+
+Each global value descriptor consists of:
+* a name


My guess is that ordinals would be a bit simpler as well as being more compact, but if it fits in the size targets I'm fine with names.

I wouldn't be surprised if a dense ordinal-based encoding without any baseline values also fits within the perf size goals. If that meant we didn't need any custom build tooling, JSON format, or baseline offset tables that seems like a nice simplification. The part that remains an unknown for me on that route is how we deal with C# globals/types in NativeAOT assuming that we would have some. So far all the DebugHeader work for NativeAOT has never needed to encode info for managed type or global and I don't know what capabilities we have to work with in the NativeAOT toolchain.

It sounds like my suggestion is not the direction you are choosing to go. If you want to discuss it more I'm happy to, but I also don't want to be a distraction when you've decided on another approach already. I believe the current approach with JSON, baselines, and custom build tools can work even if it feels more complex than needed to me. Feel free to mark this resolved.

Co-authored-by: Aaron Robinson <[email protected]> Co-authored-by: Jan Kotas <[email protected]> Co-authored-by: Noah Falk <[email protected]>

docs/design/datacontracts/data_descriptor.md

docs/design/datacontracts/datacontracts_design.md

docs/design/datacontracts/data_descriptor.md

it's a build tooling implementation detail

lambdageek · 2024-03-27T19:38:39Z

Assuming we're converging toward consensus on this PR, the next thing I'd like to tee up is a spec for the overall contract descriptor #100365

docs/design/datacontracts/data_descriptor.md

elinor-fung · 2024-03-27T20:29:56Z

docs/design/datacontracts/data_descriptor.md

+* `"type": "type name"` the name of a primitive type or another type defined in the logical descriptor
+* optional `"offset": int | "unknown"` the offset of the field or "unknown". If omitted, same as "unknown".
+
+Note that the logical descriptor does not contain "unknown" offsets: it is expected that the binary


Maybe call out specifically that this also means only the baseline data descriptor should have 'unknown' offsets and the in-memory one should not.

docs/design/datacontracts/data_descriptor.md

jkotas · 2024-03-27T21:17:11Z

docs/design/datacontracts/data_descriptor.md

+  "types": [
+    {
+      "name": "Thread",
+      "fields": [


We may want to use more compact style for key/value pairs for the in-memory descriptor at least, the json parser can support both less verbose and more verbose variants:

"types": [ "Thread" : { "ThreadId" : 32, "ThreadState" : 0, "Next" : 128 }`, ...

I know the right way to serialize key/value pairs in JSON is one of the FAQs...

Added a "compact" JSON variant to the spec

jkotas · 2024-03-27T21:21:06Z

docs/design/datacontracts/data_descriptor.md

+    }
+  ],
+  "globals": [
+    { "name": "s_pThreadStore", "value": { "indirect": 0 } }


Should indirect be 1 in this example? The value at index 0 below is FEATURE_EH_FUNCLETS.

No. the value of FEATURE_EH_FUNCLETS in this example is stored in the baseline and the in-memory version happened to match and doesn't override it. (Even if it did override it, we would probably store it as a literal value in the json since it is known at build time and won't vary at execution time. There is no need to store constant values in the aux array). I'll clarify the example

noahfalk · 2024-03-28T09:39:26Z

docs/design/datacontracts/data_descriptor.md

+If the value is given as `{"indirect": int}` then the value is stored in an auxiliary array that is
+part of the data contrat descriptor.  Only in-memory data descriptors may have indirect values; baseline data descriptors may not have indirect values.
+
+Rationale: This allows tooling to generate the in-memory data descriptor as a single constant


I was under the impression that the JSON format was for documentation purposes only and it would never appear in memory of the target process. What is the scenario where we expect the JSON format to be embedded in memory?

I was under the impression that the JSON format was for documentation purposes only

The well-known baseline JSON data will be embedded into the CDAC reader tool. It's not just for documentation.

What is the scenario where we expect the JSON format to be embedded in memory?

If the in-proc data descriptor is going to be a data-segment constant that is being generated by cdac-build-tool, there didn't seem to be any reason why we should be a binary blob. JSON is easy to read with a variety of off-the-shelf parsers.

Also:

The size of the compact JSON format is not that much different from the size of the originally proposed binary format, and there are standardized ways to make JSON smaller if we need to (compression, binary JSON, ...)

Designing our own binary format for structured data is a rathole

I just misunderstood the text. I thought it was saying the JSON would be inside the target process being debugged. If this JSON is included in the reader then I have no concern with that at all. I consider it an implementation choice of any diagnostic tool, including ours, how it wants to utilize the documentation. Sorry for confusion!

I thought it was saying the JSON would be inside the target process being debugged.

The text is saying that the (compact) JSON will be inside the runtime binary in the target process.

Sorry it looks like I had it right the first time. I see there was an edit earlier today that eliminated the binary format entirely and just use the JSON as the in-memory format.

This feels like it is getting progressively further away from the approach I would have favored, but I think you can make it work and it feels like my opinion on this is in the minority. My impression is that defining flat in-memory structs with no baseline at all would be easier to generate (no build tooling needed or baseline to manage), easier to consume (no JSON parser needed), and fits in our size goals. As an example the GC already does this as have past successful prototypes. I know I suggested having a baseline at an earlier point, but pretty quickly it felt like the amount of complexity it added to the design made it feel not worth the potential size savings.

In any case if your vision is different and you think JSON, baselines, and custom build tools are the best approach go for it. I see no reason it won't work even if it isn't the approach I would have picked.

it felt like the amount of complexity it added to the design made it feel not worth the potential size savings.

@lambdageek and I chatted about this too. I'm hoping we can focus more on actually understanding what the entire description will be and then after we know for sure we can apply a baseline logic or just accept that compisition wasn't needed because the size isn't as bad as assumed.

no baseline at all would be easier to generate (no build tooling needed or baseline to manage)

I'm hoping we can focus more on actually understanding what the entire description will be

I think the design having the space for baselining makes sense, but I would like to start with embedding everything in the target process and see where that is with PrintException and how much we think it will grow. And then actually use/implement the baselining if deemed necessary.

Part of my thinking that we wouldn't need a baseline is that I'm expecting a simple ordinal based binary encoding to be between 10-40% of the size of JSON encoding for the same set of data. For example you might see that a JSON encoding of the entire runtime descriptor with no baseline is 20KB in size which we decide is too big. Then we either save size by sticking with JSON and using a baseline, or we save size by using some simple structs with binary data that is substantially more efficient.

docs/design/datacontracts/data_descriptor.md

jkotas · 2024-03-29T04:44:16Z

docs/design/datacontracts/data_descriptor.md

+| s_pThreadStore      | pointer | 0x0100ffe0 |
+
+The `FEATURE_EH_FUNCLETS` global's value comes from the baseline - not the in-memory data
+descriptor.  By contrast, `FEATUER_COMINTEROP` comes from the in-memory data descriptor - with the


Suggested change

descriptor. By contrast, `FEATUER_COMINTEROP` comes from the in-memory data descriptor - with the

descriptor. By contrast, `FEATURE_COMINTEROP` comes from the in-memory data descriptor - with the

jkotas · 2024-03-29T04:50:15Z

docs/design/datacontracts/data_descriptor.md

+
+Unknown offsets are not supported in the compact format.
+
+Rationale: the compact format is expected ot be used for the in-memory data descriptor. In the


Suggested change

Rationale: the compact format is expected ot be used for the in-memory data descriptor. In the

Rationale: the compact format is expected to be used for the in-memory data descriptor. In the

jkotas · 2024-03-29T04:55:02Z

docs/design/datacontracts/data_descriptor.md

+Rationale: the compact format is expected ot be used for the in-memory data descriptor. In the
+common case the field type is known from the baseline descriptor. As a result, a field descriptor
+like `"field_name": 36` is the minimum necessary information to be conveyed.  If the field is not
+present in the baseline, then `"field_name": [12, "uint16"]` may be used.


"may be used" or "must be used"?

jkotas · 2024-03-29T05:54:48Z

docs/design/datacontracts/datacontracts_design.md


-## Version <version_number>
+#### Baseline data descriptor identifier


This should be moved to data_descriptor.md where it talks about baseline descriptor. This higher-level doc should only have a brief mention.

jkotas · 2024-03-29T06:01:20Z

docs/design/datacontracts/datacontracts_design.md


-Insert description (if possible) about what is interesting about this particular version of the contract
+#### Global Values
+Global values which can be of types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, pointer, nint, nuint)


Suggested change

Global values which can be of types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, pointer, nint, nuint)

Global values can be either primitive integer constants or pointers.

The detailed explanation is in the linked doc

jkotas · 2024-03-29T06:02:08Z

docs/design/datacontracts/datacontracts_design.md

-
-## Version <version_number>, DEFAULT
+#### Data Structure Layout
+Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer) or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field.


Suggested change

Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer) or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field.

Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive integer types, pointers or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field.

jkotas

LGTM modulo feedback. Thank you!

jkotas · 2024-03-29T14:25:17Z

docs/design/datacontracts/data_descriptor.md

+Rationale: the compact format is expected to be used for the in-memory data descriptor. In the
+common case the field type is known from the baseline descriptor. As a result, a field descriptor
+like `"field_name": 36` is the minimum necessary information to be conveyed.  If the field is not
+present in the baseline, then `"field_name": [12, "uint16"]` must be used.


Thinking about it some more:

I thought that the types are only meant for documentation purposes, but I do not see it actually mentioned anywhere. Do we expect to the types to be used in the reader?

If the types were for documentation purposes only, it does not make sense to mention them in the compact form.

I think this is just a choice of what do we want to happen if a field ever changes its type? One option is type is implicit in the name so the name must change either in the source or via some override the tooling supports. The other option is we include the type explicitly so that the reader can match on (type,name) tuple.

My vote would be for requiring field rename if it changes type.

Building on #100253 , describe an in-memory representation of the toplevel contract descriptor, comprisied of: * some target architecture properties * a data descriptor * a collection of compatible contracts Contributes to #99298 Fixes #99299 --- * [cdac] Physical contract descriptor spec * Add "contracts" to the data descriptor * one runtime per module if there are multiple hosted runtimes, diagnostic tooling should look in each loaded module to discover the contract descriptor * Apply suggestions from code review * Review feedback - put the aux data and descriptor sizes closer to the pointers - Don't include trailing nul `descriptor_size`. Clarify it is counting bytes and that `descriptor` is in UTF-8 - Simplify `DotNetRuntimeContractDescriptor` naming discussion --------- Co-authored-by: Elinor Fung <[email protected]>

data descriptor spec

d392600

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 25, 2024

dotnet-policy-service bot assigned lambdageek Mar 25, 2024

lambdageek requested review from noahfalk, jkotas, davidwrighton, AaronRobinsonMSFT and elinor-fung March 25, 2024 20:28

lambdageek added area-Diagnostics-coreclr documentation Documentation bug or enhancement, does not impact product or test code area-Meta and removed area-Diagnostics-coreclr needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Mar 25, 2024

AaronRobinsonMSFT reviewed Mar 25, 2024

View reviewed changes

jkotas reviewed Mar 25, 2024

View reviewed changes

docs/design/datacontracts/data_descriptor.md Outdated Show resolved Hide resolved

jkotas reviewed Mar 26, 2024

View reviewed changes

docs/design/datacontracts/data_descriptor.md Outdated Show resolved Hide resolved

jkotas reviewed Mar 26, 2024

View reviewed changes

docs/design/datacontracts/datacontracts_design.md Outdated Show resolved Hide resolved

noahfalk reviewed Mar 26, 2024

View reviewed changes

fix typos

980144a

Co-authored-by: Aaron Robinson <[email protected]> Co-authored-by: Jan Kotas <[email protected]> Co-authored-by: Noah Falk <[email protected]>

elinor-fung reviewed Mar 26, 2024

View reviewed changes

docs/design/datacontracts/data_descriptor.md Outdated Show resolved Hide resolved

docs/design/datacontracts/datacontracts_design.md Outdated Show resolved Hide resolved

elinor-fung reviewed Mar 26, 2024

View reviewed changes

lambdageek added 4 commits March 26, 2024 14:43

Add field and global types to physical blob descriptor

67eee3f

spellcheck

019bc63

simplify

452c990

composition requirement for binary blobs

8224050

This was referenced Mar 26, 2024

[cdac] data contract spec follow up items and open questions #100162

Open

Portable Data Contract-based DAC in .NET 9+ #99298

Open

[cdac] Publish data stream spec #99299

Closed

lambdageek added 2 commits March 26, 2024 16:10

lint

11e988c

remove binary blob format

db02e42

it's a build tooling implementation detail

add example

b40fa38

lambdageek mentioned this pull request Mar 27, 2024

[cdac] Physical contract descriptor spec #100365

Merged

elinor-fung reviewed Mar 27, 2024

View reviewed changes

jkotas reviewed Mar 27, 2024

View reviewed changes

noahfalk reviewed Mar 28, 2024

View reviewed changes

docs/design/datacontracts/data_descriptor.md Outdated Show resolved Hide resolved

lambdageek added 3 commits March 28, 2024 11:34

Spell check

a46b9e2

replace "binary blob" by "in-memory data descriptor"

7896c0a

clarify globals example. use [int] for indirect data

27f9892

lambdageek force-pushed the cdac-data-spec branch from 585a123 to 00f8139 Compare March 28, 2024 16:16

Add a "compact" JSON variant

b2dba2c

lambdageek force-pushed the cdac-data-spec branch from 00f8139 to b2dba2c Compare March 28, 2024 16:17

jkotas reviewed Mar 29, 2024

View reviewed changes

jkotas approved these changes Mar 29, 2024

View reviewed changes

move baseline discussion to data spec; spell check

ddb0a4b

lambdageek merged commit 7486be8 into dotnet:main Mar 29, 2024
12 checks passed

jkotas reviewed Mar 29, 2024

View reviewed changes

github-actions bot locked and limited conversation to collaborators Apr 29, 2024


		## Example

		And example C header describing some data types is given in [sample.data.h](./sample.data.h). And

	descriptor. By contrast, `FEATUER_COMINTEROP` comes from the in-memory data descriptor - with the
	descriptor. By contrast, `FEATURE_COMINTEROP` comes from the in-memory data descriptor - with the


		Unknown offsets are not supported in the compact format.

		Rationale: the compact format is expected ot be used for the in-memory data descriptor. In the


		## Version <version_number>
		#### Baseline data descriptor identifier

	Global values which can be of types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, pointer, nint, nuint)
	Global values can be either primitive integer constants or pointers.

	Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive types (int8, uint8, int16, uint16, int32, uint32, int64, uint64, nint, nuint, pointer) or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field.
	Each data structure layout has a name for the type, followed by a list of fields. These fields can be of primitive integer types, pointers or of another named data structure type. Each field descriptor provides the offset of the field, the name of the field, and the type of the field.

[cdac] Data Descriptor Spec #100253

[cdac] Data Descriptor Spec #100253

Conversation

lambdageek commented Mar 25, 2024 • edited Loading

AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lambdageek Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lambdageek commented Mar 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elinor-fung Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lambdageek commented Mar 25, 2024 •

edited

Loading

lambdageek Mar 26, 2024 •

edited

Loading

jkotas Mar 27, 2024 •

edited

Loading

elinor-fung Mar 29, 2024 •

edited

Loading