[type-layout] Document bit validity of structs and padding #1630

joshlf · 2024-09-25T00:35:53Z

This intentionally omits discussions of enums and unions. Enum bit validity is well-defined, but it is difficult to articulate concisely. Union bit validity is still not nailed down.

chorman0773

Left a few comments. The rules arround structs specifically are good though.

chorman0773 · 2024-10-10T17:15:08Z

src/type-layout.md

+Note that the total set of possible bit patterns for any given byte has 257
+elements - 256 initialized bit patterns and one "uninitialized" bit pattern.


This does not include provenance-carrying bytes, which are complicated.

I'd either briefly mention them, or omit this paragraph entirely.

chorman0773 · 2024-10-10T17:16:15Z

src/type-layout.md

+Each Rust type has "bit validity", which is the set of bit patterns which may
+appear in a value of that type. It is [undefined behavior] to produce a value


This may be better in its own section about bit validity in general, rather than introducing the concept in its entirety in a section specifically about composite types.

chorman0773 · 2024-10-10T17:17:40Z

src/type-layout.md

+Padding bytes have no bit validity requirement: it is always well-defined
+to write any byte value - including an uninitialized byte - to any byte of
+padding.


Could be useful to mention that padding bytes will get overwritten whenever a struct value is moved. I don't remember if we mentioned it elsewhere in the chapter though - if it's already there, I don't think it needs to be duplicated here.

Looks like there are no existing docs for typed copies (cc @RalfJung). Perhaps we could add a section documenting them?

IMO we ultimately shouldn't document bit validity at all. Bit validity is a derived concept. What actually matters is

decoding a low-level representation into a high-level value

and inverting that: encoding the value back into a low-level representation

"bit validity" is just the precondition for decoding to be well-defined. If we define decoding, that implicitly defines bit validity as well. But the decode/encode pair also explains other things like padding (and sometimes provenance) being lost on a typed copy.

IMO bit validity (at least as I've used it) is a tad stronger than that: it also permits unsafe code to assume that it won't see certain values produced by safe Rust. That implies that it also constrains what low-level representations can be produced by the "encoding" operation you described.

I don't see any reason why explicitly documenting encoding and decoding separately is a problem. I'd just want to make sure we can still derive the stronger notion of bit validity that I described from whatever guarantees are made regarding encoding and decoding.

@chorman0773

It does not permit that, no, as we can relax validity invariants.

Does that hold for bit validities which are explicitly documented? E.g., would we consider unsafe code which assumes that a bool can only be 0 or 1 [1] to be unsound?

[1] https://doc.rust-lang.org/reference/types/boolean.html:

The value false has the bit pattern 0x00 and the value true has the bit pattern 0x01. It is undefined behavior for an object with the boolean type to have any other bit pattern.

IMO bit validity (at least as I've used it) is a tad stronger than that: it also permits unsafe code to assume that it won't see certain values produced by safe Rust. That implies that it also constrains what low-level representations can be produced by the "encoding" operation you described.

That's not a good idea. Whenever unsafe code permits inputs to not satisfy the safety invariant, that code should document what it actually requires. We can and will change validity rules to allow more values, which removes UB and is hence not considered a breaking change. If someone defines an unsafe function with precondition "anything goes, even if it breaks the safety invariant, as long as it satisfies the validity invariant", that unsafe code can become unsound through such a spec change.

I don't see any reason why explicitly documenting encoding and decoding separately is a problem.

It's fully redundant, and therefore makes it harder to form a mental model of this part of unsafe Rust. It makes it seem like one has to learn the encode/code part and the bit validity rules, when that is not actually necessary.

E.g., would we consider unsafe code which assumes that a bool can only be 0 or 1 [1] to be unsound?

That's a question about safety invariants, which are a type system concern, not about validity invariant (which are an opsem concern). It is easier to consider this question with &str since there, validity != safety: it is documented that this is valid UTF-8, so unsafe code may rely on that. But this has nothing to do with validity, and it is in fact not immediate UB to violate the UTF-8 property.

I don't see any reason why explicitly documenting encoding and decoding separately is a problem.

It's fully redundant, and therefore makes it harder to form a mental model of this part of unsafe Rust. It makes it seem like one has to learn the encode/code part and the bit validity rules, when that is not actually necessary.

Sorry, to clarify, I was agreeing with you here 🙂 Namely, that (for our purposes) just defining encoding/decoding and not mentioning bit validity is fine.

IMO bit validity (at least as I've used it) is a tad stronger than that: it also permits unsafe code to assume that it won't see certain values produced by safe Rust. That implies that it also constrains what low-level representations can be produced by the "encoding" operation you described.

That's not a good idea. Whenever unsafe code permits inputs to not satisfy the safety invariant, that code should document what it actually requires. We can and will change validity rules to allow more values, which removes UB and is hence not considered a breaking change. If someone defines an unsafe function with precondition "anything goes, even if it breaks the safety invariant, as long as it satisfies the validity invariant", that unsafe code can become unsound through such a spec change.

E.g., would we consider unsafe code which assumes that a bool can only be 0 or 1 [1] to be unsound?

That's a question about safety invariants, which are a type system concern, not about validity invariant (which are an opsem concern). It is easier to consider this question with &str since there, validity != safety: it is documented that this is valid UTF-8, so unsafe code may rely on that. But this has nothing to do with validity, and it is in fact not immediate UB to violate the UTF-8 property.

Okay so IIUC, code which assumes (for soundness) that bools are 0x00 or 0x01 is sound, but only in virtue of the fact that bools have a safety invariant of only being 0 or 1? I think that's fine for our purposes.

I do have a more general concern, though.

It seems to me that, as it currently stands, we don't clearly specify in documentation which things are intended to document validity and which are intended to document safety, and furthermore, we don't clearly specify which are merely preconditions (which we could relax in the future) and which are postconditions (which would be a breaking change to relax).

As an example, consider the Reference page on bools:

The value false has the bit pattern 0x00 and the value true has the bit pattern 0x01. It is undefined behavior for an object with the boolean type to have any other bit pattern.

Clearly this is documenting a precondition for soundness, but is it also documenting a safety invariant? Saying "the value false has the bit pattern 0x00 and the value true has the bit pattern 0x01" seems to me to be a promise, but maybe it wasn't intended that way.

The stdlib docs are even more nebulous:

If you cast a bool into an integer, true will be 1 and false will be 0.

Is there some nomenclature convention I'm missing here, or is it just that we haven't gotten around to making our language consistent? Not that it's specifically your responsibility to own this or anything, but I think it's worth stating for the record: the ambiguity of the language - and especially ambiguity which allows preconditions to be misinterpreted as postconditions or invariants - makes it hard for external authors to be confident that they're writing unsafe code correctly. Writing unsafe code is my full-time job and I still keep running into situations where I thought I understood what was being promised only to learn that I was misreading the text.

Okay so IIUC, code which assumes (for soundness) that bools are 0x00 or 0x01 is sound, but only in virtue of the fact that bools have a safety invariant of only being 0 or 1? I think that's fine for our purposes.

Yes.

It seems to me that, as it currently stands, we don't clearly specify in documentation which things are intended to document validity and which are intended to document safety,

Indeed. So far we haven't stably committed to this dichotomy of invariants, and more importantly, we haven't stably committed to how these two invariants are called. Personally I regret calling them "validity invariant" and "safety invariant"; I now think that "language invariant" and "library invariant" are better terms. They go along nicely with the terms "language UB" and "library UB" that we have also been using already (but also not documented stably yet -- mostly because nobody had the time to write that text, I think).

I opened rust-lang/unsafe-code-guidelines#539 to hopefully make progress on this question.

chorman0773 · 2024-10-11T02:03:04Z

It does not permit that, no, as we can relax validity invariants.

…

On Thu, Oct 10, 2024, 21:20 Joshua Liebow-Feeser ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/type-layout.md <#1630 (comment)>: > +Padding bytes have no bit validity requirement: it is always well-defined +to write any byte value - including an uninitialized byte - to any byte of +padding. IMO bit validity (at least as I've used it) is a tad stronger than that: it also permits unsafe code to assume that it won't see certain values produced by safe Rust. That implies that it also constrains what low-level representations can be produced by the "encoding" operation you described. I don't see any reason why explicitly documenting encoding and decoding separately is a problem. I'd just want to make sure we can still derive the stronger notion of bit validity that I described from whatever guarantees are made regarding encoding and decoding. — Reply to this email directly, view it on GitHub <#1630 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGLD22HV3WP7ABLDQLOSETZ24RVTAVCNFSM6AAAAABOZMPY6WVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGNRRG4ZTIOJZGM> . You are receiving this because you commented.Message ID: ***@***.***>

traviscross · 2024-10-22T21:31:12Z

@rustbot author

[type-layout] Document bit validity of structs and padding

5133f9e

This intentionally omits discussions of enums and unions. Enum bit validity is well-defined, but it is difficult to articulate concisely. Union bit validity is still not nailed down.

chorman0773 requested changes Oct 10, 2024

View reviewed changes

rustbot added the S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[type-layout] Document bit validity of structs and padding #1630

[type-layout] Document bit validity of structs and padding #1630

joshlf commented Sep 25, 2024

chorman0773 left a comment

chorman0773 Oct 10, 2024

chorman0773 Oct 10, 2024

chorman0773 Oct 10, 2024

joshlf Oct 10, 2024 •

edited

Loading

RalfJung Oct 10, 2024 •

edited

Loading

joshlf Oct 11, 2024

joshlf Oct 11, 2024

RalfJung Oct 11, 2024 •

edited

Loading

RalfJung Oct 11, 2024

joshlf Oct 18, 2024

RalfJung Oct 18, 2024

chorman0773 commented Oct 11, 2024 via email

traviscross commented Oct 22, 2024

		Note that the total set of possible bit patterns for any given byte has 257
		elements - 256 initialized bit patterns and one "uninitialized" bit pattern.

		Each Rust type has "bit validity", which is the set of bit patterns which may
		appear in a value of that type. It is [undefined behavior] to produce a value

[type-layout] Document bit validity of structs and padding #1630

Are you sure you want to change the base?

[type-layout] Document bit validity of structs and padding #1630

Conversation

joshlf commented Sep 25, 2024

chorman0773 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshlf Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

RalfJung Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RalfJung Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chorman0773 commented Oct 11, 2024 via email

traviscross commented Oct 22, 2024

joshlf Oct 10, 2024 •

edited

Loading

RalfJung Oct 10, 2024 •

edited

Loading

RalfJung Oct 11, 2024 •

edited

Loading