-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are the values of a union type? (in particular, what is the validity invariant of a union) #438
Comments
As I was trying to figure out how to evaluate the possibilities, I started thinking that |
So... basically, let's imagine what could be done if we didn't have to support code like this? I first thought "not much". While validity might be defined as "any of the variants is valid", it becomes harder to define what happens on a typed copy. Consider a union of But there's actually a way to achieve that: we could say that a value of union type is a non-empty set of values of possible variants. So in Another option would be to angelically pick which variant to do the copy at, but I'd rather avoid angelic non-determinism if possible. That definition should give us maximal niche power. However, we also probably don't want it for Also, implementing this in Miri will be quite terrible... |
Could we have different |
I think one goal is for |
What optimizations can we get from For For niches, doing layout optimizations on unions does sound like it would have some plausible value in general, but I expect a significant fraction of union use either doesn't have niches or the union is implementing tag packing itself in which case the user doesn't mind if the union never exposes a niche. Is there more value to exposing niches than permitting enum layout optimizations? |
Is it? I don't think so. It's not my goal, anyway. ;)
When and how do you want safe code to be able to read union fields?
Hm, good question. That would depend on how smart LLVM is about these attributes, but of course we can always look to make it smarter. The problem @scottmcm ran into that made him ask for union |
XD I've mentioned this to you before Ralf. I thought that's what you were already talking about when you said "there is some desire to make unions have validity invariants". Many people want a future rust to support safe union field access when the fields of the union satisfy some sort of specific case that the compiler can automatically check by itself. Whatever is decided for unions now should hopefully allow for that to happen some day. Being intentionally vague, and specifically expecting the reader to interpret the following words in the most positive light, the assumption is usually something like: "if the safe-transmute project completes, and safe-transmute could convert all field types to each other, then that union should obviously be fine to use from safe code."; So something like |
I tried to explore this a bit, but I don't know how successful I was. https://llvm.godbolt.org/z/4K9q9d3Yq If I put a typed load in a function, define i32 @_ZN7example3foo17h02b60a93fb12ffb0E(i32 %0) {
start:
%x = alloca i32, align 4
store i32 %0, ptr %x, align 4
%_2 = load i32, ptr %x, align 4, !noundef !1
ret i32 %_2
} That just gets completely optimized out today, as far as I can tell define i32 @_ZN7example3foo17h02b60a93fb12ffb0E(i32 returned %0) {
ret i32 %0
} Though Alive says it would be legal for it to mark the return type as So I don't know if the typed load would be sufficient in practice to get |
@Lokathor IIRC my stance in that discussion was that that isn't even a validity invariant question, it is primarily a safety invariant question. Or at least, it is certainly the case that safety invariants are sufficient to achieve what you are asking for, since it's all about "how can we soundly expose operation X to safe code". So to make this a validity invariants requires stronger motivation.
I was mostly thinking of @scottmcm, and of people asking for niches in unions. @scottmcm that's not quite how the code would look, would it? If this is a getter Was there a specific optimization you were looking for in the iterator code that went missing, or was it more a general concern without concrete examples? |
Am I understanding correctly that all of the proposals on the table satisfy the following property? Given a E.g.: #[repr(C)]
union Foo {
a: u8,
b: i8,
}
// It is guaranteed that `foo` contains no uninitialized bytes.
let foo: Foo = get_foo();
#[repr(u8)]
enum Bar {
A(u8),
B(u8, u8),
}
#[repr(C)]
union Bar {
a: [u8; 3],
b: Bar,
}
// `bar` might have uninitialized bytes if it was initialized as `b: Bar::A(...)`
let bar: Bar = get_bar(); The reason that this guarantee is useful is that it supports an oft-requested feature from zerocopy, which is to expose the underlying bytes of a union value. For example, one such use case is to permit safe field access. Another is to permit a kernel emulator to write a union into an untyped shared memory buffer (which requires viewing that union as a If my understanding is correct, would folks be open to at least codifying this minimal guarantee in the Reference so that we can rely on it in EDIT: cc @kupiakos, whose code needs this property |
No that is definitely not the case. The proposal currently implemented in Miri and MiniRust does not satisfy this property: we say that *all* byte patterns are valid for all unions. (And furthermore, these bytes are exactly represented in the union value, except for bytes that are padding in *all* union variants -- those bytes are discarded in the value, i.e. they constitute padding for the union itself.)
I am also not convinced this is a guarantee we want. This is related to safe union field access which I think should involve some sort of opt-in.
|
Also, you can make a union have a safety invariant of being initialized/storing one of its variants. I make a declarative macro in my emulator that does this and adds both safe access and |
Quoting from the original comment here - is this still a goal? I understood the various designs that you outline in the first few comments in this thread as attempting to satisfy this goal. IIUC, this is articulating the same goal that I have. Maybe I'm misunderstanding what this example was meant to convey? |
Personally, I very much want a way to accomplish the goal of being able to store a conceptual-union of things without pessimizing to "yolo, might be anything including uninitialized". Being able to have something that's either That said, if it helps opsem I think it'd also be ok to have that be something other than an ordinary |
I think that it would be very useful to have unions that limit how much uninit they are without having to go all the way to having an active field. |
For those of you who have a similar use case: I filed #533, which would permit us to at least treat the desired initialization property as a safety property if not a soundness property (in other words, safe code could not violate it, although |
Yeah, agreed. And I also view "require init" as being a statement in the same category as "requires non-null". So one option here is to let people annotate an attribute and the union that makes such validity requirements explicit, rather than implicitly inferring them from the field types. But of course that would not work for generic code. Not sure if that s a concern that comes up here? |
When loading data at union type from one place and storing it in another -- what exactly is being required about the bytes, and what is preserved? In other words, what is the representation relation of a union type, and what is even the set of values that can be generated by a union-typed load?
Originally I intended the answer to be "the values are just lists of raw bytes as large as the union". IOW, a union has no validity requirements whatsoever, and all data inside it is perfectly preserved. However, reality disagrees. With the C ABI, some of the bytes in a union value do not always get preserved -- there can be padding. Also there is some desire to make unions have validity invariants, so that e.g. a
union { usize, *const T }
isnoundef
.On the other hand, it is certainly not the case that the value of a union is "a value of one of its fields". Apart from the problem about writing
decode
for such a specification, it can be violated in safe code.Note that for the purpose of this, we do assume that the LLVM type
iN
will perfectly preserve N bytes of memory. LLVM code generated from Rust can never put poison in memory (as that would not be preserved on a per-byte level, the entireiN
gets "infected" if any byte is poison), which is currently accurate (Rust code has UB for any situation where the generated LLVM code might produce poison). LLVMiN
values carrying provenance is incoherent but mostly works in practice and LLVM does not offer us a better option. We need a "byte" type to properly express our intent to LLVM; until that exist, we make do with what we have. If you want to discuss LLVM issues, please open a new thread -- this one here is solely about the Rust AM semantics.Possible designs
MiniRust decides that a value of union type is a list of "chunks" where the bytes are perfectly preserved, but there can be gaps between those chunks where data is lost of a copy. The extent of these chunks can be computed from a Rust union type layout. That is sufficient to model the reality of union padding. The effective validity invariant here is still "any list of bytes allowed".
The
noundef
property could be achieved if we refine this "chunk" idea a bit further: every byte in a union is either "discarded", "preserved", or "initialized". Bytes which are padding in all variants get "discarded"; bytes which are padding only in some variants get "preserved"; bytes which are "in the data part" for all variants get "initialized". A value of union type is a list of bytes the size of the union with the constraint that "discarded" bytes must beUninit
and "initialized" bytes must not beUninit
. The validity invariant requires that all "initialized" bytes must be initialized -- and that's it; "discarded" bytes get flushed toUninit
on adecode
. Then the Rust typeunion { usize, *const T }
would translate to a union type where all bytes are "initialized", and we could emitnoundef
attributes for such a type.(I don't want multiple distinct AM values to be equivalent, hence the 'value of union type' predicate demands that all "uninit" bytes are truly
Uninit
, but of course the validity invariant allows those bytes to be anything. Note that the validity invariant is a predicate over "list of bytes", whereas "what are the values of this type" is a predicate overValue
. The chunks idea achieves this by just not having the padding bytes in theValue
; we could do the same here but that would be formally more awkward to express I think. Maybe I should just link to a MiniRust patch here, that might be more clear than trying to say this in English.^^)We could in principle go further and tro to achieve something along the lines of "if a byte is non-null in each variant, it is also non-null in the union", but that becomes increasingly complicated -- in particular it requires the "generate union type description from Rust type" logic to encode more and more knowledge about validity invariants, when the entire intent of this "value representation" business was that the validity invariant is implicitly encoded by the representation relation.
What other designs should we consider? And are they worth the complexity?
The text was updated successfully, but these errors were encountered: