-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meaning of Undefined and Justification for UB #253
Comments
UB is the same in Rust as it is in C++.
A compiler implementation could specify what happens for some subset of programs which have UB according to the Rust language. However, this is out of scope when it comes to specifying Rust itelf, and it does not mean that the program itself becomes valid Rust.
Even from the beginning, the C/C++ standards left things undefined specifically to allow compilers to translate code to more efficient machine code. For example, this is why There's no reason to leave something as UB unless it allows for some optimization, because UB has a significant cost. Instead, if no optimizations can be enabled, it would be better to specify the behaviour, or define a range of reasonable behaviours, but leave the exact choice up to the implementation. |
Thank you for moving this discussion to a separate thread!
There is a definition of UB in our glossary. This definition coincides with how modern C/C++ compilers interpret UB in their respective languages. (I should add though that the UCG glossary represents UCG consensus, not Rust-wide consensus.) There is also an excellent blog post by Raph Levien that goes a bit into the history of UB. According to that post, UB in C/C++ used to be more about "we do not want to restrict what hardware does" than about enabling optimizations, but this meaning has shifted over time. In my opinion, UB is a terrible word for how the term is used today, I think something like "language contract" or so is much clearer, but I'm afraid we are probably stuck with it. The concept itself however is great: it is a way for the programmer to convey information to the compiler that the compiler would have no way to infer itself. However, problems arise when the programmer does not realize what information they are conveying. This happens a lot in C/C++ (when a programmer writes |
Historically, UB might not have started as being primarily for optimizations, but over the last few decades that is certainly the case. To give one example, strict aliasing is UB in C and that UB has only one purpose: more optimizations. (Specifically, the story I was told is that C compilers needed to be able to compete with Fortran compilers. Fortran has very strong aliasing guarantees, and the only way they saw to make C competitive is to also have some aliasing guarantees in C.) In Rust, without the historical luggage of C/C++, we use UB only for optimizations. There are better ways to handle platform and implementation differences, as @Diggsey mentioned. For example, we have little-endian and big-endian platforms, and this is handled by having an explicit parameter in the Rust Abstract machine defining endianess. So it is not UB to do something byte-level with multi-byte integer types, but results differ per platform. Such differences should obviously be kept to a minimum to make code maximally portable, but there can be good reasons to introduce them. Likewise, integer overflows are defined to either raise a panic or produce 2's-complement overflowing results (and in practice this is controlled by compiler flags). In such cases it is important to precisely specify what all the possible cases are, so that programmers can make their code correct wrt. all Rust implementations. This is what sets such platform/implementation differences apart from UB.
To add to this, the purpose of the UCG (unsafe-code-guidelines WG) is to specify Rust, not to specify a particular Rust implementation. Basically, the long-term goal of the UCG is to produce something akin to (but better than ;) the C/C++ spec. As far as the spec and therefore the UCG is concerned, programs with UB are just wrong, period. This is the same as in C/C++: the spec does not discuss any such implementation-specific guarantees. Some members of the lang team have also expressed a preference in the past of not making any extra promises in rustc [the implementation] for things that are UB in Rust [the language]. They want to avoid fragmenting the language into dialects that only work with some implementations. Worse, since there is only one implementation currently, there is a huge risk of any such implementation-specific promise becoming a de-facto guarantee that the entire ecosystem relies on. Therefore, as far as the rust-lang organization is concerned, programs with UB are beyond salvaging. They are not subject to stability guarantees (or any guarantees really) and they need to be fixed. Implementations could assign meaning to UB programs, but rustc [the implementation] does not. In fact it would be healthier for the ecosystem if alternative implementations (once they exist) do not do so, either, since any such guarantee is an ecosystem split -- programs that run fine in one implementation do not run fine in another. Effectively, if an implementation makes such a promise, then it implements a different language, with a different Abstract Machine. That's why I talked about "dialects". In practice, rustc [the implementation] will do what it can to help programmers even if their programs have UB, if it does not compromise UB-free programs. Usually the goal here is to make the programmer aware of that problem so that they can fix their code. Sometimes we even temporarily take back changes that "break" UB programs until UB-free ways to do things are possible; this happened around unwinding for |
Does rust have a category corresponding to C/C++'s "implementation defined" then? It sounds like we would want to avoid it, and as long as "rust = rustc" it's a bit difficult to distinguish implementation defined from plain old defined behavior. |
Not yet, mostly for the reason you mentioned. I think such questions will come up way later in the process. The IMO more interesting other "kind of behavior" to talk about is unspecified behavior, which is closely related. There was an attempt to define it that failed to reach consensus. (That PR should likely be closed and a new one started.) The only real difference between "unspecified" and "implementation-defined" is that for the latter, implementations need to document which choice they make -- so once we nailed down what "unspecified behavior" means, we pretty much also covered "implementation-defined", we just need to decide on a case-by-case basis if implementantions ought to document a choice (and guarantee that choice for future versions of the implementation) or not. |
Well, asside from permitting implementations with behaviour that diverges from possible specifications. This was a primary reason why signed integer overflow was undefined in C/++, because there were too many possible behaviours, depending on the machine architecture, and the signed integer format (which was unspecified until C++20, and I believe C2x also has the same thing). Trapping is strictly not a part of the C or C++ standard, so whenever a potential behaviour is to trap, the behaviour is necessarily undefined. I would still say its a good idea to define the term somewhere, as to avoid issues with interpretation.
I do agree. However, it is always a good idea to acknowledge a particular implementation may promise to respond to particular UB in a particular way, and that many implementations may all agree on this meaning (returning to my type-punning unions example), so it is possible to exploit that known extension (one of my personal rules of UB says that Known Compiler Extensions are fine). From the response I got, it seems like its illegal to "define" undefined behaviour in rust, even though it may even be necessary to implement the specification itself (in several places in my implementation of libcore, I transmute between
It may be reasonable to include such a note, or some acknowledgement that a valid implementation may assign actual meaning to the behaviour. In general, yes, you shouldn't invoke undefined behaviour, but sometimes (especially when writing libraries to support the specification/standard) it can become unavoidable.
This is one of the reasons I am working on lccc, so that rustc's implementation does not become the de-facto standard.
Sometimes yes, there are some instances when this becomes necessary. For example, low level code may need further guarantees the language does not provide, and that is why such extensions exist. Many of lccc's extensions are inherited from the fact it's designed to be a compiler for Rust, C, and C++ and successfully operate code that would work with gcc or clang (in particular, support of libstdc++ and libc++ is a goal because of binary compatibility), so many of the rules implemented for rust are lessened where C or C++ has weaker requirements and vice-versa.
As I mentioned above, sometimes it is infiesable to do so, as it requires adding additional requirements to the spec, that may end up incredibly vague (See the signed overflow example). If you discuss about "trapping" how do you define that to cover all possible ways traps can occur and be handled? Further, strict-aliasing exists because the meaning of values (and pointers, see the fact that |
I don't see why undocumented things would ever be a good idea. Unstable things might be, and really I think most of rustc is currently in that category: all the nightly features are clearly not UB but also not among the (very few!) actually stable and defined behaviors that are in the UCG document.
I agree with Ralf that this was a huge misstep on the part of the C/C++ committees. This really should have been implementation defined behavior (or platform-specific behavior), not undefined behavior. Making things like signed overflow UB makes things much more hazardous for the programmer, and when you couple it with the newly re-imagined UB as license for optimization you have a recipe for disaster.
My interpretation is that it is allowed for compilers to extend the language, but it is discouraged, because we would much rather incorporate those extensions into the language itself or come up with some suitable alternative that doesn't require creating language dialects. In particular, if you do fewer optimizations than rustc, or are doing something that matches better with C semantics, and as a result can (and are willing to) make more guarantees about behavior that would normally be undefined, I don't think that would be a problem. But programmers won't really be able to rely on it unless they write only for your compiler.
This is not UB, this is dependence on unspecified behavior. All types in rust have a layout, and if you write code for the layout that actually occurs then that is not UB. So as long as you are willing to live with the lack of stability, you can depend on the layout of repr(Rust) things, as long as you don't guess the type layouts incorrectly (possibly because you are the standard library and thus have control over such things).
I for one am glad you are doing so. It is easy to get into a mindset that is aligned to the single implementation, and accidentally equate rustc behaviors to Rust behaviors, and I hope that a re-implementation will shake things up.
I think implementation defined behavior or platform-specific behavior handles this well; on a particular platform or with implementation context, you can say more about what exactly a "trap" entails, for example, and the main spec doesn't have to touch it.
I don't think the meaning can change arbitrarily, at least in Rust. Also AFAIK
One additional desideratum that rust has for its UB is that it should be dynamically checkable, using Miri. I'm not totally sold on this being an iron rule, but it is definitely a major improvement on the C/C++ situation where there are mines around every corner and no way to know that you have stepped on one until it is far too late. So that is an additional reason why we might not want to throw everything into the UB-bucket, if it involves a property that is not (easily) decidable. |
I don't know how correct it is, however,
My question is how you define trapping. As I mentioned, trapping is outside the bounds of the C++ Standard, so it wouldn't be an acceptable choice for unspecified behaviour. In arm, signed integer overflow causes a hardware trap, so if we imply exclude trapping behaviour that either requires extra code to support arm, or arm is no longer a valid target.
It would come down to being defined, since a trap would explicitly interact with observable behaviour and whether or not such observable behaviour occurs. The requirement would have be extraordinarily vague, which is worse than the current status-quo, at least we know that overflow is something to not touch.
In C++ and C, it acknowledges implementations where that is the case. Right now, rust is impossible (or at least, unreasonably difficult) to implement on such an implementation. This isn't necessarily an issue in practical implementations, as many of them are theoretical implementations, or things done "for fun" (see the jvm implementation I mentioned).
My question then becomes, would you rather something be specified as UB, or something be so vaguely specified because the actual specification is unreasonable or impossible, so that its possible to derive a valid interpreation where it is, in fact, UB (which means the point itself is UB, because compilers love the "best" possible interpreation as we have established). Going to the trapping example, if its left to the platform to decide what "trapping" is, what if the decision is that a result that traps is UB. How would you define "trapping" so that the behaviour of implementations that do trap, but handle traps in particular ways, or may not even have the ability to handle traps, or whether a trap is possible to handle depends on arbitrary state, etc. such that there isn't a valid interpreation where the result is UB, or effectively UB. I did bring this up in #84 (though I will concede it was off-topic there), where the layout rules of enums are unspecified, but with niche-optimizations, it was possible to devine an interpretation where unspecified behaviour (relying on the unspecified layout of repr(Rust) enums) could be elevated to undefined behaviour. Specifying something as unspecified behaviour that can result in undefined behaviour is the same as calling it undefined, except now it's hidden under interpreting the specification with a language lawyer hat, which is less fun for regular programmers I'm sure. |
UB is by definition the vaguest possible specification. Unless there is a reason for the UB to exist, then a less vague specification is always better IMO, even if it is still quite vague. |
Well, rust doesn't have
I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior. Overflow seems like a good candidate for that. You can perhaps set flags so that overflow traps (with attendant definition of what this entails on the machine state), or wraps, or is UB. Of course, if you are writing portable code, then as long as any implementation defines it as UB you the programmer can't trust it to be anything more than that, but you could use
Right now, writing code for repr(Rust) is very hazardous, for exactly this reason. It's not literally UB if you get it right, but it may as well be for the programmer because very little about it is stably guaranteed. Instead, there are things like the layout API that allow you to access this information in a more portable way, and ideally this would be good enough that there is no reason to make risky guesses about the layout because the safe and stable alternatives are in place.
I disagree. UB (or the "language contract" as Ralf says) is a contract between the programmer and the compiler. A vague definition helps neither party, and may in fact lead to a miscommunication, which is a failure of the spec. A clear UB is at least informative for the programmer (and may additionally simplify their mental model, so it's not necessarily a negative), and it enables more optimizations for the compiler (and a simpler model is also good for the compiler writer to avoid bugs). |
Without having time right now to respond to all the points:
If the Rust standard library ever invokes UB, that is a critical bug -- please report it if you find such a case. It is certainly avoidable to do so, and it is a huge problem if we do so for all the usual reasons that UB is bad. (There are some known cases of this, but we do consider those bugs that we want to resolve, and efforts/discussions are underway to achieve that.) I think this approach is necessary for building a reliable foundation of the ecosystem. (We could of course do things that are UB in Rust [the language] but not UB in rustc [the compiler]. For the reasons mentioned above, there are no such things.) It is true that some of our docs are or have been imprecise about the distinction between UB and unspecified behavior, and also sometimes about the distinction between library-level UB and language-level UB. I am trying to fix such cases as I see them.
I think what @Diggsey meant is not that we should be vague about something being UB or not, but that saying "X is UB" is vague about what happens when X occurs. More vague than any other thing we could say. |
The JVM implementation relies on typed memory, and strict-aliasing to avoid having to emulate memory.
inherently, the careful use of UB is inevitable in a standard library, but as mentioned, the fact it's the standard library means it can if it wants, it just needs to get the compiler to do what it needs. Hopefully, it is impossible to fully implement the standard library in the language itself, sometimes this means the use of compiler intrinsics, sometimes this means , sometimes this means the use of things strictly specified as UB.
This is the UB I am referring to here. UB in the language, but which an extension of the particular compiler permits.
At least a specification of UB is not vague that it is UB, which is what I was referring to. It's worse if its not outright said "don't do X, X is UB" then if you have "the behaviour of X is unspecified" and constrain it in the vaguest possible way where a valid interpretation of the constraints allows X to have UB, because that means that X is UB, it just doesn't outright say it. This is worse, as I say, because it's harder to realise that it is UB. I would add that the signed integer overflow UB actually has had real performance benefits to actual code in the field. From a cppcon talk, which I could probably look up if people wanted it, there was some rather hot code that was using
I'm sure I've made my position on this clear, but for completeness, I really hate the distinction, because it makes it easier to reason about UB (which is rule number 1 in my rules for UB "Do not reason about UB"). The biggest footgun in C++ is not when people don't know about some arbitrary piece of UB, it's when people think they are smarter than the compiler, and try to justify a particular kind of UB (I would know, I've tried this before. It didn't end well, hense my rules of UB). |
I'm not convinced by that. Certainly for some things it was more about portability, but I think optimizations have been core from the beginning. My go-to example: One of the very first things that people wanted compilers to do was register allocation for local variables. Without that optimization things would have to be loaded and stored to the stack all over the place, which would be terrible for runtime performance. But doing that requires making certain things undefined behaviour -- |
But couldn't this be handled the same way rust does layout optimization? That is, if you are lucky and guess the compiler's playbook then you can safely update b this way but if you miss and hit the wrong thing then it's UB. (And if |
A lot of things can happen if you're lucky and guess. Specifically, the outcome of UB might be what you expect. It's always possible that the UB doesn't come back to bite you when using a particular compiler, on a particular set of flags, on a particular ... and so on. But what exactly happens is always up in the air, which is why as a user of the language/compiler you need to avoid UB if you want reliable compilations. But in that particular example with |
In early C compilers? Yeah, it probably could be handled that way. You might already realize this, but it couldn't be handled that way in modern compilers without needlessly sacrificing optimization potential. As a simple example: if b >= 0 {
do_something_with(&mut a);
if b < 0 {
do_something_else();
}
} Assuming |
It didn't have to be; it could have been implementation-defined. For example, while the C standard makes most kinds of overflow either undefined or well-defined, there is one exception. If you cast from a larger integer type to a smaller one and the value can't fit into the smaller type, the standard says: "either the result is implementation-defined or an implementation-defined signal is raised." (C11 6.3.1.3.3) This gives the implementation an extraordinary amount of flexibility, while not going all the way to "undefined behavior".
It does not. |
Yeah, that's a briliant idea. It works fairly well in debug, so what can go wrong. Sarcasm aside, the only time its OK to use UB, is if you are in a situation with a particular compiler, or a particular set of compilers, and you know that the compiler assigns a particular meaning to the particular undefined behaviour, either because you are very closely tied to the compiler (standard library or compiler support library) or you have a documented extension (again see my union example). "Guessing" what the compiler does falls under reasoning about UB.
I will concede that one is likely for that UB, but an decent amount of UB in C and C++ has justification further than that.
A signal wouldn't be the same as a trap, trapping doesn't need to result in a signal.
Huh, I thought it was one of the examples of where signed integer overflow was trapped at a hardware level (I do know such processors exist). |
I am always in favor of defining terms. :) As mentioned before, UB is defined in our glossary; if you have suggestions for improving that definition, please let us know!
As noted above, we do not want to encourage implementations to actually do that. Also I strongly disagree with it being unavoidable. In Rust, we are avoiding relying on "UB in the spec but the compiler guarantees a specific behavior" (modulo bugs), so we have constructive evidence that it is possible to build a language that way. And this is the way I (and I think I am not alone in the UCG and the lang team to think so) would prefer other Rust implementations to go as well. Certainly I see no reason that we should explicitly cater to another approach. (We shouldn't explicitly forbid it, either, but nobody has been suggesting that.) I do not think it is the role of a spec to point out that one could derive other, adjusted specifications for it. That is just obviously true for every document. These derived specifications are separate dialects of Rust. The purpose of the UCG is to spec out the "main language", not to figure out the design space for dialects. At least, I personally have little interest in developing such dialects, and I think the UCG has enough on its plate without that additional mandate. And finally, discussion of such dialects should, even when it occurs, be kept strictly separate from the "main language". We should not mix up what ends up in a Rust spec and what ends up in the spec of some derived language dialects that some hypothetical future implementations might choose to implement instead.
Again I disagree that this is necessary. So far the approach of the lang team and UCG has always been to instead work with the people writing that low-level code, figure our their needs, and see how we can accommodate them without creating language dialects. I firmly believe that this is the better strategy, and I see no reason to think that it would not work. Both sides (language designers and low-level programmers) gain a lot when we can avoid splitting off a "low-level dialect" of Rust.
There were versions of the C spec before strict aliasing. So no, that is not the reason. C could just specify that when the types of stores and loads do not match, the bit-pattern is interpreted at the other type. C provides ways to do this, e.g. through Literally the only reason C has strict aliasing rules is to enable more optimizations. If they removed strict aliasing from the spec, there wouldn't be any gaps or open questions created by that. (There'd be a lot of open questions removed actually.^^) This is also demonstrated by the fact that all major compilers have a flag like
There could be all sorts of things you can say without saying it is UB -- things like aborting program execution, or signal handlers (which the standard does talk about). A program that traps will not arbitrarily jump into otherwise dead regions of code, which UB might well do; I don't think it would be too hard to come up with a reasonable list of possible behaviors here.
You keep saying that, but it is just not true.^^ Rust proves otherwise (modulo bugs).
FWIW, in Rust this particular example does not carry over -- the reason signed integer overflow UB helped here is that people use
(This was about unspecified vs implementation-defined behavior.)
I don't think that would be a good idea. At that point programmers have to basically treat this as UB. So this is effectively equivalent to saying it is UB but some implementations making stronger guarantees about it, which is not a good idea for all the reasons mentioned before. In fact, if you take it as a given that implementations may guarantee specific behavior for UB, then if you allow implementation-defined behavior to be "implemented as UB" you just made it equivalent to UB. So no, I strongly disagree, "UB" should not be on the list of things an implementation may choose from for unspecified or implementation-defined behavior. I should add that I think there are examples of UB that are not motivated by optimizations but by "there's literally nothing better we can say". For example, taking a random integer, casting that to a function pointer, and calling the function. There is no reasonable way, in the Abstract Machine, to bound the behavior of such a program. But those cases are by far in the minority for UB, both in C/C++ and in Rust. It would be trivial to precisely describe Rust, C, and C++, and to have a Miri-like checker for them, if this was the only kind of UB that we had. |
I agree that for programmers aiming for full portability to any conforming implementation, this may as well be UB. However it differs from UB in that you can use it selectively if you happen to know more about the particular implementation, e.g. using
I take your point. Although it makes me wonder what the role of |
The Abstract Machine has parameters, for things like pointer size and endianess. Also, I'd say UB is a dynamic, run-time concept, and as such always refers for a program "after |
I agree with all of the above. My question is what if there is an operation which is UB for some configurations and not others (for example, calling a CPU intrinsic). Does the abstract machine need to know about all these configurations, in order to specify them? I was hoping that this could be classed under "implementation-defined behavior" or "platform-dependent behavior" so that the abstract language doesn't need to contain the union of all quirks from platforms it has ever compiled to. |
Nitpick: No there weren’t. Strict aliasing was already in the first standardized version of C, C89, though most people didn’t know about it until GCC started enforcing it in 2001. Edit: But it is true that it exists solely to enable compiler optimizations. |
For the more informative UCG, I think its good as it is. When rust does get arround to writing a proper specification, something more akin to what C and C++ have, IE. something like:
It's not necessarily encouraging implementations to do that, in giving an example of what can happen. I also equally mention that the construct can be ignored/evaluated as-is, potentially interfering with other well-defined constructs. C and C++ both in a note that a valid response to UB is to assign it some arbitrary meaning. It equally means, if you really want to use this construct, you shouldn't, but seek out your compiler's documentation first as they may say you can.
In lccc, we inherit some things that aren't UB from C and/or C++, usually because don't care enough about the particular optimizations to add further tags saying when certain things are UB and when they are well-defined (conversely there are some things in C and C++ that are well-defined under lccc because rust says they are and I don't want to duplicate accross them). This isn't horribly new, gcc has a bunch of extensions to both C and C++ that exist in one primarily because the other allows the same (gcc lets you type-pun with unions in C++, and clang does as well primarily because gcc does). #[repr(C)]
struct StandardLayout{
num: i32,
other_field: f32
};
fn do_thing(v: &mut StandardLayout){
let x = &mut v.num;
let y = unsafe{&mut *(x as *mut i32 as *mut StandardLayout).other_field)};
} in SB, that is UB, because you have exceeded the provenance of x. In lccc however, it's well-defined because of pointer-interconvertibility and reachbility rules. Specifically, you can reach
Fair, I will concede that point. However, standard libraries do sometimes use UB in the language proper, either because they have to, or the be efficient/clever. libc++ and libstdc++ are definate example (I can't remember exactly where, but I remember seeing some in). As mentioned the rust standard library implementation for lccc will make use of pointer-interconvertibility for manual layout optimizations. More to the point, standard libraries are in a privileged position where they can make things not UB because they want to do something that is. Same with compiler-support libraries, which is less likely to be able to avoid UB, which is why they exist. Neither libunwind nor libgcc_s are particularily well-defined when it comes to unwinding (I can't really think of a way to implement stack unwinding at all absent some undefined behaviour, aside from using pure assembly, certainly not for itanium). This is why I consider standard and compiler support libraries some of the only exceptions to my otherwise absolute rules of UB, including good ideas such as "Do not reason about UB" and "Do not rationalize UB".
Indeed, though the exact case was it was using
It likely is in rust, may depend (especially if rust introduces implementation reserved identifiers, which would be nice, since I want to have a synthetic crate in lccc filled with implementation details). In general, it's not. Examples of this include the prohibition in C++ against instantiating standard library templates with incomplete types (Yes, the C++ compiler is allowed for format your hard drive when translating the program, which is arguably hilareous). As I say, UB is literal, and it doesn't particularily matter when the UB happens. I also use it in my API as an "escape hatch" from the conformance clause (which states a conforming implementation must issue a diagnostic for ill-formed programs, but I want my implementation details, and C++ isn't that brilliant when it comes to that). |
Adding to this
I agree, this is a poor idea
There is a term for this: conditionally-supported behaviour. Requires implementations to document when they do not support the behaviour. There are some cases where it can actually be useful, for example, I would like to look into volatile access to address 0 as conditionally-supported with implementation-defined results, as it can be an asset to embedded devs. Certainly not should unspecified behaviour permit undefined behaviour. I do agree, most programmers should treat conditionally-supported behaviour as if something to not touch (whether being unsupported means UB or a compile error), and it doesn't give too much of a huge benefit over just being UB, and letting the compiler decide if it does want to give you w/e behaviour. |
You have given a bunch of horror-story examples of terrible uses of UB in C/C++, and I don't find them particularly compelling for adoption in Rust. UB at translation time is just really obvious compiler-developer-pandering. We already try very hard to be able to find all uses of UB at runtime, so if there were compile time UB I would expect nothing less than to have a "dynamic checker" for that too; but dynamic checking at compile time is just compile time checking so it just ends up as part of the compiler's workings, and so it's not UB after all. I don't think that the stock C/C++ wording
is very good either, because it does not at all elucidate the way in which UB is used, as a dynamical concept of "stuck state" in the abstract machine. In fact, I would be happy with just such a description:
|
To head off Ralf's exasperated comment: This is fine and your prerogative as the designer of lccc, but not the business of the UCG.
I think it is a deliberate choice of Rust to not attempt to accomodate older architectures that differ considerably from modern hardware. We've all seen that C suffers greatly from the baggage that it carries from that era, and no one wants to keep carrying that forward if the processors are no longer in use.
To emulate Rust in the JVM, you almost certainly have to emulate memory. You might be able to do various kinds of program analysis to hoist values out of memory but that's all subject to the as-if rule, and the R-AM works on flat, untyped memory. (Personally, I think that C++'s casting mechanisms are far too complicated. Rust has a simple and intuitive model of memory, even if it makes it harder to concretely represent the memory in other ways.)
Why don't you just reserve a crate on crates.io? The standard library and rustc are all stuffed in the
This is more interesting. That conformance clause doesn't currently exist in Rust AFAIK, and it does seem odd to me that we should require that you give a diagnostic for use of lccc extensions of Rust. But this is probably best suited for its own issue. |
I stand corrected; thanks for pointing that out.
I honestly don't think "Behavior for which this specification imposes no limitations" is very informative, given how often it misleads people, and it is more useful to talk explicitly about the Abstract Machine and that the implementation expects the programmer to uphold its side of the contract. That is, in my opinion, a better framing and phrasing of UB than what C and C++ do. We can clarify that as a consequence, there are no limitations to the behavior of a program that violates said contract. In fact I think we already say that:
But if you think it is helpful to explicitly say "no limitations" and not just "garbage", that is fine for me, too. But anyway that is a separate bikeshed.^^
I would say that this is a case of C/C++ encouraging implementations to assign some arbitrary meaning, which I think we should not do for Rust. But this is getting extremely subjective and we clearly have different positions here, so I doubt we will resolve the dispute by repeating our positions. ;) We'll probably have to agree to disagree, and when it comes to wording the final standard, there'll be more people involved and we can see what they think.
I obviously cannot stop you from doing whatever you want with your own project. I think I stated my point for why providing such guarantees to Rust code on some implementations risks an ecosystem split. On the other hand, having a unified semantics with C/C++ does require some very different trade-offs. What I do not understand is how you think this should affect the UCG. Doing better than C/C++ is explicitly one of my goals, so I'd be quite opposed to any attempt to unify UB with those languages.
Yes, this definitely sometimes happens, it's just something we'd like to avoid in Rust proper. Again I cannot tell you how to build your own compiler, so if you think this is a good strategy, I will respectfully disagree and we can go our separate ways. ;) Looks like you are set on defining a Rust dialect that makes some extra guarantees. I am not terribly happy about that but respect your decision. Again I am not sure how this should impact UCG work -- as long as we don't want to define any behavior that you need to be UB, you should be good, right? This first came up around validity of references, but given that you must support C-style pointers that point to garbage (I don't think it is UB in C to have a
I'd phrase this more carefully... the compiler has to have a uniform notion of UB across all code (otherwise things like inlining are broken). So what standard libraries can do is exploit the knowledge that something is not really UB in the actual language implemented by this compiler, even though the language spec documents it as UB. This is very similar to exploiting knowledge about unspecified implementation details. Code outside the standard library could in principle do the same, but then it would be tied to only work with a particular version of the compiler. IOW, the privilege of the standard library comes solely from being compatible with exactly one version of the compiler, and being able to rely on undocumented aspects of the compiler because it is maintained by the same people. In contrast, user code has to be compatible with a wide range of compiler versions.
At least in C, it is legal to do union-based type punning under some circumstances. So the answer is "the same as that". In C++, the answer is "the same as a reinterpret_cast from int to float".
I think that's just C++ being silly.^^ There is also some UB in the C preprocessor if I recall. But that is, on a formal/technical level, very different from the kind of UB that is used for optimizations, so they should really not use the same term.
Sorry for that. I tried to tone it down, but clearly not enough. Maybe I should take a break from this thread; I have stated my case. |
I moved and generalized that particular lang item, into an unstable (but not quite, the impl on Box is stable to use via operators) DerefMove trait. I believe discussions are already in place to make that part of rust. |
Indeed, and the comment also indicates that this is considered a bug in the standard library, precisely because not even the standard library may cause UB. In rustc, libstd is not privileged wrt UB, and any place that acts differently is a bug. The "privileged knowledge" part here explains why this is not a P-high bug that needs fixing immediately (it argues for why this bug is currently unlikely to negatively impact users), but it does not make this any less of a bug. This is very different from saying "it is okay for libstd to do something like this". It is not okay, and this particular bug is on track to be fixed by this RFC. One that RFC is implemented, this FIXME will finally disappear. I have been waiting for that for a long time. :) |
As far as I can tell, it was done intentionally, perhaps to satisfy a requirement that is impossible or grossly inefficient otherwise. Even if it is considered a bug, it may be a necessary one. I have not looked at the RFC but given the choice is to change the language, not the implementation, I stand by what I said. Standard libraries will frequently doing things that require explicit compiler support to be efficient, not necessarily because it would be impossible otherwise (though as mentioned, a reason for something being in the standard library is that it's impossible to implement in the language itself). For example, while not UB, clang and gcc defined an intrinsic to implement |
Yes, because at the time there was no better way (the original code predates even clarifying the validity invariant). But the fact that there is a "FIXME" indicates quite clearly that this is considered a hack, not a proper solution.
The RFC is in fact a libs-only change. |
They do, and Rust uses lang items for that purpose. However:
|
That and intrinsics, indeed. Though this still begs the question of what the difference between an unstable lang item/intrinsic/language feature, and undefined behaviour explicitly provided by an extension. As mentioned,
That's fair on the requirement side (with a cursory look through at the documentation, it looks like |
I am not sure why you cannot accept the fact that the rustc devs and the people specifying Rust consider it an antipattern to explicitly define any part of UB via any means.^^ Adding an unstable intrinsic is a tiny extension, whereas saying that integers may be udnefined is a global change that fundamentally alters the Abstract Machine of the language that rustc actually implements (and that rustc optimizations have to be checked against). |
For something like reachability of pointers, I would consider that similarily small. Particularly, an intrinsic can be created to get a pointer to access an enclosing repr(C) structure from a pointer to it's first element (in fact, As another example, |
For such intrinsic to have the semantics you want, you would need to modify the abstract machine a lot more than you seem to think. For example adding the
If the implementation can assume that UB doesn't happen, it is free to do anything it wants for things that would be UB, including inserting traps, as those things "wouldn't happen" anyway. |
In fact, Miri is an excellent example for how an implementation can provide explicit guarantees for what happens on UB (raise an error) while still conforming with the formal spec which says "the implementation may assume that UB does not occur". Miri is just rather defensive about that assumption and double-checks instead of just trusting the programmer. The wording for UB given in the UCG glossary totally allows for this possibility. In my view "the implementation is not limited in how it chooses to evaluate a program that has undefined behaviour" is a consequence of "the implementation may assume UB does not happen", not vice versa. My proposed wording is a logical implication ("if the program does not exhibit UB, then the implementation will realize program behavior"), and as usual with implications, they impose no restrictions on what happens when the antecedent is false -- in this concrete case, this means imposing no restrictions on what happens when the program has UB. But I think viewing UB as a proof obligation explains much better why it is so useful, and how programmers can work with it (by checking the proof obligations everywhere). This works particularly well in Rust, where we expect each |
Fair. However, in the presence of such an intrinsic, making pointer casts semantically have the same effect is not a fundamental change. It's a choice in how to implement the latter. The intrinsic mentioned is not a question of how it affects the abstract machine of rust, it exists and it is impossible for it to not exist (because of how name resolution of builtins is defined). The builtin exists to satisfy a requirement of the C++ abstract machine,
Didn't the rust abstract machine have that model before rustc added that intrinsic? Also I presume we can ignore in this argument any and all intrinsics for which the existence is implied by the standard library itself. The standard library itself is part of the abstract machine, after all. The fact
... including to assign a particular, well-defined meaning to it? My point isn't that it can break some optimization performed by the compiler, but that such an optimization, in the case of lccc, would be valid anyways.
Similarly fair. It can be viewed either way, and the former has definitely come to imply the latter. The fact UB never happens is one of the best truths a compiler writer has at their disposal. However, just as math can be built upon by removing restrictions, so to can programming languages. It's still a rust implementation because it fits within the behaviour prescribed by the rust abstract machine (provided that behaviour can be worked out). |
The intrinsic has existed ever since rust-lang/rust@c35b2bd. Before that, it directly used a function written in LLVM ir, which you could also consider a kind of intrinsic.
Almost all intrinsics are implied by the standard library itself.
No, it is not. The rust abstract machine defines how MIR is executed. The standard library merely exposes some parts of the rust abstract machine that can't directly be accessed in a stable way using the user facing language that lowers to MIR. Saying that the standard library is part of the rust abstract machine is like saying that all existing unstable code that compiles with the current version of rustc is part of the abstract machine as it can access intrinsics. It is like saying that all C/C++ code is part of the C/C++ abstract machine because it can call intrinsics. |
This is also an expectation at least in how I document anything. However, if I accept a valid raw pointer, I don't say that I can assume it does, I write something like
or
(with the last part optional). This is sufficient to express that the result is undefined behaviour, according to the library in question, if ptr does not satisfy either. |
The C++ Standard library is part of the C++ Abstract machine. It is defined as part of the document that says "The semantic descriptions in this document define a parameterized nondeterministic abstract machine." ([intro.abstract] clause 1). fn drop<T>(x: T){} It is
How fn drop<T>(x: T){
unsafe{ core::ptr::drop_in_place(&x) }
forget(x)
} This would be a stupid, but perfectly valid implementation for |
let mut u = MaybeUninit::<T>::uninit();
// SAFETY: `u.as_mut_ptr()` points to allocated memory.
unsafe {
u.as_mut_ptr().write_bytes(0u8, 1);
}
u where In my opinion the fact that intrinsics are an implementation detail used to implement certain standard library functions doesn't mean that the rust abstract machine includes the standard library. The rust abstract machine is in my opinion solely implemented by the rust compiler. IMO it includes parts that are stable defining how stable code works, and it includes unstable parts used to implement the standard library. The standard library simply depends on certain unstable parts of the rust abstract machine the same way it depends on certain unstable language features like specialization. These unstable parts can differ from compiler version to compiler version or even from compiler to compiler. The specific LLVM version is not a part of the rust abstract machine, not even an unstable part, as the same rustc version can be compiled against a wide variety of LLVM versions. This means that it is not ok for the standard library to depend on UB that just so happens to not cause a miscompilation on a specific LLVM version. Thanks to rustc_codegen_cranelift (cg_clif) (author here) even the existence of LLVM itself is not part of the rust abstract machine. Not even an unstable part. The only part of the standard library that could be considered part of the rust abstract machine is stdarch (core::arch). This contains a lot of platform intrinsics for simd that directly use llvm intrinsics. Every other bit of the standard library is completely agnostic to the codegen backend. Because of this combined with the fact that all functions in stdarch are marked as Besides, not giving the standard library a privileged position makes it easier to understand how it works and makes it safer to just copy snippets from it into your own code. |
Exactly, defined. It isn't necessarily implemented in terms of (in fact, in lccc, it's the latter, The abstract machine is the sum of the behaviour specified by the specification. If the specification includes the standard library, then the standard library is part of that abstract machine. The standard library isn't part of a program, it's something that exists because of the specification, and has it's semantics defined by the specification, so it's semantics fall definatively under part of the abstract machine, even if those semantics can be perfectly replicated in user written code. Absent part in the specification, it wouldn't be a violation of the as-if clause to not provided the standard library. The implementation of the standard library or compiler has absolutely no bearing on the abstract machine. Saying that the compiler defines the abstract machine is a great way to reduce the possibility of a competing implementation, something I am very much against. The compiler should have an argumentative position, in saying what can and cannot be done, but should not have the position of defining the behaviour. Intrinsics and lang items would fall under a specification like "The implementation may provide any number of unspecified unstable features, with unspecified semantics when enabled by a crate-level attribute declaring the feature. If the implementation does not support the feature or enabling features, the program is ill-formed."
Inherently, it has privilege, because it can access private implementation details such as intrinsics and lang items. I certainly wouldn't want to just abuse extensions without saying anything, maybe I'd include something that mentions the extension and it's non-portability, but people already cannot simply copy just anything from the standard library, because it may be feature gated. For example, code that uses the pointer-interconvertibility rule would have this // SAFETY: This is sound because lccc permits casts between pointers to *pointer-interconvertible* objects,
// And we know we have a borrow-locked field of `WideType` *pointer-interconvertible* with *narrow.
// This is an extension and not portable to other implementations.
// stdlib can do this because it expects only to be running on lccc.
// Note: See C++ Standard [expr.static.cast], clause 13, as well as xlang ir specification, [expr.convert.strong] and [expr.derive.reachability] for details on the validity of this cast.
let wide = unsafe{&mut *narrow as *mut WideType}; |
In a regular crate, you may also have one function defined to be equivalent to another. If it is or not is simply an implementation detail, not a part of the rust abstract machine.
The standard library is not an intrinsic part of the rust language. You can very well use it without, albeit only on nightly rustc versions. In fact I know of at least one past alternative standard library called lrs-lang. While it uses internal compiler interfaces, cg_llvm, cg_clif or miri wouldn't have to be changed to make it run. It would only need to be updated to for the latest version of all unstable interfaces it uses.
What I say is that the abstract machine is kind of split in two parts. A stable part that all rust code can use and an unstable part that is used to implement the standard library. Both parts need to have well defined semantics, but the semantics of the unstable part may change between rustc versions. lccc may decide to have a different unstable part and would thus need to change the standard library. That doesn't mean that the standard library is allowed to circumvent the rust abstract machine. It can only use well defined interfaces like intrinsics.
Yes it has extra privileges as it can use unstable interfaces. It just shouldn't do things that normal code isn't allowed without doing so using unstable interfaces. All data structure implementations don't need intrinsics to be implemented, they can just use stable functions. This means that copy paste should just work. (after removing stability attributes) If those data structures were to use the knowledge about the specific codegen backend to for example violate the aliasing rules in such a way that the codegen backend doesn't miscompile it, then copy paste will cause problems for users later down the line. As such it shouldn't violate those rules, even if it technically can. This is what I mean with that it is not ok for the standard library to depend on UB. It is completely fine to use implementation defined intrinsics, but don't ever cause UB. |
If the crate defines as part of it's api that the functions are equivalent, then that definition is not an implementation detail. It's not part of the rust abstract machine either, but it's part of the public API specification for that crate, just as the standard library is a and should be a part of the rust specification. Of course, the specification does not bind a particular implementation, and it is up to the particular implementation whether or not to write one in terms of the other, or both in terms of the same, or completely independant implements. And if one is in terms of the other, it's also up to the particular implementation which is done.
By saying this, it can be deduced that the existance of
An implementation can introduce undefined behaviour to a program, provided it still behaves as if it did not. |
Then it is not UB.
It is only UB to use
Unstable features require an explicit opt-in and only work on nightly. Using undefined behaviour that has been assigned a meaning by a specific implementation doesn't however require an opt-in. If you copy-paste code that uses an unstable feature, it will simply not compile when the unstable feature is not available. If you copy-paste code that uses it's knowledge of an assigned meaning to certain UB, then it will still compile when the UB doesn't have an assigned meaning by a specific implementation. Instead it may or may not behave in strange ways at runtime. This is much much worse than not compiling. |
Yes, yes it is. The compiler doesn't decide what behaviour is undefined, it only gets to decide what to do about it (though has effectively unlimited choice in that). A crucial point of [intro.abstract], clause 1 is that the implementation does not have to implement or even emulate the abstract machine, only that it must emulate it's observable behaviour. This is the as-if rule, not the first sentance which says the abstract machine exists.
If the particular operation caused, for example, signed overflow in C, then the programmer could not write the same optimization by hand, even though llvm performed it, because the resulting transformed program behaved as-if it was evaluated strictly, wrt. it's observable behaviour.
Features can change meaning without changing names, or even changing syntax. Possibly because of an evolution of it, or because they were written independent (which is why lccc qualifies its feature names not explicitly taken from rustc, so as to reduce this chance of someone else writing the same feature, unless they are implementing the same from lccc). If you used the #[lang="drop_in_place"]
pub unsafe fn drop_in_place(ptr: *mut T){} has entirely different meaning on lccc, because on lccc, the lang item simply designates the function itself, it doesn't result in special behaviour. A possibly reasonable thing could be do warn on these casts, similarily to gcc's -pedantic. So |
LLVM has a special flag to forbid signed overflow. If the optimization would cause signed overflow, then it has to remove this flag. That clang doesn't expose a way to remove this flag, doesn't mean that removing this flag in an optimization is UB.
True. In that case the feature is technically still available, it just has different behaviour. What I am mainly concerned about is stable code that doesn't have access to the unstable features.
The cast itself is completely valid. It is just when you dereference it that there is a problem. This dereference could happen arbitrarily far away from the cast. A lint for this without false-positives would need to use a theorem prover. If you also want to avoid false-negatives, you will have to solve the halting problem. |
The particular example would be if you derefence the pointer resulting from this cast. Also dereferencing a miscasted pointer can already happen very far way. miri could be run to detect that, as it does not have the same behaviour (however, it's run on the rustc libstd, as miri would not accept a sigificant portion of the standard library used by lccc, primarily because there is no one-to-one correspondance between MIR and XIR). I didn't say it's necessarily a perfect idea, but I do agree it's better than code just devolving from it. I also rarely copy stuff from standard libraries, because I know they do some things that you probably shouldn't ever attempt in well-behaved code. I would presume the big obvious SAFETY comment that says "This is an extension and not portable" would be sufficient to inform people of this fact, an extension, and the warning from copying the code verbaitum would reinforce this. |
I haven’t read the discussion, but I personally prefer Ada’s term for UB: “erroneous execution”. |
I think it should be a valid option for an implementation to say that an instance of implementation defined behavior is in fact undefined behavior. Overflow seems like a good candidate for that. You can perhaps set flags so that overflow traps (with attendant definition of what this entails on the machine state), or wraps, or is UB. Of course, if you are writing portable code, then as long as any implementation defines it as UB you the programmer can't trust it to be anything more than that, but you could use #[cfg] flags and such to do the right thing on multiple platforms or implementations. |
Technically, unbounded implementation-defined behaviour can be equivalent
to documented UB. However, in almost all cases, implementation-defined
behaviour (likewise unspecified behaviour) is constrained, either
explicitly or implicitly. For example, C has "the size and alignment
requirement of the type int are implementation-defined", but the answer to
the question "what is sizeof(int)" cannot be "undefined behaviour", it has
to be an actual value (generally 2 or 4, depending on the platform).
…On Sun, May 9, 2021 at 12:52 Mohtasham Sayeed Mohiuddin < ***@***.***> wrote:
I think it should be a valid option for an implementation to say that an
instance of implementation defined behavior is in fact undefined behavior.
Overflow seems like a good candidate for that. You can perhaps set flags so
that overflow traps (with attendant definition of what this entails on the
machine state), or wraps, or is UB.
Of course, if you are writing portable code, then as long as any
implementation defines it as UB you the programmer can't trust it to be
anything more than that, but you could use #[cfg] flags and such to do the
right thing on multiple platforms or implementations.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#253 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGLD22Z2T3SDQJDNSUORMDTM24VBANCNFSM4SVFJW4Q>
.
|
But what is the difference between "undefined behavior" and "implementation-defined behavior which can be undefined behavior in fact" then? In both cases behavior can be undefined, and undefined behavior can be replaced with any other behavior, so these two terms seem equivalent. Additionally, it means that all operations with implementation-defined semantics have to be unsafe, including integer arithmetic operators, which is a massive breaking change. |
I'm closing this ahead of triage. IMO, the question it poses is answered, and it's also a bit garbage in the comments (sorry about that). |
From various responses, I am confused about the meaning of Undefined Behaviour in rust. Coming from a C++ background, and having done extensive personal research on undefined behaviour, I understand the term to be literal, behaviour which is not defined. In C++ and C it is explicitly specified as "Behaviour for which this international standard poses no limitations". In a number of specifications I have written, I have adopted similar wording. As far as I can tell, rust does not explicitly define the term, so I assumed it has the same meaning (and it seems to have that same meaning). In particular this definition permits an implementation which assigns some meaning to undefined behaviour, while still conforming to the standard/specification (As an example, see clang and gcc with union type-punning in C++). However, in particular, a comment on #84 leads me to believe, this would not be valid in rust. If so, would it be reasonable to provide an explicit definition for the term, and is there a particular reason why a restricted interpreation of the term is beneficial to rust?
One point, I've noticed that UB has to be justified by the optimizations it enables. I would add that undefined behaviour was never intended to be a key to optimizations, it just happens that as a result of it's definition, and the conformance clause of the mentioned standards permit optimizations that assume UB doesn't occur. Rather, the original intent, at least from what I can determine, was to provide an escape hatch to portions of the standard that either cannot be specified or doesn't want to be specified, because some reasonable implementation would not be able to provide particular behaviour. If this is in fact the case in UCG, would it be reasonable to extend this justification to include reasonable implementations, not just optimizations, that are enabled as a result of the undefined behaviour.
The text was updated successfully, but these errors were encountered: