What are the values of a union type? (in particular, what is the validity invariant of a union) (original) (raw)

When loading data at union type from one place and storing it in another -- what exactly is being required about the bytes, and what is preserved? In other words, what is the representation relation of a union type, and what is even the set of values that can be generated by a union-typed load?

Originally I intended the answer to be "the values are just lists of raw bytes as large as the union". IOW, a union has no validity requirements whatsoever, and all data inside it is perfectly preserved. However, reality disagrees. With the C ABI, some of the bytes in a union value do not always get preserved -- there can be padding. Also there is some desire to make unions have validity invariants, so that e.g. a union { usize, *const T } is noundef.

On the other hand, it is certainly not the case that the value of a union is "a value of one of its fields". Apart from the problem about writing decode for such a specification, it can be violated in safe code.

Note that for the purpose of this, we do assume that the LLVM type iN will perfectly preserve N bytes of memory. LLVM code generated from Rust can never put poison in memory (as that would not be preserved on a per-byte level, the entire iN gets "infected" if any byte is poison), which is currently accurate (Rust code has UB for any situation where the generated LLVM code might produce poison). LLVM iN values carrying provenance is incoherent but mostly works in practice and LLVM does not offer us a better option. We need a "byte" type to properly express our intent to LLVM; until that exist, we make do with what we have. If you want to discuss LLVM issues, please open a new thread -- this one here is solely about the Rust AM semantics.

Possible designs

MiniRust decides that a value of union type is a list of "chunks" where the bytes are perfectly preserved, but there can be gaps between those chunks where data is lost of a copy. The extent of these chunks can be computed from a Rust union type layout. That is sufficient to model the reality of union padding. The effective validity invariant here is still "any list of bytes allowed".

The noundef property could be achieved if we refine this "chunk" idea a bit further: every byte in a union is either "discarded", "preserved", or "initialized". Bytes which are padding in all variants get "discarded"; bytes which are padding only in some variants get "preserved"; bytes which are "in the data part" for all variants get "initialized". A value of union type is a list of bytes the size of the union with the constraint that "discarded" bytes must be Uninit and "initialized" bytes must not be Uninit. The validity invariant requires that all "initialized" bytes must be initialized -- and that's it; "discarded" bytes get flushed to Uninit on a decode. Then the Rust type union { usize, *const T } would translate to a union type where all bytes are "initialized", and we could emit noundef attributes for such a type.
(I don't want multiple distinct AM values to be equivalent, hence the 'value of union type' predicate demands that all "uninit" bytes are truly Uninit, but of course the validity invariant allows those bytes to be anything. Note that the validity invariant is a predicate over "list of bytes", whereas "what are the values of this type" is a predicate over Value. The chunks idea achieves this by just not having the padding bytes in the Value; we could do the same here but that would be formally more awkward to express I think. Maybe I should just link to a MiniRust patch here, that might be more clear than trying to say this in English.^^)

We could in principle go further and tro to achieve something along the lines of "if a byte is non-null in each variant, it is also non-null in the union", but that becomes increasingly complicated -- in particular it requires the "generate union type description from Rust type" logic to encode more and more knowledge about validity invariants, when the entire intent of this "value representation" business was that the validity invariant is implicitly encoded by the representation relation.

What other designs should we consider? And are they worth the complexity?