subdivided integer types (original) (raw)

A piece of code I often find myself wanting is the ability to e.g., have an array of slots, each of which is either occupied (and refers to something) or free (and refers to the next free slot for a free list). Often, both values can be assumed to fit inside a u31 or u63 or whatever.

In current Rust, there are 4 ways to do this:

Just use the unsigned type and rely on other ways to know how to interpret it (e.g., it's free if you got to it through the free list or whatever), but this is unreliable and bug-prone and means, e.g., indices into the array of slots can't be validated as not referring to a free slot on accident
Use a i32 or i64 and rely on sign to distinguish them, e.g., free slots are stored as their bitwise negation. This works fine and is reasonably ergonomic (since e.g., you can just do sign comparison to check things instead of more complicated bittwiddling) but fails to encode the semantics into the type system. This also doesn't work if you have more than 2 variants (e.g., 4 variants distinguished by the first 2 bits and each fitting in a u<N-2>).
enum Slot { Occupied(u32), Free(u32) } has by far the best semantics and ergonomics but pads the integer to twice its size for no reason.
enum Slot { Occupied([u8; std::mem::size_of::<u32>() - 1]), Free([u8; std::mem::size_of::<u32>() - 1]) } has good semantics and size but is extremely unwieldy and pointless restricts the range to u<N-8>.

What I suggest is range-valued enums of some form:

#[repr(u32)]
enum Slot {
    ..(1 << 31) => Occupied,
    (1 << 31).. => Free,
}

which allow you to label ranges of an integer's value as enum variants. You could allow u32 as Slot in cases where the ranges cover the enter range of possible values, but simultaneously allow matching on the enum (with some choice for the syntax of how you extract the underlying value). There is precedent of a sort for this with allowing integer ranges in pattern matching.

One potential consideration would be allowing an "offset" parameter that occurs when you extract the value, where writing (1 << 31).. => #[offset(-(1 << 31))] Free would effectively make it a 1-bit tag union of 31 bit integers. Unclear exactly how you would define the semantics for this though, so perhaps better avoided.

It seems reasonable to allow arbitrary patterns to be used:

#[repr(u32)]
enum Foo {
    0 => Zero,
    1..50 | 100..200 => Good,
    _ => Bad,
}

Allowing non-exhaustive patterns could also work to generalize the current unit enums. In that case though, you would not allow casting the base type into the enum and would need to write parsing code as you currently do. This would also allow current unit enums to just define a catch-all name for all bad values to allow casting to it.

Further, though dubiously useful, this could be a syntax for subdividing any base type into enums according to valid patterns, so you could also do

#repr(str)
enum Command {
    "push" => Push,
    "pull" => Pull,
    "add" | "plus" => Add,
    _ => Unknown,
}

pitaj May 4, 2025, 6:15am 2

There is an RFC for arbitrary bitwidth integers:

If that were to be implemented, I think it would make sense for the following enum to be the same size as a u32:

enum Foo {
    A(u<31>),
    B(u<31>),
}

2e71828 May 4, 2025, 6:52am 3

My current approach for this sort of problem is to define an interconversion between the ergonomic type and a packed representation that's used for storage/transport:

pub trait Pack {
    type Packed: Copy;
    fn pack(self)->Self::Packed;
    fn unpack(packed: Self::Packed)->Self;
}

#[derive(Copy,Clone)]
pub enum Slot { Occupied(u32), Free(u32) }

const PAYLOAD_MASK:u32 = 0x7fff_ffff;
const DISCRIMINANT_POS:u8 = 1 + PAYLOAD_MASK.ilog2() as u8;

impl Pack for Slot {
    type Packed = u32;
    fn pack(self)->u32 {
        match self {
            Slot::Occupied(val) => {
                assert_eq!(val & !PAYLOAD_MASK, 0);
                val
            }
            Slot::Free(idx) => {
                assert_eq!(idx & !PAYLOAD_MASK, 0);
                idx & (1 << DISCRIMINANT_POS)
            } 
        }
    }
    
    fn unpack(packed: u32)->Self {
        let payload = packed & PAYLOAD_MASK;
        match packed >> DISCRIMINANT_POS {
            0 => Slot::Occupied(payload),
            1 => Slot::Free(payload),
            _ => unreachable!()
        }
    }
}

yaksher May 4, 2025, 2:52pm 4

While I do still feel that the ability to subdivide types based on patterns would be neat, you're right that this would address the listed use case. I don't immediately have an example of something which would strongly justify having this beyond that, even if it feels "pretty" and nice to have to me.

yaksher May 4, 2025, 2:54pm 5

Yeah, this works and is probably the "best" way to do it in current Rust, but it's also extremely obnoxious to do.

It can't, because that lets you get &muts to the fields, and writing the 4-byte u<31> writes all 4 bytes, and thus would lose the extra bit that says what the discriminant is.

Doing this needs the oft-discussed "move-only fields" idea, something like

enum Foo {
    A(move u<31>),
    B(move u<31>),
}

that blocks usings references to the fields so that the compiler can inject code whenever the field is read or written to do the appropriate masking and setting.

Vorpal May 4, 2025, 9:44pm 7

Is there some central thread or RFC to read up on this and what the status of it is? It sounds really useful to me, I would very much like to be able to squeeze more bits into the limited L1 cache.

pitaj May 8, 2025, 2:33am 8

Good point. Pattern types could work though, as long as you have non-overlap:

enum Foo {
    A(i32 is 0..),
    B(i32 is ..0),
}

I've done a lot of digging around and can't find the original thread, only numerous references to it.

it's a very simple idea, (taking the address of the move field is always an error), so i think someone should just go ahead and write an RFC for it.

yaksher May 8, 2025, 3:48am 10

You could potentially even allow overlap and resolve them like a match statement (in fact, thinking about this, my suggestion behaves like a named and packaged match statement in general, which is what inspired the dummy syntax I provided).

Requiring disjoint patterns may be clearer (in particular, field ordering for both enums and structs is currently irrelevant (I think?) and changing that would be bad), but on the other hand, the ability to have a wildcard variant could be convenient.

Don't even need to wait for pattern types. Just needs someone to do all the work in the compiler to handle it -- you can express the same kind of thing with field-less enums already, but the layout algorithms (and, more importantly, the discriminant codegen) can't handle that yet.

The problem is that today discriminants are point-encoded except for at most one niched thing, whereas if let Foo::A(_) = and if let Foo::B(_) = would both need range checks, which the layout types cannot yet express, and thus which the codegen backends cannot handle.