Add an "ascii character" type to reduce `unsafe` needs in conversions · Issue #179 · rust-lang/libs-team (original) (raw)

Proposal

Problem statement

For individual bytes, the backend can plausibly be expected to optimize things when it knows that a char came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments on unsafe blocks because the safe equivalents have material performance impact.

We should offer a nice way that people can write such "the problem fundamentally only produces ASCII Strings" code without needing unsafe and without needing spurious UTF-8 validation checks.

After all, to quote std::ascii,

However, at times it makes more sense to only consider the ASCII character set for a specific operation.

Motivation, use-cases

I was reminded about this by this comment in rust-lang/rust#105076:

pub fn as_str(&self) -> &str {
    // SAFETY: self.data[self.alive] is all ASCII characters.
    unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) }
}

But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a String::from_utf8(outdata).unwrap() at the end because other options aren't great.

One might say "oh, just build a String as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto a Vec<u8> generates substantially less complicated code than pushing one to a String (https://rust.godbolt.org/z/xMYxj5WYr) since a 0..=255 USV might still take 2 bytes in UTF-8. That problem could be worked around with a BString instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a &str or String.

There should be a core type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.

[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like .filter_map(AsciiChar::new).

[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos

Solution sketches

In core::ascii,

/// One of the 128 Unicode characters from U+0000 through U+007F, often known as /// the ASCII subset. /// /// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973, /// or NIST FIPS 1-2. /// /// # Layout /// /// This type is guaranteed to have a size and alignment of 1 byte. #[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)] #[repr(transparent)] struct Char(u8 is 0..=127);

impl Debug for Char { … } impl Display for Char { … }

impl Char { const fn new(c: char) -> Option { … } const fn from_u8(x: u8) -> Option { … } const unsafe fn from_u8_unchecked(x: u8) -> Self { … } }

impl From for char { … } impl From<&[Char]> for &str { … }

In alloc::string:

impl From<Vecascii::Char> for String { … }

^ this From being the main idea of the whole thing

Safe code can Char::new(…).unwrap() since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it in consts, then use the non-reallocating infallible Froms later if they need Strings or &strs.

Other possibilities

I wouldn't put any of these in an initial PR, but as related things

This could be a 128-variant enum with repr(u8). That would allow as casting it, for better or worse.
There could be associated constants (or variants) named ACK and DEL and such.
Lots of AsRef<str>s are possible, like on ascii::Char itself or arrays/vectors thereof
- And potentially AsRef<[u8]>s too
Additional methods like String::push_ascii
More implementations like String: Extend<ascii::Char> or String: FromIterator<ascii::Char>
Checked conversions (using the well-known ASCII fast paths) from &str (or &[u8]) back to &[ascii::Char]
The base85 example would really like a [u8; N] -> [ascii::Char; N] operation it can use in a const so it can have something like const BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();. Long-term that's called array::map and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.

And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like ascii::AsciiChar, despite the stuttering name, to avoid Char vs char confusion?

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

Add an "ascii character" type to reduce unsafe needs in conversions · Issue #179 · rust-lang/libs-team (original) (raw)