Add an "ascii character" type to reduce unsafe
needs in conversions (original) (raw)
Proposal
Problem statement
For individual bytes, the backend can plausibly be expected to optimize things when it knows that a char
came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments on unsafe
blocks because the safe equivalents have material performance impact.
We should offer a nice way that people can write such "the problem fundamentally only produces ASCII String
s" code without needing unsafe
and without needing spurious UTF-8 validation checks.
After all, to quote std::ascii,
However, at times it makes more sense to only consider the ASCII character set for a specific operation.
Motivation, use-cases
I was reminded about this by this comment in rust-lang/rust#105076:
pub fn as_str(&self) -> &str {
// SAFETY: self.data[self.alive] is all ASCII characters.
unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) }
}
But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a String::from_utf8(outdata).unwrap()
at the end because other options aren't great.
One might say "oh, just build a String
as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto a Vec<u8>
generates substantially less complicated code than pushing one to a String
(https://rust.godbolt.org/z/xMYxj5WYr) since a 0..=255
USV might still take 2 bytes in UTF-8. That problem could be worked around with a BString instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a &str
or String
.
There should be a core
type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.
[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like .filter_map(AsciiChar::new)
.
[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos
Solution sketches
In core::ascii
,
/// One of the 128 Unicode characters from U+0000 through U+007F, often known as /// the ASCII subset. /// /// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973, /// or NIST FIPS 1-2. /// /// # Layout /// /// This type is guaranteed to have a size and alignment of 1 byte. #[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)] #[repr(transparent)] struct Char(u8 is 0..=127);
impl Debug for Char { … } impl Display for Char { … }
impl Char { const fn new(c: char) -> Option { … } const fn from_u8(x: u8) -> Option { … } const unsafe fn from_u8_unchecked(x: u8) -> Self { … } }
impl From for char { … } impl From<&[Char]> for &str { … }
In alloc::string
:
impl From<Vecascii::Char> for String { … }
^ this From
being the main idea of the whole thing
Safe code can Char::new(…).unwrap()
since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it in const
s, then use the non-reallocating infallible From
s later if they need String
s or &str
s.
Other possibilities
I wouldn't put any of these in an initial PR, but as related things
- This could be a 128-variant enum with
repr(u8)
. That would allowas
casting it, for better or worse. - There could be associated constants (or variants) named
ACK
andDEL
and such. - Lots of
AsRef<str>
s are possible, like onascii::Char
itself or arrays/vectors thereof- And potentially
AsRef<[u8]>
s too
- And potentially
- Additional methods like
String::push_ascii
- More implementations like
String: Extend<ascii::Char>
orString: FromIterator<ascii::Char>
- Checked conversions (using the well-known ASCII fast paths) from
&str
(or&[u8]
) back to&[ascii::Char]
- The base85 example would really like a
[u8; N] -> [ascii::Char; N]
operation it can use in aconst
so it can have something likeconst BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();
. Long-term that's calledarray::map
and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.
And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like ascii::AsciiChar
, despite the stuttering name, to avoid Char
vs char
confusion?
What happens now?
This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.