Add an "ascii character" type to reduce unsafe
needs in conversions · Issue #179 · rust-lang/libs-team (original) (raw)
Proposal
Problem statement
For individual bytes, the backend can plausibly be expected to optimize things when it knows that a char
came from an ASCII-ranged value. However, for compound values, there's no realistic way that codegen backends can track enough information to optimize out UTF-8 validity checks. That leads to lots of "look, it's ASCII" comments on unsafe
blocks because the safe equivalents have material performance impact.
We should offer a nice way that people can write such "the problem fundamentally only produces ASCII String
s" code without needing unsafe
and without needing spurious UTF-8 validation checks.
After all, to quote std::ascii,
However, at times it makes more sense to only consider the ASCII character set for a specific operation.
Motivation, use-cases
I was reminded about this by this comment in rust-lang/rust#105076:
pub fn as_str(&self) -> &str {
// SAFETY: self.data[self.alive] is all ASCII characters.
unsafe { crate::str::from_utf8_unchecked(self.as_bytes()) }
}
But I've been thinking about this since this Reddit thread: https://www.reddit.com/r/rust/comments/yaft60/zerocost_iterator_abstractionsnot_so_zerocost/. "base85" encoding is an examplar of problems where problem is fundamentally only producing ASCII. But the code is doing a String::from_utf8(outdata).unwrap()
at the end because other options aren't great.
One might say "oh, just build a String
as you go" instead, but that doesn't work as well as you'd hope. Pushing a byte onto a Vec<u8>
generates substantially less complicated code than pushing one to a String
(https://rust.godbolt.org/z/xMYxj5WYr) since a 0..=255
USV might still take 2 bytes in UTF-8. That problem could be worked around with a BString instead, going outside the standard library, but that's not a fix for the whole thing because then there's still a check needed later to get back to a &str
or String
.
There should be a core
type for an individual ASCII character so that having proven to LLVM at the individual character level that things are in-range (which it can optimize well, and does in other similar existing cases today), the library can offer safe O(1) conversions taking advantage of that type-level information.
[Edit 2023-02-16] This conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/.E2.9C.94.20Iterate.20ASCII-only.20.26str/near/328343781 made me think about this too -- having a type gives clear ways to get "just the ascii characters from a string" using something like .filter_map(AsciiChar::new)
.
[Edit 2023-04-27] Another conversation on zulip https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/core.3A.3Astr.3A.3Afrom_ascii/near/353452589 about how on embedded the "is ascii" check is much simpler than the "is UTF-8" check, and being able to use that where appropriate can save a bunch of binary size on embedded. cc @kupiakos
Solution sketches
In core::ascii
,
/// One of the 128 Unicode characters from U+0000 through U+007F, often known as /// the ASCII subset. /// /// AKA the characters codes from ANSI X3.4-1977, ISO 646-1973, /// or NIST FIPS 1-2. /// /// # Layout /// /// This type is guaranteed to have a size and alignment of 1 byte. #[derive(Copy, Clone, Eq, PartialEq, Ord, PartialOrd, Hash)] #[repr(transparent)] struct Char(u8 is 0..=127);
impl Debug for Char { … } impl Display for Char { … }
impl Char { const fn new(c: char) -> Option { … } const fn from_u8(x: u8) -> Option { … } const unsafe fn from_u8_unchecked(x: u8) -> Self { … } }
impl From for char { … } impl From<&[Char]> for &str { … }
In alloc::string
:
impl From<Vecascii::Char> for String { … }
^ this From
being the main idea of the whole thing
Safe code can Char::new(…).unwrap()
since LLVM easily optimizes that for known values (https://rust.godbolt.org/z/haabhb6aq) or they can do it in const
s, then use the non-reallocating infallible From
s later if they need String
s or &str
s.
Other possibilities
I wouldn't put any of these in an initial PR, but as related things
- This could be a 128-variant enum with
repr(u8)
. That would allowas
casting it, for better or worse. - There could be associated constants (or variants) named
ACK
andDEL
and such. - Lots of
AsRef<str>
s are possible, like onascii::Char
itself or arrays/vectors thereof- And potentially
AsRef<[u8]>
s too
- And potentially
- Additional methods like
String::push_ascii
- More implementations like
String: Extend<ascii::Char>
orString: FromIterator<ascii::Char>
- Checked conversions (using the well-known ASCII fast paths) from
&str
(or&[u8]
) back to&[ascii::Char]
- The base85 example would really like a
[u8; N] -> [ascii::Char; N]
operation it can use in aconst
so it can have something likeconst BYTE_TO_CHAR85: [ascii::Char; 85] = something(b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~").unwrap();
. Long-term that's calledarray::map
and unwrapping each value, but without const closures that doesn't work yet -- for now it could open-code it, though.
And of course there's the endless bikeshed on what to call the type in the first place. Would it be worth making it something like ascii::AsciiChar
, despite the stuttering name, to avoid Char
vs char
confusion?
What happens now?
This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.