Update/relax str/String utf8 safety docs by zachs18 · Pull Request #134598 · rust-lang/rust (original) (raw)

Relax UTF-8 requirements of String as well as some str/String-related functions from "must be UTF-8 or UB" to "invalid UTF-8 is not immediate UB but is unsafe" to match the validity requirement for str.

cc @joshtriplett #71033 (comment)

I wonder if we might be able, in the future, to carefully exclude a few of str's functions from the "library UB" requirements.


Current state:

str currently documents that it's UTF-8 requirement is not a language-level requirement (i.e. having a non-UTF-8 str is not immediately UB, but it can cause later UB because other code can assume that str's are valid UTF-8), though std::str::from_utf8_unchecked currently still has a stronger requirement.

String does not currently have such docs, though some of its associated functions' docs imply it1.


(first commit) This PR relaxes std::str::from_utf8_unchecked's and str::as_bytes_mut's UTF-8 requirements to match that of str itself. (and updates the safety comments and implementation in std::str::from_utf8_unchecked(_mut))

(second commit) This PR also adds an "Invariant" section to String's docs to match str, which documents that

Rust libraries may assume that Strings are always valid UTF-8, just like strs.

Constructing a non-UTF-8 String is not immediate undefined behavior, but any function called on a String may assume that it is valid UTF-8, which means that a non-UTF-8 String can lead to undefined behavior down the road.

(third commit) This PR also adds lists of "exceptions" to str/String's "Invariant" sections, of functions that are safe to call on invalid-UTF-8 str/Strings. The specific list is not set in stone, I mostly just chose functions that don't do any string manipulation. It could also probably be formatted better?, maybe as a list of "categories" with the functions separated by commas, not as their own lines. If this section is added, then removing a (stable) function from it would be a breaking change, and adding new (stable) functions to it would be a new stable API guarantee. Alternately, this guarantee could be added as a short note to the functions' docs individually (with a link to the "Invariants" section) to prevent the list becoming inaccurate (e.g. "This function is safe to call on a str/String containing [invalid UTF-8](link to invariant section)").

As this relaxes stable API requirements, I think this needs a T-libs-api FCP?

@rustbot label T-libs-api

r? @joshtriplett (or other T-libs-api)

Footnotes

  1. String::from_utf8_unchecked's docs say "This function is unsafe because it does not check that the bytes passed to it are valid UTF-8. If this constraint is violated, it may cause memory unsafety issues with future users of the String, as the rest of the standard library assumes that Strings are valid UTF-8." Note that it mentions only future users, implying that calling String::from_utf8_unchecked with invalid UTF-8 alone is not immediate UB.