ACP: A substring API for OsStr · Issue #306 · rust-lang/libs-team (original) (raw)

Proposal

Problem statement

OsStr and OsString provide access to their (unspecified) internal byte encoding using {as,into}_encoded_bytes() and from_encoded_bytes_unchecked() methods. It's possible to convert an OS string to bytes, and to convert bytes to an OS string.

However, from_encoded_bytes_unchecked() is unsafe, and there is no universal way to validate the safety invariants. Some common string operations (splitting, trimming, replacing) are impossible to implement without unsafe code.

New APIs should ideally discourage relying on any details of the internal encoding (which may be unstable and not meant for interchange).

Motivating examples or use cases

Argument parsers need to extract substrings from command line arguments. For example, --option=somefilename needs to be split into option and somefilename, and the original filename must be preserved without sanitizing it.

clap currently implements strip_prefix and split_once using transmute (equivalent to the stable encoded_bytes APIs).

lexopt (my own crate) currently uses the platform-specific APIs, but I'd like to move to the encoded_bytes API eventually. unsafe is holding me back since I have working code already and I think some of my users would consider it a regression.

The os_str_bytes and osstrtools crates provides high-level string operations for OS strings. os_str_bytes is in the wild mainly used to convert between raw bytes and OS strings (e.g. 1, 2, 3). osstrtools enables reasonable uses of split() to parse $PATH and replace() to fill in command line templates.

Solution sketch

I propose a method to take a substring of an OsStr, based on offsets into the result of as_encoded_bytes(). On Unix any slice (within bounds) would be valid. On Windows this method would panic if the string is not cut on UTF-8 boundaries, exactly like the requirements of from_encoded_bytes_unchecked:

Due to the encoding being self-synchronizing, the bytes from OsStr::as_encoded_bytes can be split either immediately before or immediately after any valid non-empty UTF-8 substring.

Note that this is stricter than the actual internal encoding of OS strings on Windows.

Proposed signature:

impl OsStr { /// Takes a substring based on a range that corresponds to the return value of /// [OsStr::as_encoded_bytes]. /// /// The range's start and end must lie on valid OsStr boundaries. /// /// On Unix any boundaries are valid, as OS strings may contain arbitrary bytes. /// /// On other platforms such as Windows the internal encoding is currently /// unspecified, and a valid OsStr boundary is one of: /// - The start of the string /// - The end of the string /// - Immediately before a valid non-empty UTF-8 substring /// - Immediately after a valid non-empty UTF-8 substring /// /// # Panics /// /// Panics if range does not lie on valid OsStr boundaries or if it /// exceeds the end of the string. /// /// # Example /// /// /// use std::ffi::OsStr; /// /// let os_str = OsStr::new("foo=bar"); /// let bytes = os_str.as_encoded_bytes(); /// if let Some(index) = bytes.iter().position(|b| *b == b'=') { /// let key = os_str.slice_encoded_bytes(..index); /// let value = os_str.slice_encoded_bytes(index + 1..); /// assert_eq!(key, "foo"); /// assert_eq!(value, "bar"); /// } /// fn slice_encoded_bytes<R: RangeBounds>(&self, range: R) -> &Self; }

Examples

A proof of concept is implemented here: os_str_slice.rs

With an example port of lexopt to this API: blyxxyz/lexopt@8077851

It should be trivial to port clap's transmuting operations to this API, since they all take substrings of an OsStr.

A string replace function can be implemented using OsStr::slice_encoded_bytes() and OsString::push():

fn replace(source: &OsStr, from: &str, to: &OsStr) -> OsString { assert!(!from.is_empty());

let from = from.as_bytes();
let mut result = OsString::new();
let bytes = source.as_encoded_bytes();
let mut prev = 0;
let mut index = 0;
while index < bytes.len() {
    if bytes[index..].starts_with(from) {
        result.push(source.slice_encoded_bytes(prev..index));
        result.push(to);
        index += from.len();
        prev = index;
    } else {
        index += 1;
    }
}
result.push(source.slice_encoded_bytes(prev..));
result

}

Notice that it does require the needle to be non-empty UTF-8.

Guarding the internal encoding

This solution has two attractive properties:

Any API that converts from bytes to an OS string would not have these advantages.

Behavior on niche platforms

The current behavior of OS strings is only documented prominently for Unix and for Windows.

All other platforms reuse the OS string internals of either of these platforms. Almost all of the Unix-alikes expose OsStrExt/OsStringExt, which means they specify the internal encoding to be arbitrary bytes.

Only wasm uses Unix OS strings without the extension traits. I couldn't find the history of this, but it's probably intentional. JavaScript uses potentially ill-formed UTF-16, like Windows, but WebAssembly can be used in other environments as well, so there is no obvious single set of semantics. (There is some loosely related discussion at rustwasm/wasm-bindgen#1348.)

In order to keep the slicing invariants encapsulated it might be necessary to create a third OS string implementation to be used by wasm. Since this platform can currently only legally construct OS strings from UTF-8 strings the implementation could be backed by str/String, with the side benefit of free conversion back into UTF-8 strings (currently UTF-8 validation is performed). The implementation of Unix OS strings is simple, and this implementation would be even simpler.

Alternatives

The method can be implemented using existing APIs, see the proof of concept above. But the fact that it's a minimal safe API that can be used to implement higher-level opinionated operations makes it a natural fit for the standard library.

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

Second, if there's a concrete solution: