Add starts_with and ends_with to OsStr by Stebalien · Pull Request #26499 · rust-lang/rust (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation18 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

(also inline all non-generic public APIs)

Note: this commit doesn't use the Pattern API because it's &str specific.
Also Note: this commit doesn't introduce OsStr::contains because I'm lazy.

I'd be happy to write an RFC if you want to explore other possibilities like
expanding the Pattern API (although I really don't think we should).

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

So this code seems good to me, but i'd prefer to have someone from @rust-lang/libs land it. I'm going to switch to r? @alexcrichton

Can this hold off on adding #[inline] for now? Right now it tends to incur a little too much compile-time overhead for not enough runtime win. For example I've never seen OsStr manipulation at the top of any profile.

Other than that I think that this is fine, the only question being about how we want to pursue these methods into the future. Do we want to duplicate the API surface area of strings onto OS strings immediately, or incrementally? What we have today is kinda the "bare minimum" to get by with the plan to "expand if necessary". I'm somewhat against adding apis incrementally over time as it's bound to just be surprising that OsStr doesn't implement a method or two, so it may be worth taking time to plan out what the final API of OS strings might look like in the long run.

That being said I do think it's fine for these to enter in an unstable fashion, for now. If it turns out that everyone really wants these two methods then this is as good a way as any to test the waters.

bluss

/// Returns true if the `other` is a prefix of the `OsStr`.
#[unstable(feature = "os_str_compare", reason = "recently added")]
pub fn starts_with<S: AsRef>(&self, other: S) -> bool {
self.bytes().starts_with(other.as_ref().bytes())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct for plain bytes and UTF-8, but what about other encodings? If OsStr is WTF-8, does that allow false positives (say for example that a single byte encoded sequence for a codepoint exists that's a prefix of a multibyte sequence. UTF-8 doesn't have this).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, ok that was simple.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, WTF-8 preserves the nice properties of UTF-8 like self-synchronization.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for people from the future: No this is not fine because if self contains a non-BMP code point (4-byte sequence), and other contains just a high-surrogate (3-byte sequence), this implementation will always return false, contradicting the result when given two WTF-16 sequences.

@alexcrichton I started with starts_with/ends_with because I wanted to know if a filename started with '.' and realized that the only reliable way to do this was to use to_os_string_lossy().starts_with('.').

Without distinguishing between utf8 and wtf8, I can implement the following:

starts_with/ends_with (done)
contains (done)
is_empty/len (done)
lines/lines_any (in-progress)

Everything else depends on whether we're dealing with WTF-8 or UTF-8. I can write up an RFC if you want.

@Stebalien Since you mention UTF-8 and WTF-8, on linux there is no encoding for paths in general, they are just arbitrary byte strings.

@bluss good point. That rules out methods like split replace, and trim (methods that look at individual characters) because they will behave differently on unix/windows. For example, "OsStr::new("Ünіcôdé").split(OsStr::new("")) act like bytes() on unix and chars() on windows.

find and the slicing methods should still be doable however.

I'm going to close this for now and write an RFC.

@Stebalien we actually discussed this today in some libs triage, and we ended up reaching the same conclusion! After a quick overview of the string/osstr apis, we concluded that the "end goal" for what OsStr exposes may end up just being the Pattern methods on strings. Most other methods didn't seem to apply, and the pattern ones all generally seemed to fit nicely (including starts_with and ends_with).

There may be a possibility of generalizing the Pattern trait to work for OS strings as well (cc @Kimundi, the author of the trait), but we can also start out conservatively and assume that a "pattern" is "something that can be an OsStr" to start off with.

Hm, right now I see two possible ways to generalize the pattern API to Osstrings:

Make the slice type a generic paramter of the trait, and add seperate impls for Pattern<'a, str> and Pattern<'a OsStr> (and maybe Pattern<'a, [T]> as well). That should be relatively easy to do, but makes the API of the actual methods more complicated to read (where P: Pattern<'a> -> where P: Pattern<'a, str>). That said, it doesn't look that terrible either. Would be nice if we could do type Pattern<'a> = GeneralPattern<'a, str> I guess.
Change the Pattern API to work with any kind somehow searchable slice, and have wrapper around that that exposes it for str, OsStr and maybe plain [T]. Might reduce code duplication somewhat, but might also just be a more complicated variations of option 1. Also, its unclear if we could have specialized search algorithms as easily with this.

The current algorithm for &str in &str search supports any "ordered alphabet" so I assume it can be adapted to any [T] where T: Ord or at least any integer types. I think we want to make sure we can still specialize the [u8] case since we want to have all possibilities open to further optimizations (memchr, memcmp and even pcmpestri).

Yeah I'm not sure if it's actually possible to use the same Pattern trait (or if we even want to), but it's certainly a possibility to explore!

I want to call attention to a point that @Kimundi made in passing: a generalization of the Pattern API should, ideally, apply to [T] as well. (That would in particular bring Vec<u8> largely up to parity with String as a first-class string type.)

Labels

I-needs-decision

Issue: In need of a decision.

Relevant to the library API team, which will review and decide on the PR/issue.