[idea] add some binary data methods (original) (raw)

As we know the String (or str type) has many string methods, such as find、split、etc. but its not used by binary data (Vec、&[u8]). I hope to add some binary processing functions that allow me to use them in this way:

let s: Vec<u8> = ...;
let p = s.find(b"abc");

etc.

I agree. Could you make a Pre-RFC for this?

rne April 23, 2025, 1:35pm 3

Something like this maybe?

Maybe in the Vec impl itself?

toc April 23, 2025, 4:00pm 5

The problem with this proposal is that Vec/slice could easily become bloated if they included extra functions for every which paradigm that includes a notion of sequentially stored things. Such functionality might be better grouped under an entirely different abstraction (which is still based on a Vec/slice underneath the covers), some kind of bstr. It might even be reasonable to pull that abstraction into std at some point.

epage April 23, 2025, 4:56pm 6

FYI API additions like this go through an "API Change Proposal" (ACP) process, not an RFC.

And in fact there is an ACP for this that was accepted. You can see the tracking issue at Tracking Issue for `byte_search` · Issue #134149 · rust-lang/rust · GitHub

Similarly, a BStr and BString ACP was accepted.

kornel April 23, 2025, 9:04pm 7

I've needed this a few times for [u8]. I don't recall needing this for other item types.

There are classic fast substring search algorithms, so it seems to me like something that belongs to the standard library.

Algorithmically, it should be the same or simpler than UTF-8 search, so it shouldn't cause any code bloat, at least if it's limited to BStr or [u8] and not generic T: Eq.

I've used substring search for parsing binary data that isn't supposed to be a string at all. AFAIK BStr is meant to be "binary clean" and will not require the data to be text-like, but the proposed text makes it sound very string-like. Hopefully the docs can be clarified that it's just a byte slice, and doesn't have to be text.

Vorpal April 23, 2025, 9:22pm 8

While it would be good to clarify this, I don't see how you could end up with any other interpretation when working thorough the consequences: If bstr doesn't have to be UTF8, nor does it have any other limitations imposed (such as C's nul termination) it has to accept arbitrary data. So it has to be binary clean.

Could that be made more obvious so you wouldn't have to work through the logic? Yes.

However it is theoretically possible it could use algorithms optimised for "mostly UTF8", but I don't know of any such (except for printing, where you could plausible optimise for "mostly don't have to escape the data").

Maybe we can create a new lib trait for this, like CollectionMut?

Then it will conflict with a lot of APIs in the standard library