Provide {to,from}_{ne,le,be}_bytes functions on integers by tbu- · Pull Request #51919 · rust-lang/rust (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation49 Commits2 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

tbu-

If one doesn't view integers as containers of bytes, converting them to
bytes necessarily needs the specfication of encoding.

I think Rust is a language that wants to be explicit. The to_bytes
function is basically the opposite of that – it converts an integer into
the native byte representation, but there's no mention (in the function
name) of it being very much platform dependent. Therefore, I think it
would be better to replace that method by three methods, the explicit
to_ne_bytes ("native endian") which does the same thing and
to_{le,be}_bytes which return the little- resp. big-endian encoding.

JakubOnderka, alexreg, xen0n, felix91gr, ollie27, Shnatsel, jminer, leonardo-m, Evrey, Kerollmops, and 5 more reacted with thumbs up emoji

@rust-highfive

r? @bluss

(rust_highfive has picked a reviewer for you, use r? to override)

@tbu-

I think these methods also make more sense than the to_be and to_le function from a type-system point of view: It doesn't make sense to repeatedly call to_be on an integer, but you can, the same way that it doesn't make sense to repeatedly call encode() on a string in Python 2, but you can do so there (which will raise an error for non-ASCII, but still).

@tbu-

@kennytm kennytm added the T-libs-api

Relevant to the library API team, which will review and decide on the PR/issue.

label

Jun 30, 2018

kennytm

///
/// [`to_native_bytes`]: #method.to_native_bytes
#[rustc_deprecated(since = "1.29.0", reason = "method doesn't communicate
intent, use `to_native_bytes`, `to_le_bytes` or `to_be_bytes` instead")]
#[stable(feature = "int_to_from_bytes", since = "1.29.0")]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR is accepted, we should better keep to_bytes()/from_bytes() unstable or just remove them, instead of stabilize and immediately deprecate them.

@alexreg

Very fair point! Three different method classes makes sense.

@felix91gr

Hi! I don't know if I'm supposed to add my 2 cents here, but since I can, I will.

I've been using the to_bytes kind of transformation without regards for the endianness. This is because this is what my program is doing:

I'm using this to make a generalized N-D Perlin (and in the future, Simplex) noise library. Since I'm using the hash's bytes as the seed for the pseudorandom generator, I didn't care about endianness when I implemented it.

But now that I think of it, specifying the endianness would allow me to have the exact same noise patterns regardless of the platform in which the library is running. So I think, even in my case, this is a better feature than the recently merged to_bytes :)

Thanks for your hard work 🙇‍♂️

@tbu-

Python implements a similar API on integers:

 |  from_bytes(...) from builtins.type
 |      int.from_bytes(bytes, byteorder, *, signed=False) -> int
 |      
 |      Return the integer represented by the given array of bytes.
 |      
 |      The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
 |      
 |      The byteorder argument determines the byte order used to represent the
 |      integer.  If byteorder is 'big', the most significant byte is at the
 |      beginning of the byte array.  If byteorder is 'little', the most
 |      significant byte is at the end of the byte array.  To request the native
 |      byte order of the host system, use `sys.byteorder' as the byte order value.
 |      
 |      The signed keyword-only argument indicates whether two's complement is
 |      used to represent the integer.
 |  
 |  to_bytes(...)
 |      int.to_bytes(length, byteorder, *, signed=False) -> bytes
 |      
 |      Return an array of bytes representing an integer.
 |      
 |      The integer is represented using length bytes.  An OverflowError is
 |      raised if the integer is not representable with the given number of
 |      bytes.
 |      
 |      The byteorder argument determines the byte order used to represent the
 |      integer.  If byteorder is 'big', the most significant byte is at the
 |      beginning of the byte array.  If byteorder is 'little', the most
 |      significant byte is at the end of the byte array.  To request the native
 |      byte order of the host system, use `sys.byteorder' as the byte order value.
 |      
 |      The signed keyword-only argument determines whether two's complement is
 |      used to represent the integer.  If signed is False and a negative integer
 |      is given, an OverflowError is raised.

(Note that the byteorder argument is mandatory.)

@TimNN TimNN added S-waiting-on-team

Status: Awaiting decision from the relevant subteam (see the T- label).

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Jul 10, 2018

@alexcrichton

This was discussed briefly at @rust-lang/libs triage yesterday but the conclusion was that we're going to leave comments individually here if we feel so.

I personally would not want to merge this PR. I think that leveraging the already-stable to_le and to_be methods to work in combination with to_bytes and from_bytes (already implemented) is the best way to work with these methods. Given that to_le and to_be are already stable I wouldn't feel we need to stabilize to_le_bytes or to_be_bytes. The question then to me is whether we need to include "native" in the name. I feel that the fact that these are native-endianness is implied by the lack of other information in the method name as otherwise we'd specify big or little.

@alexcrichton

@rfcbot fcp close

Let's see what others think though!

@rfcbot

Team member @alexcrichton has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@tbu-

The question then to me is whether we need to include "native" in the name. I feel that the fact that these are native-endianness is implied by the lack of other information in the method name as otherwise we'd specify big or little.

As you can see in this thread, the implied native isn't clear for everyone. In fact, I think one should actively think about whether one wants a big-endian, little-endian or native byte representation for the integer. Having to_le and to_be on the integers does not achieve this, as you can just omit calls to them and never notice. If you instead have the choice between to_le_bytes, to_be_bytes, to_native_bytes, it is very clear that you have to make a choice, and it's immediately documented, too. If you see to_bytes in the wild, you wouldn't really know whether the author intended to make the code platform-specific or just forgot to add to_be or to_le.

@SimonSapin

This alternative was discussed in #49792 and I mildly preferred the current design in order to avoid the multiplication of API surface. This PR proposes replacing two methods with six.

I don’t really mind changing, especially since to_bytes and from_bytes were only stabilized in this cycle (1.29) so we can still remove them. However the motivation based only on principle feels a bit weak.

@tbu-, do you feel that this change would prevent bugs? What do you think of the "intermediate" option of renaming the existing methods but not adding the be/le ones? (Keep documenting them as to be combined with to_le and to_be.)

@tbu-

@tbu-, do you feel that this change would prevent bugs?

Yes, this is the intention behind this change. As you can see above, we already caught one with the PR alone. :)

What do you think of the "intermediate" option of renaming the existing methods but not adding the be/le ones?

I think that would be worse than what we currently have, as to_be().to_native_bytes() looks wrong, as we're combining a big-endian function with a function operating on native endian.

I think the optimization based on number of functions is misguided.

In C, zero-terminated "strings" share the same type, whether they're encoded as UTF-8 or just treated as bag of bytes. Rust makes the distinction because it is useful to prevent errors, it makes sense to make the user cast between bytes and UTF-8-strings, even if this results in an enormous method duplication between byte slices and strings.

As shown with the Python example, other languages consider the endianness of the integers part of the encoding process, too.

@alexcrichton

I think I'm personally just not sold on this. I see the cost of six functions, some of which duplicate the functionality of existing functions, as not worth it. Dealing with endianness is inherently tricky and I don't feel like making all the methods longer and more wordy will really reduce the trickiness, but rather just make it less ergonomic to call when you already know which one you'd like.

@tbu-

I see the cost of six functions, some of which duplicate the functionality of existing functions, as not worth it.

I understand this part.

don't feel like making all the methods longer and more wordy will really reduce the trickiness, but rather just make it less ergonomic to call when you already know which one you'd like.

I don't really understand this part. It's

to_be().to_bytes()              → to_be_bytes()            [reduction by  5 chars]
to_le().to_bytes()              → to_le_bytes()            [reduction by  5 chars]
to_bytes()                      → to_native_bytes()        [increase  by  7 chars]
i32::from_be(i32::from_bytes()) → i32::from_be_bytes()     [reduction by 10 chars]
i32::from_le(i32::from_bytes()) → i32::from_le_bytes()     [reduction by 10 chars]
i32::from_bytes()               → i32::from_native_bytes() [increase  by  7 chars]

To me, it looks like it decreases the ergonomics of doing the platform-specific thing slightly but also increases the ergonomics of doing the platform-independent thing slightly, and the important part: It brings them to the same level. Currently, doing the platform-independent thing is slightly less ergonomic.

Rust wants to encourage good code, if we want to bring ergonomics of the methods into play, then I think an even playing field between platform-independent methods and platform-dependent methods is in line with Rust's objectives.

@SimonSapin

I feel that "native" in a name doesn’t really work on its own. It only sort of does in "native endian", and even then only by opposition to little/big endian.

What do you all think of transmute_to_bytes and transmute_from_bytes, to express that this is a type-level change without a memory representation change?

Also the more I think about it the more I’m becoming partial to having explicit {to,from}_{le,be}_bytes conversions. The {to,from}_{le,be} methods have the merit of already being stable, but they’re kinda weird in taking and returning the same type. In fact the implementation of to_be and from_be for example is identical! I understand that they’re intended to be thought about differently, and perhaps their use is self-documenting what direction of conversion is intended, but it feels like the type system would be better suited to this (even though adding a whole series of dedicated types like struct BigEndianU32(u32); is probably not worth it).

@alexcrichton

@tbu- to me ergonomics isn't literally how many characters you type but also how easy it is to remember idioms, and remembering that there's endian conversions on integers plus byte<-> int conversions means it's easy to remember how to compose. With everything it can be confusing about knowing which one was the one implemented and there's now a choice to be made of which to do in various circumstances.

@SimonSapin I think we could try to find a better name, yeah, but I'd reserve transmute for unsafe-like operations and would prefer to not have it here on a safe method.

@marshrayms

I am today starting on my first Rust program. I need to read a binary file format. When reading and writing files or network data, the endianness is an inherent part of the format's or protocol's definition.

It seems far more complicated to me to first read an incorrect intermediate value using host endianness and then adjust it to the correct value with bit manipulations. Byte-swapping is a completely unnecessary complication when we can just read it correctly in the first place.

I would really appreciate it if conversions of the form 'u32::from_le_bytes([u8; 4])' were made available.

@tbu-

The old issue has already been in FCP, a new issue was opened for the new API.

@tbu-

@SimonSapin

@bors

📌 Commit 0ddfae5 has been approved by SimonSapin

@bors bors added S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Aug 4, 2018

kennytm added a commit to kennytm/rust that referenced this pull request

Aug 4, 2018

@kennytm

…Sapin

Provide {to,from}_{ne,le,be}_bytes functions on integers

If one doesn't view integers as containers of bytes, converting them to bytes necessarily needs the specfication of encoding.

I think Rust is a language that wants to be explicit. The to_bytes function is basically the opposite of that – it converts an integer into the native byte representation, but there's no mention (in the function name) of it being very much platform dependent. Therefore, I think it would be better to replace that method by three methods, the explicit to_ne_bytes ("native endian") which does the same thing and to_{le,be}_bytes which return the little- resp. big-endian encoding.

@rust-highfive

This comment has been minimized.

bors added a commit that referenced this pull request

Aug 4, 2018

@bors

Rollup of 14 pull requests

Successful merges:

@bors

@bors

@bors bors added S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

and removed S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

labels

Aug 4, 2018

SimonSapin added a commit to SimonSapin/rust that referenced this pull request

Aug 14, 2018

@SimonSapin

frewsxcv added a commit to frewsxcv/rust that referenced this pull request

Aug 17, 2018

@frewsxcv

@gamozolabs

I would really like to see this be put into a trait rather than functions.

As a trait this could be implemented by a builtin derive which would make it work for structures composed of primitive types like this is being implemented on currently.

This would allow for simple creation of packed (or C structures by ignoring padding) structures in Rust (like an IP header or binary file header) and then the automatic conversion to and from the raw bytes on the wire/file.

I have a library to do this now and I use it in many of my projects. I think the ability to be able to handle reading and writing packed structures safely from bytestreams is critical enough to a systems language to justify it being put into the core language rather than being an external library. Especially given it would be almost no code increase to the Rust base (would be a procmacro), and would have no implications to programs unless they decide to use the feature.

@BurntSushi

@gamozolabs I think it would be appropriate to publish that library on crates.io, garner adoption and prove out the API. At that point, someone could consider writing an RFC, but I think it's probably immature to jump to that right now.

cswinter pushed a commit to cswinter/LocustDB that referenced this pull request

Sep 18, 2018

@zhzy0077 @cswinter

@jonhoo

@BurntSushi

@jonhoo That's a pretty broad question! Nothing immediately pops out at me. byteorder should continue to work fine?

@jonhoo

Hehe, that's true. I guess I was more wondering whether byteorder will transition to use these methods instead of having its own implementation?

@BurntSushi

Hehe, that's true. I guess I was more wondering whether byteorder will transition to use these methods instead of having its own implementation?

Oh! Hah. That didn't cross my mind at all because byteorder is one of the most conservative crates I maintain, and I won't increase its MSRV for something like this.

@tbu- tbu- mentioned this pull request

Dec 10, 2024

Labels

S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

T-libs-api

Relevant to the library API team, which will review and decide on the PR/issue.