[Python-Dev] bytes (original) (raw)

[Python-Dev] bytes / unicode

Toshio Kuratomi a.badger at gmail.com
Tue Jun 22 07:50:40 CEST 2010


On Tue, Jun 22, 2010 at 11:58:57AM +0900, Stephen J. Turnbull wrote:

Toshio Kuratomi writes:

> One comment here -- you can also have uri's that aren't decodable into their > true textual meaning using a single encoding. > > Apache will happily serve out uris that have utf-8, shift-jis, and > euc-jp components inside of their path but the textual > representation that was intended will be garbled (or be represented > by escaped byte sequences). For that matter, apache will serve > requests that have no true textual representation as it is working > on the byte level rather than the character level. Sure. I've never seen that combination, but I have seen Shift JIS and KOI8-R in the same path. But in that case, just using 'latin-1' as the encoding allows you to use the (unicode) string operations internally, and then spew your mess out into the world for someone else to clean up, just as using bytes would. This is true. I'm giving this as a real-world counter example to the assertion that URIs are "text". In fact, I think you're confusing things a little by asserting that the RFC says that URIs are text. I'll address that in two sections down.

> So a complete solution really should allow the programmer to pass > in uris as bytes when the programmer knows that they need it.

Other than passing bytes into a constructor, I would argue if a complete solution requires, eg, an interface that allows urljoin(base,subdir) where the types of base and subdir are not required to match, then it doesn't belong in the stdlib. For stdlib usage, that's premature optimization IMO. I'll definitely buy that. Would urljoin(b_base, b_subdir) => bytes and urljoin(u_base, u_subdir) => unicode be acceptable though? (I think, given other options, I'd rather see two separate functions, though. It seems more discoverable and less prone to taking bad input some of the time to have two functions that clearly only take one type of data apiece.)

The RFC says that URIs are text, and therefore they can (and IMO should) be operated on as text in the stdlib.

If I'm reading the RFC correctly, you're actually operating on two different levels here. Here's the section 2 that you quoted earlier, now in its entirety:: 2. Characters

The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters. When a URI appears in a protocol element, the character encoding is defined by that protocol; without such a definition, a URI is assumed to be in the same character encoding as the surrounding text.

The ABNF notation defines its terminal values to be non-negative integers (codepoints) based on the US-ASCII coded character set [ASCII]. Because a URI is a sequence of characters, we must invert that relation in order to understand the URI syntax. Therefore, the integer values used by the ABNF must be mapped back to their corresponding characters via US-ASCII in order to complete the syntax rules.

A URI is composed from a limited set of characters consisting of digits, letters, and a few graphic symbols. A reserved subset of those characters may be used to delimit syntax components within a URI while the remaining characters, including both the unreserved set and those reserved characters not acting as delimiters, define each component's identifying data.

So here's some data that matches those terms up to actual steps in the process::

We start off with some arbitrary data that defines a resource. This is

not necessarily text. It's the data from the first sentence:

data = b"\xff\xf0\xef\xe0"

We encode that into text and combine it with the scheme and host to form

a complete uri. This is the "URI characters" mentioned in section #2.

It's also the "sequence of characters mentioned in 1.1" as it is not

until this point that we actually have a URI.

uri = b"http://host/" + percentencoded(data)

Note1: percentencoded() needs to take any bytes or characters outside of

the characters listed in section 2.3 (ALPHA / DIGIT / "-" / "." / "_"

/ "~") and percent encode them. The URI can only consist of characters

from this set and the reserved character set (2.2).

Note2: in this simplistic example, we're only dealing with one piece of

data. With multiple pieces, we'd need to combine them with separators,

for instance like this:

uri = b'http://host/' + percentencoded(data1) + b'/'

+ percentencoded(data2)

Note3: at this point, the uri could be stored as unicode or bytes in

python3. It doesn't matter. It will be a subset of ASCII in either

case.

Then we take this and encode it for presentation inside of a data

file. If we're saving in any encoding that has ASCII as a subset and we

had bytes returned from the previous step, all we need to do is save to

a file. If we had unicode from the previous step, we need to transform

to the encoding we're using and output it.

u_uri.encode('utf8')

With all this in mind... URIs are text according to the RFC if you want to deal with URIs that are percent encoded. In other words, things like this:: http://host/%ff%f0%ef%e0

If you want to deal with things like this:: http://host/café

Then you are going one step further; back to the orginal data that was encoded in the RFC. At that point you are no longer dealing with the sequence of characters talked about in the RFC. You are dealing with data which may or may not be text.

As Robert Collins says, this is bytes by definition which I pretty much agree with. It's very very convenient to work with this data as text most of the time but the RFC does not mandate that it is text so operating on it as bytes is perfectly reasonable.

It's not just a matter of manipulating the URIs themselves, where working directly on bytes will work just as well and and with the same string operations (as long as everything is bytes). It's also a question of API complexity (eg, Barry's bugaboo of proliferation of encoding= parameters) and of debugging (if URIs are internally str, then they will display sanely in tracebacks and the interpreter).

The proliferation of encoding I agree is a thing that is ugly. Although, if I'm thinking correctly, that only matters when you want to allow mixing bytes and unicode, correct? One of these cases:

For debugging, I'm either not understanding or you're wrong. If I'm given an arbitrary sequence of bytes how do I sanely store them as str internally? If I transform them using an encoding that anticipates the full range of bytes I may be able to display some representation of them but it's not necessarily the sanest method of display (for instance, if I know that path element 1 is always going to be a utf8 encoded string and path element 2 is always shift-jis encoded, and path element 3 is binary data, I could construct a much saner display method than treating the whole thing as latin1).

The cases where URIs can't be sanely treated as text are garbage input, and the stdlib should not try to provide a solution. Just passing in bytes and getting out bytes is GIGO. Trying to do "some" error-checking is going to be insufficient much of the time and overly strict most of the rest of the time. The programmer in the trenches is going to need to decide what to allow and what not; I don't think there are general answers because we know that allowing random URLs on the web leads to various kinds of problems. Some sites will need to address some of them. What is your basis for asserting that URIs that aren't sanely treated as text are garbage? It's definitely not in the RFC.

Note also that the "complete solution" argument cuts both ways. Eg, a "complete" solution should implement UTS 39 "confusables detection"[1] and IDNA[2]. Good luck doing that with bytes! Note that IDNA and confusables detection operate on a different portion of the uri than the need for bytes. Those operate on the domain name (looks like it's called the authority in the rfc) whereas bytes are useful for the path, query, and fragment portions.

Note: I'm not sure precisely what Philip is looking to do but the little I've read sounds like its contrary to the design principles of the python3 unicode handling redesign. I'm stating my reading of the RFC not to defend the use case Philip has, but because I think that the outlook that non-text uris (before being percentencoded) are violations of the RFC is wrong and will lead to interoperability problems/warts(since you could turn them into latin1 and from there into bytes and from there into the proper values) if allowed to predominate the thinking.

-Toshio -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: <http://mail.python.org/pipermail/python-dev/attachments/20100622/321573c2/attachment-0001.pgp>



More information about the Python-Dev mailing list