[Python-Dev] bytes (original) (raw)

[Python-Dev] bytes / unicode

R. David Murray rdmurray at bitdance.com
Mon Jun 28 01:31:21 CEST 2010

Previous message: [Python-Dev] bytes / unicode
Next message: [Python-Dev] bytes / unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I've been watching this discussion with intense interest, but have been so lagged in following the thread that I haven't replied. I got caught up today....

On Sun, 27 Jun 2010 15:53:59 +1000, Nick Coghlan <ncoghlan at gmail.com> wrote:

The difference is that we have three classes of algorithm here: - those that work only on octet sequences - those that work only on character sequences - those that can work on either

Python 2 lumped all 3 classes of algorithm together through the multi-purpose 8-bit str type. The unicode type provided some scope to separate out the second category, but the divisions were rather blurry. Python 3 forces the first two to be separated by using either octets (bytes/bytearray) or characters (str). There are a very small number of APIs where it is appropriate to be polymorphic, but this is currently difficult due to the need to supply literals of the appropriate type for the objects being operated on. This isn't ever going to happen automagically due to the need to explicitly provide two literals (one for octet sequences, one for character sequences).

In email6 I'm currently handling this by putting the algorithm on a base class and the literals on 'Bytes...' and 'String...' subclasses as class variables. Slightly ugly, but it works.

The current design also speaks to an earlier point someone made about the fact that we are really dealing with more complex, and domain specific, data, not simply "byte strings". A "BytesMessage" contains lots of structured encoding information as well as the possibility of 'garbage' bytes. A StringMessage contains text and data decoded into objects (ex: an image object), possibly with some PEP 383 surrogates included (haven't quite figured that part out yet). So, a BytesMessage object isn't just a byte string, it's a load of structured data that requires the associated algorithms to convert into meaningful text and objects. Going the other way, the decisions made about character encodings need to be encoded into the structured bytes representation that could ultimately go out on the wire.

I suspect that the same thing needs to be done for URIs/IRIs, and html/MIME and the corresponding text and objects. It is my hope that the email6 work will lay a firm foundation for the latter, but URI/IRI is a whole different protocol that I'm glad I don't have to deal with :)

The virtues of a separate polystr type are that:

Having such a poly_str type would probably make my life easier.

I also would like just vent a little frustration at having to use single-character-slice notation when I want to index a character in a string in my algorithms....

-- R. David Murray www.bitdance.com

Previous message: [Python-Dev] bytes / unicode
Next message: [Python-Dev] bytes / unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list