[Python-Dev] bytes type discussion (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Wed Feb 15 09:14:37 CET 2006


Greg Ewing wrote:

If the protocol has been sensibly designed, that shouldn't happen, since everything up to the coding marker should be ascii (or some other protocol-defined initial coding).

XML, for one protocol, requires you to restart over. The initial sequence could be UTF-16, or it could be EBCDIC. You read a few bytes (up to four), then know which of these it is. Then you start over, reading further if it looks like an ASCII superset, to find out the real encoding. You normally then start over, although switching at that point could also work.

For protocols that are not sensibly designed (or if you're just trying to guess) what you suggest may be needed. But it would be good to have a nicer way of going about it for when the protocol is sensible.

There might be buffering of decoded strings already, (ie. beyond the point to which you have read), so you would need to unbuffer these, and reinterpret them. To support that, you really need to buffer both the original bytes, and the decoded ones, since the encoding might not roundtrip.

Regards, Martin



More information about the Python-Dev mailing list