[Python-Dev] Bytes path support (original) (raw)

Marko Rauhamaa marko at pacujo.net
Sat Aug 23 10:21:57 CEST 2014


"Stephen J. Turnbull" <stephen at xemacs.org>:

Just read as bytes and decode piecewise in one way or another. For Oleg's HTML case, there's a well-understood structure that can be used to determine retry points

HTML and XML are interesting examples since their encoding is initially unknown:

                  ^
                  +--- Now I know it is UTF-8
                                  ^
                                  +--- Now I know it was UTF-16
                                       all along!

Then we have:

HTTP/1.1 200 OK Content-Type: text/html; charset=ISO-8859-1

See how deep you have to parse the TCP stream before you realize the content encoding is UTF-16.

Marko



More information about the Python-Dev mailing list