[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

Chris Angelico rosuav at gmail.com
Wed Sep 17 03:14:15 CEST 2014


On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray <rdmurray at bitdance.com> wrote:

Yes. I thought you were saying that one could not treat the string with smuggled bytes as if it were a string. (It's a string that can't be encoded unless you use the surrogateescape error handler, but it is still a string from Python's POV, which is the point of the error handler).

Or, to put it another way, your implication was that there were no string operations that could be usefully applied to a string containing smuggled bytes, but that is not the case. (I may well have read an implication that was not there; if so I apologize and you can ignore the rest of this :)

Ahh, I see where we are getting confused. What I said was that you can't treat the string as a pure Unicode string. Parts of it are Unicode text, parts of it aren't.

Basically, we are pretending that the each smuggled byte is single character for string parsing purposes...but they don't match any of our parsing constants. They are all "any character" matches in the regexes and what have you.

This is slightly iffy, as you can't be sure that one byte represents one character, but as long as you don't much care about that, it's not going to be an issue. I'm fairly sure you're never going to find an encoding in which one unknown byte represents two characters, but there are cases where it takes more than one byte to make up a character (or the bytes are just shift codes or something). Does that ever throw off your regexes? It wouldn't be an issue to a .* between two character markers, but if you ever say .{5} then it might match incorrectly.

I think we're in agreement here, just using different words. :)

ChrisA



More information about the Python-Dev mailing list