[Python-3000] Unicode and OS strings (original) (raw)
Stephen J. Turnbull stephen at xemacs.org
Fri Sep 14 06:52:45 CEST 2007
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Greg Ewing writes:
Stephen J. Turnbull wrote:
What should happen internally is that all undecodable characters (which PUA characters are by definition for standard codecs) are mapped to unused codepoints in the PUA, chosen by Python.
You mean chosen dynamically?
Yes.
What happens if these PUA characters get encoded some other way,
You can't win that, because Unicode is the only encoding that attempts to guarantee even the possibility of round-tripping. The only thing you can win is if it's the same character set (which might be used by multiple encodings), and then we record the character set and the code point. That's the best we can do in theory.
The main problem with this scheme that I know of is that if you have a Python string that contains such a code point, you'll need to somehow include the information about the original encoding when pickling and the like.
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]