[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Sun Sep 16 09:13:29 CEST 2007


"Martin v. Löwis" writes:

What I'm suggesting is to provide a way for processes to record and communicate that information without needing to provide a "source encoding" slot for strings, and which is able to handle strings containing unrecognized (including corrupt) characters from multiple source encodings.

Can you please (re-)state how that way would precisely work? I could not find that in the archives.

The basic idea is to allocate code points in private space as-needed.

All points in private space would be initially "owned" by the Python process.

When a codec encounters something it can't handle, whether it's a valid character in a legacy encoding, a private use character in a UTF, or an invalid sequence of code units, it throws an exception specifying the character or code unit and the current coded character set, and the handler either finds that tuple in the table, or assigns a private use character and enters it in the table with key being the charset-codepoint tuple, and the inverse assignment in an inverse mapping table.

It may be that no charset can be assigned to the codepoint, in which case None would be assigned as the charset, and instead of mapping characters, the invalid codepoints would be individually mapped.

On output, if the codec can output in the recorded character set, it does so, otherwise it throws an unencodable character exception.

This definitely requires that the Unicode codecs be modified to do the right thing if they encounter private use characters in the input stream or output stream.

Other codecs don't need to be modified, although ISO 2022-based codecs (at least) would benefit from it. Some codecs (like ISO-8859 codecs) will have implicit charsets (ASCII code points can't be errors for them, so only GR matters), and can use codec-specific handlers that know what the implicit charset is. (AIUI this would require that the handler-specifying protocol be changed from an enumeration of the available handlers to the ability to actually specify one.) The rest can use the None charset, so that code units will be preserved.

Applications which wish to pass strings across process boundaries will have to pass the table too. If they don't, then in general they can't use this family of exception handlers.

To handle cases like Marcin's private encoding, and in general to allow efficient IPC for process that know they're going to get certain private use characters in I/O, there should be an API to preallocate specific code points. (Theoretically, dynamically allocated private code points could be reallocated, but that would require translating all existing strings, and I can't believe that would ever be worth it.)

What happens if a string "escapes" without the table?

  1. The application uses the preallocation API. Then the characters it understands are handled normally, and dynamically allocated private use characters are errors, anyway. I don't see how this makes things worse.

  2. The application doesn't use the preallocation API, but does know about some private use characters. Then it will get confused by the dynamic allocation, as Greg and Marcin point out, and users should be advised not to use the new handler.

  3. The application doesn't know about any private use characters. Then dynamically allocated characters are exceptions anyway. I don't see how this makes things worse.

Advantages:

  1. Almost all the "interesting" information about the original encoded source is preserved, including under string operations like slicing and concatenation with strings form other sources. (I can quantify "almost all" more precisely if necessary.)

  2. 100% Unicode conformance in the sense that if the internal representation escapes, it's valid Unicode.

  3. Efficient internal representation in the sense that applications need not worry about invalid Unicode when doing string operations.

  4. In 16-bit environments, up to 6400 non-BMP characters can be mapped into the BMP private use area using the same algorithm, achieving a "string is character array" representation at the expense of slight overhead in I/O and one extra table reference in each character property lookup. As Marcin points out, given that not all composable characters have one-character NFC representations, we can't guarantee that the user's notion of string length will equal the number of characters in the string, but in practice I think that will almost invariably work out. And if we're doing normalization, the codec overhead becomes less important.

Disadvantages:

  1. Unicode codecs will need to be modified, since they need to throw exceptions on private use characters.

  2. Other codecs will need to be modified to take advantage of this handler, since AFAIK currently none of the available handlers can use charset information, so I can't imagine the codecs provide it.

  3. More overhead in exception-handling than James Knight's or Marcin Kowalczyk's proposals.

  4. Applications that know about some private use characters will need to be modified to preallocate those characters before they can take advantage of this handler.

In general, I don't think that the overhead should be weighted very heavily against this proposal. Exception handlers impose a fair amount of overhead anyway, AIUI. Furthermore, any application that cares enough to keep track of the original code points will IMO be hungry for any additional information that can help in exception handling. This proposal provides as much as you can get, short of buffering all the input.

HTH,



More information about the Python-3000 mailing list