[Python-Dev] transform() and untransform() methods, and the codec registry (original) (raw)

Alexander Belopolsky alexander.belopolsky at gmail.com
Thu Dec 9 21:29:35 CET 2010

Previous message: [Python-Dev] transform() and untransform() methods, and the codec registry
Next message: [Python-Dev] sWAPcASE Was: transform() and untransform() methods, and the codec registry
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Dec 9, 2010 at 2:17 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:

On Thu, 9 Dec 2010 13:55:08 -0500 Alexander Belopolsky <alexander.belopolsky at gmail.com> wrote: ..

This is actually very misleading because

>>> 'abc'.transform('rot13') 'nop' works even though 'abc' is not "an object with the buffer interface". Agreed. It was already pointed out in the parent thread. I would say my opinion on keeping transform()/untransform() is +0 if these error messages (and preferably the exception type as well) get improved. Otherwise we'd better revert them and add a more polished version in 3.3.

Error messages is only one of the problems. User confusion over which codec supports which types is another. Why, for example rot13 works on str and not on bytes? It only affects ASCII range, so isn't it natural to expect b'abc'.transform('rot13') to work? Well, presumably this is so because Caesar did not know about bytes and his "cypher" was about character shuffling. In this case, should't it also shuffle other code points assigned to Latin letters? Given how "useful" rot13 is in practice, I feel that it was only added to justify adding str.transform().

There are other problems raised on the issue and not addressed in the tracker discussion. For example, both Victor and I expressed concern about having builitn methods that do import behind the scenes. Granted, this issue already exists with encode/decode methods, but these are usable without providing an explicit encoding and in this form do not have side-effects.

Another problem is that with str.transform(), users are encouraged to write programs in which data stored in strings is not always interpreted as Unicode. For example, when I see a 'n' in a string variable, it may actually mean 'a' because it has been scrambled by rot13. Again, rot13 is not a realistic example, but as users are encouraged to create their own string to string codecs, we may soon find ourselves in the same mess as we have with 2.x programs trying to support multiple locales.

As far as realistic examples go, Unicode transformations such as case folding, normalization or decimal to ASCII translation have not been considered in str.transform() design. The str.transform/str.untransform pair may or may not be a good solution for these cases. One obvious issue being that these transformations are often not invertible.

I admit I have more questions than answers at this point, but a design that adds the same two methods to three builtin types with very different usage patterns (str, bytes and bytearray) does not seem to be well thought out.

The str type already has 40+ methods many of which are not well-suited to handle the complexities inherent in Unicode. Rather than rushing in two more methods that will prove to be about as useful as str.swapcase(), lets consider actual use cases and come up with a design that will properly address them.

Previous message: [Python-Dev] transform() and untransform() methods, and the codec registry
Next message: [Python-Dev] sWAPcASE Was: transform() and untransform() methods, and the codec registry
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list