Issue 1433: marshal roundtripping for unicode (original) (raw)

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45774

classification

Title: marshal roundtripping for unicode
Type: Stage:
Components: Unicode Versions: Python 2.5

process

Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Carl.Friedrich.Bolz, gvanrossum, lemburg, loewis
Priority: normal Keywords:

Created on 2007-11-13 10:53 by Carl.Friedrich.Bolz, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg57444 - (view) Author: Carl Friedrich Bolz-Tereick (Carl.Friedrich.Bolz) * Date: 2007-11-13 10:53
Marshal does not round-trip unicode surrogate pairs for wide unicode-builds: marshal.loads(marshal.dumps(u"\ud800\udc00")) == u'\U00010000' This is very annoying, because the size of unicode constants differs between when you run a module for the first time and subsequent runs (because the later runs use the pyc file).
msg57462 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2007-11-13 18:28
I think this is unavoidable. Depending on whether you happen to be using a narrow or wide unicode build of Python, \Uxxxxxxxx may be turned into a pair of surrogates anyway. It's not just marshal that's not roundtripping; the utf-8 codec has the same issue (and so does the utf-16 codec I presume). You will have to code around it. I think that the alternative would be more painful in other circumstances.
msg57469 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-11-13 19:29
As Guido says: this is by design. The Unicode type doesn't really support storage of surrogates; so don't use it for that.
msg57571 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2007-11-15 22:59
I think you have a wrong understanding of round-tripping. In Unicode it is really irrelevant if you're using a UCS2 surrogate pair or a UCS4 representation to describe a code point. The length of the Unicode representation may change, but the meaning won't, so you don't lose any information.
History
Date User Action Args
2022-04-11 14:56:28 admin set github: 45774
2007-11-15 22:59:20 lemburg set nosy: + lemburgmessages: +
2007-11-13 19:29:27 loewis set status: open -> closednosy: + loewisresolution: wont fixmessages: +
2007-11-13 18:28:40 gvanrossum set nosy: + gvanrossummessages: +
2007-11-13 10:53:09 Carl.Friedrich.Bolz create