Issue 1433: marshal roundtripping for unicode (original) (raw)
This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45774
classification
Title: | marshal roundtripping for unicode | ||
---|---|---|---|
Type: | Stage: | ||
Components: | Unicode | Versions: | Python 2.5 |
process
Status: | closed | Resolution: | wont fix |
---|---|---|---|
Dependencies: | Superseder: | ||
Assigned To: | Nosy List: | Carl.Friedrich.Bolz, gvanrossum, lemburg, loewis | |
Priority: | normal | Keywords: |
Created on 2007-11-13 10:53 by Carl.Friedrich.Bolz, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (4) | ||
---|---|---|
msg57444 - (view) | Author: Carl Friedrich Bolz-Tereick (Carl.Friedrich.Bolz) * | Date: 2007-11-13 10:53 |
Marshal does not round-trip unicode surrogate pairs for wide unicode-builds: marshal.loads(marshal.dumps(u"\ud800\udc00")) == u'\U00010000' This is very annoying, because the size of unicode constants differs between when you run a module for the first time and subsequent runs (because the later runs use the pyc file). | ||
msg57462 - (view) | Author: Guido van Rossum (gvanrossum) * ![]() |
Date: 2007-11-13 18:28 |
I think this is unavoidable. Depending on whether you happen to be using a narrow or wide unicode build of Python, \Uxxxxxxxx may be turned into a pair of surrogates anyway. It's not just marshal that's not roundtripping; the utf-8 codec has the same issue (and so does the utf-16 codec I presume). You will have to code around it. I think that the alternative would be more painful in other circumstances. | ||
msg57469 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2007-11-13 19:29 |
As Guido says: this is by design. The Unicode type doesn't really support storage of surrogates; so don't use it for that. | ||
msg57571 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2007-11-15 22:59 |
I think you have a wrong understanding of round-tripping. In Unicode it is really irrelevant if you're using a UCS2 surrogate pair or a UCS4 representation to describe a code point. The length of the Unicode representation may change, but the meaning won't, so you don't lose any information. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:28 | admin | set | github: 45774 |
2007-11-15 22:59:20 | lemburg | set | nosy: + lemburgmessages: + |
2007-11-13 19:29:27 | loewis | set | status: open -> closednosy: + loewisresolution: wont fixmessages: + |
2007-11-13 18:28:40 | gvanrossum | set | nosy: + gvanrossummessages: + |
2007-11-13 10:53:09 | Carl.Friedrich.Bolz | create |