Issue 4329: base64 does not properly handle unicode strings (original) (raw)

Created on 2008-11-15 11:11 by mbecker, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg75911 - (view)	Author: Michael Becker (mbecker)	Date: 2008-11-15 11:11
See below. unicode string causes exception. Explicitly converting it to a regular string addresses the issue. I only noticed this because my input string changed to unicode after updating python to 2.6 and django to 1.0. >>> import base64 >>> a=u'aHR0cDovL3NvdXJjZWZvcmdlLm5ldC90cmFja2VyMi8_ZnVuYz1kZXRhaWwmYWlkPTIyNTg5MzUmZ3JvdXBfaWQ9MTI2OTQmYXRpZD0xMTI2OTQ=' >>> b=base64.urlsafe_b64decode(a) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.6/base64.py", line 112, in urlsafe_b64decode return b64decode(s, '-_') File "/usr/local/lib/python2.6/base64.py", line 71, in b64decode s = _translate(s, {altchars[0]: '+', altchars[1]: '/'}) File "/usr/local/lib/python2.6/base64.py", line 36, in _translate return s.translate(''.join(translation)) TypeError: character mapping must return integer, None or unicode >>> b=base64.urlsafe_b64decode(str(a)) >>> b 'http://sourceforge.net/tracker2/?func=detail&aid=2258935&group_id=12694&atid=112694'
msg76218 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-11-21 23:07
It's not a bug. base64 is a codec to encode bytes and characters. You have to encode your unicode string to bytes using a charset Example (utf-8): >>> from base64 import b64encode, b64decode >>> b64encode(u'a\xe9'.encode("utf-8")) 'YcOp' >>> unicode(b64decode('YcOp'), "utf-8") u'a\xe9'
msg76223 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2008-11-21 23:30
"This module provides data encoding and decoding as specified in RFC 3548. This standard defines the Base16, Base32, and Base64 algorithms for encoding and decoding arbitrary binary strings into text strings that can be safely sent by email, used as parts of URLs, or included as part of an HTTP POST request. " In other words, arbitrary 8-bit byte strings <=> 'safe' byte strings You have to encode unicode to bytes first, as you did. Str works because you only have ascii chars and str uses the ascii encoder by default. The bytes() constructor has no default and 'ascii' must be supplied The error message is correct even if backwards. Unicode.translate requires a unicode mapping, whereas b64decode supplies a bytes mapping because it requires bytes. 3.0 added an earlier type check, so the same code gives TypeError: expected bytes, not str I believe there was an explicit decision to leave low-level wire- protocol byte functions as bytes/bytearray only. The 3.0 manual needs updating in this respect, but I will start another issue for that.
msg76441 - (view)	Author: Michael Becker (mbecker)	Date: 2008-11-25 23:31
Terry, Thanks for your response. My main concern was that the behavior changed when updating from 2.5 to 2.6. The new behavior was not intuitive. Also 2.6, I thought, was supposed to be backward compatible. Based on this issue, I would assume this statement is not true when strings are passed to any method that convert them to bytes. Maybe this was documented in the 2.6 documentation somewhere and I simply missed it. Should I have run the 2to3 converter on my 2.5 code prior to updating to 2.6? Please let me know the new issue number so I can track the progress. Thanks!
msg76469 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2008-11-26 15:20
2.6 is, as far as I know, intended to be backwards compatible except for where it fixes bugs. Upgrading to 2.6 does (should) not change strings (type str) to unicode. Only importing the appropriate __future__ or upgrading to 3.0 will do that. I have no idea what Django does. The 3 lines of code you posted gives exactly the same traceback in my copy of 2.5 as the one you posted.
msg76472 - (view)	Author: Michael Becker (mbecker)	Date: 2008-11-26 16:50
Terry, I had a feeling Django had something to do with this. I'll have a closer look there. For reference, in my django code, I did not explicitly declare the string as a unicode string. Django must be importing unicode_literals from __future__ as you suggested. I'll have a closer look there. Just out of curiosity, would the 2to3 tool have resolved this issue come 3.0? Would it have change the type to a bytes? Or, would this issue need to be caught in unit tests? Thanks!

History
Date	User	Action	Args
2022-04-11 14:56:41	admin	set	github: 48579
2008-11-26 16:50:23	mbecker	set	messages: +
2008-11-26 15:20:58	terry.reedy	set	messages: +
2008-11-25 23:31:40	mbecker	set	messages: +
2008-11-21 23:31:27	terry.reedy	set	resolution: fixed -> not a bug
2008-11-21 23:30:46	terry.reedy	set	nosy: + terry.reedymessages: +
2008-11-21 23:07:03	vstinner	set	status: open -> closedresolution: fixedmessages: + nosy: + vstinner
2008-11-15 11:11:24	mbecker	create