(original) (raw)

In testing some existing code with the 2.7 alpha release, I've run into:

�� TypeError: Unicode-objects must be encoded before hashing

when the existing code tries to pass unicode objects to hashlib.sha1 and hashlib.md5.� This is, I believe, due to changes made for issue 3745:

http://bugs.python.org/issue3745

The issue states the need to reject unencoded strings based on the fact that one backend implementation (openssl) refused to accept them while another (_sha256) assumed a utf-8 encoding.� The thing is, I cannot observe any such difference using Python 2.5 or 2.6.� Instead of what is shown in the ticket (which was done on a Python 3, I believe) I see, when I adjust the demo test to use Python 2 syntax for "unencoded strings":

Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import _hashlib

>>> _hashlib.openssl_sha256(u"\xff")
Traceback (most recent call last):
� File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 0: ordinal not in range(128)

>>> import _sha256
>>> _sha256.sha256(u'\xff')
Traceback (most recent call last):
� File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 0: ordinal not in range(128)

>>>

(Sample from Windows because that's the only place I can get import _sha256 to work.� The Ubuntu Linux I tried behaves the same way as above for the _hashlib version, while it doesn't appear to have _sha256 as an option.)

So from what I can see the behavior wasn't inconsistent from backend-to-backend in Python 2 but rather fell in line with what I'm familiar with: if you pass unicode to some code that only wants bytes, the unicode object will get encoded to a bytestring using the system default encoding. No problems if the data can in fact always be encoded using that encoding, the error above if the data can't be encoded. Changing these functions to now require the caller to do the encoding explicitly ahead of time strikes me as introducing an inconsistency. Plus it introduces a backwards incompatibility in Python 2.7.� Is this really necessary?

Karen