Issue 13618: bytes.decode() UnicodeEncodeError on Apple iOS (>16-bit) characters (original) (raw)
I've searched high and low to find a way to make Python accept Apple's iOS characters, but it looks like Python is not supporting greater than 16-bit characters correctly. If you look at the leading character of each group, it's \xf0, indicating a 4-character sequence, which also indicates greater than 16-bit characters. I've tried all three "errors" arguments to decode - ignore, replace, and strict - and still get this error each time:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 140: character maps to
So I have no way to proceed short of rolling my own corrected unicode decoder. My assumption is that Python should convert a character regardless of whether it's found in the internal lookup database, or at a minimum there should be a way to signal Python to do so.
Below is a sample bytes string that will reproduce the problem:
b'\n \n \n average-user-rating\n \n \n 1\n \n \n text\n \n \n \xf0\x9f\x8e\x84\xf0\x9f\x8e\x85\xf0\x9f\x8e\x81\xf0\x9f\x8e\x84\xf0\x9f\x8e\x85\xf0\x9f\x8e\x81 if you haven't checked this out yet please do. download APP TRAILERS and go to videos use promo code FREE4U and enjoy free apps courtesy of apple MERRY CHRISTMAS \xf0\x9f\x8e\x84\xf0\x9f\x8e\x85\xf0\x9f\x8e\x81\xf0\x9f\x8e\x84\xf0\x9f\x8e\x85\xf0\x9f\x8e\x81\n \n \n title\n \n \n 4. IF YOU LOVE FREE STUFF (v1.5)\n \n \n type\n \n \n review\n \n \n user-name\n \n \n Freenesss on Dec 16, 2011\n \n \n \n \n average-user-rating\n \n \n 0.8\n \n \n text\n \n \n This application is very cool .. I hope only be added to the dictionary other languages \xe2\x80\x8b\xe2\x80\x8b..\n \n \n title\n \n \n 8. the dictionary (v1.5)\n \n \n type\n \n \n review\n \n \n user-name\n \n \n Rnaa on Dec 16, 2011\n \n \n \n \n average-user-rating\n \n \n 1\n \n \n text\n \n \n Hey I'm 13 trying to b discovered plz check my 1st video out on you tube its called speak now cover by Bekka burton thnx and I luv luv luv this app\n \n \n title\n \n \n 9. Love this app+check me out on you tube (v1.5)\n \n \n type\n \n \n review\n \n \n user-name\n \n \n Lol\xee\x84\x86 on Dec 16, 2011\n \n \n'
(Obviously, stripped down to not-well-formed XML, but for conversion purposes that's irrelevant.)
I feel like a 'tard now, it was because I was trying to print() it at the same time I decoded it, which is what threw up. Well, sorry about that, next time I'll be a little more careful to separate every step before I go reporting it.