msg147936 - (view) |
Author: kxroberto (kxroberto) |
Date: 2011-11-19 11:35 |
"unicode" seems not to be an official unicode encoding name alias. Yet it is quite frequent on the web - and obviously means UTF-8. (search '"text/html; charset=unicode"' in Google) Chrome and IE display it as UTF-8. (Mozilla as ASCII, thus mixed up chars). Should it be added in to aliases.py ? --- ./aliases.py +++ ./aliases.py @@ -511,6 +511,7 @@ 'utf8' : 'utf_8', 'utf8_ucs2' : 'utf_8', 'utf8_ucs4' : 'utf_8', + 'unicode' : 'utf_8', # uu_codec codec 'uu' : 'uu_codec', |
|
|
msg147937 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-11-19 11:49 |
Sorry, but it's not obviously that Unicode means UTF-8. |
|
|
msg147938 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2011-11-19 12:03 |
Definitely; this will just serve to create more confusion for beginners over what a Unicode string is: unicodestring.encode('unicode') <- WTF? |
|
|
msg147969 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2011-11-19 20:28 |
Joining the chorus: people who need it in their application will have to add it themselves (monkeypatching the aliases dictionary as appropriate). |
|
|
msg148309 - (view) |
Author: kxroberto (kxroberto) |
Date: 2011-11-25 08:22 |
I wonder where is the origin, who is the inventor of the frequent charset=unicode? But: "Sorry, but it's not obviously that Unicode means UTF-8." When I faced the first time on the web, I guessed it is UTF-8 without looking. It even sounds colloquially reasonable ;-) And its right 99.999% of cases. (UTF-16 is less frequent than this non-canonical "unicode") "Definitely; this will just serve to create more confusion for beginners over what a Unicode string is: unicodestring.encode('unicode') <- WTF?" I guess no python tutorial writer or encoding menu writer poses that example. That string comes in on technical paths: web, MIME etc. In the aliases.py there are many other names which are not canonical. frequency > convenience > alias "Joining the chorus: people who need it in their application will have to add it themselves (monkeypatching the aliases dictionary as appropriate)." Those people first would need to be aware of the option: Be all-seeing, or all wait for the first bug reports ... Reverse question: what would be the minus of having this alias? |
|
|
msg148312 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-11-25 11:43 |
> Python is not a language written for the web, it's generic language to program anything! If you have a problem to parse an HTML page, the special case should be added to the HTML parser, not to the language. Do you have the encoding issue with a parser included in Python (html.parser.*)? If you have the issue with an third-party parser, you have to report the bug there. |
|
|
msg148353 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2011-11-25 19:38 |
The mapping "unicode" -> "utf-8" is simply not defined unambiguously, in addition to being factually wrong. For example, when Microsoft talks about Unicode they mean UTF-16. |
|
|
msg148354 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-11-25 19:46 |
> For example, when Microsoft talks about Unicode they mean UTF-16. Sorry, but UTF-16 is ambiguously: do you mean UTF-16-LE or UTF-16-BE? ;-) |
|
|
msg148362 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2011-11-25 21:09 |
> Reverse question: what would be the minus of having this alias? Please accept that this issue is closed. |
|
|