msg54203 - (view) |
Author: Peter Jacobi (peter_jacobi) |
Date: 2004-08-02 09:48 |
As the missing ISO 8859 codecs, (11:Thai, 16:Romanian) can be automatically generated from the Unicode mapping files (via gencodec.py), I'd like to ask for inclusion in the next version. |
|
|
msg54204 - (view) |
Author: Peter Jacobi (peter_jacobi) |
Date: 2004-08-02 10:16 |
Logged In: YES user_id=845149 In a thread on news://comp.lang.python I was asked by Martin v. Löwis to provide evidence on the correctness of the ISO 8859-11 Unicode mapping file, as found on ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-11.TXT (due to the disclaimer boilerplate in these files). So far I can provide these three points: a) ISO 8859-n vs ISO-8859-n If the information at http://en.wikipedia.org/wiki/ISO_8859- 1#ISO_8859-1_vs_ISO-8859-1 is correct, Python 8859-n codecs do implement the ISO standard charsets ISO 8859-n in the specialized IANA forms ISO-8859-n (and in agreement with the Unicode mapping files). So any difficult C0/C1 wording in the original ISO standard can be disregarded. b) libiconv ISO 8859-11 The implementation by Bruno Haible in libiconv does agree with the Unicode mapping file: http://cvs.sourceforge.net/viewcvs.py/libiconv/libiconv/lib/ c) IBM ICU4C The implementation in ICU4C does agree with the Unicode mapping file: http://oss.software.ibm.com/cvs/icu/charset/data/ucm/ |
|
|
msg54205 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2004-08-02 10:30 |
Logged In: YES user_id=21627 Marc-Andre, should we add these? |
|
|
msg54206 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2004-08-02 11:14 |
Logged In: YES user_id=38388 Martin, I think it's a good idea to add the codecs for completeness. We should probably also review the mapping files posted on the unicode.org site every now and then and update the codecs in Python accordingly. Sticking to the Unicode Consortium's view of things is a good way to assure compatibility with other applications, IMO. |
|
|
msg54207 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2004-08-03 14:34 |
Logged In: YES user_id=38388 Peter, could you attach the generated codecs to this report ? Thanks. |
|
|
msg54208 - (view) |
Author: Peter Jacobi (peter_jacobi) |
Date: 2004-08-03 22:58 |
Logged In: YES user_id=845149 Attached are the output if gencodec.py for ISO-8859-11, ISO-8859-16 and for reference also the original mapping files. Peter |
|
|
msg54209 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2004-08-05 11:15 |
Logged In: YES user_id=38388 Thank you. Please also provide suitable aliases (I couldn't find any on the IANA site), then I'll add them to Python 2.4. |
|
|
msg54210 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2004-08-05 11:41 |
Logged In: YES user_id=21627 The unfortunate problem is that ISO-8859-11 is not a IANA-registered character set. For ISO-8859-16, http://www.iana.org/assignments/character-sets lists: Name: ISO-8859-16 MIBenum: 112 Source: ISO Alias: iso-ir-226 Alias: ISO_8859-16:2001 Alias: ISO_8859-16 Alias: latin10 Alias: l10 I believe ISO-8859-11 does not have any aliases. Some people may claim TIS-620 is an alias, but it is not (as it does not contain \xa0). |
|
|
msg54211 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2004-08-05 12:15 |
Logged In: YES user_id=38388 I found these references for iso-8859-11: iso_8859-11:1992 (try searching for this in goole :-) http://mnogosearch.kn.vutbr.cz/Download/snapshot/mnogosearch32/src/uconv-alias.c windows-874 http://www.memecode.com/site/ver.php?id=94 thai windows-874 tis-620 iso-8859-11:2001 http://de.wikipedia.org/wiki/ISO_8859-11 The lsat URL suggests that iso-8859-11 is the same as tis-620, but only the "basis" for windows-874. It also quotes the year 2001 as the last revision of the mapping which corresponds to the header of the Unicode mapping file. I think it's safe to add the alias for tis-620 even though the iso mapping has one more character. According to Google that encoding name is much more popular than the iso one. |
|
|
msg54212 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2004-08-05 12:33 |
Logged In: YES user_id=38388 Nevermind. I'll also add a proper tis_620.py codec. |
|
|
msg54213 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2004-08-05 12:44 |
Logged In: YES user_id=38388 Checking in Misc/NEWS; /cvsroot/python/python/dist/src/Misc/NEWS,v <-- NEWS new revision: 1.1073; previous revision: 1.1072 done Checking in Lib/encodings/aliases.py; /cvsroot/python/python/dist/src/Lib/encodings/aliases.py,v <-- aliases.py new revision: 1.27; previous revision: 1.26 done RCS file: /cvsroot/python/python/dist/src/Lib/encodings/iso8859_11.py,v done Checking in Lib/encodings/iso8859_11.py; /cvsroot/python/python/dist/src/Lib/encodings/iso8859_11.py,v <-- iso8859_11.py initial revision: 1.1 done RCS file: /cvsroot/python/python/dist/src/Lib/encodings/iso8859_16.py,v done Checking in Lib/encodings/iso8859_16.py; /cvsroot/python/python/dist/src/Lib/encodings/iso8859_16.py,v <-- iso8859_16.py initial revision: 1.1 done RCS file: /cvsroot/python/python/dist/src/Lib/encodings/tis_620.py,v done Checking in Lib/encodings/tis_620.py; /cvsroot/python/python/dist/src/Lib/encodings/tis_620.py,v <-- tis_620.py initial revision: 1.1 done |
|
|
msg54214 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2004-08-05 13:02 |
Logged In: YES user_id=21627 Code page 874 differs from the 8859 one in the definition of \x80..\x9f. http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP874.TXT says 0x80 0x20AC #EURO SIGN 0x85 0x2026 #HORIZONTAL ELLIPSIS 0x91 0x2018 #LEFT SINGLE QUOTATION MARK 0x92 0x2019 #RIGHT SINGLE QUOTATION MARK 0x93 0x201C #LEFT DOUBLE QUOTATION MARK 0x94 0x201D #RIGHT DOUBLE QUOTATION MARK 0x95 0x2022 #BULLET 0x96 0x2013 #EN DASH 0x97 0x2014 #EM DASH I assume the Thai version of Windows is likely to generate "windows-874". Debian offers the th_TH locale, with TIS-620, and a th_TH.UTF-8 locale (i.e. no ISO-8859-1 one). If ISO 8859-11 is understood as published by ISO (i.e. no control characters at all), then CP 874 is a strict extension (adding C0, plus the characters above). Google gives these frequencies: tis-620 16,200 windows-874 7,290 iso-8859-11 5,880 |
|
|
msg55192 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2007-08-23 19:31 |
Not sure why this is still open. The patches were checked in a long time ago. |
|
|