Issue 24339: iso6937 encoding missing (original) (raw)
Created on 2015-05-31 13:20 by John Helour, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (32)
Author: John Helour (John Helour) *
Date: 2015-05-31 13:20
Please add encoding for the iso6937 charset. Many settopboxes (DVB-T/S) and relevant devices uses it for displaying EPG, videotext, etc.
I've wrote (please look at the attached file) the encoding/decoding conversion codec some years ago.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2015-05-31 14:31
New encoding can be added only in new Python release (3.6).
Author: John Helour (John Helour) *
Date: 2015-06-01 11:20
I've rewrote the iso6937 codec into Python 3.
Could someone check it please?
Author: Julien Palard (mdk) *
Date: 2016-11-13 22:11
Hi John, thanks for your contribution,
Looks like your implementation is missing some codepoints, like "\t":
>>> print("\t".encode(encoding='iso6937'))
[...]
UnicodeError: encoding with 'iso6937' codec failed (UnicodeError: Unacceptable utf-8 character)
Probably due to the "range(0x20, "…, why 0x20
?
You're having problems to decode multibytes sequences as you're not having the else: … result += chr(c[0])
in this case. So typically decoding \xc2\x20
will raise a KeyError
as \x20
is not in your decoding table.
Also, please conform your contribution to the PEP8: you're missing spaces after comas and you're sometime indenting with 8 spaces instead of 4.
I implemented a simple checker based on glibc localedata, it show clearly your decoding problems step by step, and should be easily extended to check for your encoding function too, see attachment. It uses the ISO6937 found typically in the locales debian package or in an 'apt-get sourcee glibc'.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2016-11-14 08:53
Another comment about coding style: please use \uXXXX hex code representations for the decoding map. The stdlib source code is normally kept ASCII compatible and, for codecs, the Unicode code point numbers make it easier to check the mappings for correctness.
Thanks.
PS: You will also have to sign a contributor agreement: https://www.python.org/psf/contrib/
Author: Marc-Andre Lemburg (lemburg) *
Date: 2016-11-14 11:54
Just as reference, here's the wikipedia page for the encoding:
https://en.wikipedia.org/wiki/ISO/IEC_6937
and this is the ISO document (as preview):
http://webstore.iec.ch/preview/info_isoiec6937%7Bed3.0%7Den.pdf
(from the German wikipedia page).
Author: STINNER Victor (vstinner) *
Date: 2016-11-14 12:03
iso6937.py:
from utf-8 to iso6937
def iso6937_encode(input,errors,encoding_map):
Wait, is this code for Python 3? Decode from UTF-8 and encode to ISO-6937 in the same function seems strange to me.
I expected that the codec only implements two functions: encode text (unicode) to ISO-6937 (bytes), decode bytes from ISO-6937 to text.
Since the encoding is non trivial (multibyte), if we decide to add it, I suggest to require unit tests. I would like to see unit tests on multibyte strings, to check how the error handler is handled.
--
In general, I would prefer to not embed too many codecs in Python, it has a little cost to maintain these codecs.
My rule is more to only added encodings used (in practice) as locale encodings.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2016-11-14 12:27
I think the encoder can just use codecs.charmap_encode(). The decoder seems could be simpler too.
Would be nice to generate the ISO 6937 encoding file from external data (e.g. from glibc localedata) like they are generated for other encodings. Take Tools/unicode/ files as a pattern.
Tests are required.
A number of lists of encodings should be updated: Doc/library/codecs.rst, Lib/encodings/aliases.py, Lib/locale.py, Lib/test/test_unicode.py, Lib/test/test_codecs.py, Lib/test/test_xml_etree.py.
Author: STINNER Victor (vstinner) *
Date: 2016-11-14 13:05
@Serhiy: Do you think that the encoding is popular enough to pay the price of its maintainance?
It's already possible to register manually a new encoding in an application. It was even already possible in Python 2.7 (and older).
Author: Julien Palard (mdk) *
Date: 2016-11-14 13:08
@Serhiy @haypo: Popular enough or not, it may start as a lib on pypi, we'll see its usage from here.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2016-11-14 13:11
On 14.11.2016 13:03, STINNER Victor wrote:
STINNER Victor added the comment:
iso6937.py:
from utf-8 to iso6937
def iso6937_encode(input,errors,encoding_map):
Wait, is this code for Python 3? Decode from UTF-8 and encode to ISO-6937 in the same function seems strange to me.
The patch shows the file as UTF-8. In reality, it is decoding from Unicode strings.
I expected that the codec only implements two functions: encode text (unicode) to ISO-6937 (bytes), decode bytes from ISO-6937 to text.
Since the encoding is non trivial (multibyte), if we decide to add it, I suggest to require unit tests. I would like to see unit tests on multibyte strings, to check how the error handler is handled.
+1
In general, I would prefer to not embed too many codecs in Python, it has a little cost to maintain these codecs.
My rule is more to only added encodings used (in practice) as locale encodings.
This encoding is used in EPG data of various DVB television formats. As such it is in active use (even though it is very old).
I think "active use" is a better approach to restricting ourselves to only locale encodings, since the latter are slowly converging towards UTF-8 :-)
BTW: Once a charmap style codec is written, there is little change, so the maintenance is minimal. Codecs which include more active logic such as this one are different, of course, and therefore may potentially add more maintenance burden.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2016-11-14 15:02
My rule is more to only added encodings used (in practice) as locale encodings.
Just for reference: , , , .
@Serhiy: Do you think that the encoding is popular enough to pay the price of its maintainance?
Yes, it seems to me that the encoding still in use. I found questions about decoding from ISO 6937 and implementations for different programming languages.
Author: STINNER Victor (vstinner) *
Date: 2016-11-14 15:39
Ok. I'm not waiting for a simpler patch reusing existing charmap functions to see the complexity of the codec ;-)
Author: John Helour (John Helour) *
Date: 2016-11-25 22:22
PEP8 compliant, added missing codepoints, utf-8 to \uXXXX rewrited
Author: John Helour (John Helour) *
Date: 2016-11-25 22:35
@mdk
Big thanks for the checker.
Looks like your implementation is missing some codepoints, like "\t":
print("\t".encode(encoding='iso6937'))
[...] UnicodeError: encoding with 'iso6937' codec failed (UnicodeError: Unacceptable utf-8 character)
The '\t' character is undefined in the iso6937 table, like all chars within the range 0x00 - 0x1F. I don't know how to handle such input for conversion.
Author: Julien Palard (mdk) *
Date: 2016-11-26 13:21
According to https://webstore.iec.ch/preview/info_isoiec6937%7Bed3.0%7Den.pdf:
NOTE: The shaded positions 00/00 to 01/15 and 07/15 to 09/15 are outside the scope of this International Standard.
So it's clear to me that they are not undefined, they are just described elsewhere.
According to https://en.wikipedia.org/wiki/ISO/IEC_6937:
ISO/IEC 6937:2001, [...] is a multibyte extension of ASCII
Also, the glibc charmap for ISO_6937 define them:
$ head -n 20 localedata/charmaps/ISO_6937 ISO_6937 % / % version: 1.0 % source: ECMA registry and ISO/IEC 6937:1992
% alias ISO-IR-156 % alias ISO_6937:1992 % alias ISO6937 CHARMAP /x00 NULL (NUL) /x01 START OF HEADING (SOH) /x02 START OF TEXT (STX) /x03 END OF TEXT (ETX) /x04 END OF TRANSMISSION (EOT) /x05 ENQUIRY (ENQ) /x06 ACKNOWLEDGE (ACK) /x07 BELL (BEL) /x08 BACKSPACE (BS) /x09 CHARACTER TABULATION (HT)
Finally, if we're not implementing this range, this mean we have no way to encode a new line, which looks highly strange to me, newline being a commonly used character.
But I found no line in the whole ISO/IEC6937 about its ASCII inheritance, I may have just missed it.
Author: John Helour (John Helour) *
Date: 2016-11-26 15:41
If I take the ISO_6937 file as a template for encoding table then increasing the range 0x20..0x7f to 0x00..0xA0 is the simplest solution.
Author: John Helour (John Helour) *
Date: 2016-11-26 15:43
If I take the ISO_6937 file as a template for encoding table then increasing the range 0x20..0x7f to 0x00..0xA0 is the simplest solution.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2016-11-28 12:40
The codec code has a few (performance) issues:
- nonspacing_diacritical_marks should be a set for fast lookup
- ord(c) in range(0x00, 0xA0) should be rewritten using < and >=
- result += bytes([ord(c)]) has exponential timing (it copies the whole bytes string for every single operation); better use a bytearray and convert this to bytes in one final step
- the error messages should include more useful information about the cause and location of the error, instead of just UnicodeError("Unacceptable unicode character") and raise KeyError
Please also check whether it's not possible to reuse the charmap codec functions we have. Thanks.
Author: John Helour (John Helour) *
Date: 2016-11-29 21:32
Please also check whether it's not possible to reuse the charmap codec functions we have I've found nothing useful, maybe you (as the author) can find something really useful which can improve code readability or increase the performance.
Please look at the newest codec version, particularly on line:
tmp += bytearray(encoding_map[c], 'latin1', 'ignore')
It is about extended ascii inheritance. Is it reliable and fast enough?
Author: John Helour (John Helour) *
Date: 2016-11-30 14:46
Please ignore my previous question about: tmp += bytearray(encoding_map[c], 'latin1', 'ignore')
The latest version don't needs such encoding ...
Author: John Helour (John Helour) *
Date: 2016-12-04 13:17
Performance issue resolved, more info on error added.
I've checked encoding and decoding on a two UTF-8 ~3MiB txt files. Except the first BOM mark (May I ignore it?) all seems work OK.
Author: Julien Palard (mdk) *
Date: 2016-12-04 16:49
LGTM, for me it's time to release it as a package on pypi to check the adoption rate and see it it's worth adding it in Python and maybe close this issue.
Author: Julien Palard (mdk) *
Date: 2017-02-19 15:59
John: You should probably package this as a pip module alongisde with a git repository, at least to measure qty of interested persones, and get some feedback / contributions.
Author: Xiang Zhang (xiang.zhang) *
Date: 2017-05-16 03:25
Would you mind converting this patch to a Github PR John?
Author: Julien Palard (mdk) *
Date: 2019-05-06 18:08
For the moment, I'm closing this issue as there's no activity on it I suspect it may no be that usefull.
I may be wrong, so if someone actually needs this, don't hesitate either to put it as a package on PyPI (it should probably go there anyway), either to reopen the issue.
Author: Maarten Derickx (koffie)
Date: 2021-06-23 06:22
Is there any way to contact John Helour? I would still very much like to put this package on github and pypi. And would like to ask him permission for licensing. Or is there some standard open source license under which all code uploaded to https://bugs.python.org/ can automatically be distributed?
https://www.python.org/about/legal/ seems to indicate so, but doesn't mention an explicit license just the things you can do with it.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2021-06-23 08:08
Maarten, the code posted on bugs is copyrighted by the person who wrote it. We can only accept it for inclusion in Python after the CLA has been signed, since then we are allowed to relicense it.
As a result you can only take John's code and post it elsewhere, if John permits you to do so, since the files don't include a license.
Note: Creating a character map based codec is not hard using gencodec.py from the Tools/unicode/ dir and perhaps some added extra logic.
Author: Maarten Derickx (koffie)
Date: 2021-06-24 17:18
Hi Marc-Andre Lemburg,
Thanks for your reply. I tried using gencodec.py as could be downloaded from https://github.com/python/cpython/blob/main/Tools/unicode/gencodec.py as you mentioned. However the code in gencodec.py seems to be in a much worse shape than the iso6937.py attached here. The code in gencodec relies on being able to compare integers with tuples. This is caused by the lines:
mappings = sorted(map)
hinting that this code has never been run using python 3.
providing a decent sort key solves this issue. But after that other issues pop up. For example there seems to be some problems handling the 0x-001 by the not appropriately handling of items in the mapping that have MISSING_CODE resulting in things like:
0x80: 0x-001
showing up in the generated code.
And then there is the issue that python_mapdef_code has as a side effect that it does 'del map["IDENTITY"]' causing "'IDENTITY' in map" in python_tabledef_code to always evaluate to False even when it should evaluate to True.
The problems above can be observed by just running gencodec.py on https://unicode.org/Public/MAPPINGS/VENDORS/APPLE/SYMBOL.TXT .
If gencodec.py was a trustworthy and well maintained piece of code, I would happily use it. However at the moment I don't see it as a valid option since debugging gencodec.py would cost me at least as much time as just writing its output myself instead of generating it. Additionally https://unicode.org/ doesn't seem to provide a mapping file for iso6937.
I do agree that using codecs.charmap_encode and codecs.charmap_decode is a much better solution then the one in iso6937.py. But I don't understand gencodec.py well enough to actually fix it.
Author: Maarten Derickx (koffie)
Date: 2021-06-29 13:12
The route via gencodec or more generally via charmap_encode and charmap_decode seems to be one that is not possible without some low level CPython code adjustments. The reason for this is that charmap_encode and charmap_decode only seem to support mappings where a single encoded byte corresponds to multiple unicode points.
However iso6937 is a mixed length encoding, meaning in this specific case that unicode characters sometimes need to be encoded as a single byte and sometimes with two bytes.
For example chr(0x00c0) should be encoded as b'\xc1' + b'A'.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2021-06-29 14:41
Right, the charmap codec was built with the Unicode Consortium mappings in mind.
However, you may have some luck decoding the two byte chars in ISO 6937 using combining code points in Unicode. With some extra post processing you could also normalize the output into single code points.
If I find time, I may have a look at gencodec.py again and update it to more modern interfaces. I've long given up maintenance of Unicode in Python and only try to help by giving some guidance based on the original implementation design.
Author: Maarten Derickx (koffie)
Date: 2021-06-29 15:07
Hi Marc-Andre Lemburg,
Thanks for your responses and guidance. At least your pointers to charmap_encode and charmap_decode helped, since it shows at least what the general idea is on how to deal with these types of encodings.
In the mean time I did produce some successes. I wrote some python code that can create character mappings based on the table in http://webstore.iec.ch/preview/info_isoiec6937%7Bed3.0%7Den.pdf so that we can be sure that there are no human errors in generating the mappings.
I think my further approach is to write pure python versions of charmap_encode and charmap_decode that can handle the general case of multi byte encodings to unicode case. This won't be as fast as using the builtins written c. But at least gives maintainable and hopefully reusable code.
Maybe later the c-implementation can be updated as well.
History
Date
User
Action
Args
2022-04-11 14:58:17
admin
set
github: 68527
2021-06-29 15:07:49
koffie
set
messages: +
2021-06-29 14:41:53
lemburg
set
messages: +
2021-06-29 13:12:35
koffie
set
messages: +
2021-06-24 17🔞21
koffie
set
messages: +
2021-06-23 08:08:57
lemburg
set
messages: +
2021-06-23 06:22:13
koffie
set
nosy: + koffie
messages: +
2019-05-06 18:08:20
mdk
set
status: open -> closed
resolution: postponed
messages: +
stage: patch review -> resolved
2017-05-16 03:25:45
xiang.zhang
set
messages: +
stage: needs patch -> patch review
2017-02-19 15:59:52
mdk
set
messages: +
2016-12-04 16:49:29
mdk
set
messages: +
2016-12-04 13:17:35
John Helour
set
messages: +
2016-12-04 13:15:09
serhiy.storchaka
set
priority: normal -> low
assignee: serhiy.storchaka
2016-12-04 12:59:05
John Helour
set
files: + iso6937.py
2016-12-04 12:57:07
John Helour
set
files: - iso6937.py
2016-12-04 12:56:50
John Helour
set
files: + check_iso6937.py
2016-12-03 19:36:10
John Helour
set
files: + iso6937.py
2016-12-03 19:34:46
John Helour
set
files: - iso6937.py
2016-11-30 14:46:18
John Helour
set
files: + iso6937.py
messages: +
2016-11-30 14:38:49
John Helour
set
files: - iso6937.py
2016-11-30 14:24:21
John Helour
set
files: + iso6937.py
2016-11-30 14:23:19
John Helour
set
files: - iso6937.py
2016-11-30 14:22:36
John Helour
set
files: + iso6937.py
2016-11-30 14:19:38
John Helour
set
files: - iso6937.py
2016-11-29 21:32:36
John Helour
set
files: + iso6937.py
messages: +
2016-11-28 12:40:36
lemburg
set
messages: +
2016-11-26 15:43:31
John Helour
set
files: + iso6937.py
messages: +
2016-11-26 15:41:43
John Helour
set
messages: +
2016-11-26 15:38:32
John Helour
set
files: - iso6937.py
2016-11-26 13:21:30
mdk
set
messages: +
2016-11-25 22:35:21
John Helour
set
messages: +
2016-11-25 22:22:35
John Helour
set
files: + iso6937.py
messages: +
2016-11-14 15:39:07
vstinner
set
messages: +
2016-11-14 15:02:54
serhiy.storchaka
set
messages: +
2016-11-14 13:11:10
lemburg
set
messages: +
2016-11-14 13:08:07
mdk
set
messages: +
2016-11-14 13:05:58
vstinner
set
messages: +
2016-11-14 12:27:32
serhiy.storchaka
set
stage: needs patch
messages: +
versions: + Python 3.7, - Python 3.6
2016-11-14 12:03:26
vstinner
set
nosy: + vstinner
messages: +
2016-11-14 11:54:54
lemburg
set
messages: +
2016-11-14 08:53:33
lemburg
set
messages: +
2016-11-14 02:12:46
xiang.zhang
set
nosy: + xiang.zhang
2016-11-13 22:11:01
mdk
set
files: + check_iso6937.py
nosy: + mdk
messages: +
2015-06-18 09:26:40
John Helour
set
files: - iso6937.py
2015-06-05 18:10:54
John Helour
set
files: + iso6937.py
2015-06-05 18:09:23
John Helour
set
files: - iso6937.py
2015-06-05 08:45:49
John Helour
set
files: - iso6937.py
2015-06-05 08:44:33
John Helour
set
files: - iso6937.py
2015-06-05 08:44:16
John Helour
set
files: + iso6937.py
2015-06-05 08:36:28
John Helour
set
files: + iso6937.py
2015-06-01 11:20:10
John Helour
set
files: + iso6937.py
messages: +
2015-05-31 14:31:35
serhiy.storchaka
set
nosy: + loewis, serhiy.storchaka, lemburg
messages: +
versions: + Python 3.6, - Python 2.7
2015-05-31 13:20:03
John Helour
create