Issue 24339: iso6937 encoding missing (original) (raw)

Created on 2015-05-31 13:20 by John Helour, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (32)

msg244538 - (view)

Author: John Helour (John Helour) *

Date: 2015-05-31 13:20

Please add encoding for the iso6937 charset. Many settopboxes (DVB-T/S) and relevant devices uses it for displaying EPG, videotext, etc.

I've wrote (please look at the attached file) the encoding/decoding conversion codec some years ago.

msg244540 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2015-05-31 14:31

New encoding can be added only in new Python release (3.6).

msg244576 - (view)

Author: John Helour (John Helour) *

Date: 2015-06-01 11:20

I've rewrote the iso6937 codec into Python 3.

Could someone check it please?

msg280720 - (view)

Author: Julien Palard (mdk) * (Python committer)

Date: 2016-11-13 22:11

Hi John, thanks for your contribution,

Looks like your implementation is missing some codepoints, like "\t":

>>> print("\t".encode(encoding='iso6937'))                                                                                     
[...]
UnicodeError: encoding with 'iso6937' codec failed (UnicodeError: Unacceptable utf-8 character)

Probably due to the "range(0x20, "…, why 0x20?

You're having problems to decode multibytes sequences as you're not having the else: … result += chr(c[0]) in this case. So typically decoding \xc2\x20 will raise a KeyError as \x20 is not in your decoding table.

Also, please conform your contribution to the PEP8: you're missing spaces after comas and you're sometime indenting with 8 spaces instead of 4.

I implemented a simple checker based on glibc localedata, it show clearly your decoding problems step by step, and should be easily extended to check for your encoding function too, see attachment. It uses the ISO6937 found typically in the locales debian package or in an 'apt-get sourcee glibc'.

msg280741 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2016-11-14 08:53

Another comment about coding style: please use \uXXXX hex code representations for the decoding map. The stdlib source code is normally kept ASCII compatible and, for codecs, the Unicode code point numbers make it easier to check the mappings for correctness.

Thanks.

PS: You will also have to sign a contributor agreement: https://www.python.org/psf/contrib/

msg280759 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2016-11-14 11:54

Just as reference, here's the wikipedia page for the encoding:

https://en.wikipedia.org/wiki/ISO/IEC_6937

and this is the ISO document (as preview):

http://webstore.iec.ch/preview/info_isoiec6937%7Bed3.0%7Den.pdf

(from the German wikipedia page).

msg280761 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-14 12:03

iso6937.py:

from utf-8 to iso6937
def iso6937_encode(input,errors,encoding_map):

Wait, is this code for Python 3? Decode from UTF-8 and encode to ISO-6937 in the same function seems strange to me.

I expected that the codec only implements two functions: encode text (unicode) to ISO-6937 (bytes), decode bytes from ISO-6937 to text.

Since the encoding is non trivial (multibyte), if we decide to add it, I suggest to require unit tests. I would like to see unit tests on multibyte strings, to check how the error handler is handled.

In general, I would prefer to not embed too many codecs in Python, it has a little cost to maintain these codecs.

My rule is more to only added encodings used (in practice) as locale encodings.

msg280765 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-11-14 12:27

I think the encoder can just use codecs.charmap_encode(). The decoder seems could be simpler too.

Would be nice to generate the ISO 6937 encoding file from external data (e.g. from glibc localedata) like they are generated for other encodings. Take Tools/unicode/ files as a pattern.

Tests are required.

A number of lists of encodings should be updated: Doc/library/codecs.rst, Lib/encodings/aliases.py, Lib/locale.py, Lib/test/test_unicode.py, Lib/test/test_codecs.py, Lib/test/test_xml_etree.py.

msg280770 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-14 13:05

@Serhiy: Do you think that the encoding is popular enough to pay the price of its maintainance?

It's already possible to register manually a new encoding in an application. It was even already possible in Python 2.7 (and older).

msg280771 - (view)

Author: Julien Palard (mdk) * (Python committer)

Date: 2016-11-14 13:08

@Serhiy @haypo: Popular enough or not, it may start as a lib on pypi, we'll see its usage from here.

msg280773 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2016-11-14 13:11

On 14.11.2016 13:03, STINNER Victor wrote:

STINNER Victor added the comment:

iso6937.py:

from utf-8 to iso6937
def iso6937_encode(input,errors,encoding_map):

Wait, is this code for Python 3? Decode from UTF-8 and encode to ISO-6937 in the same function seems strange to me.

The patch shows the file as UTF-8. In reality, it is decoding from Unicode strings.

I expected that the codec only implements two functions: encode text (unicode) to ISO-6937 (bytes), decode bytes from ISO-6937 to text.

Since the encoding is non trivial (multibyte), if we decide to add it, I suggest to require unit tests. I would like to see unit tests on multibyte strings, to check how the error handler is handled.

In general, I would prefer to not embed too many codecs in Python, it has a little cost to maintain these codecs.

My rule is more to only added encodings used (in practice) as locale encodings.

This encoding is used in EPG data of various DVB television formats. As such it is in active use (even though it is very old).

I think "active use" is a better approach to restricting ourselves to only locale encodings, since the latter are slowly converging towards UTF-8 :-)

BTW: Once a charmap style codec is written, there is little change, so the maintenance is minimal. Codecs which include more active logic such as this one are different, of course, and therefore may potentially add more maintenance burden.

msg280779 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-11-14 15:02

My rule is more to only added encodings used (in practice) as locale encodings.

Just for reference: , , , .

@Serhiy: Do you think that the encoding is popular enough to pay the price of its maintainance?

Yes, it seems to me that the encoding still in use. I found questions about decoding from ISO 6937 and implementations for different programming languages.

msg280783 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-14 15:39

Ok. I'm not waiting for a simpler patch reusing existing charmap functions to see the complexity of the codec ;-)

msg281746 - (view)

Author: John Helour (John Helour) *

Date: 2016-11-25 22:22

PEP8 compliant, added missing codepoints, utf-8 to \uXXXX rewrited

msg281748 - (view)

Author: John Helour (John Helour) *

Date: 2016-11-25 22:35

@mdk

Big thanks for the checker.

Looks like your implementation is missing some codepoints, like "\t":

print("\t".encode(encoding='iso6937'))
[...] UnicodeError: encoding with 'iso6937' codec failed (UnicodeError: Unacceptable utf-8 character)

The '\t' character is undefined in the iso6937 table, like all chars within the range 0x00 - 0x1F. I don't know how to handle such input for conversion.

msg281774 - (view)

Author: Julien Palard (mdk) * (Python committer)

Date: 2016-11-26 13:21

According to https://webstore.iec.ch/preview/info_isoiec6937%7Bed3.0%7Den.pdf:

NOTE: The shaded positions 00/00 to 01/15 and 07/15 to 09/15 are outside the scope of this International Standard.

So it's clear to me that they are not undefined, they are just described elsewhere.

According to https://en.wikipedia.org/wiki/ISO/IEC_6937:

ISO/IEC 6937:2001, [...] is a multibyte extension of ASCII

Also, the glibc charmap for ISO_6937 define them:

$ head -n 20 localedata/charmaps/ISO_6937 ISO_6937 % / % version: 1.0 % source: ECMA registry and ISO/IEC 6937:1992

% alias ISO-IR-156 % alias ISO_6937:1992 % alias ISO6937 CHARMAP /x00 NULL (NUL) /x01 START OF HEADING (SOH) /x02 START OF TEXT (STX) /x03 END OF TEXT (ETX) /x04 END OF TRANSMISSION (EOT) /x05 ENQUIRY (ENQ) /x06 ACKNOWLEDGE (ACK) /x07 BELL (BEL) /x08 BACKSPACE (BS) /x09 CHARACTER TABULATION (HT)

Finally, if we're not implementing this range, this mean we have no way to encode a new line, which looks highly strange to me, newline being a commonly used character.

But I found no line in the whole ISO/IEC6937 about its ASCII inheritance, I may have just missed it.

msg281780 - (view)

Author: John Helour (John Helour) *

Date: 2016-11-26 15:41

If I take the ISO_6937 file as a template for encoding table then increasing the range 0x20..0x7f to 0x00..0xA0 is the simplest solution.

msg281781 - (view)

Author: John Helour (John Helour) *

Date: 2016-11-26 15:43

If I take the ISO_6937 file as a template for encoding table then increasing the range 0x20..0x7f to 0x00..0xA0 is the simplest solution.

msg281869 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2016-11-28 12:40

The codec code has a few (performance) issues:

nonspacing_diacritical_marks should be a set for fast lookup
ord(c) in range(0x00, 0xA0) should be rewritten using < and >=
result += bytes([ord(c)]) has exponential timing (it copies the whole bytes string for every single operation); better use a bytearray and convert this to bytes in one final step
the error messages should include more useful information about the cause and location of the error, instead of just UnicodeError("Unacceptable unicode character") and raise KeyError

Please also check whether it's not possible to reuse the charmap codec functions we have. Thanks.

msg282048 - (view)

Author: John Helour (John Helour) *

Date: 2016-11-29 21:32

Please also check whether it's not possible to reuse the charmap codec functions we have I've found nothing useful, maybe you (as the author) can find something really useful which can improve code readability or increase the performance.

Please look at the newest codec version, particularly on line:

tmp += bytearray(encoding_map[c], 'latin1', 'ignore')

It is about extended ascii inheritance. Is it reliable and fast enough?

msg282084 - (view)

Author: John Helour (John Helour) *

Date: 2016-11-30 14:46

Please ignore my previous question about: tmp += bytearray(encoding_map[c], 'latin1', 'ignore')

The latest version don't needs such encoding ...

msg282338 - (view)

Author: John Helour (John Helour) *

Date: 2016-12-04 13:17

Performance issue resolved, more info on error added.

I've checked encoding and decoding on a two UTF-8 ~3MiB txt files. Except the first BOM mark (May I ignore it?) all seems work OK.

msg282351 - (view)

Author: Julien Palard (mdk) * (Python committer)

Date: 2016-12-04 16:49

LGTM, for me it's time to release it as a package on pypi to check the adoption rate and see it it's worth adding it in Python and maybe close this issue.

msg288144 - (view)

Author: Julien Palard (mdk) * (Python committer)

Date: 2017-02-19 15:59

John: You should probably package this as a pip module alongisde with a git repository, at least to measure qty of interested persones, and get some feedback / contributions.

msg293745 - (view)

Author: Xiang Zhang (xiang.zhang) * (Python committer)

Date: 2017-05-16 03:25

Would you mind converting this patch to a Github PR John?

msg341580 - (view)

Author: Julien Palard (mdk) * (Python committer)

Date: 2019-05-06 18:08

For the moment, I'm closing this issue as there's no activity on it I suspect it may no be that usefull.

I may be wrong, so if someone actually needs this, don't hesitate either to put it as a package on PyPI (it should probably go there anyway), either to reopen the issue.

msg396381 - (view)

Author: Maarten Derickx (koffie)

Date: 2021-06-23 06:22

Is there any way to contact John Helour? I would still very much like to put this package on github and pypi. And would like to ask him permission for licensing. Or is there some standard open source license under which all code uploaded to https://bugs.python.org/ can automatically be distributed?

https://www.python.org/about/legal/ seems to indicate so, but doesn't mention an explicit license just the things you can do with it.

msg396384 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2021-06-23 08:08

Maarten, the code posted on bugs is copyrighted by the person who wrote it. We can only accept it for inclusion in Python after the CLA has been signed, since then we are allowed to relicense it.

As a result you can only take John's code and post it elsewhere, if John permits you to do so, since the files don't include a license.

Note: Creating a character map based codec is not hard using gencodec.py from the Tools/unicode/ dir and perhaps some added extra logic.

msg396505 - (view)

Author: Maarten Derickx (koffie)

Date: 2021-06-24 17:18

Hi Marc-Andre Lemburg,

Thanks for your reply. I tried using gencodec.py as could be downloaded from https://github.com/python/cpython/blob/main/Tools/unicode/gencodec.py as you mentioned. However the code in gencodec.py seems to be in a much worse shape than the iso6937.py attached here. The code in gencodec relies on being able to compare integers with tuples. This is caused by the lines:

mappings = sorted(map)

hinting that this code has never been run using python 3.

providing a decent sort key solves this issue. But after that other issues pop up. For example there seems to be some problems handling the 0x-001 by the not appropriately handling of items in the mapping that have MISSING_CODE resulting in things like:

0x80: 0x-001

showing up in the generated code.

And then there is the issue that python_mapdef_code has as a side effect that it does 'del map["IDENTITY"]' causing "'IDENTITY' in map" in python_tabledef_code to always evaluate to False even when it should evaluate to True.

The problems above can be observed by just running gencodec.py on https://unicode.org/Public/MAPPINGS/VENDORS/APPLE/SYMBOL.TXT .

If gencodec.py was a trustworthy and well maintained piece of code, I would happily use it. However at the moment I don't see it as a valid option since debugging gencodec.py would cost me at least as much time as just writing its output myself instead of generating it. Additionally https://unicode.org/ doesn't seem to provide a mapping file for iso6937.

I do agree that using codecs.charmap_encode and codecs.charmap_decode is a much better solution then the one in iso6937.py. But I don't understand gencodec.py well enough to actually fix it.

msg396724 - (view)

Author: Maarten Derickx (koffie)

Date: 2021-06-29 13:12

The route via gencodec or more generally via charmap_encode and charmap_decode seems to be one that is not possible without some low level CPython code adjustments. The reason for this is that charmap_encode and charmap_decode only seem to support mappings where a single encoded byte corresponds to multiple unicode points.

However iso6937 is a mixed length encoding, meaning in this specific case that unicode characters sometimes need to be encoded as a single byte and sometimes with two bytes.

For example chr(0x00c0) should be encoded as b'\xc1' + b'A'.

msg396737 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2021-06-29 14:41

Right, the charmap codec was built with the Unicode Consortium mappings in mind.

However, you may have some luck decoding the two byte chars in ISO 6937 using combining code points in Unicode. With some extra post processing you could also normalize the output into single code points.

If I find time, I may have a look at gencodec.py again and update it to more modern interfaces. I've long given up maintenance of Unicode in Python and only try to help by giving some guidance based on the original implementation design.

msg396743 - (view)

Author: Maarten Derickx (koffie)

Date: 2021-06-29 15:07

Hi Marc-Andre Lemburg,

Thanks for your responses and guidance. At least your pointers to charmap_encode and charmap_decode helped, since it shows at least what the general idea is on how to deal with these types of encodings.

In the mean time I did produce some successes. I wrote some python code that can create character mappings based on the table in http://webstore.iec.ch/preview/info_isoiec6937%7Bed3.0%7Den.pdf so that we can be sure that there are no human errors in generating the mappings.

I think my further approach is to write pure python versions of charmap_encode and charmap_decode that can handle the general case of multi byte encodings to unicode case. This won't be as fast as using the builtins written c. But at least gives maintainable and hopefully reusable code.

Maybe later the c-implementation can be updated as well.

History

Date

User

Action

Args

2022-04-11 14:58:17

admin

set

github: 68527

2021-06-29 15:07:49

koffie

set

messages: +

2021-06-29 14:41:53

lemburg

set

messages: +

2021-06-29 13:12:35

koffie

set

messages: +

2021-06-24 17🔞21

koffie

set

messages: +

2021-06-23 08:08:57

lemburg

set

messages: +

2021-06-23 06:22:13

koffie

set

nosy: + koffie
messages: +

2019-05-06 18:08:20

mdk

set

status: open -> closed
resolution: postponed
messages: +

stage: patch review -> resolved

2017-05-16 03:25:45

xiang.zhang

set

messages: +
stage: needs patch -> patch review

2017-02-19 15:59:52

mdk

set

messages: +

2016-12-04 16:49:29

mdk

set

messages: +

2016-12-04 13:17:35

John Helour

set

messages: +

2016-12-04 13:15:09

serhiy.storchaka

set

priority: normal -> low
assignee: serhiy.storchaka

2016-12-04 12:59:05

John Helour

set

files: + iso6937.py

2016-12-04 12:57:07

John Helour

set

files: - iso6937.py

2016-12-04 12:56:50

John Helour

set

files: + check_iso6937.py

2016-12-03 19:36:10

John Helour

set

files: + iso6937.py

2016-12-03 19:34:46

John Helour

set

files: - iso6937.py

2016-11-30 14:46:18

John Helour

set

files: + iso6937.py

messages: +

2016-11-30 14:38:49

John Helour

set

files: - iso6937.py

2016-11-30 14:24:21

John Helour

set

files: + iso6937.py

2016-11-30 14:23:19

John Helour

set

files: - iso6937.py

2016-11-30 14:22:36

John Helour

set

files: + iso6937.py

2016-11-30 14:19:38

John Helour

set

files: - iso6937.py

2016-11-29 21:32:36

John Helour

set

files: + iso6937.py

messages: +

2016-11-28 12:40:36

lemburg

set

messages: +

2016-11-26 15:43:31

John Helour

set

files: + iso6937.py

messages: +

2016-11-26 15:41:43

John Helour

set

messages: +

2016-11-26 15:38:32

John Helour

set

files: - iso6937.py

2016-11-26 13:21:30

mdk

set

messages: +

2016-11-25 22:35:21

John Helour

set

messages: +

2016-11-25 22:22:35

John Helour

set

files: + iso6937.py

messages: +

2016-11-14 15:39:07

vstinner

set

messages: +

2016-11-14 15:02:54

serhiy.storchaka

set

messages: +

2016-11-14 13:11:10

lemburg

set

messages: +

2016-11-14 13:08:07

mdk

set

messages: +

2016-11-14 13:05:58

vstinner

set

messages: +

2016-11-14 12:27:32

serhiy.storchaka

set

stage: needs patch
messages: +
versions: + Python 3.7, - Python 3.6

2016-11-14 12:03:26

vstinner

set

nosy: + vstinner
messages: +

2016-11-14 11:54:54

lemburg

set

messages: +

2016-11-14 08:53:33

lemburg

set

messages: +

2016-11-14 02:12:46

xiang.zhang

set

nosy: + xiang.zhang

2016-11-13 22:11:01

mdk

set

files: + check_iso6937.py
nosy: + mdk
messages: +

2015-06-18 09:26:40

John Helour

set

files: - iso6937.py

2015-06-05 18:10:54

John Helour

set

files: + iso6937.py

2015-06-05 18:09:23

John Helour

set

files: - iso6937.py

2015-06-05 08:45:49

John Helour

set

files: - iso6937.py

2015-06-05 08:44:33

John Helour

set

files: - iso6937.py

2015-06-05 08:44:16

John Helour

set

files: + iso6937.py

2015-06-05 08:36:28

John Helour

set

files: + iso6937.py

2015-06-01 11:20:10

John Helour

set

files: + iso6937.py

messages: +

2015-05-31 14:31:35

serhiy.storchaka

set

nosy: + loewis, serhiy.storchaka, lemburg

messages: +
versions: + Python 3.6, - Python 2.7

2015-05-31 13:20:03

John Helour

create