Issue 3649: IA5 Encoding should be in the default encodings (original) (raw)

Created on 2008-08-22 16:26 by pascal.bach, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
ia5.py pascal.bach,2008-08-22 16:26 File wich implements the python .encode/decode methodes
Messages (8)
msg71755 - (view) Author: Pascal Bach (pascal.bach) Date: 2008-08-22 16:26
This encoding is used in the GSM standard it is a 7-bit encoding similar to ASCII. The encoding definition is found in: Short Message Service Centre EMI - UCP Interface 4.6 Specification (p. 79) as well as in: [3GPP 23.038] 3GPP TS 23.038 Alphabets and language-specific information. I think this encoding would be useful for other GSM specific use cases.
msg71771 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-08-22 19:20
The provided file does not work for "EXTENSION" characters: >>> import ia5 >>> u"[a]".encode("ia5") Traceback (most recent call last): File "", line 1, in File "ia5.py", line 18, in encode return codecs.charmap_encode(input,errors,encoding_map) TypeError: character mapping must be in range(256) I doubt this can be achieved with just a charmap. You will have to roll your own incremental stateful decoder. Are you willing to do it?
msg71776 - (view) Author: Pascal Bach (pascal.bach) Date: 2008-08-22 20:49
Well I have seen the problem. I'm willing to do this to improve python, but I don't know exactly how to do it. I looked at how utf-8 and utf-7 are done but I didn't exactly understand, are they based on C code? Is there an example how this needs to be done? It would be nice if you could get me some help where to start.
msg71803 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-08-23 09:06
You could start with utf_8.py, and of course replace the calls to codecs.utf_8_encode and codecs.utf_8_decode. - your "ia5_encode" follows this interface: http://docs.python.org/dev/library/codecs.html#codecs.Codec.encode - your "ia5_decode" has the signature: def ia5_decode(input, errors='strict', final=False) and returns a tuple (output object, length consumed). See http://docs.python.org/dev/library/codecs.html#codecs.IncrementalDecoder.decode for an explanation of the final parameter; in particular, if the input is a single 0x1B, - it will return ('', 0) if final is False - and raise UnicodeDecodeError("unexpected end of data") if final is True
msg71845 - (view) Author: Pascal Bach (pascal.bach) Date: 2008-08-24 17:38
I have looked at utf_8.py and I think I know how to implement the incremental de/encoder. But I don't understand the codecs.register() function. Do I have to provide stateless, stateful and streamwriter at the same time? If I implement IncrementalEncoder and IncrementalDecoder can I just give those two to codecs.register()? Thank you for your help.
msg71887 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-08-24 21:52
I don't think this codec should be named IA-5. IA-5 is specified in ITU-T Rec. T.50 (International Alphabet No. 5), recently renamed to "International Reference Alphabet", and it does *not* specify that the characters 0..31 are printable. Instead, IA5 is identical to ISO 646 (i.e. allowing for national variants), with the International Reference Version of IA5 (e.g. as used in ASN.1 IA5String) is identical to US-ASCII. If GSM uses a modified version of this, it should receive a separate name. If you were looking at section 2 (Structure of EMI messages), what makes you think that this specification calls the encoding "IA5"? In my copy, it says: # Alphanumeric characters are encoded as two numeric IA5 characters, # the higher 3 bits (0..7) first, the lower 4 bits (0..F) thereafter, # according to the following table. So it *uses* IA5 to hex-encode the encoding. To achieve that, one would have to write text.encode("emi-section-2").encode("hex") [Notice that the "hex" codec already uses IA-5] In any case, I don't think this is general enough to deserve inclusion into the standard library. The codec system is designed to be so flexible to support additional codecs outside the core.
msg71934 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-08-25 15:10
I think what you're after is the encoding used in SMS messages: http://en.wikipedia.org/wiki/Short_message_service Here's an old discussion about this codec: http://mail.python.org/pipermail/python-list/2002-October/167267.html http://mail.python.org/pipermail/python-list/2002-October/167271.html Note that nowadays, SMSCs and interface software such as Kannel typically accept UTF-16 data just fine, so the need for such a codec in Python in minimal. I agree with Martin, that the stdlib is not the right place for such a codec. It's easy to write your own codec package and have your application register this package at startup time using codecs.register().
msg71939 - (view) Author: Pascal Bach (pascal.bach) Date: 2008-08-25 15:31
I currently use the codec in my ucplib already and this is not a problem. I just thought that it might be useful for somebody else. But maybe it is to use case specific. If this codec is not of general interest I think this report can be closed.
History
Date User Action Args
2022-04-11 14:56:38 admin set github: 47899
2008-08-25 15:31:47 pascal.bach set messages: +
2008-08-25 15:11:00 lemburg set status: open -> closednosy: + lemburgresolution: rejectedmessages: +
2008-08-24 21:52:17 loewis set nosy: + loewismessages: +
2008-08-24 19:05:14 pitrou set priority: normalversions: + Python 3.1, Python 2.7, - Python 2.5
2008-08-24 17:38:11 pascal.bach set messages: +
2008-08-23 09:06:23 amaury.forgeotdarc set messages: +
2008-08-22 20:49:30 pascal.bach set messages: +
2008-08-22 19:20:25 amaury.forgeotdarc set nosy: + amaury.forgeotdarcmessages: +
2008-08-22 16:26:46 pascal.bach create