Issue 1448490: Convertion error for latin1 characters with iso-2022-jp-2 (original) (raw)

It seems like there are some errors while reading a text file encoded with ISO-2022-JP-2 using the codecs module. In all my test cases, all latin1 characters with an accent (e.g. e acute) do not appear in the output string. However, if I convert the file manually using iconv, I get everything right. Here is a simple script that will illustrate the problem:

###########################################

import codecs

import pygtk import gtk

f = codecs.open( "test.iso-2022-jp-2" , "r" ,
"iso-2022-jp-2" ) s1 = f.readline().strip() f.close()

f = open( "test.utf-8" , "r" ) s2 = f.readline().strip()

pack = gtk.VBox() pack.pack_start( gtk.Label( s1 ) ) pack.pack_start( gtk.Label( s2 ) )

window = gtk.Window( gtk.WINDOW_TOPLEVEL ) window.add( pack ) window.show_all()

def event_destroy( widget , event , data ) : gtk.main_quit() return 0

window.connect( "delete_event" ,
lambda w,e,d: False , None ) window.connect( "destroy" , event_destroy , None )

gtk.main()

###########################################

I put the file "test.iso-2022-jp-2" in attachment. To create the UTF-8 version of the file, I used the following shell command:

iconv -f ISO-2022-JP-2 -t UTF-8
test.iso-2022-jp-2 > test.utf-8

When running this script, I would actually expect a window with two times the same label. However, the first one is missing the e acute.

-- Francois