[I18n-sig] Re: [Python-Dev] Unicode debate (original) (raw)
Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Tue, 2 May 2000 08:59:03 +0200
- Previous message: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
- Next message: [I18n-sig] Re: [Python-Dev] Unicode debate
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Neil Hodgson <nhodgson@bigpond.net.au> wrote:
I'm dropping in a bit late in this thread but can the current = problem be summarised in an example as "how is 'literal' interpreted here"? =20 s =3D aUnicodeStringFromSomewhere DoSomething(s + "")
nope. the whole discussion centers around what happens if you type:
# example 1
u =3D aUnicodeStringFromSomewhere
s =3D an8bitStringFromSomewhere
DoSomething(s + u)
and
# example 2
u =3D aUnicodeStringFromSomewhere
s =3D an8bitStringFromSomewhere
if len(u) + len(s) =3D=3D len(u + s):
print "true"
else:
print "not true"
in Guido's design, the first example may or may not result in an "UTF-8 decoding error: UTF-8 decoding error: unexpected code byte" exception. the second example may result in a similar error, print "true", or print "not true", depending on the contents of the 8-bit string.
(under the counter proposal, the first example will never raise an exception, and the second will always print "true")
...
the string literal issue is a slightly different problem.
The two options being that literal is either assumed to be encoded in Latin-1 or UTF-8. I can see some arguments for both sides.
better make that "two options", not "the two options" ;-)
a more flexible scheme would be to borrow the design from XML (see http://www.w3.org/TR/1998/REC-xml-19980210). for those who haven't looked closer at XML, it basically treats the source file as an encoded unicode character stream, and does all pro- cessing on the decoded side.
replace "entity" with "script file" in the following excerpts, and you get close:
section 2.2:
A parsed entity contains text, a sequence of characters,
which may represent markup or character data.
A character is an atomic unit of text as specified by
ISO/IEC 10646.
section 4.3.3:
Each external parsed entity in an XML document may
use a different encoding for its characters. All XML
processors must be able to read entities in either
UTF-8 or UTF-16.=20
Entities encoded in UTF-16 must begin with the Byte
Order Mark /.../ XML processors must be able to use
this character to differentiate between UTF-8 and
UTF-16 encoded documents.
Parsed entities which are stored in an encoding other
than UTF-8 or UTF-16 must begin with a text declaration
containing an encoding declaration.
(also see appendix F: Autodetection of Character Encodings)
I propose that we adopt a similar scheme for Python -- but not in 1.6. the current "dunno, so we just copy the characters" is good enough for now...
- Previous message: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
- Next message: [I18n-sig] Re: [Python-Dev] Unicode debate
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]