[Python-3000] Pre-PEP: Easy Text File Decoding (original) (raw)

Paul Prescod paul at prescod.net
Sun Sep 10 05:29:05 CEST 2006


PEP: XXX Title: Easy Text File Decoding Version: RevisionRevisionRevision Last-Modified: DateDateDate Author: Paul Prescod <paul at prescod.net> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 09-Sep-2006 Post-History: 09-Sep-2006 Python-Version: 3.0

Abstract

Python 3000 will use Unicode as the standard string type. This means that text files read from disk will be "decoded" into Unicode code points just as binary files might be decoded into integers and structures. This change brings a few issues to the fore that were previously ignorable.

For example, in Python 2.x, it was possible to open a text file, read the data into a Python string, filter some lines and print the remaining lines to the console without ever considering what "encoding" the text was in. In Python 3000, the programmer will only get access to Python's powerful string manipulation functions after decoding the data to Unicode code points. This means that either the programmer or the Python runtime must select an decoding algorithm (by naming the encoding algorithm that was used to encode the data in the first place).

Often the programmer can do so based upon out-of-band knowledge ("this file format is always UCS-2" or "the protocol header says that this data is latin-1"). In other cases, the programmer may be more naive or simply wish to avoid thinking about it and would rather defer the issue to Python.

This document presents a proposal for algorithms and APIs that Python can use to simplify the programmer's life.

Issues outside the scope of this PEP

Any programmer who wishes to take direct control of the encoding selection may of course ignore the features described in this PEP and choose a decoding explicitly. The PEP is not intended to constrain them in any way.

Bytes received through means other than the file system are not addressed by this PEP. For example, the PEP does not address data directly read from a socket or returned from marshal functions.

Rationale

The simplest possible use case for Python text processing involves a user maintaining some form of simple database (e.g. an address book) as a text file and processing it with Python. Unfortunately, this use case is not as simple as it should be because of the variety of encodings in the universe. For example, the file might be UTF-8, ISO-8859-1 or ISO-8859-2.

Professional programmers making widely distributed programs probably have no alternative but to deal with this variability head-on. But programmers working with data that originates and resides primarily on their own computer might wish to avoid dealing with it. They would like Python to just "try to do the right" thing with respect to the file. They would like to think about encodings if and only if Python failed to guess appropriately.

Proposal

The function to open a text file will tenatively be called textfile(), though the function name is not an integral part of this PEP. The function takes three arguments, the filename, the mode ("r", "w", "r+", etc.) and the type.

The type could be a true encoding or one of a small set of additional symbolic values. The two main symbolic values are:

end?). This sample will likely be on the order of thousands of bytes.

Other symbolic values might allow the programmer to suggest specific encoding detection algorithms like XML [#XML-encoding-detection], HTML [#HTML-encoding-detection] and the "coding:" comment convention. These would be specified in separate PEPs.

The Site Decoding Hook

The "sys" module could have a function called "setdefaultfileencoding". The encoding specified could be a true encoding name or one of the encoding detection scheme names (e.g. "guess" or "XML").

In addition, it should be possible to register new encoding detection schemes using a method like "sys.registerencodingdetector". This function would take two arguments, a string and a callable. The callable would accept a byte stream argument and return a text stream. The contract for these detection scheme implementations must allow them to peek ahead some bytes to use the content as a hint to the encoding.

Alternatives and Open Issues

  1. Guido proposes that the function be called merely "open". His proposal is that the binary open should be the alternative and should be invoked explicitly with a "b" mode switch. The PEP author feels first, that changing the behaviour of an existing function is more confusing and disruptive than creating another. Backporting a change to the "open" function would be difficult and therefore it would be unnecessarily difficult to create file-manipulating libraries that work both on Python 2.x and 3.x.

Second, the author feels that the "open" is an unnecessarily cryptic name based only in Unix/C history. For a programmer coming from (for example) Javascript, open() would tend to imply "open window". The PEP author believes that factory functions should say what they are creating.

  1. There is substantial disagreement on the behaviour of the function when there is no encoding argument passed and no site override (i.e the out-of-box default). Current proposals include ASCII (on the basis that it is a nearly universal subset of popular encodings), UTF-8 (on the basis that it is the dominant global standard encompassing all of Unicode), a locale-derived encoding (on the basis that this is what a naive user will generate in a text editor) or the guessing algorithm (on the basis that it is by definition designed to guess right more often than any more specific encoding name).

The PEP author strongly advocates a strict encoding like ASCII, UTF-8 or no default at all (in which case the lack of an encoding would raise an exception). A default like iso-8859-1 (even inferred from the environment) will result in encodings like UTF-8, UCS-2 and even binary files being "interpreted" as gibberish strings. This could result in document or database corruption. An encoding with a "guess" default will encourage the widespread creation of very unreliable code.

The current proposal is to have no out-of-box default until some point in the future when a small set of auto-detectable encodings are globally dominant. UTF-8 has gradually been gaining popularity through W3C and other standards so it is possible that five years from now it will be the "no-brainer" default. Until we can guess with substantial confidence, absence of both an encoding declaration and a site override should result in a thrown exception.

References

.. [#XML-encoding-detection] XML Encoding Detection algorithm: http://www.w3.org/TR/REC-xml/#sec-guessing .. [#HTML-encoding-detection] HTML Encoding Detection algorithm: http://www.w3.org/TR/REC-xml/#sec-guessing

Copyright

This document has been placed in the public domain.

.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/38766c07/attachment-0001.htm



More information about the Python-3000 mailing list