[Python-Dev] [python] Re: New lines, carriage returns, and Windows (original) (raw)

Michael Foord fuzzyman at voidspace.org.uk
Sat Sep 29 17:30:26 CEST 2007


Guido van Rossum wrote:

[snip..] Python does have its own I/O model. There are binary files and text files. For binary files, you write bytes and the semantic model is that of an array of bytes; byte indices are seek positions.

For text files, the contents is considered to be Unicode, encoded as bytes in a binary file. So text file always has an underlying binary file. Two translations take place, both of which have defaults varying by platform. One translation is encoding Unicode text into bytes upon output, and decoding bytes to Unicode text upon input. This can use any encoding supported by the encodings package. The other translation deals with line endings. Upon input, any of \r\n, \r, or \n is translated to a single \n by default (this is nhe "universal newlines" algorithm from Python 2.x). This can be tweaked or disabled. Upon output, \n is translated into a platform specific string chosen from \r\n, \r, or \n. This can also be disabled or overridden. Note that \r, when written, is never treated specially; if you want special processing for \r on output, you can write your own translation layer. So the question is, that when a string containing '\r\n' is written to a file in text mode on a Windows platform, should it be written with the encoded representation of '\r\n' or '\r\r\n'?

Purity would dictate the latter and practicality the former (IMO)...

However, that would mean that round tripping a string would change it ('\r\n' would be written as '\r\n' and then read as '\n') - on the other hand (particularly given that we are treating the data as text and not a binary blob) I don't see how writing '\r\r\n' would ever actually be useful in text.

+1 on just writing '\r\n' from me.

Michael Foord http://www.manning.com/foord

That's all. There is nothing unimplementable or confusing in these specifications.

Python doesn't care about record I/O on legacy OSes; it does care about variability found in practice between popular OSes. Note that \r, \n and friends in Python 3000 are either ASCII (in bytes literals) or Unicode (in text literals). Again, no support for legacy systems that don't use ASCII or a superset. Legacy OSes are called that for a reason.



More information about the Python-Dev mailing list