Opening a UTF-8 encoded file with unix newlines ("\n") on Win32: codecs.open("whatever.txt","r","utf-8").read() replaces the newlines ("\n") with CR+LF ("\r\n"). The docs specifically say that : "Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing." And yet, opening the file with an explicit binary mode resolves the situation: codecs.open("whatever.txt","rb","utf-8").read() This reads the file with the original newlines unmodified. The implementation of codecs.open and the documentation are out of sync.
Ryan McGuire wrote: > > New submission from Ryan McGuire <python.org@enigmacurry.com>: > > Opening a UTF-8 encoded file with unix newlines ("\n") on Win32: > > codecs.open("whatever.txt","r","utf-8").read() > > replaces the newlines ("\n") with CR+LF ("\r\n"). > > The docs specifically say that : > > "Files are always opened in binary mode, even if no binary mode was > specified. This is done to avoid data loss due to encodings using 8-bit > values. This means that no automatic conversion of '\n' is done on > reading and writing." > > And yet, opening the file with an explicit binary mode resolves the > situation: > > codecs.open("whatever.txt","rb","utf-8").read() > > This reads the file with the original newlines unmodified. > > The implementation of codecs.open and the documentation are out of sync. The implementation looks like this: if encoding is not None and \ 'b' not in mode: # Force opening of the file in binary mode mode = mode + 'b' in both Python 2 and 3, so I'm not sure what could be causing this.