Issue 20409: .readline() returned garble text (original) (raw)

I'm using Windows 8. I created file 'weird1.txt' (attached) from an Excel worksheet using "save as Unicode Text (*.txt)". And this happened when I used Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:19:30) [MSC v.1600 64 bit (AMD64)] on win32:

handle = open('weird1.txt'); handle.readline() 'ÿþ>\x00P\x006\x004\x00;\x00Y\x00A\x00L\x000\x000\x001\x00C\x00;\x00T\x00F\x00C\x003\x00;\x00 \x00S\x00G\x00D\x00I\x00D\x00:\x00S\x000\x000\x000\x000\x000\x000\x000\x000\x001\x00,\x00 \x00C\x00h\x00r\x00 \x00I\x00 \x00f\x00r\x00o\x00m\x00 \x001\x005\x001\x000\x000\x006\x00-\x001\x004\x007\x005\x009\x004\x00,\x001\x005\x001\x001\x006\x006\x00-\x001\x005\x001\x000\x009\x007\x00,\x00 \x00r\x00e\x00v\x00e\x00r\x00s\x00e\x00 \x00c\x00o\x00m\x00p\x00l\x00e\x00m\x00e\x00n\x00t\x00,\x00 \x00V\x00e\x00r\x00i\x00f\x00i\x00e\x00d\x00 \x00O\x00R\x00F\x00,\x00 \x00"\x00L\x00a\x00r\x00g\x00e\x00s\x00t\x00 \x00o\x00f\x00 \x00s\x00i\x00x\x00 \x00s\x00u\x00b\x00u\x00n\x00i\x00t\x00s\x00 \x00o\x00f\x00 \x00t\x00h\x00e\x00 \x00R\x00N\x00A\x00 \x00p\x00o\x00l\x00y\x00m\x00e\x00r\x00a\x00s\x00e\x00 \x00I\x00I\x00I\x00 \x00t\x00r\x00a\x00n\x00s\x00c\x00r\x00i\x00p\x00t\x00i\x00o\x00n\x00 \x00i\x00n\x00i\x00t\x00i\x00a\x00t\x00i\x00o\x00n\x00 \x00f\x00a\x00c\x00t\x00o\x00r\x00 \x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00 \x00(\x00T\x00F\x00I\x00I\x00I\x00C\x00)\x00;\x00 \x00p\x00a\x00r\x00t\x00 \x00o\x00f\x00 \x00t\x00h\x00e\x00 \x00T\x00a\x00u\x00B\x00 \x00d\x00o\x00m\x00a\x00i\x00n\x00 \x00o\x00f\x00 \x00T\x00F\x00I\x00I\x00I\x00C\x00 \x00t\x00h\x00a\x00t\x00 \x00b\x00i\x00n\x00d\x00s\x00 \x00D\x00N\x00A\x00 \x00a\x00t\x00 \x00t\x00h\x00e\x00 \x00B\x00o\x00x\x00B\x00 \x00p\x00r\x00o\x00m\x00o\x00t\x00e\x00r\x00 \x00s\x00i\x00t\x00e\x00s\x00 \x00o\x00f\x00 \x00t\x00R\x00N\x00A\x00 \x00a\x00n\x00d\x00 \x00s\x00i\x00m\x00i\x00l\x00a\x00r\x00 \x00g\x00e\x00n\x00e\x00s\x00;\x00 \x00c\x00o\x00o\x00p\x00e\x00\n'

Then I opened 'weird1.txt' in Notepad++ 6.5.2, created file 'weird2.txt' by copying the whole content of 'weird1.txt' into a new file and saved it in Notepad++ 6.5.2 (I wanted to attach 'weird2.txt' but only one attachment is allowed), and this happened:

handle = open('weird2.txt'); handle.readline() '>P64;YAL001C;TFC3; SGDID:S000000001, Chr I from 151006-147594,151166-151097, reverse complement, Verified ORF, "Largest of six subunits of the RNA polymerase III transcription initiation factor complex (TFIIIC); part of the TauB domain of TFIIIC that binds DNA at the BoxB promoter sites of tRNA and similar genes; coope\n'

I can't see any difference between the contents of 'weird1.txt' and 'weird2.txt' using Notepad++ or the Windows Notepad. Maybe some experts could tell me what's going on here?

The file use different encodings. In the first case, the first two bytes (which don't appear in the second example) I believe are the BOM. I'm not an expert, but I believe it is a utf-16 file (thus all the \x00 bytes). The second file is presumably utf-8, with no BOM. Notepad++ handles both automatically. For Python, you have to tell it to look for the BOM by specifying the appropriate codec in the open call. This is because Python's philosophy is to not guess at the encoding of files (though it does have a default encoding, usually utf-8).

Questions like this are better directed to the python-list mailing list, by the way.