| msg30548 - (view) |
Author: Clodoaldo Pinto Neto (cpn) |
Date: 2006-11-15 14:19 |
| When comparing two files which should be equal the last line is different: The first file is a bzip2 compressed file and is read with bz2.BZ2File() The second file is the same file uncompressed and read with open() The first file named file.txt.bz2 is uncompressed with: $ bunzip2 -k file.txt.bz2 To compare I use this script: ############################### import bz2 f1 = bz2.BZ2File(r'file.txt.bz2', 'r') f2 = open(r'file.txt', 'r') lines = 0 while True: line1 = f1.readline() line2 = f2.readline() if line1 == '': break lines += 1 if line1 != line2: print 'line number:', lines print repr(line1) print repr(line2) f1.close() f2.close() ############################## Output: $ python bzp.py line number: 588317 '\x07' '' The offending attached file is 5.5 MB. Sorry, i could not reproduce this problem with a smaller file. Tested in Fedora Core 5 and Python 2.4.3 |
|
|
| msg30549 - (view) |
Author: Clodoaldo Pinto Neto (cpn) |
Date: 2006-11-15 14:28 |
| I can't upload the bz2 sample file. So it is here: http://fahstats.com/img/file.txt.bz2 |
|
|
| msg30550 - (view) |
Author: Clodoaldo Pinto Neto (cpn) |
Date: 2006-11-15 14:35 |
| Confirmed in Windows Python 2.4 and 2.5 http://groups.google.com/group/comp.lang.python/tree/browse_frm/thread/3010fd664d78010f/4166d429b25c9ed4?rnum=1&_done=%2Fgroup%2Fcomp.lang.python%2Fbrowse_frm%2Fthread%2F3010fd664d78010f%2F4166d429b25c9ed4%3Ftvc%3D1%26#doc_7770aa47861db452 |
|
|
| msg30551 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2006-11-15 17:30 |
| With your file, I can reproduce that on Linux, Python 2.5. Which compressor did you compress your file with? I unpacked it with bunzip2 without problems, then recompressed it with bzip2, which resulted in a slightly smaller (51 bytes) file, which then didn't trigger the bug. |
|
|
| msg30552 - (view) |
Author: Clodoaldo Pinto Neto (cpn) |
Date: 2006-11-15 17:46 |
| I received this file already compressed. I don't know what was the used compressor. There is no error if i test the compressed file with: $ bzip2 -t file.txt.bz2 |
|
|
| msg55363 - (view) |
Author: Sean Reifschneider (jafo) *  |
Date: 2007-08-28 10:26 |
| There are some bugs in the bz2 module. The problem boils down to the following code, notice how *c is assigned *BEFORE* the check to see if there was a read error: do { BZ2_bzRead(&bzerror, f->fp, &c, 1); f->pos++; *buf++ = c; } while (bzerror == BZ_OK && c != '\n' && buf != end); This could be fixed by putting a "if (bzerror == BZ_OK) break;" after the BZ2_bzRead() call. However, I also noticed that in the universal newline section of the code it is reading a character, incrementing f->pos, *THEN* checking if buf == end and if so is throwing away the character. I changed the code around so that the read loop is unified between universal newlines and regular newlines. I guess this is a small performance penalty, since it's checking the newline mode for each character read, however we're already doing a system call for every character so one additional comparison and jump to merge duplicate code for maintenance reasons is probably a good plan. Especially since the reason for this bug only existed in one of the two duplicated parts of the code. Please let me know if this looks good to commit. |
|
|
| msg55469 - (view) |
Author: Sean Reifschneider (jafo) *  |
Date: 2007-08-30 09:39 |
| Found some problems in the previous version, this one passes the tests and I've also spent time reviewing the code and I think this is correct. Part of the problem is that only bzerror was being checked, not the number of bytes read. When bzerror is not BZ_OK, the code expects that it returns a byte that was read, but in some cases it returns an error when no bytes were read. This code passes the test and also correctly handles the bz2 file that is the object of this bug. |
|
|
| msg55950 - (view) |
Author: Sean Reifschneider (jafo) *  |
Date: 2007-09-17 05:48 |
| I have committed this into trunk and the 2.5 maintenance branch. It passes all tests and the resulting build passes the submitter-provided test. |
|
|