Issue 1597011: Reading with bz2.BZ2File() returns one garbage character (original) (raw)

Created on 2006-11-15 14:19 by cpn, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
bzp.py cpn,2006-11-15 14:21 python script to reproduce the bug
python-trunk-bz2.patch jafo,2007-08-28 10:26
python-trunk-bz2-v2.patch jafo,2007-08-30 09:39
Messages (8)
msg30548 - (view) Author: Clodoaldo Pinto Neto (cpn) Date: 2006-11-15 14:19
When comparing two files which should be equal the last line is different: The first file is a bzip2 compressed file and is read with bz2.BZ2File() The second file is the same file uncompressed and read with open() The first file named file.txt.bz2 is uncompressed with: $ bunzip2 -k file.txt.bz2 To compare I use this script: ############################### import bz2 f1 = bz2.BZ2File(r'file.txt.bz2', 'r') f2 = open(r'file.txt', 'r') lines = 0 while True: line1 = f1.readline() line2 = f2.readline() if line1 == '': break lines += 1 if line1 != line2: print 'line number:', lines print repr(line1) print repr(line2) f1.close() f2.close() ############################## Output: $ python bzp.py line number: 588317 '\x07' '' The offending attached file is 5.5 MB. Sorry, i could not reproduce this problem with a smaller file. Tested in Fedora Core 5 and Python 2.4.3
msg30549 - (view) Author: Clodoaldo Pinto Neto (cpn) Date: 2006-11-15 14:28
I can't upload the bz2 sample file. So it is here: http://fahstats.com/img/file.txt.bz2
msg30550 - (view) Author: Clodoaldo Pinto Neto (cpn) Date: 2006-11-15 14:35
Confirmed in Windows Python 2.4 and 2.5 http://groups.google.com/group/comp.lang.python/tree/browse_frm/thread/3010fd664d78010f/4166d429b25c9ed4?rnum=1&_done=%2Fgroup%2Fcomp.lang.python%2Fbrowse_frm%2Fthread%2F3010fd664d78010f%2F4166d429b25c9ed4%3Ftvc%3D1%26#doc_7770aa47861db452
msg30551 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-11-15 17:30
With your file, I can reproduce that on Linux, Python 2.5. Which compressor did you compress your file with? I unpacked it with bunzip2 without problems, then recompressed it with bzip2, which resulted in a slightly smaller (51 bytes) file, which then didn't trigger the bug.
msg30552 - (view) Author: Clodoaldo Pinto Neto (cpn) Date: 2006-11-15 17:46
I received this file already compressed. I don't know what was the used compressor. There is no error if i test the compressed file with: $ bzip2 -t file.txt.bz2
msg55363 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-08-28 10:26
There are some bugs in the bz2 module. The problem boils down to the following code, notice how *c is assigned *BEFORE* the check to see if there was a read error: do { BZ2_bzRead(&bzerror, f->fp, &c, 1); f->pos++; *buf++ = c; } while (bzerror == BZ_OK && c != '\n' && buf != end); This could be fixed by putting a "if (bzerror == BZ_OK) break;" after the BZ2_bzRead() call. However, I also noticed that in the universal newline section of the code it is reading a character, incrementing f->pos, *THEN* checking if buf == end and if so is throwing away the character. I changed the code around so that the read loop is unified between universal newlines and regular newlines. I guess this is a small performance penalty, since it's checking the newline mode for each character read, however we're already doing a system call for every character so one additional comparison and jump to merge duplicate code for maintenance reasons is probably a good plan. Especially since the reason for this bug only existed in one of the two duplicated parts of the code. Please let me know if this looks good to commit.
msg55469 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-08-30 09:39
Found some problems in the previous version, this one passes the tests and I've also spent time reviewing the code and I think this is correct. Part of the problem is that only bzerror was being checked, not the number of bytes read. When bzerror is not BZ_OK, the code expects that it returns a byte that was read, but in some cases it returns an error when no bytes were read. This code passes the test and also correctly handles the bz2 file that is the object of this bug.
msg55950 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-09-17 05:48
I have committed this into trunk and the 2.5 maintenance branch. It passes all tests and the resulting build passes the submitter-provided test.
History
Date User Action Args
2022-04-11 14:56:21 admin set github: 44233
2007-09-17 06:48:23 jafo set resolution: fixed
2007-09-17 05:48:14 jafo set status: open -> closedmessages: +
2007-08-30 09:39:57 jafo set files: + python-trunk-bz2-v2.patchmessages: +
2007-08-28 10:26:52 jafo set assignee: jafo
2007-08-28 10:26:08 jafo set files: + python-trunk-bz2.patchnosy: + jafomessages: +
2006-11-15 14:19:09 cpn create