msg214195 - (view) |
Author: Tommy Carstensen (Tommy.Carstensen) |
Date: 2014-03-20 09:39 |
This is my first post on bugs.python.org. I hope I abide to the rules. It was suggested to me on stackoverflow.com, that I request an enhancement to the module fileinput here: http://stackoverflow.com/questions/22510123/reading-individual-bytes-of-multiple-binary-files-using-the-python-module-filein I can read the first byte of a binary file like this: with open(my_binary_file,'rb') as f: f.read(1) But when I run this code: import fileinput with fileinput.FileInput(my_binary_file,'rb') as f: f.read(1) then I get this error: AttributeError: 'FileInput' object has no attribute 'read' I would like to propose an enhancement to fileinput, which makes it possible to read binary files byte by byte. I posted this solution to my problem: def process_binary_files(list_of_binary_files): for file in list_of_binary_files: with open(file,'rb') as f: yield f.read(1) return list_of_binary_files = ['f1', 'f2'] generate_byte = process_binary_files(list_of_binary_files) byte = next(generate_byte) |
|
|
msg214739 - (view) |
Author: Josh Rosenberg (josh.r) *  |
Date: 2014-03-24 21:44 |
fileinput's semantics are heavily tied to lines, not bytes. And processing binary files byte by byte is rather inefficient; can you explain why this feature would be of general utility such that it would be worth including it in the standard library? It's not hard to just get a byte at a time using existing parts: def bytefileinput(): return (bytes((b,)) for line in fileinput.input() for b in line) There are ways to do similar things without using fileinput at all. But it really depends on your use case. Giving fileinput a read() method isn't a bad idea assuming some reasonable behavior is defined for the various line oriented methods, but making it iterate binary mode input byte by byte would be a breaking change of limited utility in my view. |
|
|
msg214741 - (view) |
Author: Josh Rosenberg (josh.r) *  |
Date: 2014-03-24 21:48 |
That example should have included mode="rb" when using fileinput.input(); oops. Pretend I didn't forget it. |
|
|
msg214752 - (view) |
Author: Tommy Carstensen (Tommy.Carstensen) |
Date: 2014-03-24 22:32 |
I read the fileinput code and realized how heavily tied it is to line input. Will reading individual bytes as suggested not be very memory intensive, if each line is billions of characters? def bytefileinput(): return (bytes((b,)) for line in fileinput.input() for b in line) I posted my workaround on stackoverflow (see link earlier in tread), which does not make use of the fileinput module at all. After having read through the fileinput code I agree that the module should only support reading lines and this enhancement request should be closed. |
|
|
msg214758 - (view) |
Author: Josh Rosenberg (josh.r) *  |
Date: 2014-03-24 23:18 |
On memory: Yeah, it could be if the file didn't include any newline characters. Same problem could apply if a text input file relied on word wrap in an editor and included very few or no newlines itself. There are non-fileinput ways of doing this, like I said; if you want consistent performance, you'd probably use one of them. For example, using the two arg form of iter: from functools import partial def bytefileinput(files): for file in files: with open(filename, "rb") as f: yield from iter(partial(f.read, 1), b'') Still kind of slow, but predictable on memory usage and not to complex. |
|
|
msg214759 - (view) |
Author: Josh Rosenberg (josh.r) *  |
Date: 2014-03-24 23:18 |
And of course, missed another typo. open's first arg should be file, not filename. |
|
|