Issue 45150: Add a file_digest() function in hashlib (original) (raw)

Created on 2021-09-09 10:03 by tarek, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 28252	open	python-dev,2021-09-09 10:04
PR 31928	open	python-dev,2022-03-16 02:11
PR 31930	merged	christian.heimes,2022-03-16 11:49
PR 32046	merged	christian.heimes,2022-03-22 13:06

Messages (14)
msg401457 - (view)	Author: Tarek Ziadé (tarek) *	Date: 2021-09-09 10:03
I am proposing the addition of a very simple helper to return the hash of a file.
msg401461 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2021-09-09 10:48
Hey Tarek, long time no see! * the _sha256 module is optional, can be disabled and is not available in some distributions. * I also don't like to use sha256 has default. It's slow, even slower than sha512. Any default makes it also harder to upgrade to a better, more secure default in the future. * like hmac.new() a file_digest() should accept PEP 452-compatible arguments and hash name as digstmod argument, not just a callable. * a filename argument prevents users from passing in file-like objects like BytesIO. * 4096 bytes chunk size is very conservative. The call overhead for read() and update() may dominate the performance of the function. * The hex argument feels weird. In a perfect world, the hash and hmac objects should get an "update_file" method. The OpenSSL-based hashes could even release the GIL and utilize OpenSSL's BIO layer to avoid any Python overhead.
msg401462 - (view)	Author: Tarek Ziadé (tarek) *	Date: 2021-09-09 10:56
Hey Christian, I hope things are well for you! Thanks for all the precious feedback, I'll rework the patch accordingly
msg415138 - (view)	Author: Aur Saraf (Aur.Saraf) *	Date: 2022-03-14 13:56
Tarek, Are you still working on this? Would you like me to take over? Aur
msg415139 - (view)	Author: Tarek Ziadé (tarek) *	Date: 2022-03-14 14:00
@Aur, go for it, I started to implement it and got lost into the details for each backend..
msg415141 - (view)	Author: Aur Saraf (Aur.Saraf) *	Date: 2022-03-14 14:13
OK, I'll give it a go.
msg415320 - (view)	Author: Aur Saraf (Aur.Saraf) *	Date: 2022-03-16 02:14
PR contains a draft implementation, would appreciate some review before I implement the same interface on all builtin hashes as well as OpenSSL hashes.
msg415321 - (view)	Author: Aur Saraf (Aur.Saraf) *	Date: 2022-03-16 02:21
The rationale behind `from_raw_file()` and the special treatment of non-buffered IO is that there is no `read_buffer()` API or other clean way to say "I want to read just what's currently in the buffer so that from now on I could read directly from the file descriptor without harm". If you want to read from a buffered file object, sure, just call `from_file()`. If you want to ensure you'll get full performance benefits, call `from_raw_file()`. If you pass an eligible file object to `from_file()` you'll get the benefits anyway, because why not.
msg415324 - (view)	Author: Aur Saraf (Aur.Saraf) *	Date: 2022-03-16 02:36
Forgot an important warning: this is the first time I write C code against the Python API, and I didn't thoroughly read the guide (or at all, to be honest). I think I did a good job, but please suspect my code of noob errors. I'm especially not confident that it's OK to not do any special handling of signals. Can read() return 0 if it was interrupted by a signal? This will stop the hash calculation midway and behave as if it succeeded. Sounds suspiciously like something we don't want. Also, I probably should support signals because such a long operation is something the user definitely might want to interrupt? May I have some guidance please? Would it be enough to copy the code from fileutils.c _Py_Read() and addi an outer loop so we can do many reads with the GIL released and still call PyErr_CheckSignals when needed with the GIL taken?
msg415326 - (view)	Author: Aur Saraf (Aur.Saraf) *	Date: 2022-03-16 02:58
Added an attempt to handle signals. I don't think it's working, because when I press Ctrl+C while hashing a long file, it only raises KeyboardInterrupt after waiting the amount of time it usually takes the C code to return, but maybe that's not a good test?
msg415336 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2022-03-16 11:49
Before we continue hacking on an implementation, let's discuss some API design. - Multiple functions or multiple dispatch (name, fileobj) are confusing to users. Let's keep it simple and only implement one function that operates on file objects. - The function should work with most binary file-like including open(..., "rb"), BytesIO, SocketIO. mmap.mmap() is not file-like enough. Anon mapping doesn't provide a fileno and the mmap object has no readinto(). - The function should accept either digest name, digest constructor, or a callable that returns a digest object. The latter makes it possible to reuse file_digest() with MAC constructs like HMAC. - Don't do any I/O in C unless you are prepared to enter a world of pain and suffering. It's hard to get it right across platforms. For example your C code does not work for SocketIO on Windows. - If we decide to implement an accelerator in C, then we don't have to bother with our fallback copies like _sha256module.c. They are slow and only used when OpenSSL is not available.
msg415356 - (view)	Author: Aur Saraf (Aur.Saraf) *	Date: 2022-03-16 18:51
I don't think HMAC of a file is a common enough use case to support, but I have absolutely no problem conceding this point, the cost of supporting it is very low. I/O in C is a world of pain in general. In the specific case of `io.RawIOBase` objects (non-buffered binary files) to my understanding it's not _that_ terrible (am I right? Does my I/O code work as-is?). To my understanding, providing a fast path just for this case that calculates the hash without taking the GIL for every chunk would be very nice to have for many use cases. Now, we could just be happy with `file_digest()` having an `if` for `isinstance(io.RawIOBase)` that chooses a fast code path silently. But since non-buffered binary files are so hard to tell apart from other types of file-like objects, as a user of this code I would like to have a way to say "I want the fast path, please raise if I accidentally passed the wrong things and got the regular path". We could have `file_digest('sha256', open(path, 'rb', buffered=0), ensure_fast_io=True)`, but I think for this use case `raw_file_digest('sha256', open(path, 'rb', buffered=0))` is cleaner. In all other cases you just call `file_digest()`, probably get the Python I/O and not the C I/O, and are still happy to have that loop written for you by someone who knows what they're doing. For the same reason I think the fast path should only support hash names and not constructors/functions/etc', which would complicate it because new-object-can-be-accessed-without-GIL wouldn't necessarily apply. Does this make sense?
msg415754 - (view)	Author: miss-islington (miss-islington)	Date: 2022-03-22 09:37
New changeset 4f97d64c831c94660ceb01f34d51fa236ad968b0 by Christian Heimes in branch 'main': bpo-45150: Add hashlib.file_digest() for efficient file hashing (GH-31930) https://github.com/python/cpython/commit/4f97d64c831c94660ceb01f34d51fa236ad968b0
msg415793 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2022-03-22 15:41
New changeset e03db6d5be7cf2e6b7b55284985c404de98a9420 by Christian Heimes in branch 'main': bpo-45150: Fix testing under FIPS mode (GH-32046) https://github.com/python/cpython/commit/e03db6d5be7cf2e6b7b55284985c404de98a9420

History
Date	User	Action	Args
2022-04-11 14:59:49	admin	set	github: 89313
2022-03-22 15:41:04	christian.heimes	set	messages: +
2022-03-22 13:06:13	christian.heimes	set	pull_requests: + <pull%5Frequest30136>
2022-03-22 09:37:24	miss-islington	set	nosy: + miss-islingtonmessages: +
2022-03-16 18:51:44	Aur.Saraf	set	messages: +
2022-03-16 11:49:42	christian.heimes	set	pull_requests: + <pull%5Frequest30021>
2022-03-16 11:49:27	christian.heimes	set	assignee: tarek -> christian.heimesmessages: +
2022-03-16 02:58:43	Aur.Saraf	set	messages: +
2022-03-16 02:36:04	Aur.Saraf	set	messages: +
2022-03-16 02:21:23	Aur.Saraf	set	messages: +
2022-03-16 02:14:05	Aur.Saraf	set	messages: +
2022-03-16 02:11:47	python-dev	set	pull_requests: + <pull%5Frequest30019>
2022-03-14 14:13:50	Aur.Saraf	set	messages: +
2022-03-14 14:00:05	tarek	set	messages: +
2022-03-14 13:56:59	Aur.Saraf	set	nosy: + Aur.Sarafmessages: +
2021-09-09 10:56:27	tarek	set	messages: +
2021-09-09 10:48:47	christian.heimes	set	messages: +
2021-09-09 10:16:36	serhiy.storchaka	set	nosy: + gregory.p.smith, christian.heimestype: enhancementversions: + Python 3.11
2021-09-09 10:04:49	python-dev	set	keywords: + patchnosy: + python-devpull_requests: + <pull%5Frequest26673>stage: patch review
2021-09-09 10:03:17	tarek	create