Re: [PATCH] md5: accepts a new --threads option (original) (raw)

[Top][All Lists]


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


From: Pádraig Brady
Subject: Re: [PATCH] md5: accepts a new --threads option
Date: Tue, 20 Oct 2009 11:11:00 +0100
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Pádraig Brady wrote:

Giuseppe Scrivano wrote: > Hello, > > inspired by the attempt to make `sort' multi-threaded, I added threads > support to md5sum and the sha* programs family. It has effect only when > multiple files are specified. > > Any comment?

How does it compare to:

filesperprocess=10 cpus=4 find files | xargs -n$filesperprocess -P$cpus md5sum

I would expect it to be a bit better as fileperprocess could be very large, thus having less overhead in starting processes. Though is the benefit worth the extra implementation complexity and new less general interface for users?

Expanding a bit on why I don't think this should be added...

You don't gain much by splitting the work per file as the UNIX toolkit is already well equipped as to process multiple files in parallel with:

find files | xargs -n$files_per_process -P$processes md5sum

That is a more general solution and works for any command or collection of commands (script). Also more generally the work could be split across multiple machines (in the case where the processing cost is bigger than transmission), using ssh or whatever:

find files | dxargs¹ ...

Also one often wants to split the work per data source rather than per CPU and so would need a variant of the above rather than a contained threaded solution. Consider the case where you have files on separate disks (heads). You wouldn't want multiple threads/processes fighting over the disk head so you would do something like:

find /disk1 | xargs md5sum & find /disk2 | xargs md5sum

Note if we're piping/redirecting the output of the above then we must be careful to line buffer the output from md5sum so that it's not interspersed. Hmm I wonder should we linebuffer the output from *sum by default. In the meantime one can check for the correct output by varying the -o parameter in the following:

( find /etc | xargs ./stdbuf -oL md5sum & find /etc | xargs ./stdbuf -oL md5sum ) 2>/dev/null | sed -n '/[^ ]{32}/!p'

Now it's a different story if the data within a file could be processed in parallel. I.E. if the digest algorithms themselves could be parallelized. The higher the processing cost compared to the I/O cost, the bigger the benefit would be. Doing a very quick check of these costs on my laptop...

$ timeout -sINT 10 dd bs=32K if=/dev/sda of=/dev/null 347570176 bytes (348 MB) copied, 10.004 s, 34.7 MB/s

$ timeout -sINT 10 dd bs=32K if=/dev/zero | ./md5sum 1816690688 bytes (1.8 GB) copied, 10.0002 s, 182 MB/s

$ timeout -sINT 10 dd bs=32K if=/dev/zero | ./cat >/dev/null 9205088256 bytes (9.2 GB) copied, 10.0514 s, 916 MB/s

$ timeout -sINT 10 dd bs=32K if=/dev/zero of=/dev/null 48931995648 bytes (49 GB) copied, 10.0314 s, 4.9 GB/s

Note there is some low hanging fruit with speeding up md5sum et. al. They seem to use stdio needlessly, thus introducing data copying. Also there is an improved sha1 floating around that's 25% more efficient.

cheers, Pádraig.

¹ http://www.semicomplete.com/blog/geekery/distributed-xargs.html