bug#16004: Multicore Core-utils (original) (raw)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


From: Pádraig Brady
Subject: bug#16004: Multicore Core-utils
Date: Fri, 29 Nov 2013 23:05:54 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 11/29/2013 10:18 PM, CDR wrote:

Dear friends

In case this email is read by Richard M. Stallman and David MacKenzie. I need a multi-core version of "comm" and "join". The current version only uses one core and it takes hours to process two files, with 4 columns and 510 million lines. I need to process those files every night.

I wonder if any plan exists to jump to multicore. If not, is there a volunteer that can do the job, for a reasonable fee? I am one-man company but I guess we all need a parallel-processing-capable core-utils.

Note comm and join need a sorted file and sort(1) is already multicore aware. Since sorting needs to implicitly handle all the input before generating output, it makes sense for sort(1) to handle that itself. Also the sorting operation itself is relative expensive compared to the corresponding I/O involved, which further justifies the multicore knowledge within sort(1).

So if you're dealing with an already sorted file, it then often depends on the I/O for that file which could be a bottleneck. For example if your data file that "takes hours to process" was on a mechanical hard disk, then processing with a single thread/process is probably best, otherwise multiple ones would be just seeking the disk head and slow things down. The increasing prevalence of SSDs changes the game here though, so that separate accesses to the same file could very well be a win.

BTW you haven't said whether you're I/O or CPU bound. I presume you're CPU bound given you're mentioning multicore, which is a little surprising given the relatively inexpensive operations done within comm(1) and join(1). It's worth mentioning locales here, because if you don't need the relatively expensive locale matching rules, you can disable those before a run by setting: export LC_ALL=C If that did change things to be I/O bound again then you might consider putting each file on separate devices, to gain from parallel I/O operations.

So if you're still CPU bound, a more general technique to consider, is splitting up the file to be processed by separate processes. Now this is more sorted to tools that don't have relevance on the relative order of particular lines which unfortunately comm(1) and join(1) do, but perhaps there is some way you could split your data to more files when generating it, which could then be fed to separate join(1) processes.

thanks, Pádraig.