Re: [coreutils] added ability in sort to skip n number of lines for each (original) (raw)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


From: Pádraig Brady
Subject: Re: [coreutils] added ability in sort to skip n number of lines for each file
Date: Mon, 22 Nov 2010 17:28:45 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

On 18/11/10 16:36, Jim Hester wrote:

A common problem when sorting files stems from the file containing 1 or more header lines, which should not be sorted. As of now, the common solution to this problem is to remove the header lines with manually, or to output only the non header lines with tail, awk, or some other program and pipe the results to sort.

Thanks for the patch!

This was likely not deemed a problem when sort was only single threaded, as the printing and pipe was likely still faster than the sort itself. However with multi-threaded sort this results in the operation bottle necking waiting for more information from the pipe.

I'm not following the argument above. One can always print the header synchronously? I.E. the head below is guaranteed to run before the sort

printf "z_header\nb\na\n" > file (head -n1 file; sort <(tail -n+2 file) <(tail -n+2 file))

Now the above is awkward and dependent on bash (constructs per file), so your idea has some merit I think.

This common operation would be greatly improved if sort could simply print a user defined number of lines for each file. I have made a simple patch to implement this feature, which I have attached to this email.

Note join recently got the --header option http://lists.gnu.org/archive/html/bug-coreutils/2010-01/msg00284.html also essentially to exclude starting lines from order comparisons.

cheers, Pádraig.