Re: [coreutils] join feature: auto-format (original) (raw)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


From: Pádraig Brady
Subject: Re: [coreutils] join feature: auto-format
Date: Fri, 07 Jan 2011 13:03:13 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

On 06/01/11 12:05, Pádraig Brady wrote:

On 07/10/10 19:25, Pádraig Brady wrote: > On 07/10/10 18:43, Assaf Gordon wrote: >> Pádraig Brady wrote, On 10/07/2010 06:22 AM: >>> On 07/10/10 01:03, Pádraig Brady wrote: >>>> On 06/10/10 21:41, Assaf Gordon wrote: >>>>> >>>>> The "--auto-format" feature simply builds the "-o" format line >>>>> automatically, based on the number of columns from both input files. >>>> >>>> Thanks for persisting with this and presenting a concise example. >>>> I agree that this is useful and can't think of a simple workaround. >>>> Perhaps the interface would be better as: >>>> >>>> -o {all (default), padded, FORMAT} >>>> >>>> where padded is the functionality you're suggesting? >>> >>> Thinking more about it, we mightn't need any new options at all. >>> Currently -e is redundant if -o is not specified. >>> So how about changing that so that if -e is specified >>> we operate as above by auto inserting empty fields? >>> Also I wouldn't base on the number of fields in the first line, >>> instead auto padding to the biggest number of fields >>> on the current lines under consideration. >> >> My concern is the principle of "least surprise" - if there are existing >> scripts/programs that specify "-e" without "-o" (doesn't make sense, but >> still possible) - this change will alter their behavior. >> >> Also, implying/forcing 'auto-format' when "-e" is used without "-o" might >> be a bit confusing. > > Well seeing as -e without -o currently does nothing, > I don't think we need to worry too much about changing that behavior. > Also to me, specifying -e EMPTY implicitly means I want > fields missing from one of the files replaced with EMPTY. > > Note POSIX is more explicit, and describes our current operation: > > -e EMPTY > Replace empty output fields in the list selected by -o with EMPTY > > So changing that would be an extension to POSIX. > But I still think it makes sense. > I'll prepare a patch soon, to do as I describe above, > unless there are objections.

The attached changes join (from what's done on other platforms) so that...

join -e will automatically pad missing fields from one file so that the same number of fields are output from each file. Previously -e was only used for missing fields specified with -o or -j.

With this change join now does:

$ cat file1 a 1 2 b 1 d 1 2

$ cat file2 a 3 4 b 3 4 c 3 4

$ join -a1 -a2 -1 1 -2 1 -e. file1 file2 a 1 2 3 4 b 1 . 3 4 c . . 3 4 d 1 2 . .

$ join -a1 -a2 -1 1 -2 4 -e. file1 file2 . . . . a 3 4 . . . . b 3 4 . . . . c 3 4 a 1 2 . . b 1 . d 1 2 . .

$ join -a1 -a2 -1 4 -2 1 -e. file1 file2 . a 1 2 . . . . b 1 . . . d 1 2 . . . a . . 3 4 b . . 3 4 c . . 3 4

$ join -a1 -a2 -1 4 -2 4 -e. file1 file2 . a 1 2 a 3 4 . a 1 2 b 3 4 . a 1 2 c 3 4 . b 1 . a 3 4 . b 1 . b 3 4 . b 1 . c 3 4 . d 1 2 a 3 4 . d 1 2 b 3 4 . d 1 2 c 3 4

While -e without -o was previously a noop, and so could safely be extended IMHO, this will also change the behavior when with -e and -j are specified. Previously if -j > 1 was specified, and that field was missing, then -e would be used in its place, rather than the empty string. This still does that, but also does the padding. Without the -j issue I'd be 80:20 for just extending -e to auto pad, but given -j I'm 50:50. The alternative it to select this with say '-o padded', but that's less discoverable, and complicates the interface somewhat.

Considering this more, I think it's safer to auto pad only when '-o padded' is specified. I notice the plan9 join man page has an example that uses -e '' to explicitly specify the NUL string as filler, which would have triggered our auto pad if we left it as above.

cheers, Pádraig.