Bit Counting and Similar Instructions (original) (raw)

ACiphers By Ritter Page

Population count and binary logarithm comments.


Contents


Subject: Bit counting and similar instructions Date: 3 Dec 1998 22:43:04 GMT From: Doug Moore dougm@farkas.caam.rice.edu Message-ID: 74745o$qi$1@joe.rice.edu Newsgroups: comp.arch.arithmetic Lines: 13

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word Identify the position of most significant 1 bit in a word. Identify the position of the least significant 1 bit in a word. Report the binary logarithm of an integer.

Of course, these are unusual instructions since they would hard to generate from most programming languages.

Doug Moore (dougm@caam.rice.edu)


Subject: Re: Bit counting and similar instructions Date: 4 Dec 1998 01:27:07 GMT From: Vincent Lefevre Vincent.Lefevre@ens-lyon.fr Message-ID: 747dpb$14b@cri.ens-lyon.fr References: 74745o$qi$1@joe.rice.edu Newsgroups: comp.arch.arithmetic Lines: 19

In article 74745o$qi$1@joe.rice.edu, Doug Moore dougm@farkas.caam.rice.edu wrote:

Identify the position of most significant 1 bit in a word.

The ARM10 will have such an instruction (CLZ) [ARMv5T architecture]. See http://www.eet.com/story/OEG19981015S0019

Identify the position of the least significant 1 bit in a word.

This can be done with a few instructions (bitwise logical operations on x and x-1), but you'll have the position as a power of 2 (not the number of 0's after the least significant 1).

-- Vincent Lefevre Vincent.Lefevre@ens-lyon.fr - PhD stud. in Computer Science Web: http://www.ens-lyon.fr/~vlefevre/ - 100% validated HTML - Acorn Risc PC, Yellow Pig 17, Championnat International des Jeux Mathematiques et Logiques, TETRHEX, Faits divers insolites, etc...


Subject: Re: Bit counting and similar instructions Date: 4 Dec 1998 08:33:18 GMT From: michael.williams@arm-sponge.com (Michael Williams) Message-ID: 7486oe$jua@sis.cambridge.arm.com References: 747dpb$14b@cri.ens-lyon.fr Newsgroups: comp.arch.arithmetic Lines: 53

In article 747dpb$14b@cri.ens-lyon.fr, Vincent Lefevre vlefevre+news@ens-lyon.fr wrote:

Identify the position of the least significant 1 bit in a word.

This can be done with a few instructions (bitwise logical operations on x and x-1), but you'll have the position as a power of 2 (not the number of 0's after the least significant 1).

David Seal came up with a neat algoritm for converting such a value to the index, many moons ago. The message ID is 32975@armltd.uucp, but this dates from early 1994, so may be hard to find. I replicate it here:

(Apologies to non-ARM-coders. I hope you get the gist.)

In article 32975@armltd.uucp dseal@armltd.co.uk (David Seal) writes:

I produced a better variant of the same theme last night:

; Operand in R0, register R1 is free, R2 addresses a byte table 'Table'

RSB R1,R0,#0 ;Standard trick to isolate bottom bit in R1, AND R1,R1,R0 ; or produce zero in R1 if R0 = 0.

ORR R1,R1,R1,LSL #4 ;If R1=X with 0 or 1 bits set, R0 = X * &11 ORR R1,R1,R1,LSL #6 ;R0 = X * &451 RSB R1,R1,R1,LSL #16 ;R1 = X * &0450FBAF

LDRB R1,[R2,R1,LSR #26] ;Look up table entry indexed by top 6 bits ; of R1

; Result in R1

If you're interested, Table looks like this:

Table DATA DCB 0x20,0x00,0x01,0x0c DCB 0x02,0x06,0xff,0x0d DCB 0x03,0xff,0x07,0xff DCB 0xff,0xff,0xff,0x0e DCB 0x0a,0x04,0xff,0xff DCB 0x08,0xff,0xff,0x19 DCB 0xff,0xff,0xff,0xff DCB 0xff,0x15,0x1b,0x0f DCB 0x1f,0x0b,0x05,0xff DCB 0xff,0xff,0xff,0xff DCB 0x09,0xff,0xff,0x18 DCB 0xff,0xff,0x14,0x1a DCB 0x1e,0xff,0xff,0xff DCB 0x17,0x17,0xff,0x13 DCB 0x1d,0xff,0x16,0x12 DCB 0x1c,0x11,0x10,0xff

Mike.


Subject: Re: Bit counting and similar instructions Date: Fri, 11 Dec 1998 06:38:35 GMT From: jreiser@teleport.com (John Reiser) Message-ID: 3670ba1d.5981552@news.teleport.com References: 747dpb$14b@cri.ens-lyon.fr Newsgroups: comp.arch.arithmetic Lines: 7

Also remember elementary number theory: log2modp[ (x & -x) % p] where p is a prime larger than the word size and having 2 as a generator of the multiplicative group of units modulo p. log2modp is a pre-computed table of logarithms. For 32 bits, the smallest such prime is p = 37.

jreiser@teleport.com


Subject: Re: Bit counting and similar instructions Date: 11 Dec 1998 09:42:49 GMT From: nmm1@cus.cam.ac.uk (Nick Maclaren) Message-ID: 74qpep$ele$1@pegasus.csx.cam.ac.uk References: 3670ba1d.5981552@news.teleport.com Newsgroups: comp.arch.arithmetic Lines: 21

In article 3670ba1d.5981552@news.teleport.com, John Reiser jreiser@teleport.com wrote:

Also remember elementary number theory: log2modp[ (x & -x) % p] where p is a prime larger than the word size and having 2 as a generator of the multiplicative group of units modulo p. log2modp is a pre-computed table of logarithms. For 32 bits, the smallest such prime is p = 37.

Hmm. Elementary? I shall have to think on't :-)

In any case, that make a lot of assumptions about the representation. "(x & -x)" is not well-defined in general.

Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QG, England. Email: nmm1@cam.ac.uk Tel.: +44 1223 334761 Fax: +44 1223 334679


Subject: Re: Bit counting and similar instructions Date: Fri, 4 Dec 1998 08:09:04 GMT From: pmontgom@cwi.nl (Peter L. Montgomery) Message-ID: F3FLB4.K45@cwi.nl References: 74745o$qi$1@joe.rice.edu Newsgroups: comp.arch.arithmetic Lines: 45

In article 74745o$qi$1@joe.rice.edu Doug Moore dougm@farkas.caam.rice.edu writes:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word Identify the position of most significant 1 bit in a word. Identify the position of the least significant 1 bit in a word. Report the binary logarithm of an integer.

Here is a partial list, old and new architectures:

     Population count:

           Cray, CDC Cyber series, Alliant FX/80.
           UltraSPARC, Alpha 21264

           Intel x86 has the parity of a byte

     Most significant 1 bit (equivalent to binary logarithm)

           Cray, Alpha 21264, Motorola 68020, Intel x86
           CDC Cyber series had this if 0 < arg < 2^48

     Least significant 1 bit:

           Alpha 21264, Intel x86
           Available as POPULATION_COUNT((x-1) & ~x) if x != 0

Of course, these are unusual instructions since they would hard to generate from most programming languages.

 The population count, least significant 1 bit, and binary

logarithm should be added to Fortran, C and other programming languages. All three should operate on unsigned operands, with the last two functions being undefined when the argument is zero. A binary logarithm function is preferred to a leading zero count function since (for example) BINARY_LOGARITHM(123) will be 6 on both 32-bit and 64-bit machines, whereas LEADING_ZERO_COUNT(123) will be 25 on 32-bit machines but 57 on 64-bit machines.

    Peter-Lawrence.Montgomery@cwi.nl    San Rafael, California

The bridge to the 21st century is being designed for the private autombile. The bridge to the 22nd century will forbid private automobiles.


Subject: Re: Bit counting and similar instructions Date: 4 Dec 1998 10:13:33 GMT From: nmm1@cus.cam.ac.uk (Nick Maclaren) Message-ID: 748ckd$pkk$1@pegasus.csx.cam.ac.uk References: F3FLB4.K45@cwi.nl Newsgroups: comp.arch.arithmetic Lines: 37

In article F3FLB4.K45@cwi.nl, Peter L. Montgomery pmontgom@cwi.nl wrote:

In article 74745o$qi$1@joe.rice.edu Doug Moore dougm@farkas.caam.rice.edu writes:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word Identify the position of most significant 1 bit in a word. Identify the position of the least significant 1 bit in a word. Report the binary logarithm of an integer.

Of course, these are unusual instructions since they would hard to generate from most programming languages.

The population count, least significant 1 bit, and binary

logarithm should be added to Fortran, C and other programming languages.

There is absolutely no difficulty in adding these as Fortran intrinsic functions, C library macro/functions etc., and a compiler generating inline code and using hardware support where appropriate. It is precisely how the absolute value is supported.

C9X will not welcome such suggestions, but there might be a chance the next Fortran update.

My experience is that I want the most significant bit 10 times for every time I want one of the others, but I suspect that is because of the sort of algorithms I look at rather than any inherent importance of the operation.

Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QG, England. Email: nmm1@cam.ac.uk Tel.: +44 1223 334761 Fax: +44 1223 334679


Subject: Re: Bit counting and similar instructions Date: 4 Dec 1998 11:11:26 -0500 From: hrubin@b.stat.purdue.edu (Herman Rubin) Message-ID: 7491je$hmk@b.stat.purdue.edu References: 74745o$qi$1@joe.rice.edu Newsgroups: comp.arch.arithmetic Lines: 29

In article 74745o$qi$1@joe.rice.edu, Doug Moore dougm@farkas.caam.rice.edu wrote:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word Identify the position of most significant 1 bit in a word. Identify the position of the least significant 1 bit in a word. Report the binary logarithm of an integer.

Of course, these are unusual instructions since they would hard to generate from most programming languages.

Let me add one more to the list; it would be very useful for such purposes as generating non-uniform random numbers from uniform random bit input.

What is wanted is to read a bit stream from a position given by a bit pointer, find the distance to the next 1 in the stream, and update the pointer, with appropriate interrupts if the stream becomes exhausted or there are no 1's remaining bits in the stream.

Even partial tools would be helpful.

This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558


Subject: Re: Bit counting and similar instructions Date: Sat, 05 Dec 1998 15🔞58 +0100 From: Terje Mathisen Terje.Mathisen@hda.hydro.com Message-ID: 366940D2.54B0DDE1@hda.hydro.com References: 7491je$hmk@b.stat.purdue.edu Newsgroups: comp.arch.arithmetic Lines: 92

Herman Rubin wrote:

In article 74745o$qi$1@joe.rice.edu, Doug Moore dougm@farkas.caam.rice.edu wrote:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word Identify the position of most significant 1 bit in a word. Identify the position of the least significant 1 bit in a word. Report the binary logarithm of an integer.

Of course, these are unusual instructions since they would hard to generate from most programming languages.

Let me add one more to the list; it would be very useful for such purposes as generating non-uniform random numbers from uniform random bit input.

What is wanted is to read a bit stream from a position given by a bit pointer, find the distance to the next 1 in the stream, and update the pointer, with appropriate interrupts if the stream becomes exhausted or there are no 1's remaining bits in the stream.

Even partial tools would be helpful.

The partial tools already exists, and should be plenty good enough to do this very quickly.

If the input bit stream is really random, then the average distance to the next 1 bit will be two bits, right?

In 255 of 256 cases, the next bit will be found within the next 8 bits, so I'd use code like this:

unsigned char cache; // Current byte in bit stream unsigned char *buf; // points to the next byte in the bit stream unsigned bitpos; // bit position to start looking at (0..8) unsigned char mask[9] = {255,254,252,248,240,224,192,128,0}; unsigned char firstbit[256]; // returns the position of the first 1 bit

b = cache & mask[bitpos]; // 4 bits left in cache (on average) if (b) { bit = firstbit[b]; count = bit - bitpos; bitpos = bit + 1; return count; }

In about 93% of all cases, the code above will be all that's needed. It uses about 7 operations (worst case, assuming tables in L1 cache), so most modern cpus can inline this code and run it in 3-5 cycles.

When we've reached the end of the current byte, then we'll need to locate the next non-zero byte:

for (count = 8-bitpos; (b = *buf++) == 0; count += 8) ;

The code above will nearly always run just a single (predicted) iteration, so it will cost a couple of cycles, plus the cost of the mis-predicted branch in the initial code above.

cache = b; // Save remainder for next iteration bit = firstbit[b]; bitpos = bit + 1; return bit + count;

To avoid the need for explicit checks for the end of the input buffer, a simple sentinel value (255?) will do fine.

IMHO, it really seems like this "problem" can be solved with portable code, averaging less than 5 cycles per invocation. This is NOT a good candidate for extra hardware support.

Terje

PS. Using the fp hardware and 32-bit input blocks might be even faster, since there will be no need for the 256-byte lookup table:

d = cache & mask[bitpos]; if (d) { double t = d; int64 ll = *(int64 *) & t; return (ll >> 52) - 1023; }

--


Subject: Re: Bit counting and similar instructions Date: Sun, 6 Dec 1998 18:13:51 -0800 From: "Derek Gladding" derek_gladding@altavista.net Message-ID: 74fegu$8ib$1@owl.slip.net References: 366940D2.54B0DDE1@hda.hydro.com Newsgroups: comp.arch.arithmetic Lines: 22

Terje Mathisen wrote in message 366940D2.54B0DDE1@hda.hydro.com...

Herman Rubin wrote:

In article 74745o$qi$1@joe.rice.edu, Doug Moore dougm@farkas.caam.rice.edu wrote:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word Identify the position of most significant 1 bit in a word. Identify the position of the least significant 1 bit in a word. Report the binary logarithm of an integer.

The ARC (http://www.risccores.com) has the most-significant 1 operation as one of the optional instruction set extensions.


Subject: Re: Bit counting and similar instructions Date: Sat, 05 Dec 1998 22:29:25 +0100 From: Terje Mathisen Terje.Mathisen@hda.hydro.com Message-ID: 3669A5B5.E6CB3F5@hda.hydro.com References: 7491je$hmk@b.stat.purdue.edu Newsgroups: comp.arch.arithmetic Lines: 45

Herman Rubin wrote:

In article 74745o$qi$1@joe.rice.edu, Doug Moore dougm@farkas.caam.rice.edu wrote:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word Identify the position of most significant 1 bit in a word. Identify the position of the least significant 1 bit in a word. Report the binary logarithm of an integer.

Of course, these are unusual instructions since they would hard to generate from most programming languages.

Let me add one more to the list; it would be very useful for such purposes as generating non-uniform random numbers from uniform random bit input.

What is wanted is to read a bit stream from a position given by a bit pointer, find the distance to the next 1 in the stream, and update the pointer, with appropriate interrupts if the stream becomes exhausted or there are no 1's remaining bits in the stream.

Even partial tools would be helpful.

Herman, this application does NOT need any fancy opcodes!

It is quite similar to decoding huffman-encoded (or other variable bit length encoding) data. The fastest way I know to do this is to use one or a couple of lookup tables, which are indexed by the next N bits of the bit stream.

With N=8, each lookup will locate 0 to 8 1 bits, in the form of a table consisting of a count and a set of offsets. Worst case, this will use 9*256 bytes, and it will b faster than any conceivable special opcode which returns just a single entry.

Terje

--


Subject: Re: Bit counting and similar instructions Date: 8 Dec 1998 16:54:01 -0500 From: hrubin@b.stat.purdue.edu (Herman Rubin) Message-ID: 74k75p$juk@b.stat.purdue.edu References: 3669A5B5.E6CB3F5@hda.hydro.com Newsgroups: comp.arch.arithmetic Lines: 57

In article 3669A5B5.E6CB3F5@hda.hydro.com, Terje Mathisen Terje.Mathisen@hda.hydro.com wrote:

Herman Rubin wrote:

In article 74745o$qi$1@joe.rice.edu, Doug Moore dougm@farkas.caam.rice.edu wrote:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word Identify the position of most significant 1 bit in a word. Identify the position of the least significant 1 bit in a word. Report the binary logarithm of an integer.

Of course, these are unusual instructions since they would hard to generate from most programming languages.

Let me add one more to the list; it would be very useful for such purposes as generating non-uniform random numbers from uniform random bit input.

What is wanted is to read a bit stream from a position given by a bit pointer, find the distance to the next 1 in the stream, and update the pointer, with appropriate interrupts if the stream becomes exhausted or there are no 1's remaining bits in the stream.

Even partial tools would be helpful.

Herman, this application does NOT need any fancy opcodes!

It is quite similar to decoding huffman-encoded (or other variable bit length encoding) data. The fastest way I know to do this is to use one or a couple of lookup tables, which are indexed by the next N bits of the bit stream.

With N=8, each lookup will locate 0 to 8 1 bits, in the form of a table consisting of a count and a set of offsets. Worst case, this will use 9*256 bytes, and it will b faster than any conceivable special opcode which returns just a single entry.

It does not need opcodes, it needs hardware, for the uses envisioned.

It is not a matter of size, but of speed. The use I envision for this is to generate non-uniform random variables, by using small numbers of bits and exact computation to obtain some of the bits of the output random variable, the rest to be filled in by random bits.

There are always competing procedures, less efficient from a theoretical point, but doing a much better job of using standard computer hardware for speed. This is likely to be done millions of times, so the "end effect" problems will arise, and need to be considered for the timing.

This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558


Subject: Re: Bit counting and similar instructions Date: Wed, 09 Dec 1998 09:17:12 +0100 From: Terje Mathisen Terje.Mathisen@hda.hydro.com Message-ID: 366E3208.C72F8F4@hda.hydro.com References: 74k75p$juk@b.stat.purdue.edu Newsgroups: comp.arch.arithmetic Lines: 36

Herman Rubin wrote:

In article 3669A5B5.E6CB3F5@hda.hydro.com, Terje Mathisen Terje.Mathisen@hda.hydro.com wrote:

Herman, this application does NOT need any fancy opcodes!

It is quite similar to decoding huffman-encoded (or other variable bit length encoding) data. The fastest way I know to do this is to use one or a couple of lookup tables, which are indexed by the next N bits of the bit stream.

With N=8, each lookup will locate 0 to 8 1 bits, in the form of a table consisting of a count and a set of offsets. Worst case, this will use 9*256 bytes, and it will b faster than any conceivable special opcode which returns just a single entry.

It does not need opcodes, it needs hardware, for the uses envisioned.

It is not a matter of size, but of speed. The use I envision for this is to generate non-uniform random variables, by using small numbers of bits and exact computation to obtain some of the bits of the output random variable, the rest to be filled in by random bits.

OK, I'll bite:

Please post your current C (or other language) algorithm to do this processing, and I'll see if I can find a way to optimize it, preferably to the point where it will be as fast or faster than what you'd get from a single-cycle find_next_1_bit opcode.

Terje

--


Subject: Re: Bit counting and similar instructions Date: Fri, 15 Jan 1999 12:48:52 +0000 From: Suresh Kadiyala suresh@ssofttech.com Message-ID: 369F3934.162F18A5@ssofttech.com References: 74745o$qi$1@joe.rice.edu Newsgroups: comp.arch.arithmetic Lines: 29

Doug Moore wrote:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word

PowerPC has this.

Identify the position of most significant 1 bit in a word. Identify the position of the least significant 1 bit in a word.

I think PPC has the above two.

Report the binary logarithm of an integer.

Of course, these are unusual instructions since they would hard to generate from most programming languages.

Doug Moore (dougm@caam.rice.edu)

Suresh Kadiyala


Subject: Re: Bit counting and similar instructions Date: Mon, 18 Jan 1999 09:36:24 +0100 From: Marc Daumas Marc.Daumas@ENS-Lyon.Fr Message-ID: 36A2F288.BF50D66A@ENS-Lyon.Fr References: 369F3934.162F18A5@ssofttech.com Newsgroups: comp.arch.arithmetic Lines: 22

Suresh Kadiyala wrote:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word

PowerPC has this.

I was told that this operation was regarded as very usefull to break cryptographic codes and thereafter it was ommitted on most architectures so that there willl be no restriction to export.

Truth or legend ?

What instruction is that on the power PC ? I don't know any such instruction on x86.

-- Marc Daumas - Charge de recherches au CNRS (LIP - ENS de Lyon) mailto:Marc.Daumas@ENS-Lyon.Fr - http://www.ens-lyon.fr/~daumas ENS de Lyon - 46, allee d'Italie - 69364 Lyon Cedex 07 - FRANCE Phone: (+33) 4 72 72 83 52 - Fax: (+33) 4 72 72 80 80


Subject: Re: Bit counting and similar instructions Date: 18 Jan 1999 07:51:41 -0500 From: hrubin@b.stat.purdue.edu (Herman Rubin) Message-ID: 77vaot$17h6@b.stat.purdue.edu References: 36A2F288.BF50D66A@ENS-Lyon.Fr Newsgroups: comp.arch.arithmetic Lines: 26

In article 36A2F288.BF50D66A@ENS-Lyon.Fr, Marc Daumas Marc.Daumas@ENS-Lyon.Fr wrote:

Suresh Kadiyala wrote:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word

PowerPC has this.

I was told that this operation was regarded as very usefull to break cryptographic codes and thereafter it was ommitted on most architectures so that there willl be no restriction to export.

One could make this argument about any bit instruction. The cost of doing this is large enough to make some useful algorithms not pay, but not large enough to make much difference in cryptanalysis, where other operations are likely to be done many times.

-- This address is for information only. I do not claim that these views are those of the Statistics Department or of Purdue University. Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 hrubin@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

Cache-Post-Path: proxy0.isltd.insignia.com!unknown@christian-mac.isltd.insignia.com


Subject: Re: Bit counting and similar instructions Date: Tue, 19 Jan 1999 11:49:11 +0000 From: christian.bau@isltd.insignia.com (Christian Bau) Message-ID: christian.bau-1901991149110001@christian-mac.isltd.insignia.com References: 36A2F288.BF50D66A@ENS-Lyon.Fr Newsgroups: comp.arch.arithmetic Lines: 37

In article 36A2F288.BF50D66A@ENS-Lyon.Fr, Marc Daumas Marc.Daumas@ENS-Lyon.Fr wrote:

Suresh Kadiyala wrote:

I'd like to know which modern architectures support instructions for any of the following (somewhat related) functions:

Count the number of 1 bits in a word

PowerPC has this.

I was told that this operation was regarded as very usefull to break cryptographic codes and thereafter it was ommitted on most architectures so that there willl be no restriction to export.

Truth or legend ?

What instruction is that on the power PC ? I don't know any such instruction on x86.

There is no single instruction to count bits in a word on the PowerPC, only an instruction to count the number of leading zero bits. Written in C, the following code would be quite efficient on the PowerPC if you do the operation a lot:

static unsigned char lookup_table [2048] = { /* Initialised to the number of bits in i for 0 <= i < 2048 */ }

#define bit_count(n) (lookup_table[((n)>>22) & 0x7ff]
+lookup_table[((n)>>11) & 0x7ff]
+lookup_table[((n)>> 0) & 0x7ff])

Total latency six cycles if a pointer to the lookup table is in a register. I think the Altivec processors (PowerPC 750 + Vector instructions) could be quite fast at it as well. My guess is about 5 cycles for counting 128 bits.


Subject: Re: Bit counting and similar instructions Date: Tue, 19 Jan 1999 13:34:27 +0100 From: Terje Mathisen Terje.Mathisen@hda.hydro.com Message-ID: 36A47BD3.804DF5B0@hda.hydro.com References: christian.bau-1901991149110001@christian-mac.isltd.insignia.com Newsgroups: comp.arch.arithmetic Lines: 67

Christian Bau wrote:

There is no single instruction to count bits in a word on the PowerPC, only an instruction to count the number of leading zero bits. Written in C, the following code would be quite efficient on the PowerPC if you do the operation a lot:

static unsigned char lookup_table [2048] = { /* Initialised to the number of bits in i for 0 <= i < 2048 */ }

#define bit_count(n) (lookup_table[((n)>>22) & 0x7ff]
+lookup_table[((n)>>11) & 0x7ff]
+lookup_table[((n)>> 0) & 0x7ff])

Total latency six cycles if a pointer to the lookup table is in a

A Pentium would be similar:

; assume input value in EAX

mov ebx,eax and eax,7ffh

mov ecx,ebx and ebx,7ffh SHL 11

shr ecx,22 mov eax,lookup_table[eax]

shr ebx,11 mov ecx,lookup_table[ecx]

add eax,ecx mov ebx,lookup_table[ebx]

add eax,ebx

register. I think the Altivec processors (PowerPC 750 + Vector instructions) could be quite fast at it as well. My guess is about 5 cycles for counting 128 bits.

The Altivec might be even faster:

The key would be to use the register-internal 16-element table lookup to convert 16 nibbles into the corresponding bit counts simultaneously.

Doing this operation twice and adding should do the trick.

I don't have an Altivec manual, but the (pseudo-)code should be something like this:

r1 = r0 >> 4; r2 = 0x04030302030202010302020102010100; // 16-element nibble table

r0 &= 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f; r1 &= 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f;

r0 = register_table_lookup(r0,r2); r1 = register_table_lookup(r1,r2);

return r0+r1;

Terje

Originator: nmm1@taurus.cus.cam.ac.uk


Subject: Re: Bit counting and similar instructions Date: 19 Jan 1999 12:55:43 GMT From: nmm1@cus.cam.ac.uk (Nick Maclaren) Message-ID: 781vcf$siu$1@pegasus.csx.cam.ac.uk References: 36A47BD3.804DF5B0@hda.hydro.com Newsgroups: comp.arch.arithmetic Lines: 30

In article 36A47BD3.804DF5B0@hda.hydro.com, Terje Mathisen Terje.Mathisen@hda.hydro.com writes: |> Christian Bau wrote: |> > There is no single instruction to count bits in a word on the PowerPC, |> > only an instruction to count the number of leading zero bits. Written in |> > C, the following code would be quite efficient on the PowerPC if you do |> > the operation a lot: ... |> > |> > Total latency six cycles if a pointer to the lookup table is in a |> |> A Pentium would be similar: ... |> |> > .... I think the Altivec processors (PowerPC 750 + Vector |> > instructions) could be quite fast at it as well. My guess is about 5 |> > cycles for counting 128 bits.

Most machines are similar to the above. The benefits of such an instruction would be marginal, at best, given the small number of programs that would benefit.

On the other hand, the cost of implementing such an instruction is usually also marginal ....

Regards, Nick Maclaren, University of Cambridge Computing Service, New Museums Site, Pembroke Street, Cambridge CB2 3QG, England. Email: nmm1@cam.ac.uk Tel.: +44 1223 334761 Fax: +44 1223 334679


Subject: Re: Bit counting and similar instructions Date: Tue, 19 Jan 1999 18:40:34 -0800 From: alexr@I.HATE.SPAM (Alex Rosenberg) Message-ID: alexr-1901991840340001@roseal2.apple.com References: 36A47BD3.804DF5B0@hda.hydro.com Newsgroups: comp.arch.arithmetic Lines: 50

In article 36A47BD3.804DF5B0@hda.hydro.com, Terje Mathisen Terje.Mathisen@hda.hydro.com wrote:

Christian Bau wrote:

register. I think the Altivec processors (PowerPC 750 + Vector instructions) could be quite fast at it as well. My guess is about 5 cycles for counting 128 bits.

The Altivec might be even faster:

The key would be to use the register-internal 16-element table lookup to convert 16 nibbles into the corresponding bit counts simultaneously.

Doing this operation twice and adding should do the trick.

I don't have an Altivec manual, but the (pseudo-)code should be something like this:

Motorola has a manual at: http://www.mot.com/SPS/PowerPC/teksupport/teklibrary/manuals/altivec_pem.pdf

Apple also has a whole slew of AltiVec info at: http://developer.apple.com/hardware/altivec http://developer.apple.com/tools/mpw-tools/altivec.html

r1 = r0 >> 4; r2 = 0x04030302030202010302020102010100; // 16-element nibble table

r0 &= 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f; r1 &= 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f;

r0 = register_table_lookup(r0,r2); r1 = register_table_lookup(r1,r2);

return r0+r1;

The two masking operations are unnecessary. The permute operation is actually a 5-bit lookup and the 4-bit table can be specified twice to make the fifth bit indifferent. Also, this would only count within each byte. You'd have to add two sum across operations to compute all 128 bits. If you had a large number of bits to count, you could save a few cycles by accumulating a few iterations as bytes before the sum across operations.

Check artcle 6ja3v6$17p4$1@rtpnews.raleigh.ibm.com by Brett Olsson brett@raleigh.ibm.com in your favorite news archive for a 5-bit variant that's substantially similar.

+------------------------------------------------------------+ | Alexander M. Rosenberg mailto:alexr@_spies.com | | Nobody cares what I say. Remove the underscore to mail me. |


Subject: Re: Bit counting and similar instructions Date: 20 Jan 1999 18:44:51 -0800 From: gillies@cs.ubc.ca (Donald Gillies) Message-ID: 7864b3$keh$1@cascade.cs.ubc.ca References: alexr-1901991840340001@roseal2.apple.com Newsgroups: comp.arch.arithmetic Lines: 47

It is not efficient to access memory to count bits on the PowerPC, or on ANY modern RISC processor. The cost of going to memory is roughly 5-10 instruction times. Even going to primary secondary cache is going to hurt your execution time a whole lot. Also, you cannot trust most RISC's to perform efficient byte accesses - many have to extract the byte slowly in the registers, or mask off the upper 24 bits, or whatnot. Moreover, many compilers (green hills) will throw away the base address of a lookup array after just 1 use, forcing a reload of this base address and wasting valuable cycles.

Thus, to count bits on most RISC's, make clever use of arithmetic:

#include "stdio.h"

// (C) Donald W. Gillies, 1992. All rights reserved. You may reuse // this bitcount() function anywhere you please as long as you retain // this Copyright Notice. // #define bitready() register unsigned long tmp #define bitcount(n)
(tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111),
tmp = ((tmp + (tmp >> 3)) & 030707070707),
tmp = (tmp + (tmp >> 6)),
tmp = (tmp + (tmp >> 12) + (tmp >> 24)) & 077)

/* 16 instructions on most risc machines for 32-bit bitcount ! */

main() { bitready();

printf("bitcount(1) = %ld\n", bitcount(1));
printf("bitcount(2) = %ld\n", bitcount(2));
printf("bitcount(3) = %ld\n", bitcount(3));
printf("bitcount(0xff) = %ld\n", bitcount(0xff));
printf("bitcount(0xffffff) = %ld\n", bitcount(0xffffff));

}

% ~/a.out bitcount(1) = 1 bitcount(2) = 1 bitcount(3) = 2 bitcount(0xff) = 8 bitcount(0xffffff) = 24 %


Subject: Re: Bit counting and similar instructions Date: Thu, 21 Jan 1999 09:33:52 +0100 From: Terje Mathisen Terje.Mathisen@hda.hydro.com Message-ID: 36A6E670.ECE65309@hda.hydro.com References: 7864b3$keh$1@cascade.cs.ubc.ca Newsgroups: comp.arch.arithmetic Lines: 111

Donald Gillies wrote:

It is not efficient to access memory to count bits on the PowerPC, or on ANY modern RISC processor. The cost of going to memory is roughly

We were discussing two specific cpus here, both of which will actually run the table-lookup copy quickly, but in general, I agree.

5-10 instruction times. Even going to primary secondary cache is going to hurt your execution time a whole lot. Also, you cannot trust most RISC's to perform efficient byte accesses - many have to extract the byte slowly in the registers, or mask off the upper 24 bits, or whatnot. Moreover, many compilers (green hills) will throw away the base address of a lookup array after just 1 use, forcing a reload of this base address and wasting valuable cycles.

This is a single example of a totally broken compiler, besides if the bitcount is actually time critical, I would implement it in asm anyway.

(I have actually written at leat 5 or 6 different asm versions of this code, see below)

Thus, to count bits on most RISC's, make clever use of arithmetic:

#include "stdio.h"

// (C) Donald W. Gillies, 1992. All rights reserved. You may reuse // this bitcount() function anywhere you please as long as you retain // this Copyright Notice.

AFAIK, this algorithm was known before 1992.

// #define bitready() register unsigned long tmp #define bitcount(n)
(tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111),
tmp = ((tmp + (tmp >> 3)) & 030707070707),
tmp = (tmp + (tmp >> 6)),
tmp = (tmp + (tmp >> 12) + (tmp >> 24)) & 077)

/* 16 instructions on most risc machines for 32-bit bitcount ! */

To really make this version efficient, you need to do the bitcount across an array:

In that case you can speed it up a lot more, by the use of bit-parallel addition:

Three input words can be stored in two vertical words, with a single full adder to merge them.

Similarly seven input words can be converted to three vertical words, and finally you can convert 15 input words to 4 vertical words.

After doing the adds, the actual bitcounting in the resulting words should be handled in parallel, this gets rid of all/most of the stalls.

Robert Harley posted 64-bit Alpha code to do this which ran in (much?) less than a cycle/byte, I have written an MMX asm version which gets similar performance.

Here's an example for a seven-wide counter:

#define FULL_ADD(c1, c0, w0, w1, s1, s2) w1 = s1; c0 = w0; w0 &= w1;
c0 ^= w1; w1 = s2; c1 = c0; c0 ^= w1; c1 &= w1; c1 |= w0

#define MASK55 0x55555555 #define MASK33 0x33333333 #define MASK0F 0x0f0f0f0f

ulong count_bits7(unsigned *src) { unsigned c0, c1, t0, t1, d0, d1, e0, e1, f1, f2;

t0 = src[0];
FULL_ADD(c1, c0, t0, t1, src[1], src[2]); // c1:c0         Up to 4

live vars + src[] FULL_ADD(d1, d0, c0, t1, src[3], src[4]); // d1:d0, c1 Up to 5 live vars + src[] FULL_ADD(e1, e0, d0, t1, src[5], src[6]); // e1:e0, d1, c1 Up to 6 live vars + src[] FULL_ADD(f2, f1, c1, t1, d1, e1); // f2:f1:e0

e0 -= (e0 >> 1) & MASK55;					// 2 bits, 0-2
f1 -= (f1 >> 1) & MASK55;
f2 -= (f2 >> 1) & MASK55;

e0 = (e0 & MASK33) + ((e0 >> 2) & MASK33);	// 4 bits, 0-4
f1 = (f1 & MASK33) + ((f1 >> 2) & MASK33);
f2 = (f2 & MASK33) + ((f2 >> 2) & MASK33);

e0 += e0 >> 4; f1 += f1 >> 4; f2 += f2 >> 4; // 4 bits, 0-8
e0 &= MASK0F; f1 &= MASK0F; f2 &= MASK0F;	// 8 bits, 0-8

e0 += (f1 << 1) + (f2 << 2); 		// 8 bits, 0-8+16+32 = 56
e0 += e0 >> 8;				// 8 bits, 0-112
e0 += e0 >> 16;				// 8 bits, 0-224

return e0 & 255;

}

It is worth noting though that on a plain Pentium, which is very good at byte addressing, and can handle two loads/cycle, the naive byte-wide code ran neck&neck with this much more complicated algorithm!

Terje

--


Terry Ritter, hiscurrent address, and histop page.

Last updated: 1999-02-20