Paul Hsieh's Assembly Lab (original) (raw)

unsigned c, d; /* Divide out low powers of 2 if we can (to decrease a, b) */ d = 1;
while (((a|b) & 1) == 0) {
a >>= 1;
b >>= 1;
d <<= 1;**
**} goto start; do {**
**/* Find largest 2^t*b, less than a */**
**c = b;**
**do { c += c; } while (c <= a); /* a -= 2^t*b */**
**a -= (c >> 1); start:; /* Make sure a > b, and b != 0 */

if (a < b) {**
**if (a == 0) return d * b;**
**c = a; a = b; b = c;**
**}**
**} while (a >= SLIM); /* Return with precalculated small gcds */

return d * smallGcds[a][b];
}
Obviously some cases of (a,b) input need to be checked (a=0, b=0 sends the program into an infinite loop.)
I believe that it might be possible to improve upon this by performing table look ups on several bits of each operand at once which could give prime linear factors which could be used to reduce the operands by several steps in one shot. So I am refraining from performing assembly language analysis until I resolve this.
Update: There have been a number of submissions that have come to my attention. The first is fromPat D:
The second submission is from Ville Miettinen who works (or worked) for hybrid graphics using the cmovCC instructions:

#define SW (sizeof(int)/sizeof(char)) p=s-1;
do {
p++;
if( (((int)p)&(SW-1))==0 ) {
do {
d = *((int *)p);
p += SW;
d = (((d&M1)+M1)|d)&M2;
} while( d==M2 );
p -= SW;
}
} while( *p!=0 );
return p-s;
}
It compiles to substantially the same thing (on a P6 it should have the same performance) but it has a few problems with it. Its not really portable to different sizes of int (the numbers 0x7f7f7f7f and 0x80808080 would have to be truncated or expanded, and casting p from a pointer to a scalar is not guaranteed to work) and its not any easier to understand than the assembly.
Note that although alignment on reads is not necessary on modern x86's it is performed here, as a means of avoiding a memory read failure. This can happen as a result of reading past the end of the string, but under the assumption that your memory allocator aligns to at least DWORD boundaries.
Update!
Ryan Mack has sent in an MMX version of this loop:
; MMX version by Ryan Mack
; Roughly 13 + 3n + BRANCH clocks on a P-II const unsigned __int64 STRINGTBL[8] = {0, 0xff,
0xffff, 0xffffff, 0xffffffff, 0xffffffffff,
0xffffffffffff, 0xffffffffffffff} /* ... */ pxor mm1, mm1
mov ecx, eax
mov edx, eax
and ecx, -8
and eax, 7
movq mm0, [ecx]
por mm0, [STRINGTBL+eax*8]
MAIN:
add ecx, 8
pcmpeqb mm0, mm1
packsswb mm0, mm0
movd eax, mm0
movq mm0, [ecx]
test eax, eax
jz MAIN bsf eax, eax
shr eax, 2
lea ecx, [ecx+eax-8]
sub ecx, edx
The individual loop iterations are not dependent on each other, so on a modern out of order processor this is just instruction issue limited. On a P-II we have 6 ALU microops + 1 load microop, which should translate to a 3 clock per 8 byte throughput. On AMD processors there are 7 riscops which is 3 clocks on an Athlon and 4 clocks on a K6 per 8 byte throughput.
Performance measurements on an Athlon show that (ignoring branch misprediction) for long strings this 64 bit method is upwards of 55% faster than the previous 32 bit method (which in turn is 22% faster than the WATCOM clib implementation.) However, for very short strings, this method is as much as 3 times slower then the previous method (which is about the same same speed as as the WATCOM clib implementation.) But keep in mind that if your strings are shorter, the branch mispredicts (which were not modeled in my tests) will take a comparatively larger percentage of the time. In any event, if you intend to use one of these alternatives, you should performance measure them to see which is better for your application.
Update!
Norbert Juffa has written in to inform me of an old trick which is a significant improvement over the device I discovered:
After a few minutes of research it turns out that the original null byte detection by Alan Mycroft is superior to the formula given on your website as it has shorter dependency chains. On a PIII, strcpy() performance jumped almost 20% from substituting one for the other. Mycroft's macro is
#define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
I rendered this as:
| mov edx, eax not eax sub edx, 0x01010101 and eax, 0x80808080 and eax, edx | ; cycle 1 ; cycle 2 ; cycle 3 |
| ----------------------------------------------------------------------------- | --------------------------------- |
BTW, one of Mycroft's original posts about this macro can be found at Deja. The post is dated 1987-04-27!
I did a little digging and indeed you can find an old thread discussing this on google groups. So here's an implementation in C:
Compared to other solutions on an Athlon, this is indeed the fastest solution for strings up to about 32 characters (the MMX solution is faster for longer strings). This function greatly outperforms the strlen of most libraries.

test ecx, 32 ; sh >= 32 ?
jz $lshift_done ; nope, done
mov edx, eax ; sll (i,32)
xor eax, eax ;
$lshift_done:
cmp ecx, 64 ; (sh>=64)||(sh<0)**
**sbb ecx, ecx ; (sh>=64)||(sh<0) ? 0 : 0xffffffff**
**and eax, ecx ; (sh>=64)||(sh<0) ? 0 : sll (i,sh)**
**and edx, ecx ;**
**#else /* Athlon, P6+ */**
**test ecx, -64 ; (sh>=64)||(sh<0) ? NZ : ZR**
**ror ecx, 6 ; (64>sh>=32) ? CY : NC(ZF safe)

mov ecx, 0 ; 0
cmovc edx, eax ; i=(64>sh>=32) ? sll(i,32) : i
cmovc eax, ecx ;
cmovnz edx, ecx ; (sh>=64)||(sh<0) ? 0 : sll (i,sh)
cmovnz eax, ecx ;
#endif
ret 8 ; pop two DWORD parms and ret
}
}
shld, cmovcc, and ror are not likely to show up in the output of any known compiler.