Further BigInteger performance improvements (original) (raw)

Alan Eliasen eliasen at mindspring.com
Thu Jun 4 22:58:48 UTC 2009


Andrew Haley wrote:

You give examples of the speedup for very large bignums, but you don't say the size of numbers at which your approach becomes faster than the current code. Of course any asymptotic improvement helps with numbers that are half a million decimal digits long, but where's the crossover?

Andrew,

Good question. Like most Karatsuba and Toom-Cook implementations, the optimal crossover point is vague, as it can depend on the size of both operands, and the curves don't separate fast at that point. If you look at the updated file (again, at http://futureboy.us/temp/BigInteger.java ) you'll see the crossover points:

/**
 * The threshold value for using Karatsuba multiplication.  If the

number * of ints in both mag arrays are greater than this number, then * Karatsuba multiplication will be used. This value is found * experimentally to work well. */ private static final int KARATSUBA_THRESHOLD = 50;

/**
 * The threshold value for using 3-way Toom-Cook multiplication.
 * If the number of ints in both mag arrays are greater than this

number, * then Toom-Cook multiplication will be used. This value is found * experimentally to work well. */ private static final int TOOM_COOK_THRESHOLD = 75;

/**
 * The threshold value for using Karatsuba squaring.  If the number
 * of ints in the number are larger than this value,
 * Karatsuba squaring will be used.   This value is found
 * experimentally to work well.
 */
private static final int KARATSUBA_SQUARE_THRESHOLD = 90;

/**
 * The threshold value for using Toom-Cook squaring.  If the number
 * of ints in the number are larger than this value,
 * Karatsuba squaring will be used.   This value is found
 * experimentally to work well.
 */
private static final int TOOM_COOK_SQUARE_THRESHOLD = 140;

Thus, the first crossover for Karatsuba multiplication is at 50*32 bits (1600 bits), or about 482 decimal digits. Toom-Cook at 2400 bits. (Hint: divide the number of bits by 3.32 to get the approximate number of decimal digits.)

These crossovers are determined experimentally, of course, and constants are chosen that work well for a variety of number forms and operand combinations. These numbers were found to work well after a lot of timing runs with a lot of different number sizes and forms. Of course, it's possible that different thresholds can work better on different architectures and VM settings. I'm not proposing we tune for each architecture, but rather choose conservative crossover thresholds which work well, which I believe these are.

It generally isn't damaging if you're just "in the ballpark" when setting your thresholds; the algorithms will still work fine, and performance will still be very similar for most operands. The curves don't diverge too quickly around the crossover point, and there's a lot of random variation for different size operands, and all of your threshold points interact with each other, so it's a multi-dimensional optimization problem.

In fact, the effects of choosing different thresholds is hard to predict. There may be little local minima and maxima in performance whether you choose, say, 45 or 48 or 50 or 52 for the crossover, and operands of different lengths may behave quite differently with different crossovers. You won't determine a good crossover point by asymptotic analysis. You also have to see if a certain combination of crossovers exhibits anomalously bad behavior for certain combinations of operand sizes and avoid those, which is why it's as much black art as science.

The crossover point also depends on how efficient your "grade school" multiplication vs. your Karatsuba multiplication vs. your Toom-Cook implementation is (and there are lots of different ways to split and combine numbers in the Toom-Cook method.) My method for Toom-Cook is the "optimal" algorithm found by Marco Bodrato, (see citations in source code) who is perhaps the leading researcher into Toom-Cook multiplication. He also analyzed this patch, corrected my original Toom-Cook to a slightly different, more optimal variation, (and this changed the Toom-Cook threshold downwards somewhat, as it was a bit faster.)

There are also possible optimizations in the future for "unbalanced" operand sizes, where one operand is significantly larger than the other. There are unbalanced Toom-Cook multiplies that can achieve better performance than the straight 3-way Toom-Cook. And of course there are higher-order Toom-Cook multiplications possible, for even larger numbers, but the marginal benefits of higher orders diminishes, and the code is more complex.

Again, these crossovers were found experimentally to work well. They still work well with tests of Xiaobin's improvements to multiplication, but I may run yet another set of exhaustive tuning tests to see if the thresholds can be again improved after my massive regression test finishes, hopefully in the next 24 hours.

Note that my optimizations for the pow() function give vastly better performance at even small bit sizes for many operands, as they factor out powers of 2 in the exponent and perform these very rapidly as bit-shifts.

-- Alan Eliasen eliasen at mindspring.com http://futureboy.us/



More information about the core-libs-dev mailing list