[Python-Dev] Optionally using GMP to implement long if available (original) (raw)

Tim Peters tim.peters at gmail.com
Mon Nov 10 05:11:23 CET 2008


[Tim Peters]

.. Whenever two digits are multiplied, the code intends to cast (at least) one of them to "twodigits" first (if you ever see a spot where this doesn't happen, it's a bug).

[Mark Dickinson]

There are indeed one or two spots that seem to be missing a cast, for example the line "rem -= hi*n" in inplacedivrem1.

Definitely a bug! Alas, it's not surprising -- there are no platforms on which this bug has a visible consequence (because digit is currently type unsigned short, C coerces hi and n to int before multiplying, and on all platforms to date a C int is at least 32 bits, so that the multiply is at least 32x32->32 despite the lack of a twodigts cast).

... 3. There's no way to know exactly what compilers will do with this short of staring at generated code.

Yes---I've done the staring for gcc, so I now have precisely one data point, which is that various flavours of gcc on x86/x8664 are clever enough to turn

(uint64t)myuint32 * myotheruint32 into the appropriate HW instruction.

Nice! Is this a documented feature? "The problem" is that it probably depends on a combination of gcc version and compilation flags, and because /knowing/ whether it works requires staring at generated code, there's probably no sane way to automate detection of when, if, and under what conditions it breaks. "Serious" packages use assembler to avoid all this uncertainty.

Unfortunately I don't have easy access to other compilers or platforms right now. :-(

Sorry, neither do I. If you can dream up a way to automate testing for generation of the desired HW instructions, you could post the test program here and I bet a lot of people would run it. Maybe even if you just described what to look for "by eyeball" in the generated code.

Am still working on the benchmarking, but I'm definitely seeing speedup on gcc/x86---between 10% and 100% depending on the operations.

Sure -- it's not surprising that HW crunching more bits at a time is significantly faster.

FYI, in a previous life working in speech recognition, under Microsoft's Visual Studio 6 the only way we found to get at the Pentium's 32x32->64 HW ability efficiently was via using inline assembler.

Urk. We definitely don't want to go there. Though I guess this is how packages like gmp and GP/Pari manage.

  1. I have no idea what versions of Microsoft's compiler after MSVC 6 do here; perhaps it's no longer an issue (the Windows Python distro no longer uses MSVC 6).

  2. If this is thought to be worth having, then on very widely used platforms I think a good case /can/ be made for incorporating some assembler.

  3. GMP is "speed at any cost" -- they use assembler even when it's a stupid idea ;-)

.. But maybe it's possible to write portable code (by providing fallbacks) that turns out to be efficient on the majority of mainstream systems?

If "it works" under the gcc and Windows compilers du jour on x86 systems, that probably covers over 90% of Python installations. Good enough -- stop before it gets pointlessly hard ;-)

The extent of the ifdef'ery in the patch is really rather small: one (compound) #ifdef in longintrepr.h for defining digit, twodigits, stwodigits etc, and a couple more for the places where digits are read and written in marshal.c.

But so far it only works with an unknown subset of gcc versions, right? These things don't get simpler, alas :-(

I agree that very-long-integer optimizations probably don't really belong in Python,

Depends in part on whether Python can attract as many obsessed maintainers and porters for such gonzo algorithms as GMP attracts ;-)

Well, you can count me in. :)

Last I looked (which was at least 3 years ago), GMP's source code was bigger than all of Python's combined. For a start, I'll have the PSF draw up a contract obligating you to lifetime servitude :-)



More information about the Python-Dev mailing list