Issue 46504: Faster code for trial quotient in x_divrem (original) (raw)

x_divrem1() was recently (bpo-46406) changed to generate faster code for division, essentially nudging optimizing compilers into recognizing that modern processors compute the quotient and remainder with a single machine instruction.

The same can be done for x_divrem(), although it's less valuable there because the HW division generally accounts for a much smaller percent of its total runtime.

Still, it does cut a multiply and subtract out of the loop, and makes the code more obvious (since it brings x_divrem1() and x_divrem() back into synch).