CLMUL instruction set (original) (raw)

From Wikipedia, the free encyclopedia

Extension to the x86 instruction set

Carry-less Multiplication (CLMUL) is an extension to the x86 instruction set used by microprocessors from Intel and AMD which was proposed by Intel in March 2008[1] and made available in the Intel Westmere processors announced in early 2010. Mathematically, the instruction implements multiplication of polynomials over the finite field GF(2) where the bitstring a 0 a 1 … a 63 {\displaystyle a_{0}a_{1}\ldots a_{63}} $a_{0}a_{1}\ldots a_{63}$ represents the polynomial a 0 + a 1 X + a 2 X 2 + ⋯ + a 63 X 63 {\displaystyle a_{0}+a_{1}X+a_{2}X^{2}+\cdots +a_{63}X^{63}} $a_{0}+a_{1}X+a_{2}X^{2}+\cdots +a_{63}X^{63}$ . The CLMUL instruction also allows a more efficient implementation of the closely related multiplication of larger finite fields GF(2_k_) than the traditional instruction set.[2]

One use of these instructions is to improve the speed of applications doing block cipher encryption in Galois/Counter Mode, which depends on finite field GF(2_k_) multiplication. Another application is the fast calculation of CRC values,[3] including those used to implement the LZ77 sliding window DEFLATE algorithm in zlib and pngcrush.[4]

ARMv8 also has a version of CLMUL. SPARC calls their version XMULX, for "XOR multiplication".

The instruction computes the 128-bit carry-less product of two 64-bit values. The destination is a 128-bit XMM register. The source may be another XMM register or memory. An immediate operand specifies which halves of the 128-bit operands are multiplied. Mnemonics specifying specific values of the immediate operand are also defined:

Instruction	Opcode	Description
PCLMULQDQ xmmreg,xmmrm,imm	[rmi: 66 0f 3a 44 /r ib]	Perform a carry-less multiplication of two 64-bit polynomials over the finite field GF(2)[_X_].
PCLMULLQLQDQ xmmreg,xmmrm	[rm: 66 0f 3a 44 /r 00]	Multiply the low halves of the two registers.
PCLMULHQLQDQ xmmreg,xmmrm	[rm: 66 0f 3a 44 /r 01]	Multiply the high half of the destination register by the low half of the source register.
PCLMULLQHQDQ xmmreg,xmmrm	[rm: 66 0f 3a 44 /r 10]	Multiply the low half of the destination register by the high half of the source register.
PCLMULHQHQDQ xmmreg,xmmrm	[rm: 66 0f 3a 44 /r 11]	Multiply the high halves of the two registers.

A EVEX vectorized version (VPCLMULQDQ) is seen in AVX-512.

CPUs with CLMUL instruction set

[edit]

Intel
- Westmere processor (March 2010).
- Sandy Bridge processor
- Ivy Bridge processor
- Haswell processor
- Broadwell processor (with increased throughput and lower latency[5])
- Skylake (and later) processor
- Goldmont processor
AMD:
- Jaguar-based processors and newer [6]
- Puma-based processors and newer
- "Heavy Equipment" processors
  * Bulldozer-based processors [7]
  * Piledriver-based processors
  * Steamroller-based processors
  * Excavator-based processors and newer
- Zen processors
- Zen+ processors
- Zen2 (and later) processors

The presence of the CLMUL instruction set can be checked by testing one of the CPU feature bits.

^ "Intel Software Network". Intel. Archived from the original on 2008-04-07. Retrieved 2008-04-05.
^ Shay Gueron; Michael E. Kounavis (2014-04-20). "Intel Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode – Rev 2.02" (PDF). Intel. Archived from the original on 2019-08-06.
^ "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ" (PDF).
^ Vlad Krasnov (2015-07-08). "Fighting Cancer: The Unexpected Benefit Of Open Sourcing Our Code". CloudFlare. Retrieved 2016-09-04.
^ Johan De Gelas (2017-03-31). "The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads". Anandtech. p. 3. Archived from the original on March 31, 2016.
^ "Slide detailing improvements of Jaguar over Bobcat". AMD. 29 August 2012. Retrieved August 3, 2013.
^ Dave Christie (6 May 2009). "Striking a balance". AMD Developer blogs. Archived from the original on 9 November 2013. Retrieved 2011-03-11.