QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine (original) (raw)

Massively parallel quantum chromodynamics

IBM Journal of Research and Development, 2000

Quantum chromodynamics (QCD), the theory of the strong nuclear force, can be numerically simulated on massively parallel supercomputers using the method of lattice gauge theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures for which LQCD suggests a need. We demonstrate these methods on the IBM Blue Gene/Le (BG/L) massively parallel supercomputer and argue that the BG/L architecture is very well suited for LQCD studies. This suitability arises from the fact that LQCD is a regular lattice discretization of space into lattice sites, while the BG/L supercomputer is a discretization of space into compute nodes. Both LQCD and the BG/L architecture are constrained by the requirement of short-distance exchanges. This simple relation is technologically important and theoretically intriguing. We demonstrate a computational speedup of LQCD using up to 131,072 CPUs on the largest BG/L supercomputer available in 2007. As the number of CPUs is increased, the speedup increases linearly with sustained performance of about 20% of the maximum possible hardware speed. This corresponds to a maximum of 70.5 sustained teraflops. At these speeds, LQCD and the BG/L supercomputer are able to produce theoretical results for the next generation of strong-interaction physics.

Lattice gauge theory on a multi-core processor, Cell/B.E

Procedia Computer Science, 2011

We report our implementation experience of a lattice gauge theory code on the Cell Broadband Engine, which is a new heterogeneous multi-core processor. As a typical operation, we take a SU(3) matrix multiplication which is one of the most important parts of lattice gauge theories. Employing full advantage of the Cell/B.E. including SIMD operations and many registers, which enable the full use of the arithmetic units through the loop-unrolling, we obtain about 200 GFLOPS with 16 SPE, which corresponds around 80 % of the theoretical peak. To our knowledge, this is the fastest value of this operation obtained on the Cell/B.E. so far. However, when we measure the whole time including the data supply, the speed drops down to about 13 GFLOPS.We found that the bandwidth of the data transfer between the main memory and EIB, 25 GB/s, is a bottleneck. In other words, it is possible to run the arithmetic units on the Cell/B.E. with 200 GFLOPS speed, but the current socket structure of Cell/B.E. prevents it. We discuss several techniques to improve the problem partially by reducing the transferred data.

The APE computer: An array processor optimized for lattice gauge theory simulations

Computer Physics Communications, 1987

Abstract The APE computer is a high performance processor designed to provide massive computational power for intrinsically parallel and homogeneous applications. APE is a linear array of processing elements and memory boards that execute in parallel in SIMD mode under the control of a CERN/SLAC 3081/E. Processing elements and memory boards are connected by a 'circular'switchnet. The hardware and software architecture of APE, as well as its implementation are discussed in this paper. Some physics results obtained in the ...

QCDOC: A 10 Teraflops Computer for Tightly-Coupled Calculations

2004

Numerical simulations of the strong nuclear force, known as quantum chromodynamics or QCD, have proven to be a demanding, forefront problem in high-performance computing. In this report, we describe a new computer, QCDOC (QCD On a Chip), designed for optimal price/performance in the study of QCD. QCDOC uses a six-dimensional, low-latency mesh network to connect processing nodes, each of which includes a single custom ASIC, designed by our collaboration and built by IBM, plus DDR SDRAM. Each node has a peak speed of 1 Gigaflops and two 12,288 node, 10+ Teraflops machines are to be completed in the fall of 2004. Currently, a 512 node machine is running, delivering efficiencies as high as 45% of peak on the conjugate gradient solvers that dominate our calculations and a 4096-node machine with a cost of 1.6Misunderconstruction.Thisshouldgiveusaprice/performancelessthan1.6M is under construction. This should give us a price/performance less than 1.6Misunderconstruction.Thisshouldgiveusaprice/performancelessthan1 per sustained Megaflops. 0-7695-2153-3/04 $20.00 (c) 2004 IEEE the propagation of an electron in a background photon field. Standard Krylov space solvers work well to produce the solution and dominate the calculational time for QCD simulations.

FPGA Implementation of a Lattice Quantum Chromodynamics Algorithm Using Logarithmic Arithmetic

2005

In this paper, we discuss the implementation of a lattice Quantum Chromodynamics (QCD) application to a Xilinx VirtexII FPGA device on an Alpha Data ADM-XRC-II board using Handel-C and logarithmic arithmetic. The specific algorithm implemented is the Wilson Dirac Fermion Vector times Matrix Product operation. QCD is the scientific theory that describes the interactions of various types of sub-atomic particles. Lattice QCD is the use of computer simulations to prove aspects of this theory. The research described in this paper aims to investigate whether FPGAs and logarithmic arithmetic are a viable compute-platform for high performance computing by implementing lattice QCD for this platform. We have achieved competitive performance of at least 936 MFlops per node, executing 14.2 floating point equivalent operations per cycle, which is far higher than the previous solutions proposed for lattice QCD simulations.

QPACE -- a QCD parallel computer based on Cell processors

2009

QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system. Furthermore, we provide initial performance results and report on the installation of 8 QPACE racks providing an aggregate peak performance of 200 TFlops.

Investigating how to simulate lattice gauge theories on a quantum computer

PhD thesis, 2023

Quantum computers have the potential to expand the utility of lattice gauge theory to investigate non-perturbative particle physics phenomena that cannot be accessed using a standard Monte Carlo method due to the sign problem. Thanks to the qubit, quantum computers can store Hilbert space in a more efficient way compared to classical computers. This allows the Hamiltonian approach to be computationally feasible, leading to absolute freedom from the sign-problem. But what the current noisy intermediate scale quantum hardware can achieve is under investigation, and therefore we chose to study the energy spectrum and the time evolution of an SU(2) theory using two kinds of quantum hardware: the D-Wave quantum annealer and the IBM gate-based quantum hardware.

Better than $l/Mflops sustained: a scalable PC-based parallel computer for lattice QCD

Computer Physics Communications, 2003

We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The Eötvös Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 48 3 · 96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than 1/MflopsforWilson(andaround1/Mflops for Wilson (and around 1/MflopsforWilson(andaround1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations. * 1 There are obvious advantages of PC based systems. Single PC hardware usually has excellent price/performance ratios for both single and double precision applications. In most cases the operating system (Linux), compiler (gcc) and other software are free. Another advantage of using PC/Linux based systems is that lattice codes remain portable. Furthermore, due to their price they are available for a broader community working on lattice gauge theory. For recent review papers and benchmarks see .