Running a code for lattice quantum chromodynamics efficiently on CRAY T3E systems (original) (raw)

Highly Optimized Code for Lattice Quantum Chromodynamics on the CRAY T3E

Advances in Parallel Computing, 1998

In order to compute physical quantities in lattice quantum chromodynamics huge systems of linear equations have to be solved. The availability of e cient parallel Krylov subspace solvers plays a vital role in the solution of these systems. We present a detailed analysis of the performance of the stabilized biconjugate gradient (BiCGStab) algorithm with symmetric successive over-relaxed (SSOR) preconditioning on a massively parallel CRAY T3E system.

Computing and Deflating Eigenvalues While Solving Multiple Right-Hand Side Linear Systems with an Application to Quantum Chromodynamics

SIAM Journal on Scientific Computing, 2010

We present a new algorithm that computes eigenvalues and eigenvectors of a Hermitian positive definite matrix while solving a linear system of equations with Conjugate Gradient (CG). Traditionally, all the CG iteration vectors could be saved and recombined through the eigenvectors of the tridiagonal projection matrix, which is equivalent theoretically to unrestarted Lanczos. Our algorithm capitalizes on the iteration vectors produced by CG to update only a small window of vectors that approximate the eigenvectors. While this window is restarted in a locally optimal way, the CG algorithm for the linear system is unaffected. Yet, in all our experiments, this small window converges to the required eigenvectors at a rate identical to unrestarted Lanczos. After the solution of the linear system, eigenvectors that have not accurately converged can be improved in an incremental fashion by solving additional linear systems. In this case, eigenvectors identified in earlier systems can be used to deflate, and thus accelerate, the convergence of subsequent systems.

A parallel SSOR preconditioner for lattice QCD

Computer Physics Communications, 1996

We present a parallelizable SSOR preconditioning scheme for Krylov subspace iterative solvers which proves to be efficient in lattice QCD applications involving Wilson fermions. Our preconditioner is based on a locally lexicographic ordering of the lattice points. In actual hybrid Monte Carlo applications with the biconjugate gradient stabilized method BiCGstab, we achieve a gain factor of about 2 in the number of iterations compared to conventional odd-even preconditioning. Whether this translates into similar reductions in run time will depend on the parallel computer in use. We discuss implementation issues using the 'Eisenstattrick' and machine specific advantages of the method for the APE100/Quadrics parallel computer. In a full QCD simulation on a 512-processor Quadrics QH4 we find a gain in cpu-time of a factor of 1.7 over odd-even preconditioning for a 24 3 × 40 lattice.

Variants of the Block-QMR Method and Applications in Quantum Chromodynamics

1997

Numerical computations in lattice quantum chromodynamics (QCD) require the solution of large sparse systems of linear equations. These systems are so large that they can only be tackled by iterative solution techniques. In certain QCD simulations, a sequence of closely related linear systems has to be solved. In a quenched simulation, all systems have the same coe cient matrix, but di er in their right-hand sides. In the multiboson framework, systems with coe cient matrices that di er only by an additive shift have to be solved. In this paper, we describe variants of the block-QMR method that are tailored to these two classes of multiple linear systems. Numerical results with linear systems from QCD simulations are reported.

FPGA Implementation of a Lattice Quantum Chromodynamics Algorithm Using Logarithmic Arithmetic

2005

In this paper, we discuss the implementation of a lattice Quantum Chromodynamics (QCD) application to a Xilinx VirtexII FPGA device on an Alpha Data ADM-XRC-II board using Handel-C and logarithmic arithmetic. The specific algorithm implemented is the Wilson Dirac Fermion Vector times Matrix Product operation. QCD is the scientific theory that describes the interactions of various types of sub-atomic particles. Lattice QCD is the use of computer simulations to prove aspects of this theory. The research described in this paper aims to investigate whether FPGAs and logarithmic arithmetic are a viable compute-platform for high performance computing by implementing lattice QCD for this platform. We have achieved competitive performance of at least 936 MFlops per node, executing 14.2 floating point equivalent operations per cycle, which is far higher than the previous solutions proposed for lattice QCD simulations.

Least-squares finite element methods for quantum chromodynamics

Siam Journal on Scientific Computing, 2008

A significant amount of the computational time in large Monte Carlo simulations of lattice field theory is spent inverting the discrete Dirac operator. Unfortunately, traditional covariant finite difference discretizations of the Dirac operator present serious challenges for standard iterative methods. For interesting physical parameters, the discretized operator is large and ill-conditioned, and has random coefficients. More recently, adaptive algebraic multigrid (AMG) methods have been shown to be effective preconditioners for Wilson's discretization [1] [2] of the Dirac equation. This paper presents an alternate discretization of the 2D Dirac operator of Quantum Electrodynamics (QED) based on least-squares finite elements. The discretization is systematically developed and physical properties of the resulting matrix system are discussed. Finally, numerical experiments are presented that demonstrate the effectiveness of adaptive smoothed aggregation (αSA ) multigrid as a preconditioner for the discrete field equations.

Massively parallel quantum chromodynamics

IBM Journal of Research and Development, 2000

Quantum chromodynamics (QCD), the theory of the strong nuclear force, can be numerically simulated on massively parallel supercomputers using the method of lattice gauge theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures for which LQCD suggests a need. We demonstrate these methods on the IBM Blue Gene/Le (BG/L) massively parallel supercomputer and argue that the BG/L architecture is very well suited for LQCD studies. This suitability arises from the fact that LQCD is a regular lattice discretization of space into lattice sites, while the BG/L supercomputer is a discretization of space into compute nodes. Both LQCD and the BG/L architecture are constrained by the requirement of short-distance exchanges. This simple relation is technologically important and theoretically intriguing. We demonstrate a computational speedup of LQCD using up to 131,072 CPUs on the largest BG/L supercomputer available in 2007. As the number of CPUs is increased, the speedup increases linearly with sustained performance of about 20% of the maximum possible hardware speed. This corresponds to a maximum of 70.5 sustained teraflops. At these speeds, LQCD and the BG/L supercomputer are able to produce theoretical results for the next generation of strong-interaction physics.

QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine

Computing in Science & Engineering, 2000

budgets. An alternative approach uses compute nodes based on a commercial processor tightly coupled to a custom-designed network processor. Preliminary analysis shows that this solution offers good performance, but it also entails several challenges, including those arising from the processor's multicore structure and from implementing the network processor on a fieldprogrammable gate array.

Automated Code Generation for Lattice Quantum Chromodynamics and beyond

Journal of Physics: Conference Series, 2014

We present here our ongoing work on a Domain Specific Language which aims to simplify Monte-Carlo simulations and measurements in the domain of Lattice Quantum Chromodynamics. The tool-chain, called Qiral, is used to produce high-performance OpenMP C code from LaTeX sources. We discuss conceptual issues and details of implementation and optimization. The comparison of the performance of the generated code to the well-established simulation software is also made.