An Automated Framework for Accelerating Numerical Algorithms on Reconfigurable Platforms Using Algorithmic/Architectural Optimization (original) (raw)

An Automated Framework for Accelerating Numerical Algorithms on Reconfigurable Platforms Using Algorithmic/Architectural Optimization,” in

2009

Abstract—This paper describes TANOR, an automated framework for designing hardware accelerators for numerical computation on reconfigurable platforms. Applications utilizing numerical algorithms on large-size data sets require high-throughput computation platforms. The focus is on N-body interaction problems which have a wide range of applications spanning from astrophysics to molecular dynamics. The TANOR design flow starts with a MATLAB description of a particular interaction function, its parameters, and certain architectural constraints specified through a graphical user interface. Subsequently, TANOR automatically generates a configuration bitstream for a target FPGA along with associated drivers and control software necessary to direct the application from a host PC. Architectural exploration is facilitated through support for fully custom fixed-point and floating point representations in addition to standard number representations such as single precision floating point. More...

TANOR: A Tool for Accelerating N-Body Simulations on Reconfigurable Platform

2007 International Conference on Field Programmable Logic and Applications, 2007

Algorithm-architecture co-exploration is hindered by the lack of efficient tools. As a consequence, designers are currently able to explore only a limited set of points in the whole design space. Therefore, a tool that can allow fast exploration of algorithmic and architectural tradeoffs in an automated manner is highly desired. In this paper, we describe TANOR an automated tool targeted for designing hardware accelerators for the class of N-body interaction problems. The design flow, starting from a high level (MATLAB) description, configures the entire system automatically. We describe the design of TANOR and demonstrate the effectiveness and adaptability of our tool using three different target applications, namely, the gravitational kernel used in astrophysics, the gaussian kernel common in image processing applications, and a force calculation kernel applied in molecular dynamics. Our results demonstrate that TANOR generates hardware accelerator that are competitive with existing custom accelerator.

An End-to-End Tool Flow for FPGA-Accelerated Scientific Computing

IEEE Design & Test of Computers, 2011

As part of their ongoing work with the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC), the authors are developing a complete tool chain for FPGA-based acceleration of scientific computing, from early-stage assessment of applications down to rapid routing. This article provides an overview of this tool chain.

Scientific Application Acceleration with Reconfigurable Functional Units

While scientific applications in the past were limited by floating point computations, modern scientific applications use more unstructured formulations. These applications have a significant percentage of integer computation—increasingly a limiting factor in scientific application performance. In real scientific applications employed at Sandia National Labs, integer computations constitute on average 37% of the application operations, forming large and complex dataflow graphs. Reconfigurable Functional Units (RFUs) are a particularly attractive accelerator for these graphs because they can potentially accelerate many unique graphs with a small amount of additional hardware. In this study, we analyze application traces of Sandia’s scientific applications and the SPEC-FP benchmark suite. First we select a set of dataflow graphs to accelerate using the RFU, then we use execution-based simulation to determine the acceleration potential of the applications when using an RFU. On average, a set of 32 or fewer graphs is sufficient to capture the dataflow behavior of 30% of the integer computation, and more than half of Sandia applications show an improvement of 5% or more.

Accelerating astrophysical particle simulations with programmable hardware (FPGA and GPU)

Computer Science - Research and Development, 2009

In a previous paper we have shown that direct gravitational N -body simulations in astrophysics scale very well for moderately parallel supercomputers (order 10-100 nodes). The best balance between computation and communication is reached if the nodes are accelerated by special purpose hardware; in this paper we describe the implementation of particle based astrophysical simulation codes on new types of accelerator hardware (field programmable gate arrays, FPGA, and graphical processing units, GPU). In addition to direct gravitational N -body simulations we also use the algorithmically similar "smoothed particle hydrodynamics" method as test application; the algorithms are used for astrophysical problems as e.g. evolution of galactic nuclei with central black holes and gravitational wave generation, and star formation in galaxies and galactic nuclei. We present the code performance on a single node using different kinds of special hardware (traditional GRAPE, FPGA, and GPU) and some implementation aspects (e.g. accuracy). The results show that GPU hardware for real application codes is as fast as GRAPE, but for an order of magnitude lower price, and that FPGA is useful for acceleration of complex sequences of operations (like SPH). We discuss future prospects and new cluster computers built with new generations of FPGA and GPU cards.

Molecular simulations with hardware accelerators: a portable interface definition for FPGA supported acceleration

2007

Recent widespread interest in the use of configurable hardware accelerators has brought to light the need for a portable application programmer interface (API) to achieve widespread adoption. Recent activities defining a candidate common generic API for field programmable gate arrays have facilitated the definition of an application specific API for accelerating molecular dynamics programs. Using the LAMMPS application as a prototype implementation platform, both the general FPGA API and application specific molecular dynamics API are presented with preliminary results confirming the viability of the portability of both a general and functionally specific API across reconfigurable hardware and development environments.

Creating Customized CGRAs for Scientific Applications

Electronics, 2021

Executing complex scientific applications on Coarse Grain Reconfigurable Arrays (CGRAs) offers improvements in the execution time and/or energy consumption when compared to optimized software implementations or even fully customized hardware solutions. In this work, we explore the potential of application analysis methods in such customized hardware solutions. We offer analysis metrics from various scientific applications and tailor the results that are to be used by MC-Def, a novel Mixed-CGRA Definition Framework targeting a Mixed-CGRA architecture that leverages the advantages of CGRAs and those of FPGAs by utilizing a customized cell-array along, with a separate LUT array being used for adaptability. Additionally, we present the implementation results regarding the VHDL-created hardware implementations of our CGRA cell concerning various scientific applications.

Direct N-Body Application on Low-Power and Energy-Efficient Parallel Architectures

Parallel Computing: Technology Trends

The aim of this work is to quantitatively evaluate the impact of computation on the energy consumption on ARM MPSoC platforms, exploiting CPUs, embedded GPUs and FPGAs. One of them possibly represents the future of High Performance Computing systems: a prototype of an Exascale supercomputer. Performance and energy measurements are made using a state-of-the-art direct N-body code from the astrophysical domain. We provide a comparison of the time-to-solution and energy delay product metrics, for different software configurations.We have shown that FPGA technologies can be used for application kernel acceleration and are emerging as a promising alternative to “traditional” technologies for HPC, which purely focus on peak-performance than on power-efficiency.

Compiled hardware acceleration of Molecular Dynamics code

2008 International Conference on Field Programmable Logic and Applications, 2008

The objective of Molecular Dynamics (MD) simulations is to determine the shape of a molecule in a given biomolecular environment. These simulations are very demanding computationally, where simulations of a few milliseconds can take days or months depending on the number of atoms involved. Therefore, MD simulations are a prime candidate for FPGA-based code acceleration. We have investigated the possible acceleration of the commonly used MD program NAMD. This code is highly optimized for software based execution and does not benefit from an FPGA-based acceleration as written. We have therefore developed a modified version, based on the calculations NAMD performs, that streams a set of data through a highly pipelined circuit on the FPGA. We have used the ROCCC compiler toolset to generate the circuit and implemented it on the SGI Altix 4700 fitted with a RASC RC100 blade.