Application Experiences on a GPU-Accelerated Arm-based HPC Testbed (original) (raw)
Related papers
Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform
2022
This paper assesses and reports the experience of eleven application teams working to build, validate, and benchmark several HPC applications on a novel GPU-accelerated Arm testbed. The testbed consists of the latest, at time of writing, Arm Devkits from NVIDIA with server-class Arm CPUs and NVIDIA A100 GPUs. The applications and mini-apps are written using multiple parallel programming models, including C++, C, Fortran, CUDA, OpenACC, and OpenMP. Each application builds extensively on the other tools available in the programming environment, including scientific libraries, compilers, and other tooling. Our goal is to evaluate application readiness for the next generation of Arm-and GPU-based HPC systems and determine the tooling readiness for future application developers. On both accounts, the reported case studies demonstrate that the diversity of software and tools available for GPU-accelerated Arm systems are prepared
Proceedings of the HPC Asia 2023 Workshops
A set of benchmarks, including numerical libraries and real-world scientific applications, were run on several modern ARM systems (Amazon Graviton 3/2, Futjutsu A64FX, Ampere Altra, Thunder X2) and compared to x86 systems (Intel and AMD) as well as to hybrid Intel x86/NVIDIA GPUs systems. For benchmarking automation, the application kernel module of XDMoD was used. XDMoD is a comprehensive suite for HPC resource utilization and performance monitoring. The application kernel module enables continuous performance monitoring of HPC resources through the regular execution of user applications. It has been used on the Ookami system (one of the first USA-based Fujitsu ARM A64FX SVE 512 systems). The applications used for this study span a variety of computational paradigms: HPCC (several HPC benchmarks), NWChem (ab initio chemistry), Open Foam(partial differential equation solver), GROMACS (biomolecular simulation), AI Benchmark Alpha (AI benchmark) and Enzo (adaptive mesh refinement). ARM performance, while generally slower, was nonetheless shown in many cases to be comparable to current x86 counterparts and often outperforms previous generations of x86 CPUs. In terms of energy efficiency, which considers both power consumption and execution time, ARM was shown in most cases to be more energy efficient than x86 processors. In cases where GPU performance was tested, the GPU systems showed the fastest speed and the highest energy
Evaluating the Arm Ecosystem for High Performance Computing
Proceedings of the Platform for Advanced Scientific Computing Conference, 2019
In recent years, Arm-based processors have arrived on the HPC scene, offering an alternative the existing status quo, which was largely dominated by x86 processors. In this paper, we evaluate the Arm ecosystem, both the hardware offering and the software stack that is available to users, by benchmarking a production HPC platform that uses Marvell's ThunderX2 processors. We investigate the performance of complex scientific applications across multiple nodes, and we also assess the maturity of the software stack and the ease of use from a users' perspective. This papers finds that the performance across our benchmarking applications is generally as good as, or better, than that of well-established platforms, and we can conclude from our experience that there are no major hurdles that might hinder wider adoption of this ecosystem within the HPC community. CCS Concepts: • General and reference → Performance; • Computing methodologies → Massively parallel algorithms; Distributed programming languages.
A performance analysis of the first generation of HPC‐optimized Arm processors
Concurrency and Computation: Practice and Experience
In this paper, we present performance results from Isambard, the first production supercomputer to be based on Arm CPUs that have been optimized specifically for HPC. Isambard is the first Cray XC50 ''Scout'' system, combining Cavium ThunderX2 Arm-based CPUs with Cray's Aries interconnect. The full Isambard system will be delivered in the summer of 2018, when it will contain over 10 000 Arm cores. In this work, we present node-level performance results from eight early-access nodes that were upgraded to B0 beta silicon in March 2018. We present node-level benchmark results comparing ThunderX2 with mainstream CPUs, including Intel Skylake and Broadwell, as well as Xeon Phi. We focus on a range of applications and mini-apps important to the UK national HPC service, ARCHER, as well as to the Isambard project partners and the wider HPC community. We also compare performance across three major software toolchains available for Arm: Cray's CCE, Arm's version of Clang/Flang/LLVM, and GNU.
Virtualizing CUDA Enabled GPGPUs on ARM Clusters
Parallel Processing and Applied Mathematics, 2016
Tiny ARM based devices are the backbone of the Internet of Things technologies, nevertheless the availability of high performance multicore lightweight CPUs pushed the High Performance Computing to hybrid architectures leveraging on diverse levels parallelism. In this paper we describe how to accelerate inexpensive ARM-based computing nodes with high-end CUDA enabled GPGPUs hosted on x86 64 machines using the GVirtuS general-purpose virtualization service. We draw the vision of a possible hierarchical remote workload distribution among different devices. Preliminary, but promising, performance evaluation data suggests that the developed technology is suitable for real world applications. 2 GVirtuS on heterogeneous architectures An ARM port of GVirtuS is motivated raised from different application fields such as High Performance Internet of Things (HPIoT) and cloud computing. In HPC infrastructures, ARM processors are used as computing nodes often provided by tiny GPU on chip or integrated on the CPU board [5].
Tibidabo: Making the case for an ARM-based HPC system
Future Generation Computer Systems, 2014
It is widely accepted that future HPC systems will be limited by their power consumption. Current HPC systems are built from commodity server processors, designed over years to achieve maximum performance, with energy efficiency being an afterthought. In this paper we advocate a different approach: building HPC systems from low-power embedded and mobile technology parts, over time designed for maximum energy efficiency, which now show promise for competitive performance. We introduce the architecture of Tibidabo, the first large-scale HPC cluster built from ARM multicore chips, and a detailed performance and energy efficiency evaluation. We present the lessons learned for the design and improvement in energy efficiency of future HPC systems based on such lowpower cores. Based on our experience with the prototype, we perform simulations to show that a theoretical cluster of 16-core ARM Cortex-A15 chips would increase the energy efficiency of our cluster by 8.7x, reaching an energy efficiency of 1046 MFLOPS/W.
On the use of remote GPUs and low-power processors for the acceleration of scientific applications
Many current high-performance clusters include one or more GPUs per node in order to dramatically reduce application execution time, but the utilization of these accelerators is usually far below 100%. In this context, remote GPU virtualization can help to reduce acquisition costs as well as the overall energy consumption. In this paper, we investigate the potential overhead and bottlenecks of several "heterogeneous" scenarios consisting of client GPU-less nodes running CUDA applications and remote GPUequipped server nodes providing access to NVIDIA hardware accelerators. The experimental evaluation is performed using three general-purpose multicore processors (Intel Xeon, Intel Atom and ARM Cortex A9), two graphics accelerators (NVIDIA GeForce GTX480 and NVIDIA Quadro M1000), and two relevant scientific applications (CUDASW++ and LAMMPS) arising in bioinformatics and molecular dynamics simulations.
Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU
Future Generation Computer Systems, 2020
In this paper, we analyze the performance and energy consumption of an Arm-based high-performance computing (HPC) system developed within the European project Mont-Blanc. This system, called Dibona, has been integrated by ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX. This CPU is the same one that powers the Astra supercomputer, the rst Arm-based supercomputer entering the Top in November. We study from microbenchmarks up to large production codes. We include an interdisciplinary evaluation of three scienti c applications (a nite-element uid dynamics code, a smoothed particle hydrodynamics code, and a lattice Boltzmann code) and the Graph benchmark, focusing on parallel and energy e ciency as well as studying their scalability up to thousands of Armv cores. For comparison, we run the same tests on state-of-the-art x nodes included in Dibona and the Tier-supercomputer MareNostrum. Our experiments show that the ThunderX has a lower performance on average, mainly due to its small vector unit yet somewhat compensated by its wider links between the CPU and the main memory. We found that the software ecosystem of the Armv architecture is comparable to the one available for Intel. Our results also show that ThunderX delivers similar or better energy-to-solution and scalability, proving that Arm-based chips are legitimate contenders in the market of next-generation HPC systems.
High-Performance Computing on Complex Environments, 2014
HPC platforms are getting increasingly heterogeneous and hierarchical. The main source of heterogeneity in many individual computing nodes is due to the utilization of specialized accelerators such as GPUs alongside general purpose CPUs. Heterogeneous many-core processors will be another source of intra-node heterogeneity in the near future. As modern HPC clusters become more heterogeneous, due to increasing number of different processing devices, hierarchical approach needs to be taken with respect to memory and communication interconnects to reduce complexity. During recent years, many scientific codes have been ported to multicore and GPU architectures. To achieve optimum performance of these applications on CPU/GPU hybrid platforms software heterogeneity needs to be accounted for. Therefore, design and implementation of data parallel scientific applications for such highly heterogeneous and hierarchical platforms represent a significant scientific and engineering challenge. This chapter will present the state of the art in the solution of this problem based on the functional performance models of computing devices and nodes.
On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework
International Journal of Parallel Programming, 2016
The astonishing development of diverse and different hardware platforms is twofold: on one side, the challenge for the exascale performance for big data processing and management; on the other side, the mobile and embedded devices for data collection and human machine interaction. This drove to a highly hierarchical evolution of programming models. GVirtuS is the general virtualization system developed in 2009 and firstly introduced in 2010 enabling a completely transparent layer among GPUs and VMs. This paper shows the latest achievements and developments of GVirtuS, now supporting CUDA 6.5, memory management and scheduling. Thanks to the new and improved remoting capabilities, GVirtus now enables GPU sharing among physical and virtual machines based on x86 and ARM CPUs on local workstations, computing clusters and distributed cloud appliances.