Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform (original) (raw)

Application Experiences on a GPU-Accelerated Arm-based HPC Testbed

Proceedings of the HPC Asia 2023 Workshops

This paper assesses and reports the experience of ten teams working to port, validate, and benchmark several High Performance Computing applications on a novel GPU-accelerated Arm testbed system. The testbed consists of eight NVIDIA Arm HPC Developer Kit systems, each one equipped with a server-class Arm CPU from Ampere Computing and two data center GPUs from NVIDIA Corp. The systems are connected together using InfiniBand interconnect. The selected applications and mini-apps are written using several programming languages and use multiple accelerator-based programming models for GPUs such as CUDA, OpenACC, and OpenMP offloading. Working on application porting requires a robust and easy-to-access programming environment, including a variety of compilers and optimized scientific libraries. The goal of this work is to evaluate platform readiness and assess the effort required from developers to deploy well-established scientific workloads on current and future generation Arm-based GPU-accelerated HPC systems. The reported case studies demonstrate that the current level of maturity and diversity of software and tools is already adequate for large-scale production deployments.

Are we ready for broader adoption of ARM in the HPC community: Performance and Energy Efficiency Analysis of Benchmarks and Applications Executed on High-End ARM Systems

Proceedings of the HPC Asia 2023 Workshops

A set of benchmarks, including numerical libraries and real-world scientific applications, were run on several modern ARM systems (Amazon Graviton 3/2, Futjutsu A64FX, Ampere Altra, Thunder X2) and compared to x86 systems (Intel and AMD) as well as to hybrid Intel x86/NVIDIA GPUs systems. For benchmarking automation, the application kernel module of XDMoD was used. XDMoD is a comprehensive suite for HPC resource utilization and performance monitoring. The application kernel module enables continuous performance monitoring of HPC resources through the regular execution of user applications. It has been used on the Ookami system (one of the first USA-based Fujitsu ARM A64FX SVE 512 systems). The applications used for this study span a variety of computational paradigms: HPCC (several HPC benchmarks), NWChem (ab initio chemistry), Open Foam(partial differential equation solver), GROMACS (biomolecular simulation), AI Benchmark Alpha (AI benchmark) and Enzo (adaptive mesh refinement). ARM performance, while generally slower, was nonetheless shown in many cases to be comparable to current x86 counterparts and often outperforms previous generations of x86 CPUs. In terms of energy efficiency, which considers both power consumption and execution time, ARM was shown in most cases to be more energy efficient than x86 processors. In cases where GPU performance was tested, the GPU systems showed the fastest speed and the highest energy

Evaluating the Arm Ecosystem for High Performance Computing

Proceedings of the Platform for Advanced Scientific Computing Conference, 2019

In recent years, Arm-based processors have arrived on the HPC scene, offering an alternative the existing status quo, which was largely dominated by x86 processors. In this paper, we evaluate the Arm ecosystem, both the hardware offering and the software stack that is available to users, by benchmarking a production HPC platform that uses Marvell's ThunderX2 processors. We investigate the performance of complex scientific applications across multiple nodes, and we also assess the maturity of the software stack and the ease of use from a users' perspective. This papers finds that the performance across our benchmarking applications is generally as good as, or better, than that of well-established platforms, and we can conclude from our experience that there are no major hurdles that might hinder wider adoption of this ecosystem within the HPC community. CCS Concepts: • General and reference → Performance; • Computing methodologies → Massively parallel algorithms; Distributed programming languages.

A performance analysis of the first generation of HPC‐optimized Arm processors

Concurrency and Computation: Practice and Experience

In this paper, we present performance results from Isambard, the first production supercomputer to be based on Arm CPUs that have been optimized specifically for HPC. Isambard is the first Cray XC50 ''Scout'' system, combining Cavium ThunderX2 Arm-based CPUs with Cray's Aries interconnect. The full Isambard system will be delivered in the summer of 2018, when it will contain over 10 000 Arm cores. In this work, we present node-level performance results from eight early-access nodes that were upgraded to B0 beta silicon in March 2018. We present node-level benchmark results comparing ThunderX2 with mainstream CPUs, including Intel Skylake and Broadwell, as well as Xeon Phi. We focus on a range of applications and mini-apps important to the UK national HPC service, ARCHER, as well as to the Isambard project partners and the wider HPC community. We also compare performance across three major software toolchains available for Arm: Cray's CCE, Arm's version of Clang/Flang/LLVM, and GNU.

Tibidabo: Making the case for an ARM-based HPC system

Future Generation Computer Systems, 2014

It is widely accepted that future HPC systems will be limited by their power consumption. Current HPC systems are built from commodity server processors, designed over years to achieve maximum performance, with energy efficiency being an afterthought. In this paper we advocate a different approach: building HPC systems from low-power embedded and mobile technology parts, over time designed for maximum energy efficiency, which now show promise for competitive performance. We introduce the architecture of Tibidabo, the first large-scale HPC cluster built from ARM multicore chips, and a detailed performance and energy efficiency evaluation. We present the lessons learned for the design and improvement in energy efficiency of future HPC systems based on such lowpower cores. Based on our experience with the prototype, we perform simulations to show that a theoretical cluster of 16-core ARM Cortex-A15 chips would increase the energy efficiency of our cluster by 8.7x, reaching an energy efficiency of 1046 MFLOPS/W.

ARM HPC Ecosystem and the Reemergence of Vectors

Proceedings of the Computing Frontiers Conference, 2017

ARM's involvement in funded international projects has helped pave the road towards ARM-based supercomputers. ARM and its partners have collaborately grown an HPC ecosystem with software and hardware solutions that provide choice in a unified software ecosystem. Partners have announced important HPC deployments resulting from collaborations around the globe. One of the key enabling technologies for ARM in HPC is the Scalable Vector Extension, an instruction set extension for vector processing. This paper discusses ARM's journey into HPC, the current state of the ARM HPC ecosystem, the approach to HPC node architecture co-design, and details on the Scalable Vector Extension as a future technology representing the reemergence of vectors.

Design and Optimization of Scientific Applications for Highly Heterogeneous and Hierarchical HPC Platforms Using Functional Computation Performance Models

High-Performance Computing on Complex Environments, 2014

HPC platforms are getting increasingly heterogeneous and hierarchical. The main source of heterogeneity in many individual computing nodes is due to the utilization of specialized accelerators such as GPUs alongside general purpose CPUs. Heterogeneous many-core processors will be another source of intra-node heterogeneity in the near future. As modern HPC clusters become more heterogeneous, due to increasing number of different processing devices, hierarchical approach needs to be taken with respect to memory and communication interconnects to reduce complexity. During recent years, many scientific codes have been ported to multicore and GPU architectures. To achieve optimum performance of these applications on CPU/GPU hybrid platforms software heterogeneity needs to be accounted for. Therefore, design and implementation of data parallel scientific applications for such highly heterogeneous and hierarchical platforms represent a significant scientific and engineering challenge. This chapter will present the state of the art in the solution of this problem based on the functional performance models of computing devices and nodes.

Virtualizing CUDA Enabled GPGPUs on ARM Clusters

Parallel Processing and Applied Mathematics, 2016

Tiny ARM based devices are the backbone of the Internet of Things technologies, nevertheless the availability of high performance multicore lightweight CPUs pushed the High Performance Computing to hybrid architectures leveraging on diverse levels parallelism. In this paper we describe how to accelerate inexpensive ARM-based computing nodes with high-end CUDA enabled GPGPUs hosted on x86 64 machines using the GVirtuS general-purpose virtualization service. We draw the vision of a possible hierarchical remote workload distribution among different devices. Preliminary, but promising, performance evaluation data suggests that the developed technology is suitable for real world applications. 2 GVirtuS on heterogeneous architectures An ARM port of GVirtuS is motivated raised from different application fields such as High Performance Internet of Things (HPIoT) and cloud computing. In HPC infrastructures, ARM processors are used as computing nodes often provided by tiny GPU on chip or integrated on the CPU board [5].

Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU

Future Generation Computer Systems, 2020

In this paper, we analyze the performance and energy consumption of an Arm-based high-performance computing (HPC) system developed within the European project Mont-Blanc. This system, called Dibona, has been integrated by ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX. This CPU is the same one that powers the Astra supercomputer, the rst Arm-based supercomputer entering the Top in November. We study from microbenchmarks up to large production codes. We include an interdisciplinary evaluation of three scienti c applications (a nite-element uid dynamics code, a smoothed particle hydrodynamics code, and a lattice Boltzmann code) and the Graph benchmark, focusing on parallel and energy e ciency as well as studying their scalability up to thousands of Armv cores. For comparison, we run the same tests on state-of-the-art x nodes included in Dibona and the Tier-supercomputer MareNostrum. Our experiments show that the ThunderX has a lower performance on average, mainly due to its small vector unit yet somewhat compensated by its wider links between the CPU and the main memory. We found that the software ecosystem of the Armv architecture is comparable to the one available for Intel. Our results also show that ThunderX delivers similar or better energy-to-solution and scalability, proving that Arm-based chips are legitimate contenders in the market of next-generation HPC systems.

Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Proceedings of the 8th ACM International Conference on Computing Frontiers - CF '11, 2011

Rapid advances in the performance and programmability of graphics accelerators have made GPU computing a compelling solution for a wide variety of application domains. However, the increased complexity as a result of architectural heterogeneity and imbalances in hardware resources poses significant programming challenges in harnessing the performance advantages of GPU accelerated parallel systems. Moreover, the speedup derived from GPU often gets offset by longer communication latencies and inefficient task scheduling. To achieve the best possible performance, a suitable parallel programming model is therefore essential. In this paper, we explore a new hybrid parallel programming model that incorporates GPU acceleration with the Partitioned Global Address Space (PGAS) programming paradigm. As we demonstrate, by combining Unified Parallel C (UPC) and CUDA as a case study, this hybrid model offers programmers with both enhanced programmability and powerful heterogeneous execution. Two application benchmarks, namely NAS Parallel Benchmark (NPB) FT and MG, are used to show the effectiveness of our proposed hybrid approach. Experimental results indicate that both implementations achieve significantly better performance due to optimization opportunities offered by the hybrid model, such as the funneled execution mode and fine-grained overlapping of communication and computation.

Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform (original) (raw)

Related papers