An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems (original) (raw)
Abstract
This paper introduces an efficient and flexible 3D FFT framework for state-of-the-art multi-GPU distributed-memory systems. In contrast to the traditional pure MPI implementation, the multi-GPU distributed-memory systems can be exploited by employing a hybrid multi-GPU programming model that combines MPI with OpenMP to achieve effective communication. An asynchronous strategy that creates multiple streams and threads to reduce blocking time is adopted to accelerate intra-node communication. Furthermore, we combine our scheme with the GPU-Aware MPI implementation to perform GPU-GPU data transfers without CPU involvement. We also optimize the local FFT and transpose by creating fast parallel kernels to accelerate the total transform. Results show that our framework outperforms the state-of-the-art distributed 3D FFT library, being up to achieve 2× faster in a single node and 1.65× faster using two nodes.
Access this article
Subscribe and save
- Starting from 10 chapters or articles per month
- Access and download chapters and articles from more than 300k books and 2,500 journals
- Cancel anytime View plans
Buy Now
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Instant access to the full article PDF.
Similar content being viewed by others
References
- Asaadi H, Khaldi D, Chapman B( 2016) A comparative survey of the hpc and big data paradigms: Analysis and experiments. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 423– 432 . IEEE
- ORNL (Oak Ridge National Laboratory) (2021): Frontier. https://www.olcf.ornl.gov/frontier/. Accessed: 2021-11-01
- Brown WM ( 2011) Gpu acceleration in lammps. In: LAMMPS User’s Workshop and Symposium
- Pronk S, Páll S, Schulz R, Larsson P, Bjelkmar P, Apostolov R, Shirts MR, Smith JC, Kasson PM, Van Der Spoel D, et al ( 2013) Gromacs 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics 29( 7), 845– 854
- Salomon-Ferrer R, Gotz AW, Poole D, Le Grand S, Walker RC ( 2013) Routine microsecond molecular dynamics simulations with amber on gpus. 2. explicit solvent particle mesh ewald. Journal of chemical theory and computation 9( 9), 3878– 3888
- Lee M, Malaya N, Moser RD ( 2013) Petascale direct numerical simulation of turbulent channel flow on up to 786k cores. In: SC’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1– 11 . IEEE
- Michel J-C, Moulinec H, Suquet P (1999) Effective properties of composite materials with periodic microstructure: a computational approach. Comput Methods Appl Mech Eng 172(1–4):109–143
Article MathSciNet Google Scholar - Jung J, Kobayashi C, Imamura T, Sugita Y (2016) Parallel implementation of 3d fft with volumetric decomposition schemes for efficient molecular dynamics simulations. Comput Phys Commun 200:57–65
Article MathSciNet Google Scholar - Tari V, Lebensohn RA, Pokharel R, Turner TJ, Shade PA, Bernier JV, Rollett AD (2018) Validation of micro-mechanical fft-based simulations using high energy diffraction microscopy on ti-7al. Acta Mater 154:273–283
Article Google Scholar - Almgren AS, Bell JB, Lijewski MJ, Lukić Z, Van Andel E (2013) Nyx: A massively parallel amr code for computational cosmology. Astrophys J 765(1):39
Article Google Scholar - Kowalski K, Bair R, Bauman NP, Boschen JS, Bylaska EJ, Daily J, de Jong WA, Dunning T Jr, Govind N, Harrison RJ et al (2021) From nwchem to nwchemex: Evolving with the computational chemistry landscape. Chem Rev 121(8):4962–4998
Article Google Scholar - NVIDIA: cuFFT. https://docs.nvidia.com/cuda/cufft/index.html
- ROCmSoftwarePlatform (2018) Rocmsoftwareplatform/ROCFFT: Next generation FFT implementation for ROCM . https://github.com/ROCmSoftwarePlatform/rocFFT
- Gholami A, Hill J, Malhotra D, Biros G (2015) Accfft: a library for distributed-memory fft on cpu and gpu architectures. arXiv preprint arXiv:1506.07933
- Takahashi D (2014) Ffte: A fast fourier transform package. http://www.ffte.jp/
- Ayala A, Tomov S, Haidar A, Dongarra J ( 2020) heffte: highly efficient fft for exascale. In: International Conference on Computational Science, pp. 262– 275 . Springer
- Barker B ( 2015) Message passing interface (mpi). In: Workshop: High Performance Computing on Stampede, vol. 262
- Dagum L, Menon R (1998) Openmp: an industry standard API for shared-memory programming. IEEE Comput Sci Eng 5(1):46–55
Article Google Scholar - Frigo M, Johnson SG (2005) The design and implementation of fftw3. Proc IEEE 93(2):216–231
Article Google Scholar - Luszczek PR, Bailey DH, Dongarra JJ, Kepner J, Lucas RF, Rabenseifner R, Takahashi D ( 2006) The hpc challenge (hpcc) benchmark suite. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, vol. 213, pp. 1188455– 1188677
- Wang H, Potluri S, Bureddy D, Rosales C, Panda DK (2013) Gpu-aware mpi on rdma-enabled clusters: Design, implementation and evaluation. IEEE Trans Parallel Distrib Syst 25(10):2595–2605
Article Google Scholar - Schroeder TC ( 2011) Peer-to-peer & unified virtual addressing. In: GPU Technology Conference, NVIDIA
- Potluri S, Wang H, Bureddy D, Singh AK, Rosales C, Panda DK ( 2012) Optimizing mpi communication on multi-gpu systems using cuda inter-process communication. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pp. 1848– 1857 IEEE
- ROCmSoftwarePlatform(2018) ROCmSoftwarePlatform/RCCL: ROCM Communication Collectives Library (RCCL) . https://github.com/ROCmSoftwarePlatform/rccl
- Sunitha N, Raju K, Chiplunkar N.N (2017) Performance improvement of cuda applications by reducing cpu-gpu data transfer overhead. In: 2017 international conference on inventive communication and computational technologies (ICICCT), pp 211– 215 . IEEE
- Jodra JL, Gurrutxaga I, Muguerza J (2015) Efficient 3d transpositions in graphics processing units. Int J Parallel Prog 43(5):876–891
Article Google Scholar - Ruetsch G, Micikevicius P (2009) Optimizing matrix transpose in Cuda. Nvidia CUDA SDK Appl Note 18:1
Google Scholar - AMD (2021) AMD INSTINCT\(^{\rm TM}\) MI100 accelerator | data center GPU | AMD . https://www.amd.com/en/products/server-accelerators/instinct-mi100
Acknowledgements
This work was supported in part by the Major Project on the Integration of Industry, Education and Research of Zhongshan under grant 210610173898370.
Author information
Authors and Affiliations
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510000, China
Binbin Zhou & Lu Lu - Modern Industrial Technology Research Institute, South China University of Technology, Zhongshan, 528400, China
Lu Lu
Corresponding author
Correspondence toLu Lu.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhou, B., Lu, L. An effective 3-D fast fourier transform framework for multi-GPU accelerated distributed-memory systems.J Supercomput 78, 17055–17073 (2022). https://doi.org/10.1007/s11227-022-04491-7
- Accepted: 30 March 2022
- Published: 13 May 2022
- Version of record: 13 May 2022
- Issue date: October 2022
- DOI: https://doi.org/10.1007/s11227-022-04491-7