Novel accelerated methods for convolution neural network with matrix core (original) (raw)
Abstract
The powerful parallel computing capability of GPU and the development of matrix processing unit in recent years provide more possibilities to improve the performance of convolutional neural network (CNN) on GPU. For the Winograd convolution algorithm, which is the most widely used in CNN and has the best performance, there are already some tuning results, but they all ignore the utilization of the matrix operation unit and fail to make full use of the computing resources of GPU. This paper introduces a single precision accelerated solution on GPU for CNN. According to the indicators of architecture, the optimal data layout, grid division and block division methods are derived. In order to adapt to a variety of padding in practical application, an efficient dynamic scheme for filling is designed, and by the use of matrix cores, a pipeline algorithm with operator fusion is implemented. The deep learning accelerated library MIOpen in AMD is used as the baseline. Taking several convolutional layers of ResNet50 as the experimental input, the evaluation shows that our approach outperforms MIOpen with the speedup of 1.41x on MI210, and reaches 74% of the peak value of single precision calculations. Applying this method to the training and inference of ResNet50, the speedup of 1.68x is obtained.
Access this article
Subscribe and save
- Starting from 10 chapters or articles per month
- Access and download chapters and articles from more than 300k books and 2,500 journals
- Cancel anytime View plans
Buy Now
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Instant access to the full article PDF.
Similar content being viewed by others
Availability of data and materials
Data and materials sharing not applicable to this article as no datasets were generated or analyzed during the current study, and source data are provided with the paper in Figs. 4, 5, 6, 7, 8, 9 and 10.
References
- Yamashita R, Nishio M, Do RKG, Togashi K (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9(4):611–629
Article Google Scholar - Lee H, Kwon H (2017) Going deeper with contextual cnn for hyperspectral image classification. IEEE Trans Image Process 26(10):4843–4855
Article MathSciNet Google Scholar - Salvador A, Giró-i-Nieto X, Marqués F, Satoh S (2016) Faster r-cnn features for instance search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops pp 9–16
- Bao L, Wu B, Liu W (2018) Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 5977–5986
- Sharma S, Shanmugasundaram K, Ramasamy SK (2016) Farec-cnn based efficient face recognition technique using dlib. In: 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT) pp 192–195. IEEE
- Saranya A, Kottursamy K, AlZubi AA, Bashir AK (2022) Analyzing fibrous tissue pattern in fibrous dysplasia bone images using deep r-cnn networks for segmentation. Soft Comput 26(16):7519–7533
Article Google Scholar - Potok TE, Schuman C, Young S, Patton R, Spedalieri F, Liu J, Yao K-T, Rose G, Chakma G (2018) A study of complex deep learning networks on high-performance, neuromorphic, and quantum computers. ACM J Emerg Technol Comput Syst (JETC) 14(2):1–21
Article Google Scholar - Chang M-C, Pan Z-G, Chen J-L (2017) Hardware accelerator for boosting convolution computation in image classification applications. In: 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE) pp. 1–2 IEEE
- Khan J, Fultz P, Tamazov A, Lowell D, Liu C, Melesse M, Nandhimandalam M, Nasyrov K, Perminov I, Shah T, et al (2019) Miopen: An open source library for deep learning primitives. arXiv preprint arXiv:1910.00078
- Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B (2019) Shelhamer, e. cudnn: Efficient primitives for deep learning. arxiv 2014. arXiv preprint arXiv:1410.0759
- NVIDIA: cutlass. https://github.com/NVIDIA/cutlass (2022)
- Georganas E, Avancha S, Banerjee K, Kalamkar D, Henry G, Pabst H, Heinecke A (2018) Anatomy of high-performance deep learning convolutions on simd architectures. In: SC18: International Conference for High Performance Computing, Networking Storage and Analysis, pp 830–841. IEEE
- Mathieu M, Henaff M, LeCun Y (2013) Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851
- Lavin A, Gray S (2016) Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021
- INTEL/oneapi-src: oneDNN. https://github.com/oneapi-src/oneDNN (2021)
- Tencent: ncnn. https://github.com/Tencent/ncnn (2022)
- Winograd S (1980) Arithmetic complexity of computations, vol 33. Siam, India
Book MATH Google Scholar - Kuo L-W, Yang C-C, Lee J-K, Tseng S-Y (2014) The design of llvm-based shader compiler for embedded architecture. In: 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp 961–968. IEEE
- Horn RA (1990) The hadamard product. In: Proc Symp Appl Math vol 40: pp 87–169
- ROCmSoftwarePlatform: rocWMMA. https://github.com/ROCmSoftwarePlatform/rocWMMA (2022)
- Theckedath D, Sedamkar R (2020) Detecting affect states using vgg16, resnet50 and se-resnet50 networks. SN Comput Sci 1(2):1–7
Article Google Scholar - Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) pp 19–24. IEEE
- Chikin V, Kryzhanovskiy V (2022) Channel balancing for accurate quantization of winograd convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 12507–12516
- Yan D, Wang W, Chu X (2020) Optimizing batched winograd convolution on gpus. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 32–44
- Castro RL, Andrade D, Fraguela BB (2021) Opencnn: a winograd minimal filtering algorithm implementation in cuda. Mathematics 9(17):2033
Article Google Scholar - Markidis S, Der Chien SW, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance & precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 522–531 IEEE
- Jia L, Liang Y, Li X, Lu L, Yan S (2020) Enabling efficient fast convolution algorithms on gpus via megakernels. IEEE Trans Comput 69(7):986–997
MathSciNet MATH Google Scholar - Jiang J, Huang D, Du J, Lu Y, Liao X (2022) Optimizing small channel 3d convolution on gpu with tensor core. Parallel Comput 113:102954
Article MathSciNet Google Scholar - Jia Z, Zlateski A, Durand F, Li K (2018) Optimizing n-dimensional, winograd-based convolution for manycore cpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 109–123
- Ma Y, Cao Y, Vrudhula S, Seo J-s (2018) Optimizing the convolution operation to accelerate deep neural networks on fpga. IEEE Trans Very Large Scale Int (VLSI) Syst 26(7): 1354–1367
- Kala S, Jose BR, Mathew J, Nalesh S (2019) High-performance cnn accelerator on fpga using unified winograd-gemm architecture. IEEE Trans Very Large Scale Int (VLSI) Syst 27(12): 2816–2828
- Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pp 1–6
- Smith A, James N (2022) Amd instinct\(^{{\rm TM}}\) mi200 series accelerator and node architectures. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp 1–23. IEEE Computer Society
- He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 770–778
- Mattson P, Reddi VJ, Cheng C, Coleman C, Diamos G, Kanter D, Micikevicius P, Patterson D, Schmuelling G, Tang H et al (2020) Mlperf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40(2):8–16
Article Google Scholar - Wei H, Liu E, Zhao Y, Yu H (2020) Efficient non-fused winograd on gpus. In: Computer Graphics International Conference pp 411–418. Springer
- Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. (2016) \(\{\)TensorFlow\(\}\): a system for \(\{\)Large-Scale\(\}\) machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) pp 265–283
- NervanaSystems: neon. https://github.com/NervanaSystems/neon (2019)
- ROCmSoftwarePlatform: rocRAND. https://github.com/ROCmSoftwarePlatform/rocRAND (2019)
- Sun Y, Mukherjee S, Baruah T, Dong S, Gutierrez J, Mohan P, Kaeli D (2018) Evaluating performance tradeoffs on the radeon open compute platform. In: 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) pp 209–218. IEEE
- Zhou Y, Yang M, Guo C, Leng J, Liang Y, Chen Q, Guo M, Zhu Y (2021) Characterizing and demystifying the implicit convolution algorithm on commercial matrix-multiplication accelerators. In: 2021 IEEE International Symposium on Workload Characterization (IISWC) pp 214–225. IEEE
- Tsai YM, Cojean T, Anzt H (2020) Evaluating the performance of nvidia’s a100 ampere gpu for sparse linear algebra computations. arXiv preprint arXiv:2008.08478
Funding
This work was supported by the second batch of cultivation projects of Pazhou Laboratory in 2022, No.PZL2022KF0008 and the Major Key Project of PCL.
Author information
Authors and Affiliations
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, Guangdong, China
Yijie Guo, Lu Lu & Songxiang Zhu - Pazhou Laboratory, Guangdong, 510005, Guangzhou, China
Yijie Guo, Lu Lu & Songxiang Zhu - Pengcheng Laboratory, Guangdong, 518000, Shenzhen, China
Lu Lu
Authors
- Yijie Guo
- Lu Lu
- Songxiang Zhu
Contributions
Yijie Guo and Songxiang Zhu do the research and wrote the main manuscript text with the guidance of Lu Lu. All authors reviewed the manuscript.
Corresponding author
Correspondence toLu Lu.
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethics approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guo, Y., Lu, L. & Zhu, S. Novel accelerated methods for convolution neural network with matrix core.J Supercomput 79, 19547–19573 (2023). https://doi.org/10.1007/s11227-023-05399-6
- Accepted: 15 May 2023
- Published: 30 May 2023
- Version of record: 30 May 2023
- Issue date: November 2023
- DOI: https://doi.org/10.1007/s11227-023-05399-6