Novel accelerated methods for convolution neural network with matrix core (original) (raw)

Abstract

The powerful parallel computing capability of GPU and the development of matrix processing unit in recent years provide more possibilities to improve the performance of convolutional neural network (CNN) on GPU. For the Winograd convolution algorithm, which is the most widely used in CNN and has the best performance, there are already some tuning results, but they all ignore the utilization of the matrix operation unit and fail to make full use of the computing resources of GPU. This paper introduces a single precision accelerated solution on GPU for CNN. According to the indicators of architecture, the optimal data layout, grid division and block division methods are derived. In order to adapt to a variety of padding in practical application, an efficient dynamic scheme for filling is designed, and by the use of matrix cores, a pipeline algorithm with operator fusion is implemented. The deep learning accelerated library MIOpen in AMD is used as the baseline. Taking several convolutional layers of ResNet50 as the experimental input, the evaluation shows that our approach outperforms MIOpen with the speedup of 1.41x on MI210, and reaches 74% of the peak value of single precision calculations. Applying this method to the training and inference of ResNet50, the speedup of 1.68x is obtained.

Access this article

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Availability of data and materials

Data and materials sharing not applicable to this article as no datasets were generated or analyzed during the current study, and source data are provided with the paper in Figs. 4, 5, 6, 7, 8, 9 and 10.

References

Yamashita R, Nishio M, Do RKG, Togashi K (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9(4):611–629
Article Google Scholar
Lee H, Kwon H (2017) Going deeper with contextual cnn for hyperspectral image classification. IEEE Trans Image Process 26(10):4843–4855
Article MathSciNet Google Scholar
Salvador A, Giró-i-Nieto X, Marqués F, Satoh S (2016) Faster r-cnn features for instance search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops pp 9–16
Bao L, Wu B, Liu W (2018) Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 5977–5986
Sharma S, Shanmugasundaram K, Ramasamy SK (2016) Farec-cnn based efficient face recognition technique using dlib. In: 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT) pp 192–195. IEEE
Saranya A, Kottursamy K, AlZubi AA, Bashir AK (2022) Analyzing fibrous tissue pattern in fibrous dysplasia bone images using deep r-cnn networks for segmentation. Soft Comput 26(16):7519–7533
Article Google Scholar
Potok TE, Schuman C, Young S, Patton R, Spedalieri F, Liu J, Yao K-T, Rose G, Chakma G (2018) A study of complex deep learning networks on high-performance, neuromorphic, and quantum computers. ACM J Emerg Technol Comput Syst (JETC) 14(2):1–21
Article Google Scholar
Chang M-C, Pan Z-G, Chen J-L (2017) Hardware accelerator for boosting convolution computation in image classification applications. In: 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE) pp. 1–2 IEEE
Khan J, Fultz P, Tamazov A, Lowell D, Liu C, Melesse M, Nandhimandalam M, Nasyrov K, Perminov I, Shah T, et al (2019) Miopen: An open source library for deep learning primitives. arXiv preprint arXiv:1910.00078
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B (2019) Shelhamer, e. cudnn: Efficient primitives for deep learning. arxiv 2014. arXiv preprint arXiv:1410.0759
NVIDIA: cutlass. https://github.com/NVIDIA/cutlass (2022)
Georganas E, Avancha S, Banerjee K, Kalamkar D, Henry G, Pabst H, Heinecke A (2018) Anatomy of high-performance deep learning convolutions on simd architectures. In: SC18: International Conference for High Performance Computing, Networking Storage and Analysis, pp 830–841. IEEE
Mathieu M, Henaff M, LeCun Y (2013) Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851
Lavin A, Gray S (2016) Fast algorithms for convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021
INTEL/oneapi-src: oneDNN. https://github.com/oneapi-src/oneDNN (2021)
Tencent: ncnn. https://github.com/Tencent/ncnn (2022)
Winograd S (1980) Arithmetic complexity of computations, vol 33. Siam, India
Book MATH Google Scholar
Kuo L-W, Yang C-C, Lee J-K, Tseng S-Y (2014) The design of llvm-based shader compiler for embedded architecture. In: 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp 961–968. IEEE
Horn RA (1990) The hadamard product. In: Proc Symp Appl Math vol 40: pp 87–169
ROCmSoftwarePlatform: rocWMMA. https://github.com/ROCmSoftwarePlatform/rocWMMA (2022)
Theckedath D, Sedamkar R (2020) Detecting affect states using vgg16, resnet50 and se-resnet50 networks. SN Comput Sci 1(2):1–7
Article Google Scholar
Vasudevan A, Anderson A, Gregg D (2017) Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) pp 19–24. IEEE
Chikin V, Kryzhanovskiy V (2022) Channel balancing for accurate quantization of winograd convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp 12507–12516
Yan D, Wang W, Chu X (2020) Optimizing batched winograd convolution on gpus. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 32–44
Castro RL, Andrade D, Fraguela BB (2021) Opencnn: a winograd minimal filtering algorithm implementation in cuda. Mathematics 9(17):2033
Article Google Scholar
Markidis S, Der Chien SW, Laure E, Peng IB, Vetter JS (2018) Nvidia tensor core programmability, performance & precision. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 522–531 IEEE
Jia L, Liang Y, Li X, Lu L, Yan S (2020) Enabling efficient fast convolution algorithms on gpus via megakernels. IEEE Trans Comput 69(7):986–997
MathSciNet MATH Google Scholar
Jiang J, Huang D, Du J, Lu Y, Liao X (2022) Optimizing small channel 3d convolution on gpu with tensor core. Parallel Comput 113:102954
Article MathSciNet Google Scholar
Jia Z, Zlateski A, Durand F, Li K (2018) Optimizing n-dimensional, winograd-based convolution for manycore cpus. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 109–123
Ma Y, Cao Y, Vrudhula S, Seo J-s (2018) Optimizing the convolution operation to accelerate deep neural networks on fpga. IEEE Trans Very Large Scale Int (VLSI) Syst 26(7): 1354–1367
Kala S, Jose BR, Mathew J, Nalesh S (2019) High-performance cnn accelerator on fpga using unified winograd-gemm architecture. IEEE Trans Very Large Scale Int (VLSI) Syst 27(12): 2816–2828
Coppersmith D, Winograd S (1987) Matrix multiplication via arithmetic progressions. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pp 1–6
Smith A, James N (2022) Amd instinct\(^{{\rm TM}}\) mi200 series accelerator and node architectures. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp 1–23. IEEE Computer Society
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 770–778
Mattson P, Reddi VJ, Cheng C, Coleman C, Diamos G, Kanter D, Micikevicius P, Patterson D, Schmuelling G, Tang H et al (2020) Mlperf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40(2):8–16
Article Google Scholar
Wei H, Liu E, Zhao Y, Yu H (2020) Efficient non-fused winograd on gpus. In: Computer Graphics International Conference pp 411–418. Springer
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, et al. (2016) \(\{\)TensorFlow\(\}\): a system for \(\{\)Large-Scale\(\}\) machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) pp 265–283
NervanaSystems: neon. https://github.com/NervanaSystems/neon (2019)
ROCmSoftwarePlatform: rocRAND. https://github.com/ROCmSoftwarePlatform/rocRAND (2019)
Sun Y, Mukherjee S, Baruah T, Dong S, Gutierrez J, Mohan P, Kaeli D (2018) Evaluating performance tradeoffs on the radeon open compute platform. In: 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) pp 209–218. IEEE
Zhou Y, Yang M, Guo C, Leng J, Liang Y, Chen Q, Guo M, Zhu Y (2021) Characterizing and demystifying the implicit convolution algorithm on commercial matrix-multiplication accelerators. In: 2021 IEEE International Symposium on Workload Characterization (IISWC) pp 214–225. IEEE
Tsai YM, Cojean T, Anzt H (2020) Evaluating the performance of nvidia’s a100 ampere gpu for sparse linear algebra computations. arXiv preprint arXiv:2008.08478

Download references

Funding

This work was supported by the second batch of cultivation projects of Pazhou Laboratory in 2022, No.PZL2022KF0008 and the Major Key Project of PCL.

Author information

Authors and Affiliations

School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, Guangdong, China
Yijie Guo, Lu Lu & Songxiang Zhu
Pazhou Laboratory, Guangdong, 510005, Guangzhou, China
Yijie Guo, Lu Lu & Songxiang Zhu
Pengcheng Laboratory, Guangdong, 518000, Shenzhen, China
Lu Lu

Authors

Yijie Guo
Lu Lu
Songxiang Zhu

Contributions

Yijie Guo and Songxiang Zhu do the research and wrote the main manuscript text with the guidance of Lu Lu. All authors reviewed the manuscript.

Corresponding author

Correspondence toLu Lu.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethics approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Guo, Y., Lu, L. & Zhu, S. Novel accelerated methods for convolution neural network with matrix core.J Supercomput 79, 19547–19573 (2023). https://doi.org/10.1007/s11227-023-05399-6

Download citation

Accepted: 15 May 2023
Published: 30 May 2023
Version of record: 30 May 2023
Issue date: November 2023
DOI: https://doi.org/10.1007/s11227-023-05399-6

Novel accelerated methods for convolution neural network with matrix core (original) (raw)

Abstract

Access this article

Buy Now

Similar content being viewed by others

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Novel accelerated methods for convolution neural network with matrix core (original) (raw)

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords