Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN (original) (raw)
Abstract
3D convolutional neural network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss that occurs seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; (i) aggregate layer-wise global to local (global–local) discrete gradient using trained 3DResNext network, and (ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global–local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradient and activation of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class’s input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCAM. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating of each layer produces better classification results than the baseline model.
Access this article
Subscribe and save
- Starting from 10 chapters or articles per month
- Access and download chapters and articles from more than 300k books and 2,500 journals
- Cancel anytime View plans
Buy Now
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Instant access to the full article PDF.
Similar content being viewed by others
References
- Adebayo, J., et al. (2018). Sanity checks for saliency maps. Advances in Neural Information Processing Systems.
- Bargal, S.A., et al. (2018)Excitation backprop for RNNs. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Bazzani, L., et al. (2016). Self-taught object localization with deep networks. In 2016 IEEE winter conference on applications of computer vision (WACV). IEEE.
- Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Article Google Scholar - Carreira, J., & Zisserman, A. (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE.
- Chattopadhay, A., et al. (2018) Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV). IEEE.
- Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, 839–847. https://doi.org/10.1109/WACV.2018.00097.
Article Google Scholar - Chen, L., et al. (2017). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Choe, J., et al. (2020). Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
- Deng, J., et al. (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE.
- Fukui, H., et al. (2019). Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Girdhar, R., & Deva R. (2017). Attentional pooling for action recognition. Advances in Neural Information Processing Systems.
- Hara, K., Kataoka, H., & Satoh, Y. (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA.
- He, K., et al. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Kawaguchi, K., & Bengio, Y. (2019). Depth with nonlinearity creates no bad local minima in ResNets. Neural Networks, 118, 167–174.
Article Google Scholar - Kuehne, H., et al. (2013) Hmdb51: A large video database for human motion recognition. In High performance computing in science and engineering ‘12. (pp. 571-582). Berlin: Springer.
- Li, W., Xiatian, Z., & Shaogang, G. (2017). Harmonious attention network for person reidentification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Oquab, M., et al. (2015) Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Preim, B., & Botha, C.P. (2013). Visual computing for medicine: Theory, algorithms, and applications. Newnes.
- Russakovsky, O., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar - Schlemper, J., et al. (2019). Attention gated networks: Learning to leverage salient regions in medical images. Medical Image Analysis, 53, 197–207.
Article Google Scholar - Selvaraju, R. R., et al.(2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV.
- Shamir, O. (2018). Are ResNets provably better than linear predictors?. Advances in neural information processing systems.
- Shrikumar, A., Greenside, P., & Kundaje, A. (2017) Learning important features through propagating activation differences. In Proceedings of the 34th international conference on machine learning, (Volume 70. JMLR. org).
- Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
- Soomro, K., Zamir, A.R., Shah, M. (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
- Sundararajan, M., Ankur, T., & Qiqi, Y. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th international conference on machine learning, (Volume 70. JMLR. org).
- Szegedy, C., et al. (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Tran, D., et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision.
- Xie, S., et al. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Yudistira, N., & Kurita, T. (2017) Correlation Net: Spatio temporal multimodal deep learning for action recognition. arXiv preprint arXiv:1807.08291.
- Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European conference on computer vision. Cham: Springer.
Google Scholar - Zhang, J., et al. (2018). Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10), 1084–1102.
Article Google Scholar
Acknowledgements
The authors would like to thank KAKENHI Project No. 16K00239 for funding the research.
Author information
Author notes
- Muthu Subash Kavitha and Takio Kurita contributed equally to this work.
Authors and Affiliations
- Informatics Engineering, Faculty of Computer Science, Brawijaya University, Veteran st. 8, Malang, East Java, 65145, Indonesia
Novanto Yudistira - School of Information and Data Sciences, Nagasaki University, 1-14 Bunkyo-machi, Nagasaki, Japan
Muthu Subash Kavitha - Graduate School of Advanced Science and Engineering, Hiroshima University, Higashi-hiroshima, Hiroshima, 739-8521, Japan
Takio Kurita
Authors
- Novanto Yudistira
- Muthu Subash Kavitha
- Takio Kurita
Corresponding author
Correspondence toNovanto Yudistira.
Additional information
Communicated by Koichi Kise.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yudistira, N., Kavitha, M.S. & Kurita, T. Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN.Int J Comput Vis 130, 2349–2363 (2022). https://doi.org/10.1007/s11263-022-01649-x
- Received: 22 January 2021
- Accepted: 07 July 2022
- Published: 01 August 2022
- Version of record: 01 August 2022
- Issue date: October 2022
- DOI: https://doi.org/10.1007/s11263-022-01649-x