Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN (original) (raw)

Abstract

3D convolutional neural network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss that occurs seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; (i) aggregate layer-wise global to local (global–local) discrete gradient using trained 3DResNext network, and (ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global–local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradient and activation of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class’s input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCAM. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating of each layer produces better classification results than the baseline model.

Access this article

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Adebayo, J., et al. (2018). Sanity checks for saliency maps. Advances in Neural Information Processing Systems.
Bargal, S.A., et al. (2018)Excitation backprop for RNNs. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Bazzani, L., et al. (2016). Self-taught object localization with deep networks. In 2016 IEEE winter conference on applications of computer vision (WACV). IEEE.
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Article Google Scholar
Carreira, J., & Zisserman, A. (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE.
Chattopadhay, A., et al. (2018) Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE winter conference on applications of computer vision (WACV). IEEE.
Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, 839–847. https://doi.org/10.1109/WACV.2018.00097.
Article Google Scholar
Chen, L., et al. (2017). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Choe, J., et al. (2020). Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Deng, J., et al. (2009) Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE.
Fukui, H., et al. (2019). Attention branch network: Learning of attention mechanism for visual explanation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Girdhar, R., & Deva R. (2017). Attentional pooling for action recognition. Advances in Neural Information Processing Systems.
Hara, K., Kataoka, H., & Satoh, Y. (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA.
He, K., et al. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Kawaguchi, K., & Bengio, Y. (2019). Depth with nonlinearity creates no bad local minima in ResNets. Neural Networks, 118, 167–174.
Article Google Scholar
Kuehne, H., et al. (2013) Hmdb51: A large video database for human motion recognition. In High performance computing in science and engineering ‘12. (pp. 571-582). Berlin: Springer.
Li, W., Xiatian, Z., & Shaogang, G. (2017). Harmonious attention network for person reidentification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Oquab, M., et al. (2015) Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Preim, B., & Botha, C.P. (2013). Visual computing for medicine: Theory, algorithms, and applications. Newnes.
Russakovsky, O., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Schlemper, J., et al. (2019). Attention gated networks: Learning to leverage salient regions in medical images. Medical Image Analysis, 53, 197–207.
Article Google Scholar
Selvaraju, R. R., et al.(2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV.
Shamir, O. (2018). Are ResNets provably better than linear predictors?. Advances in neural information processing systems.
Shrikumar, A., Greenside, P., & Kundaje, A. (2017) Learning important features through propagating activation differences. In Proceedings of the 34th international conference on machine learning, (Volume 70. JMLR. org).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Soomro, K., Zamir, A.R., Shah, M. (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Sundararajan, M., Ankur, T., & Qiqi, Y. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th international conference on machine learning, (Volume 70. JMLR. org).
Szegedy, C., et al. (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Tran, D., et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision.
Xie, S., et al. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Yudistira, N., & Kurita, T. (2017) Correlation Net: Spatio temporal multimodal deep learning for action recognition. arXiv preprint arXiv:1807.08291.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European conference on computer vision. Cham: Springer.
Google Scholar
Zhang, J., et al. (2018). Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10), 1084–1102.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank KAKENHI Project No. 16K00239 for funding the research.

Author information

Author notes

Muthu Subash Kavitha and Takio Kurita contributed equally to this work.

Authors and Affiliations

Informatics Engineering, Faculty of Computer Science, Brawijaya University, Veteran st. 8, Malang, East Java, 65145, Indonesia
Novanto Yudistira
School of Information and Data Sciences, Nagasaki University, 1-14 Bunkyo-machi, Nagasaki, Japan
Muthu Subash Kavitha
Graduate School of Advanced Science and Engineering, Hiroshima University, Higashi-hiroshima, Hiroshima, 739-8521, Japan
Takio Kurita

Authors

Novanto Yudistira
Muthu Subash Kavitha
Takio Kurita

Corresponding author

Correspondence toNovanto Yudistira.

Additional information

Communicated by Koichi Kise.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yudistira, N., Kavitha, M.S. & Kurita, T. Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN.Int J Comput Vis 130, 2349–2363 (2022). https://doi.org/10.1007/s11263-022-01649-x

Download citation

Received: 22 January 2021
Accepted: 07 July 2022
Published: 01 August 2022
Version of record: 01 August 2022
Issue date: October 2022
DOI: https://doi.org/10.1007/s11263-022-01649-x

Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN (original) (raw)

Abstract

Access this article

Buy Now

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN (original) (raw)

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords