Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN (original) (raw)

Abstract

3D convolutional neural network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss that occurs seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; (i) aggregate layer-wise global to local (global–local) discrete gradient using trained 3DResNext network, and (ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global–local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradient and activation of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class’s input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCAM. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating of each layer produces better classification results than the baseline model.

Access this article

Log in via an institution

Subscribe and save

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

Download references

Acknowledgements

The authors would like to thank KAKENHI Project No. 16K00239 for funding the research.

Author information

Author notes

  1. Muthu Subash Kavitha and Takio Kurita contributed equally to this work.

Authors and Affiliations

  1. Informatics Engineering, Faculty of Computer Science, Brawijaya University, Veteran st. 8, Malang, East Java, 65145, Indonesia
    Novanto Yudistira
  2. School of Information and Data Sciences, Nagasaki University, 1-14 Bunkyo-machi, Nagasaki, Japan
    Muthu Subash Kavitha
  3. Graduate School of Advanced Science and Engineering, Hiroshima University, Higashi-hiroshima, Hiroshima, 739-8521, Japan
    Takio Kurita

Authors

  1. Novanto Yudistira
  2. Muthu Subash Kavitha
  3. Takio Kurita

Corresponding author

Correspondence toNovanto Yudistira.

Additional information

Communicated by Koichi Kise.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yudistira, N., Kavitha, M.S. & Kurita, T. Weakly-Supervised Action Localization, and Action Recognition Using Global–Local Attention of 3D CNN.Int J Comput Vis 130, 2349–2363 (2022). https://doi.org/10.1007/s11263-022-01649-x

Download citation

Keywords