STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition (original) (raw)

View PDF HTML (experimental)

Abstract:We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at this https URL.

Submission history

From: Xiaoyu Zhu [view email]
[v1] Fri, 31 Mar 2023 16:19:27 UTC (1,801 KB)
[v2] Fri, 26 Jul 2024 19:13:29 UTC (1,801 KB)