Pose is all you need: The pose only group activity recognition system (POGARS) (original) (raw)
Related papers
Deep neural network model for group activity recognition using contextual relationship
Engineering Science and Technology, an International Journal, 2018
In this paper, we present contextual relationship-based learning model using deep neural network for recognizing the activities performed by a group of people in a video sequence. The proposed model comprises of the context learning using a bottom-up approach, learning from individual human actions to group level activity as well as learning from the scene information. We build deep convolutional neural network model to capture human action-pose feature for a given input video sequence. To capture group level temporal flow changes, aggregated action-pose feature of persons within the context area are fed to deep recurrent neural network, which provides spatio-temporal group descriptor. Together with this, we build a scene level convolutional neural network, to extract scene level feature which improves the performance of group activity recognition. The probabilistic inference model, as an additional layer in deep neural network, added to ensemble the models and provide a unified deep learning framework. Experimental results show the efficiency of the proposed model on standard benchmark collective activity dataset in group activity recognition. We also present the evaluated results by varying different learning parameters, optimizers, especially recurrent neural network models long short-term memory and gated recurrent unit on the benchmark collective activity dataset.
Multi-Level Sequence GAN for Group Activity Recognition
Asian Conference on Computer Vision (ACCV), 2018
We propose a novel semi supervised, Multi-Level Sequential Generative Adversarial Network (MLS-GAN) architecture for group activity recognition. In contrast to previous works which utilise manually annotated individual human action predictions, we allow the models to learn it's own internal representations to discover pertinent sub-activities that aid the final group activity recognition task. The generator is fed with person-level and scene-level features that are mapped temporally through LSTM networks. Action-based feature fusion is performed through novel gated fusion units that are able to consider long-term dependancies, exploring the relationships among all individual actions, to learn an intermediate representation or 'action code' for the current group activity. The network achieves it's semi-supervised behaviour by allowing it to perform group action classification together with the adver-sarial real/fake validation. We perform extensive evaluations on different architectural variants to demonstrate the importance of the proposed architecture. Furthermore, we show that utilising both person-level and scene-level features facilitates the group activity prediction better than using only person-level features. Our proposed architecture outperforms current state-of-the-art results for sports and pedestrian based classification tasks on Volleyball and Collective Activity datasets, showing it's flexible nature for effective learning of group activities.
Detector-Free Weakly Supervised Group Activity Recognition
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Group activity recognition is the task of understanding the activity conducted by a group of people as a whole in a multi-person video. Existing models for this task are often impractical in that they demand ground-truth bounding box labels of actors even in testing or rely on off-the-shelf object detectors. Motivated by this, we propose a novel model for group activity recognition that depends neither on bounding box labels nor on object detector. Our model based on Transformer localizes and encodes partial contexts of a group activity by leveraging the attention mechanism, and represents a video clip as a set of partial context embeddings. The embedding vectors are then aggregated to form a single group representation that reflects the entire context of an activity while capturing temporal evolution of each partial context. Our method achieves outstanding performance on two benchmarks, Volleyball and NBA datasets, surpassing not only the state of the art trained with the same level of supervision, but also some of existing models relying on stronger supervision.
Deep Learning Architecture for Group Activity Recognition using Description of Local Motions
2020 International Joint Conference on Neural Networks (IJCNN), 2020
Nowadays, the recognition of group activities is a significant problem, specially in video surveillance. It is increasingly important to have vision architectures that automatically allow timely recognition of group activities and predictions about them in order to make decisions. This paper proposes a computer vision architecture able to learn and recognise group activities using the movements of it in the scene. It is based on the Activity Description Vector (ADV), a descriptor able to represent the trajectory information of an image sequence as a collection of the local movements that occur in specific regions of the scene. The proposal evolves this descriptor towards the generation of images able to be the input queue of a two-stream convolutional neural network capable of robustly classifying group activities. Hence, this proposal, besides the use of trajectory analysis that allows a simple high level understanding of complex groups activities, takes advantage of the deep learning characteristics providing a robust architecture for multi-class recognition. The architecture has been evaluated and compared to other approaches using BEHAVE and INRIA dataset sequences obtaining great performance in the recognition of group activities.
Proceedings of the 2020 International Conference on Multimodal Interaction, 2020
In Human Behaviour Understanding, social interaction is often modeled on the basis of lower level action recognition. The accuracy of this recognition has an impact on the system's capability to detect the higher level social events, and thus on the usefulness of the resulting system. We model team interactions in volleyball and investigate, through simulation of typical error patterns, how one can consider the required quality (in accuracy and in allowable types of errors) of the underlying action recognition for automated volleyball monitoring. Our proposed approach simulates different patterns of errors, grounded in related work in volleyball action recognition, on top of a manually annotated ground truth to model their different impact on the interaction recognition. Our results show that this can provide a means to quantify the effect of different type of classification errors on the overall quality of the system. Our chosen volleyball use case, in the rising field of sport...
Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment
We study the problem of simultaneously recognizing complex individual and group activities from spatiotemporal data in games. Recognizing complex player activities is particularly important to understand game dynamics and user behavior having a wide range of applications in game development. To do so, we propose a novel framework by developing a hierarchical dual attention RNN-based method that leverages feature and temporal attention mechanisms in a hierarchical setting for effective discovery of activities using interactions among individuals. We argue that certain activities have dependency on certain features as well as on temporal aspects of the data which can be leveraged by our dual-attention model for recognition. To the best of our knowledge, this work is the first to address activity recognition using spatiotemporal data in games. In addition, we propose using game data as a rich source of obtaining complex group interactions. In this paper, we present two contributions: (...
Attention-Driven Body Pose Encoding for Human Activity Recognition
2020 25th International Conference on Pattern Recognition (ICPR), 2021
This article proposes a novel attention-based body pose encoding for human activity recognition that presents a enriched representation of body-pose that is learned. The enriched data complements the 3D body joint position data and improves model performance. In this paper, we propose a novel approach that learns enhanced feature representations from a given sequence of 3D body joints. To achieve this encoding, the approach exploits 1) a spatial stream which encodes the spatial relationship between various body joints at each time point to learn spatial structure involving the spatial distribution of different body joints 2) a temporal stream that learns the temporal variation of individual body joints over the entire sequence duration to present a temporally enhanced representation. Afterwards, these two pose streams are fused with a multi-head attention mechanism. We also capture the contextual information from the RGB video stream using a Inception-ResNet-V2 model combined with a multi-head attention and a bidirectional Long Short-Term Memory (LSTM) network. Finally, the RGB video stream is combined with the fused body pose stream to give a novel end-to-end deep model for effective human activity recognition.
RIT-18: A Novel Dataset for Compositional Group Activity Understanding
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020
Group activity understanding is a challenging task as multiple people are involved, and their relations may vary over time. Currently, the literature of group activity is limited to group activity recognition, because videos are trimmed in very short duration and focus on a single activity. This slows down the progress in the group activity domain. In this paper, we propose a new largescale untrimmed compositional group activity dataset RIT-18 based on the volleyball games captured from YouTube. Each clip in our dataset depicts an entire rally which spans the duration from serve to a point being scored. Comprehensive annotations including group activity labels, temporal boundaries of activities, key persons, and winning teams are provided. We describe group activity recognition, future activity anticipation, and rally-level winner prediction challenges, and evaluate several baseline methods over these challenges. We report their performance on our dataset and demonstrate further eff...
Skeleton-based relational reasoning for group activity analysis
Pattern Recognition, 2021
Research on group activity recognition mostly leans on the standard twostream approach (RGB and Optical Flow) as their input features. Few have explored explicit pose information, with none using it directly to reason about the persons interactions. In this paper, we leverage the skeleton information to learn the interactions between the individuals straight from it. With our proposed method GIRN, multiple relationship types are inferred from independent modules, that describe the relations between the body joints pair-by-pair. Additionally to the joints relations, we also experiment with the previously unexplored relationship between individuals and relevant objects (e.g. volleyball). The individuals distinct relations are then merged through an attention mechanism, that gives more importance to those individuals more relevant for distinguishing the group activity. We evaluate our method in the Volleyball dataset, obtaining competitive results to the state-of-the-art. Our experiments demonstrate the potential of skeleton-based approaches for modeling multi-person interactions.
Improved Sport Activity Recognition using Spatio-temporal Context
2014
Activity recognition in sport is an attractive field for computer vision research. Game, player and team analysis are of great interest and research topics within this field emerge with the goal of automated analysis. As the execution of same activities differs between players and activities cannot be modeled by local description alone, additional information is needed. Inspired by the concept of group context ([Choi11], [Lan12], [Zhu13]), we employ contextual information to support activity recognition. Compared to other sport activity recognition systems e.g. proposed by [Bialkowski13], we focus on single player activities rather than on general team activities.