Human action recognition using extreme learning machine based on visual vocabularies (original) (raw)

Human action recognition using extreme learning machine via multiple types of features

2010

This paper introduces a human actions recognition framework based on multiple types of features. Taking the advantage of motion-selectivity property of 3D dual-tree complex wavelet transform (3D DT-CWT) and affine SIFT local image detector, firstly spatio-temporal and local static features are extracted. No assumptions of scene background, location, objects of interest, or point of view information are made whereas bidirectional two-dimensional PCA (2D-PCA) is employed for dimensionality reduction which offers enhanced capabilities to preserve structure and correlation amongst neighborhood pixels of a video frame. The proposed technique is significantly faster than traditional methods due to volumetric processing of input video, and offers a rich representation of human actions in terms of reduction in artifacts. Experimental examples are given to illustrate the effectiveness of the approach.

An efficient human action recognition framework with pose-based spatiotemporal features

Engineering Science and Technology, an International Journal, 2020

In the past two decades, human action recognition has been among the most challenging tasks in the field of computer vision. Recently, extracting accurate and cost-efficient skeleton information became available thanks to the cutting edge deep learning algorithms and low-cost depth sensors. In this paper, we propose a novel framework to recognize human actions using 3D skeleton information. The main components of the framework are pose representation and encoding. Assuming that human actions can be represented by spatiotemporal poses, we define a pose descriptor consisting of three elements. The first element contains the normalized coordinates of the raw skeleton joints information. The second element contains the temporal displacement information relative to a predefined temporal offset and the third element keeps the displacement information pertinent to the previous timestamp in the temporal resolution. The final descriptor of the whole sequence is the concatenation of frame-wise descriptors. To avoid the problems regarding high dimensionality, Principal Component Analysis (PCA) is applied on the descriptors. The resulted descriptors are encoded with Fisher Vector (FV) representation before they get trained with an Extreme Learning Machine (ELM). The performance of the proposed framework is evaluated by three public benchmark datasets. The proposed method achieved competitive results compared to the other methods in the literature.

Laplacian one class extreme learning machines for human action recognition

2016 IEEE 18th International Workshop on Multimedia Signal Processing (MMSP), 2016

A novel OCC method for human action recognition namely the Laplacian One Class Extreme Learning Machines is presented. The proposed method exploits local geometric data information within the OC-ELM optimization process. It is shown that emphasizing on preserving the local geometry of the data leads to a regularized solution, which models the target class more efficiently than the standard OC-ELM algorithm. The proposed method is extended to operate in feature spaces determined by the network hidden layer outputs, as well as in ELM spaces of arbitrary dimensions. Its superior performance against other OCC options is consistent among five publicly available human action recognition datasets.

Pose-based 3D Human Motion Analysis Using Extreme Learning Machine

In 3D human motion pose-based analysis, the main problem is how to classify multi-class label activities based on primitive action (pose) inputs efficiently for both accuracy and processing time. Because, pose is not unique and the same pose can be anywhere on different activity classes. In this paper, we evaluate the effectiveness of Extreme Learning Machine (ELM) in 3D human motion analysis based on pose cluster. ELM has reputation as eager classifier with fast training and testing time but the classification result originally has still low testing accuracy even by increasing the hidden nodes number and adding more training data. To achieve better accuracy, we pursue a feature selection method to reduce the dimension of pose cluster training data in time sequence. We propose to use frequency of pose occurrence. This method is similar like bag of words which is a sparse vector of occurrence counts of poses in histogram as features for training data (bag of poses). By using bag of poses as the optimum feature selection, the ELM performance can be improved without adding network complexity (Hidden nodes number and training data).

Robust human action recognition using improved BOW and hybrid features

2012 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), 2012

Recognizing human action in video has many applications in computer vision and robotics. It is a challenging task not only because of the variations produced by general factors like illumination, background clutter, occlusion or intra-class variation, but also because of subtle behavioral patterns among interacting people or between people and objects in images as well as attracts many attentions in activity in recently years. However, these researches are not yet fully realized due to the lack of an effective feature to present human action. In this paper, we present a novel for human action representation based on hybrid features from local and global features. Firstly, a local-based feature descriptor is combined by motion and SURF. Secondly, improved BOW with kmeans++ and soft-weighting scheme are used to yield the histogram of word occurrences (HoWO) to present for action in video. Thirdly, HOGIHOF features are extracted from video for global features. Next, hybrid features is created by concatenating HoWO and HOGIHOF. Lastly, Support Vector Machine is used for classification on KTH, Weizmann and YouTube datasets. The experimental results also indicated that the extraction of features is effective and shows the feasibility of our proposal. In addition, compared with other approaches our approach is more robust, more flexible, easier to implement and simpler to comprehend.

Gradient local auto-correlation features for depth human action recognition

2021

Human action classification is a dynamic research topic in computer vision and has applications in video surveillance, human–computer interaction, and sign-language recognition. This paper aims to present an approach for the categorization of depth video oriented human action. In the approach, the enhanced motion and static history images are computed and a set of 2D auto-correlation gradient feature vectors is obtained from them to describe an action. Kernel-based Extreme Learning Machine is used with the extracted features to distinguish the diverse action types promisingly. The proposed approach is thoroughly assessed for the action datasets namely MSRAction3D, DHA, and UTD-MHAD. The approach achieves an accuracy of 97.44% for MSRAction3D, 99.13% for DHA, and 88.37% for UTD-MHAD. The experimental results and analysis demonstrate that the classification performance of the proposed method is considerable and surpasses the state-of-the-art human action classification methods. Beside...

Combination for Fast Human Action Recognition

2010

In this paper, we study the human action recognition problem based on motion features directly extracted from video. In order to implement a fast human action recognition system, we select simple features that can be obtained from non-intensive computation. We propose to use the motion history image (MHI) as our fundamental representation of the motion. This is then further processed to give a histogram of the MHI and the Haar wavelet transform of the MHI. The processed MHI thus allows a combined feature vector to be computed cheaply and this has a lower dimension than the original MHI. Finally, this feature vector is used in a SVM-based human action recognition system. Experimental results demonstrate the method to be efficient, allowing it to be used in real-time human action classification systems.

An enhanced method for human action recognition

This paper presents a fast and simple method for human action recognition. The proposed technique relies on detecting interest points using SIFT (scale invariant feature transform) from each frame of the video. A fine-tuning step is used here to limit the number of interesting points according to the amount of details. Then the popular approach Bag of Video Words is applied with a new normalization technique. This normalization technique remarkably improves the results. Finally a multi class linear Support Vector Machine (SVM) is utilized for classification. Experiments were conducted on the KTH and Weizmann datasets. The results demonstrate that our approach outperforms most existing methods, achieving accuracy of 97.89% for KTH and 96.66% for Weizmann.

Motion feature combination for human action recognition in video

… Vision and Computer Graphics. Theory and …, 2009

We study the human action recognition problem based on motion features directly extracted from video. In order to implement a fast human action recognition system, we select simple features that can be obtained from nonintensive computation. We propose to use the motion history image (MHI) as our fundamental representation of the motion. This is then further processed to give a histogram of the MHI and the Haar wavelet transform of the MHI. The combination of these two features is computed cheaply and has a lower dimension than the original MHI. The combined feature vector is tested in a Support Vector Machine (SVM) based human action recognition system and a significant performance improvement has been achieved. The system is efficient to be used in real-time human action classification systems.

Human Action Recognition using Ensemble of Shape, Texture and Motion features

2018

Even though many approaches have been proposed for Human Action Recognition, challenges like illumination variation, occlusion, camera view and background clutter keep this topic open for further research. Devising a robust descriptor for representing an action to give good classification accuracy is a demanding task. In this work, a new feature descriptor is introduced which is named ‘Spatio Temporal Shape-Texture-Motion’ (STSTM) descriptor. STSTM feature descriptor uses hybrid approach by combining local and global features. Salient points are extracted using Spatio Temporal Interest Points (STIP) algorithm which are further encoded using Discrete Wavelet Transform (DWT). DWT coefficients thus extracted represent local motion information of the object. Shape and texture features are extracted using Histogram of Oriented Gradient (HOG) and Local Binary Pattern (LBP) algorithms respectively. To achieve dimensionality reduction, Principal Component analysis is applied separately to t...