Teja Kumar - Academia.edu (original) (raw)
Papers by Teja Kumar
Algorithms for intelligent systems, Sep 16, 2022
IOP Conference Series: Materials Science and Engineering, 2022
The application spectrum of natural fiber reinforced polymer composites has been increasing treme... more The application spectrum of natural fiber reinforced polymer composites has been increasing tremendously because of enhanced tribologicalbehavior. The present study investigates the wear and coefficient of friction of castor oil fiber reinforced composites for developing as new tribo-material. Composites with 40 vol.% with short fiber length of 5mm were fabricated by hand lay-up method. The tribology tests were performed with pin on disc tribometer at normal loads (15N, 30N & 45N) and sliding distances (1000m, 2000m & 3000m). A fuzzy logic model has been developed to predict and analyze the wear and friction characteristics at unknown test cases. Each input variable was divided into three linguistic variables and each output variable was divided into five linguistic variables. A triangular membership function was used for defining all the variables. The capabilities of fuzzy logic model were tested by confirmatory experiments. The model predicted the wear results with an error of 3....
2019 IEEE International Conference on Intelligent Systems and Green Technology (ICISGT), 2019
Extraction and recognition of human gestures in 3D sign language is a challenging task. 3D sign l... more Extraction and recognition of human gestures in 3D sign language is a challenging task. 3D sign language gestures are a set of hand and finger movements with respect to face, head and body. 3D motion capture technology involves capturing 3D sign gesture videos that are often affected by background, self-occlusions and lighting. This paper investigates the elation between joints on 3D skeleton. Kernel based methods can be remarkably effective for recognizing 2D and 3D signs. This work explores the potential of global alignment kernels, we use 3D motion capture data for 3D sign language recognition. Accordingly, this paper encodes five 3D relational geometric features (distance, angle, surface, area and perimeter) into global alignment kernel based on similarities between query and database sequence. The proposed framework has been tested on 800 gestures captured by five subject (i:e:; 5 800 = 4000) sign language dataset and three other publicly available action datasets, namely, CMU, HDM05 and NTU RGB-D. The proposed method outperforms when compared to other state-of-the-art methods on the above datasets.
2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020
To deal with the limitations of human action recognition systems that apply deep neural networks ... more To deal with the limitations of human action recognition systems that apply deep neural networks (DNNs) to 3D skeletal feature maps, we propose an improved set of features that enable better pattern discrimination when using a spectrally enriched circular convolutional neural network (CCNN). These new features exploit the local relationships between joint movements based on 3D quadrilaterals constructed for all possible sets of four joints. Next, we compute the volumes of these time-varying quadrilaterals, by generating color-coded images, named spatio-temporal quad-joint relative volume feature maps (QjRVMs). To preserve the pixel frequency distribution while training a DNN, which is otherwise lost due to vanishing gradients and random dropouts, we propose a new architecture CCNNs. CCNNs use cyclic multi-resolution filters in a four-stream architecture, requiring only batch normalization and ReLU operations to identify multiple pixel pattern variations simultaneously. Applying the proposed CCNN to QjRVM images illustrates that combining multi-resolution features enhances the overall classification accuracy. Finally, we evaluate our proposed human action framework using our own 102-class, 5-subject action dataset, created using 3D motion capture technology, named KLHA3D-102. We also evaluate our framework using 3 publicly available datasets: CMU, HDM05, and NTU RGB-D.
IEEE Access, 2021
3D skeletal based action recognition is being practiced with features extracted from joint positi... more 3D skeletal based action recognition is being practiced with features extracted from joint positional sequence modeling on deep learning frameworks. However, the spatial ordering of skeletal joints during the entire action recognition lifecycle is found to be fixed across datasets and frameworks. Intuition inspired us to investigate through experimentation, the influence of multiple random skeletal joint ordered features on the performance of deep learning systems. Therefore, the argument: can joint order independent learning for skeletal action recognition practicable? If practicable, the goal is to discover how many different types of randomly ordered joint feature representations are sufficient for training deep networks. Implicitly, we further investigated on multiple features and deep networks that recorded highest performance on jumbled joints. This work proposes a novel idea of learning skeletal joint volumetric features on a spectrally graded CNN to achieve joint order independence. Intuitively, we propose 4 joint features called as quad joint volumetric features (QJVF), which are found to offer better spatio temporal relationships between time series joint data when compared to existing features. Consequently, we propose a Spectrally graded Convolutional Neural Network (SgCNN) to characterize spatially divergent features extracted from jumbled skeletal joints. Finally, evaluation of the proposed hypothesis has been experimented on our 3D skeletal action KLHA3D102, KLYOGA3D datasets along with benchmarks, HDM05, CMU and NTU RGB D. The results demonstrated that the joint order independent feature learning is achievable on CNNs trained on quantified spatio temporal feature maps extracted from randomly shuffled skeletal joints from action sequences. INDEX TERMS Human action recognition, 3D motion capture, Spectrally Graded CNNs, skeletal joint ordering.
Multimedia Tools and Applications, 2020
Appearance and depth-based action recognition has been researched exclusively for improving recog... more Appearance and depth-based action recognition has been researched exclusively for improving recognition accuracy by considering motion and shape recovery particulars from RGB-D video data. Convolutional neural networks (CNN) have shown evidences of superiority on action classification problems with spatial and apparent motion inputs. The current generation of CNNs use spatial RGB videos and depth maps to recognize action classes from RGB-D video. In this work, we propose a 4-stream CNN architecture that has two spatial RGB-D video data streams and two apparent motion streams, with inputs extracted from the optical flow of RGB-D videos. Each CNN stream is packed with 8 convolutional layers, 2 dense and one SoftMax layer, and a score fusion model to merge the scores from four streams. Performance of the proposed 4-stream action recognition framework is tested on our own action dataset and three benchmark datasets for action recognition. The usefulness of the proposed model is evaluated with state-of-the-art CNN architectures for action recognition.
Journal of King Saud University - Computer and Information Sciences, 2018
Machine translation of sign language is a critical task of computer vision. In this work, we prop... more Machine translation of sign language is a critical task of computer vision. In this work, we propose to use 3D motion capture technology for sign capture and graph matching for sign recognition. Two problems related to 3D sign matching are addressed in this work: (1) how to identify same signs with different number
Neurocomputing, 2019
Currently, one of the challenging and most interesting human action recognition (HAR) problems is... more Currently, one of the challenging and most interesting human action recognition (HAR) problems is the 3D sign language recognition problem. The sign in the 3D video can be characterized in the form of 3D joint location information in 3D space over time. Therefore, the objective of this study is to construct a color coded topographical descriptor from joint distances and angles computed from joint locations. We call these two color coded images the joint distance topographic descriptor (JDTD) and joint angle topographical descriptor (JATD) respectively. For the classification we propose a two stream convolutional neural network (2CNN) architecture, which takes as input the color-coded images JDTD and JATD. The two independent streams were merged and concatenated together with features from both streams in the dense layer. For a given query 3D sign (or action), a list of class scores was obtained as a text label corresponding to the sign. The results showed improvement in classifier performance over the predecessors due to the mixing of distance and angular features for predicting closely related spatio temporal discriminative features. To benchmark the performance of our proposed model, we compared our results with the state-of-the-art baseline action recognition frameworks by using our own 3D sign language dataset and two publicly available 3D mocap action datasets, namely, HDM05 and CMU.
Advances in Multimedia, 2018
Extracting and recognizing complex human movements from unconstrained online/offline video sequen... more Extracting and recognizing complex human movements from unconstrained online/offline video sequence is a challenging task in computer vision. This paper proposes the classification of Indian classical dance actions using a powerful artificial intelligence tool: convolutional neural networks (CNN). In this work, human action recognition on Indian classical dance videos is performed on recordings from both offline (controlled recording) and online (live performances, YouTube) data. The offline data is created with ten different subjects performing 200 familiar dance mudras/poses from different Indian classical dance forms under various background environments. The online dance data is collected from YouTube for ten different subjects. Each dance pose is occupied for 60 frames or images in a video in both the cases. CNN training is performed with 8 different sample sizes, each consisting of multiple sets of subjects. The remaining 2 samples are used for testing the trained CNN. Differe...
International Journal of Engineering & Technology, 2017
Human action recognition is a vibrant area of research with multiple application areas in human m... more Human action recognition is a vibrant area of research with multiple application areas in human machine interface. In this work, we propose a human action recognition based on spatial graph kernels on 3D skeletal data. Spatial joint features are extracted using joint distances between human joint distributions in 3D space. A spatial graph is constructed using 3D points as vertices and the computed joint distances as edges for each action frame in the video sequence. Spatial graph kernels between the training set and testing set are constructed to extract similarity between the two action sets. Two spatial graph kernels are constructed with vertex and edge data represented by joint positions and joint distances. To test the proposed method, we use 4 publicly available 3D skeletal datasets from G3D, MSR Action 3D, UT Kinect and NTU RGB+D. The proposed spatial graph kernels result in better classification accuracies compared to the state of the art models.
Journal of Computer Languages, 2019
Extracting hand movements using single RGB video camera for sign language recognition is a necess... more Extracting hand movements using single RGB video camera for sign language recognition is a necessary attribute in developing an automated sign language recognition system. Local spatio temporal methods has shown encouraging outcomes for hand extraction using color cues. However, the color intensities does not behave as an independent entity during video capture in real environments. This has become a roadblock in the development of sign language machine translator for processing video data in real world environments. Not surprisingly, the result is more accurate when additional information is provided in the form of depth for sign language recognition in real environments. In this paper, we make use of a multi modal feature sharing mechanism with a four-stream convolutional neural network(CNNs) for RGB-D based sign language recognition. Unlike the multi stream CNNs, where output class prediction is based on independently operated two or three modal streams due to scale variations, we propose a feature sharing multi stream CNN on multi modal data for sign language recognition. The proposed 4stream CNN divides into two input data groupings under the training and testing spaces. The training space uses four inputs: RGB spatial in main stream and depth spatial, RGB and depth temporal on Region of Interest mapping (ROIM) stream. The testing space uses only RGB and RGB temporal data for prediction from the trained model. The ROIM stream shares the multi modal data to generate ROI maps of the human subject, which are used to regulate the feature maps in RGB stream. The scale variations in the three streams is managed by translating the depth map to fit the RGB data. Sharing of multi modal features with RGB spatial features during training has circumvented overfitting on RGB video data. To validate the proposed CNN architecture, the accuracy of the classifier is investigated with RGB D sign language data and three benchmark action datasets. The results show a remarkable behaviour of the classifier in handling missing depth modal during testing. The robustness of the system against stateof theart action recognition methods is studied using contrasting datasets.
International Journal of Intelligent Systems and Applications, 2018
Extraction of complex head and hand movements along with their constantly changing shapes for rec... more Extraction of complex head and hand movements along with their constantly changing shapes for recognition of sign language is considered a difficult problem in computer vision. This paper proposes the recognition of Indian sign language gestures using a powerful artificial intelligence tool, convolutional neural networks (CNN). Selfie mode continuous sign language video is the capture method used in this work, where a hearing-impaired person can operate the Sign language recognition (SLR) mobile application independently. Due to non-availability of datasets on mobile selfie sign language, we initiated to create the dataset with five different subjects performing 200 signs in 5 different viewing angles under various background environments. Each sign occupied for 60 frames or images in a video. CNN training is performed with 3 different sample sizes, each consisting of multiple sets of subjects and viewing angles. The remaining 2 samples are used for testing the trained CNN. Different CNN architectures were designed and tested with our selfie sign language data to obtain better accuracy in recognition. We achieved 92.88 % recognition rate compared to other classifier models reported on the same dataset.
IEEE Transactions on Multimedia, 2019
Representing 3D motion-capture sensor data with 2D color-coded joint distance maps (JDMs) as inpu... more Representing 3D motion-capture sensor data with 2D color-coded joint distance maps (JDMs) as input to a deep neural network has been shown to be effective for 3D skeletal-based human action recognition tasks. However, the joint distances are limited by their ability to represent rotational joint movements, which account for a considerable amount of information in human action classification tasks. Moreover, for the subject, view and time invariance in the recognition process, the deep classifier needs training on JDMs along different coordinate axes from multiple streams. To overcome the above shortcomings of JDMs, we propose integrating joint angular movements along with the joint distances in a spatiotemporal color-coded image called a joint angular displacement map (JADM). In the literature, multistream deep convolutional neural networks (CNNs) have been employed to achieve invariance across subjects and views for 3D human action data, which is achieved by sacrificing training time for accuracy. To improve the recognition accuracy with reduced training times, we propose to test our JADMs with a singlestream deep CNN model. To test and analyze the proposed method, we chose video sequences of yoga. The 3D motion-capture data represent a complex set of actions with lateral and rotational spatiotemporal variations. We validated the proposed method using 3D traditional human action data from the publicly available datasets HDM05 and CMU. The proposed model can accurately recognize 3D yoga actions, which may help in building a 3D model-based yoga assistant tool.
IEEE Signal Processing Letters, 2018
Locations, angles, edges, and surfaces are spatial joint features that were predominantly used fo... more Locations, angles, edges, and surfaces are spatial joint features that were predominantly used for characterizing threedimensional (3-D) skeletal data in human action recognition. Despite their demonstrated success, features described earlier find difficulty in representing a relational change among joint movements in 3-D space for classifying human actions. To characterize a relation between joints on 3-D skeleton, we propose spatial 3-D relational geometric features (S3DRGFs). S3DRGFs are calculated on a subset of four joints in a chronological order covering all joints on the skeleton. Each of these four joints shape into a polygon, that reshapes spatially and temporally with respect to the sign (action) in the 3-D video. Consequently, we construct the spatio-temporal features (S3DRGF) by computing the area and perimeter of these polygons. Accordingly, query 3-D sign (action) recognition process transforms the joint area and perimeter features (JAF and JPF) into global alignment kernels based on the computed similarity scores with the dataset features. The similarity scores from JAF and JPF kernels are averaged for recognition. The proposed framework has been tested on our own 3-D sign language dataset (BVC3DSL) and three other publicly available datasets: HDM05, CMU, and NTU RGBD skeletal data. The results show higher levels of accuracy in decoding 3-D sign language into text for building a 3-D model based sign language translator.
IEEE Signal Processing Letters, 2018
The objective of this letter is to design a unique spatiotemporal feature map characterization fo... more The objective of this letter is to design a unique spatiotemporal feature map characterization for three-dimensional (3-D) sign (or action) data. Current maps characterize geometric features, such as joint distances and angles or both, which could not accurately model the relative joint variations in a 3-D sign (or action) location data. Therefore, we propose a new color-coded feature map called joint angular velocity maps to accurately model the 3-D joint motions. Instead of using traditional convolutional neural networks (CNNs), we propose to develop a new ResNet architecture called connived feature ResNet, which has a CNN layer in the feedforward loop of the densely connected standard ResNet architecture. We show that this architecture avoids using dropout in the last layers and achieves the desired goal in less number of iterations compared to other ResNet and CNN based architectures used for sign (action) classification. To test our proposed model, we use our own motion captured 3-D sign language data (BVC3DSL) and other publicly available skeletal action data: CMU, HDM05, and NTU RGBD.
IEEE Signal Processing Letters, 2018
Convolutional neural networks (CNNs) can be remarkably effective for recognizing 2D and 3D action... more Convolutional neural networks (CNNs) can be remarkably effective for recognizing 2D and 3D actions. To further explore the potential of CNNs, we applied them in the recognition of 3D motion-captured sign language. The sign's 3D spatiotemporal information of each sign was interpreted using joint angular displacement maps (JADMs), which encode the sign as a color texture image; JADMs were calculated for all joint pairs. Multiple CNN layers then capitalized on the differences between these images and identify discriminative spatio-temporal features. We then compared the performance of our proposed model against those of state-of-the-art baseline models by using our own 3D sign language dataset and two other benchmark action datasets, namely, HDM05 and CMU.
IEEE Access, 2018
Traditional level sets suffer from two major limitations: 1) unable to detect touching object bou... more Traditional level sets suffer from two major limitations: 1) unable to detect touching object boundaries and 2) segment partially occluded objects. In this paper, we present a model and simulation of a level set functional with unified knowledge of objects region, boundary, and shape models. The simulations of the proposed model were tested on high-speed videos of the train rolling stock for bogie part segmentation. The proposed model will resolve single-and multi-object segmentation of touching boundaries and partially occulted mechanical parts on a train bogie. Simulations on high-speed videos of four trains with 1 0720 frames have resulted in near perfect segmentation of 10 touching and occluded bogie parts. The proposed model performed better than the state-of-the-art level set segmentation models, showing faster and more accurate segmentations of moving mechanical parts in high-speed videos. INDEX TERMS Level sets, train rolling stock, automated maintenance, hybrid high speed video segmentation, computer vision based condition monitoring.
Algorithms for intelligent systems, Sep 16, 2022
IOP Conference Series: Materials Science and Engineering, 2022
The application spectrum of natural fiber reinforced polymer composites has been increasing treme... more The application spectrum of natural fiber reinforced polymer composites has been increasing tremendously because of enhanced tribologicalbehavior. The present study investigates the wear and coefficient of friction of castor oil fiber reinforced composites for developing as new tribo-material. Composites with 40 vol.% with short fiber length of 5mm were fabricated by hand lay-up method. The tribology tests were performed with pin on disc tribometer at normal loads (15N, 30N & 45N) and sliding distances (1000m, 2000m & 3000m). A fuzzy logic model has been developed to predict and analyze the wear and friction characteristics at unknown test cases. Each input variable was divided into three linguistic variables and each output variable was divided into five linguistic variables. A triangular membership function was used for defining all the variables. The capabilities of fuzzy logic model were tested by confirmatory experiments. The model predicted the wear results with an error of 3....
2019 IEEE International Conference on Intelligent Systems and Green Technology (ICISGT), 2019
Extraction and recognition of human gestures in 3D sign language is a challenging task. 3D sign l... more Extraction and recognition of human gestures in 3D sign language is a challenging task. 3D sign language gestures are a set of hand and finger movements with respect to face, head and body. 3D motion capture technology involves capturing 3D sign gesture videos that are often affected by background, self-occlusions and lighting. This paper investigates the elation between joints on 3D skeleton. Kernel based methods can be remarkably effective for recognizing 2D and 3D signs. This work explores the potential of global alignment kernels, we use 3D motion capture data for 3D sign language recognition. Accordingly, this paper encodes five 3D relational geometric features (distance, angle, surface, area and perimeter) into global alignment kernel based on similarities between query and database sequence. The proposed framework has been tested on 800 gestures captured by five subject (i:e:; 5 800 = 4000) sign language dataset and three other publicly available action datasets, namely, CMU, HDM05 and NTU RGB-D. The proposed method outperforms when compared to other state-of-the-art methods on the above datasets.
2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020
To deal with the limitations of human action recognition systems that apply deep neural networks ... more To deal with the limitations of human action recognition systems that apply deep neural networks (DNNs) to 3D skeletal feature maps, we propose an improved set of features that enable better pattern discrimination when using a spectrally enriched circular convolutional neural network (CCNN). These new features exploit the local relationships between joint movements based on 3D quadrilaterals constructed for all possible sets of four joints. Next, we compute the volumes of these time-varying quadrilaterals, by generating color-coded images, named spatio-temporal quad-joint relative volume feature maps (QjRVMs). To preserve the pixel frequency distribution while training a DNN, which is otherwise lost due to vanishing gradients and random dropouts, we propose a new architecture CCNNs. CCNNs use cyclic multi-resolution filters in a four-stream architecture, requiring only batch normalization and ReLU operations to identify multiple pixel pattern variations simultaneously. Applying the proposed CCNN to QjRVM images illustrates that combining multi-resolution features enhances the overall classification accuracy. Finally, we evaluate our proposed human action framework using our own 102-class, 5-subject action dataset, created using 3D motion capture technology, named KLHA3D-102. We also evaluate our framework using 3 publicly available datasets: CMU, HDM05, and NTU RGB-D.
IEEE Access, 2021
3D skeletal based action recognition is being practiced with features extracted from joint positi... more 3D skeletal based action recognition is being practiced with features extracted from joint positional sequence modeling on deep learning frameworks. However, the spatial ordering of skeletal joints during the entire action recognition lifecycle is found to be fixed across datasets and frameworks. Intuition inspired us to investigate through experimentation, the influence of multiple random skeletal joint ordered features on the performance of deep learning systems. Therefore, the argument: can joint order independent learning for skeletal action recognition practicable? If practicable, the goal is to discover how many different types of randomly ordered joint feature representations are sufficient for training deep networks. Implicitly, we further investigated on multiple features and deep networks that recorded highest performance on jumbled joints. This work proposes a novel idea of learning skeletal joint volumetric features on a spectrally graded CNN to achieve joint order independence. Intuitively, we propose 4 joint features called as quad joint volumetric features (QJVF), which are found to offer better spatio temporal relationships between time series joint data when compared to existing features. Consequently, we propose a Spectrally graded Convolutional Neural Network (SgCNN) to characterize spatially divergent features extracted from jumbled skeletal joints. Finally, evaluation of the proposed hypothesis has been experimented on our 3D skeletal action KLHA3D102, KLYOGA3D datasets along with benchmarks, HDM05, CMU and NTU RGB D. The results demonstrated that the joint order independent feature learning is achievable on CNNs trained on quantified spatio temporal feature maps extracted from randomly shuffled skeletal joints from action sequences. INDEX TERMS Human action recognition, 3D motion capture, Spectrally Graded CNNs, skeletal joint ordering.
Multimedia Tools and Applications, 2020
Appearance and depth-based action recognition has been researched exclusively for improving recog... more Appearance and depth-based action recognition has been researched exclusively for improving recognition accuracy by considering motion and shape recovery particulars from RGB-D video data. Convolutional neural networks (CNN) have shown evidences of superiority on action classification problems with spatial and apparent motion inputs. The current generation of CNNs use spatial RGB videos and depth maps to recognize action classes from RGB-D video. In this work, we propose a 4-stream CNN architecture that has two spatial RGB-D video data streams and two apparent motion streams, with inputs extracted from the optical flow of RGB-D videos. Each CNN stream is packed with 8 convolutional layers, 2 dense and one SoftMax layer, and a score fusion model to merge the scores from four streams. Performance of the proposed 4-stream action recognition framework is tested on our own action dataset and three benchmark datasets for action recognition. The usefulness of the proposed model is evaluated with state-of-the-art CNN architectures for action recognition.
Journal of King Saud University - Computer and Information Sciences, 2018
Machine translation of sign language is a critical task of computer vision. In this work, we prop... more Machine translation of sign language is a critical task of computer vision. In this work, we propose to use 3D motion capture technology for sign capture and graph matching for sign recognition. Two problems related to 3D sign matching are addressed in this work: (1) how to identify same signs with different number
Neurocomputing, 2019
Currently, one of the challenging and most interesting human action recognition (HAR) problems is... more Currently, one of the challenging and most interesting human action recognition (HAR) problems is the 3D sign language recognition problem. The sign in the 3D video can be characterized in the form of 3D joint location information in 3D space over time. Therefore, the objective of this study is to construct a color coded topographical descriptor from joint distances and angles computed from joint locations. We call these two color coded images the joint distance topographic descriptor (JDTD) and joint angle topographical descriptor (JATD) respectively. For the classification we propose a two stream convolutional neural network (2CNN) architecture, which takes as input the color-coded images JDTD and JATD. The two independent streams were merged and concatenated together with features from both streams in the dense layer. For a given query 3D sign (or action), a list of class scores was obtained as a text label corresponding to the sign. The results showed improvement in classifier performance over the predecessors due to the mixing of distance and angular features for predicting closely related spatio temporal discriminative features. To benchmark the performance of our proposed model, we compared our results with the state-of-the-art baseline action recognition frameworks by using our own 3D sign language dataset and two publicly available 3D mocap action datasets, namely, HDM05 and CMU.
Advances in Multimedia, 2018
Extracting and recognizing complex human movements from unconstrained online/offline video sequen... more Extracting and recognizing complex human movements from unconstrained online/offline video sequence is a challenging task in computer vision. This paper proposes the classification of Indian classical dance actions using a powerful artificial intelligence tool: convolutional neural networks (CNN). In this work, human action recognition on Indian classical dance videos is performed on recordings from both offline (controlled recording) and online (live performances, YouTube) data. The offline data is created with ten different subjects performing 200 familiar dance mudras/poses from different Indian classical dance forms under various background environments. The online dance data is collected from YouTube for ten different subjects. Each dance pose is occupied for 60 frames or images in a video in both the cases. CNN training is performed with 8 different sample sizes, each consisting of multiple sets of subjects. The remaining 2 samples are used for testing the trained CNN. Differe...
International Journal of Engineering & Technology, 2017
Human action recognition is a vibrant area of research with multiple application areas in human m... more Human action recognition is a vibrant area of research with multiple application areas in human machine interface. In this work, we propose a human action recognition based on spatial graph kernels on 3D skeletal data. Spatial joint features are extracted using joint distances between human joint distributions in 3D space. A spatial graph is constructed using 3D points as vertices and the computed joint distances as edges for each action frame in the video sequence. Spatial graph kernels between the training set and testing set are constructed to extract similarity between the two action sets. Two spatial graph kernels are constructed with vertex and edge data represented by joint positions and joint distances. To test the proposed method, we use 4 publicly available 3D skeletal datasets from G3D, MSR Action 3D, UT Kinect and NTU RGB+D. The proposed spatial graph kernels result in better classification accuracies compared to the state of the art models.
Journal of Computer Languages, 2019
Extracting hand movements using single RGB video camera for sign language recognition is a necess... more Extracting hand movements using single RGB video camera for sign language recognition is a necessary attribute in developing an automated sign language recognition system. Local spatio temporal methods has shown encouraging outcomes for hand extraction using color cues. However, the color intensities does not behave as an independent entity during video capture in real environments. This has become a roadblock in the development of sign language machine translator for processing video data in real world environments. Not surprisingly, the result is more accurate when additional information is provided in the form of depth for sign language recognition in real environments. In this paper, we make use of a multi modal feature sharing mechanism with a four-stream convolutional neural network(CNNs) for RGB-D based sign language recognition. Unlike the multi stream CNNs, where output class prediction is based on independently operated two or three modal streams due to scale variations, we propose a feature sharing multi stream CNN on multi modal data for sign language recognition. The proposed 4stream CNN divides into two input data groupings under the training and testing spaces. The training space uses four inputs: RGB spatial in main stream and depth spatial, RGB and depth temporal on Region of Interest mapping (ROIM) stream. The testing space uses only RGB and RGB temporal data for prediction from the trained model. The ROIM stream shares the multi modal data to generate ROI maps of the human subject, which are used to regulate the feature maps in RGB stream. The scale variations in the three streams is managed by translating the depth map to fit the RGB data. Sharing of multi modal features with RGB spatial features during training has circumvented overfitting on RGB video data. To validate the proposed CNN architecture, the accuracy of the classifier is investigated with RGB D sign language data and three benchmark action datasets. The results show a remarkable behaviour of the classifier in handling missing depth modal during testing. The robustness of the system against stateof theart action recognition methods is studied using contrasting datasets.
International Journal of Intelligent Systems and Applications, 2018
Extraction of complex head and hand movements along with their constantly changing shapes for rec... more Extraction of complex head and hand movements along with their constantly changing shapes for recognition of sign language is considered a difficult problem in computer vision. This paper proposes the recognition of Indian sign language gestures using a powerful artificial intelligence tool, convolutional neural networks (CNN). Selfie mode continuous sign language video is the capture method used in this work, where a hearing-impaired person can operate the Sign language recognition (SLR) mobile application independently. Due to non-availability of datasets on mobile selfie sign language, we initiated to create the dataset with five different subjects performing 200 signs in 5 different viewing angles under various background environments. Each sign occupied for 60 frames or images in a video. CNN training is performed with 3 different sample sizes, each consisting of multiple sets of subjects and viewing angles. The remaining 2 samples are used for testing the trained CNN. Different CNN architectures were designed and tested with our selfie sign language data to obtain better accuracy in recognition. We achieved 92.88 % recognition rate compared to other classifier models reported on the same dataset.
IEEE Transactions on Multimedia, 2019
Representing 3D motion-capture sensor data with 2D color-coded joint distance maps (JDMs) as inpu... more Representing 3D motion-capture sensor data with 2D color-coded joint distance maps (JDMs) as input to a deep neural network has been shown to be effective for 3D skeletal-based human action recognition tasks. However, the joint distances are limited by their ability to represent rotational joint movements, which account for a considerable amount of information in human action classification tasks. Moreover, for the subject, view and time invariance in the recognition process, the deep classifier needs training on JDMs along different coordinate axes from multiple streams. To overcome the above shortcomings of JDMs, we propose integrating joint angular movements along with the joint distances in a spatiotemporal color-coded image called a joint angular displacement map (JADM). In the literature, multistream deep convolutional neural networks (CNNs) have been employed to achieve invariance across subjects and views for 3D human action data, which is achieved by sacrificing training time for accuracy. To improve the recognition accuracy with reduced training times, we propose to test our JADMs with a singlestream deep CNN model. To test and analyze the proposed method, we chose video sequences of yoga. The 3D motion-capture data represent a complex set of actions with lateral and rotational spatiotemporal variations. We validated the proposed method using 3D traditional human action data from the publicly available datasets HDM05 and CMU. The proposed model can accurately recognize 3D yoga actions, which may help in building a 3D model-based yoga assistant tool.
IEEE Signal Processing Letters, 2018
Locations, angles, edges, and surfaces are spatial joint features that were predominantly used fo... more Locations, angles, edges, and surfaces are spatial joint features that were predominantly used for characterizing threedimensional (3-D) skeletal data in human action recognition. Despite their demonstrated success, features described earlier find difficulty in representing a relational change among joint movements in 3-D space for classifying human actions. To characterize a relation between joints on 3-D skeleton, we propose spatial 3-D relational geometric features (S3DRGFs). S3DRGFs are calculated on a subset of four joints in a chronological order covering all joints on the skeleton. Each of these four joints shape into a polygon, that reshapes spatially and temporally with respect to the sign (action) in the 3-D video. Consequently, we construct the spatio-temporal features (S3DRGF) by computing the area and perimeter of these polygons. Accordingly, query 3-D sign (action) recognition process transforms the joint area and perimeter features (JAF and JPF) into global alignment kernels based on the computed similarity scores with the dataset features. The similarity scores from JAF and JPF kernels are averaged for recognition. The proposed framework has been tested on our own 3-D sign language dataset (BVC3DSL) and three other publicly available datasets: HDM05, CMU, and NTU RGBD skeletal data. The results show higher levels of accuracy in decoding 3-D sign language into text for building a 3-D model based sign language translator.
IEEE Signal Processing Letters, 2018
The objective of this letter is to design a unique spatiotemporal feature map characterization fo... more The objective of this letter is to design a unique spatiotemporal feature map characterization for three-dimensional (3-D) sign (or action) data. Current maps characterize geometric features, such as joint distances and angles or both, which could not accurately model the relative joint variations in a 3-D sign (or action) location data. Therefore, we propose a new color-coded feature map called joint angular velocity maps to accurately model the 3-D joint motions. Instead of using traditional convolutional neural networks (CNNs), we propose to develop a new ResNet architecture called connived feature ResNet, which has a CNN layer in the feedforward loop of the densely connected standard ResNet architecture. We show that this architecture avoids using dropout in the last layers and achieves the desired goal in less number of iterations compared to other ResNet and CNN based architectures used for sign (action) classification. To test our proposed model, we use our own motion captured 3-D sign language data (BVC3DSL) and other publicly available skeletal action data: CMU, HDM05, and NTU RGBD.
IEEE Signal Processing Letters, 2018
Convolutional neural networks (CNNs) can be remarkably effective for recognizing 2D and 3D action... more Convolutional neural networks (CNNs) can be remarkably effective for recognizing 2D and 3D actions. To further explore the potential of CNNs, we applied them in the recognition of 3D motion-captured sign language. The sign's 3D spatiotemporal information of each sign was interpreted using joint angular displacement maps (JADMs), which encode the sign as a color texture image; JADMs were calculated for all joint pairs. Multiple CNN layers then capitalized on the differences between these images and identify discriminative spatio-temporal features. We then compared the performance of our proposed model against those of state-of-the-art baseline models by using our own 3D sign language dataset and two other benchmark action datasets, namely, HDM05 and CMU.
IEEE Access, 2018
Traditional level sets suffer from two major limitations: 1) unable to detect touching object bou... more Traditional level sets suffer from two major limitations: 1) unable to detect touching object boundaries and 2) segment partially occluded objects. In this paper, we present a model and simulation of a level set functional with unified knowledge of objects region, boundary, and shape models. The simulations of the proposed model were tested on high-speed videos of the train rolling stock for bogie part segmentation. The proposed model will resolve single-and multi-object segmentation of touching boundaries and partially occulted mechanical parts on a train bogie. Simulations on high-speed videos of four trains with 1 0720 frames have resulted in near perfect segmentation of 10 touching and occluded bogie parts. The proposed model performed better than the state-of-the-art level set segmentation models, showing faster and more accurate segmentations of moving mechanical parts in high-speed videos. INDEX TERMS Level sets, train rolling stock, automated maintenance, hybrid high speed video segmentation, computer vision based condition monitoring.