Juan David Marciales Niebles - Profile on Academia.edu (original) (raw)
Papers by Juan David Marciales Niebles
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
This paper proposes a new framework for estimating the Manhattan Frame (MF) of an indoor scene fr... more This paper proposes a new framework for estimating the Manhattan Frame (MF) of an indoor scene from a single RGB-D image. Our technique formulates this problem as the estimation of a rotation matrix that best aligns the normals of the captured scene to a canonical world axes. By introducing sparsity constraints, our method can simultaneously estimate the scene MF, the surfaces in the scene that are best aligned to one of three coordinate axes, and the outlier surfaces that do not align with any of the axes. To test our approach, we contribute a new set of annotations to determine ground truth MFs in each image of the popular NYUv2 dataset. We use this new benchmark to experimentally demonstrate that our method is more accurate, faster, more reliable and more robust than the methods used in the literature. We further motivate our technique by showing how it can be used to address the RGB-D SLAM problem in indoor scenes by incorporating it into and improving the performance of a popular RGB-D SLAM method.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
One of the cornerstone principles of deep models is their abstraction capacity, i.e. their abilit... more One of the cornerstone principles of deep models is their abstraction capacity, i.e. their ability to learn abstract concepts from 'simpler' ones. Through extensive experiments, we characterize the nature of the relationship between abstract concepts (specifically objects in images) learned by popular and high performing convolutional networks (conv-nets) and established mid-level representations used in computer vision (specifically semantic visual attributes). We focus on attributes due to their impact on several applications, such as object description, retrieval and mining, and active (and zero-shot) learning. Among the findings we uncover, we show empirical evidence of the existence of Attribute Centric Nodes (ACNs) within a conv-net, which is trained to recognize objects (not attributes) in images. These special conv-net nodes (1) collectively encode information pertinent to visual attribute representation and discrimination, (2) are unevenly and sparsely distribution across all layers of the conv-net, and (3) play an important role in conv-net based object recognition.
Automation in Construction, 2014
Workface assessmentthe process of determining the overall activity rates of onsite construction w... more Workface assessmentthe process of determining the overall activity rates of onsite construction workers throughout a daytypically involves manual visual observations which are time-consuming and laborintensive. To minimize subjectivity and the time required for conducting detailed assessments, and allowing managers to spend their time on the more important task of assessing and implementing improvements, we propose a new inexpensive vision-based method using RGB-D sensors that is applicable to interior construction operations. This is a particularly challenging task as construction activities have a large range of intra-class variability including varying sequences of body posture and time-spent on each individual activity. The skeleton extraction algorithms from RGB-D sequences produce noisy outputs when workers interact with tools or when there is a significant body occlusion within the camera's field-of-view. Existing vision-based methods are also limited as they can primarily classify "atomic" activities from RGB-D sequences involving one worker conducting a single activity. To address these limitations, our method includes three components: 1) an algorithm for detecting, tracking, and extracting body skeleton features from depth images; 2) a discriminative bag-of-poses activity classifier for classifying single visual activities from a given body skeleton sequence; and 3) a Hidden Markov Model to represent emission probabilities in the form of a statistical distribution of single activity classifiers. For training and testing purposes, we introduce a new dataset of eleven RGB-D sequences for interior drywall construction operations involving three actual construction workers conducting eight different activities in various interior locations. Our results with an average accuracy of 76% on the testing dataset show the promise of vision-based methods using RGB-D sequences for facilitating the activity analysis workface assessment.
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009
This paper presents a method that considers not only patch appearances, but also patch relationsh... more This paper presents a method that considers not only patch appearances, but also patch relationships in the form of adjectives and prepositions for natural scene recognition. Most of the existing scene categorization approaches only use patch appearances or co-occurrence of patch appearances to determine the scene categories, but the relationships among patches remain ignored. Those relationships are, however, critical for recognition and understanding. For example, a 'beach' scene can be characterized by a 'sky' region above 'sand', and a 'water' region between 'sky' and 'sand'. We believe that exploiting such relations between image regions can improve scene recognition. In our approach, each image is represented as a spatial pyramid, from which we obtain a collection of patch appearances with spatial layout information. We apply a feature mining approach to get discriminative patch combinations. The mined patch combinations can be interpreted as adjectives or prepositions, which are used for scene understanding and recognition. Experimental results on a fifteen class scene dataset show that our approach achieves competitive state-of-the-art recognition accuracy, while providing a rich description of the scene classes in terms of the mined adjectives and prepositions.
2013 IEEE International Conference on Computer Vision Workshops, 2013
We introduce a new method for representing the dynamics of human-object interactions in videos. P... more We introduce a new method for representing the dynamics of human-object interactions in videos. Previous algorithms tend to focus on modeling the spatial relationships between objects and actors, but ignore the evolving nature of this relationship through time. Our algorithm captures the dynamic nature of human-object interactions by modeling how these patterns evolve with respect to time. Our experiments show that encoding such temporal evolution is crucial for correctly discriminating human actions that involve similar objects and spatial human-object relationships, but only differ on the temporal aspect of the interaction, e.g. answer phone and dial phone We validate our approach on two human activity datasets and show performance improvements over competing state-of-the-art representations.
Real-Time and Automated Recognition and 2D Tracking of Construction Workers and Equipment from Site Video Streams
Computing in Civil Engineering (2012), 2012
ABSTRACT This paper presents an automated and real-time algorithm for recognition and 2D tracking... more ABSTRACT This paper presents an automated and real-time algorithm for recognition and 2D tracking of construction workers and equipment from site video streams. In recent years, several research studies have proposed semi-automated vision-based methods for tracking of construction workers and equipment. Nonetheless, there is still a need for automated initial recognition and real-time tracking of these resources in video streams. To address these limitations, a new algorithm based on histograms of Oriented Gradients (HOG) is ...
Construction Research Congress 2012, 2012
Video recording of construction operations provides an understandable data that could be used to ... more Video recording of construction operations provides an understandable data that could be used to analyze and improve construction performance. Despite the benefits, manual stopwatch study of previously recorded videos can be laborintensive, may suffer from biases of the observers, and impractical after substantial period of observations. To address these limitations, this paper presents a new visionbased method for automated action recognition of construction equipment from different camera viewpoints. This is particularly a challenging task as construction equipment can be partially occluded and they usually come in wide variety of sizes and appearances. The scale and pose of the equipment action can also significantly vary based on the camera configurations. In the proposed method, first a video is represented as a collection of spatio-temporal features by extracting space-time interest points and describing each feature with a histogram of oriented gradients (HOG). The algorithm automatically learns the probability distributions of the spatiotemporal features and action categories using a multiple binary Support Vector Machine (SVM) classifier. This strategy handles noisy feature points arisen from typical dynamic backgrounds. Given a novel video sequence, the multiple binary SVM classifier recognizes and localizes multiple equipment actions in long and dynamic video sequences containing multiple equipment actions. We have exhaustively tested our algorithm on 1,200 videos from earthmoving operations. Results with average accuracy of 85% across all categories of equipment actions reflect the promise of the proposed method for automated performance monitoring.
2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010
We present an automatic and efficient method to extract spatio-temporal human volumes from video,... more We present an automatic and efficient method to extract spatio-temporal human volumes from video, which combines top-down model-based and bottom-up appearancebased approaches. From the top-down perspective, our algorithm applies shape priors probabilistically to candidate image regions obtained by pedestrian detection, and provides accurate estimates of the human body areas which serve as important constraints for bottom-up processing. Temporal propagation of the identified region is performed with bottom-up cues in an efficient level-set framework, which takes advantage of the sparse top-down information that is available. Our formulation also optimizes the extracted human volume across frames through belief propagation and provides temporally coherent human regions. We demonstrate the ability of our method to extract human body regions efficiently and automatically from a large, challenging dataset collected from YouTube.
2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007
2008 IEEE Workshop on Motion and video Computing, 2008
Spatial-temporal local motion features have shown promising results in complex human action class... more Spatial-temporal local motion features have shown promising results in complex human action classification. Most of the previous works [6],[16],[21] treat these spatialtemporal features as a bag of video words, omitting any long range, global information in either the spatial or temporal domain. Other ways of learning temporal signature of motion tend to impose a fixed trajectory of the features or parts of human body returned by tracking algorithms. This leaves little flexibility for the algorithm to learn the optimal temporal pattern describing these motions. In this paper, we propose the usage of spatial-temporal correlograms to encode flexible long range temporal information into the spatial-temporal motion features. This results into a much richer description of human actions. We then apply an unsupervised generative model to learn different classes of human actions from these ST-correlograms. KTH dataset, one of the most challenging and popular human action dataset, is used for experimental evaluation. Our algorithm achieves the highest classification accuracy reported for this dataset under an unsupervised learning scheme.
Proceedings of International Conference on Multimedia Retrieval, 2014
Recent efforts in computer vision tackle the problem of human activity understanding in video seq... more Recent efforts in computer vision tackle the problem of human activity understanding in video sequences. Traditionally, these algorithms require annotated video data to learn models. In this paper, we introduce a novel data collection framework, to take advantage of the large amount of video data available on the web. We use this new framework to retrieve videos of human activities in order to build datasets for training and evaluating computer vision algorithms. We rely on Amazon Mechanical Turk workers to obtain high accuracy annotations. An agglomerative clustering technique brings the possibility to achieve reliable and consistent annotations for temporal localization of human activities in videos. Using two different datasets, Olympics Sports and our novel Daily Human Activities dataset, we show that our collection/annotation framework achieves robust annotations for human activities in large amount of video data.
Construction Research Congress 2012, 2012
In this paper we present a novel method for reliable recognition of construction workers and thei... more In this paper we present a novel method for reliable recognition of construction workers and their actions using color and depth data from a Microsoft Kinect sensor. Our algorithm is based on machine learning techniques, in which meaningful visual features are extracted based on the estimated body pose of workers. We adopt a bag-of-poses representation for worker actions and combine it with powerful discriminative classifiers to achieve accurate action recognition. The discriminative framework is able to focus on the visual aspects that are distinctive and can detect and recognize actions from different workers. We train and test our algorithm by using 80 videos from four workers involved in five drywall related construction activities. These videos were all collected from drywall construction activities inside of an under construction dining hall facility. The proposed algorithm is further validated by recognizing the actions of a construction worker that was never seen before in the training dataset. Experimental results show that our method achieves an average precision of 85.28 percent. The results reflect the promise of the proposed method for automated assessment of craftsmen productivity, safety, and occupational health at indoor environments.
2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014
This paper proposes a framework for recognizing complex human activities in videos. Our method de... more This paper proposes a framework for recognizing complex human activities in videos. Our method describes human activities in a hierarchical discriminative model that operates at three semantic levels. At the lower level, body poses are encoded in a representative but discriminative pose dictionary. At the intermediate level, encoded poses span a space where simple human actions are composed. At the highest level, our model captures temporal and spatial compositions of actions into complex human activities. Our human activity classifier simultaneously models which body parts are relevant to the action of interest as well as their appearance and composition using a discriminative approach. By formulating model learning in a maxmargin framework, our approach achieves powerful multiclass discrimination while providing useful annotations at the intermediate semantic level. We show how our hierarchical compositional model provides natural handling of occlusions. To evaluate the effectiveness of our proposed framework, we introduce a new dataset of composed human activities. We provide empirical evidence that our method achieves state-of-the-art activity classification performance on several benchmark datasets.
International Journal of Computer Vision, 2008
We present a novel unsupervised learning method for human action categories. A video sequence is ... more We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and intermediate topics corresponding to human action categories. This is achieved by using a probabilistic Latent Semantic Analysis (pLSA) model. Given a novel video sequence, the model can categorize and localize the human action(s) contained in the video. We test our algorithm on two challenging datasets: the KTH human action dataset and a recent dataset of figure skating actions. Our results are on par or slightly better than the best reported results. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.
Automation in Construction, 2013
This paper presents a computer vision based algorithm for automated 2D detection of construction ... more This paper presents a computer vision based algorithm for automated 2D detection of construction workers and equipment from site video streams. The state-of-the-art research proposes semi-automated detection methods for tracking of construction workers and equipment. Considering the number of active equipment and workers on jobsites and their frequency of appearance in a camera's field of view, application of semi-automated techniques can be time-consuming. To address this limitation, a new algorithm based on Histograms of Oriented Gradients and Colors (HOG+C) is proposed. Our proposed detector uses a single sliding window at multiple scales to identify the potential candidates for the location of equipment and workers in 2D. Each detection window is first divided into small spatial regions and then the gradient orientations and hue-saturation colors are locally histogrammed and concatenated to form the HOG + C descriptors. Tiling the sliding detection window with a dense and overlapping grid of formed descriptors and using a binary Support Vector Machine (SVM) classifier for each resource enables automated 2D detection of workers and equipment. A new comprehensive benchmark dataset containing over 8000 annotated video frames including equipment and workers from different construction projects is introduced. This dataset contains a large range of pose, scale, background, illumination, and occlusion variation. Our preliminary results on detection of standing workers, excavators and dump trucks with an average accuracy of 98.83%, 82.10%, and 84.88% respectively indicate the applicability of the proposed method for automated activity analysis of workers and equipment from single video cameras. Unlike other state-of-the-art algorithms in automated resource tracking, this method particularly detects idle resources and does not need manual or semi-automated initialization of the resource locations in 2D video frames. The experimental results and the perceived benefits of the proposed method are discussed in detail.
Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers
Advanced Engineering Informatics, 2013
ABSTRACT Video recordings of earthmoving construction operations provide understandable data that... more ABSTRACT Video recordings of earthmoving construction operations provide understandable data that can be used for benchmarking and analyzing their performance. These recordings further support project managers to take corrective actions on performance deviations and in turn improve operational efficiency. Despite these benefits, manual stopwatch studies of previously recorded videos can be labor-intensive, may suffer from biases of the observers, and are impractical after substantial period of observations. This paper presents a new computer vision based algorithm for recognizing single actions of earthmoving construction equipment. This is particularly a challenging task as equipment can be partially occluded in site video streams and usually come in wide variety of sizes and appearances. The scale and pose of the equipment actions can also significantly vary based on the camera configurations. In the proposed method, a video is initially represented as a collection of spatio-temporal visual features by extracting space–time interest points and describing each feature with a Histogram of Oriented Gradients (HOG). The algorithm automatically learns the distributions of the spatio-temporal features and action categories using a multi-class Support Vector Machine (SVM) classifier. This strategy handles noisy feature points arisen from typical dynamic backgrounds. Given a video sequence captured from a fixed camera, the multi-class SVM classifier recognizes and localizes equipment actions. For the purpose of evaluation, a new video dataset is introduced which contains 859 sequences from excavator and truck actions. This dataset contains large variations of equipment pose and scale, and has varied backgrounds and levels of occlusion. The experimental results with average accuracies of 86.33% and 98.33% show that our supervised method outperforms previous algorithms for excavator and truck action recognition. The results hold the promise for applicability of the proposed method for construction activity analysis.
Computer Vision – ECCV 2010, 2010
Much recent research in human activity recognition has focused on the problem of recognizing simp... more Much recent research in human activity recognition has focused on the problem of recognizing simple repetitive (walking, running, waving) and punctual actions (sitting up, opening a door, hugging). However, many interesting human activities are characterized by a complex temporal composition of simple actions. Automatic recognition of such complex actions can benefit from a good understanding of the temporal structures. We present in this paper a framework for modeling motion by exploiting the temporal structure of the human activities. In our framework, we represent activities as temporal compositions of motion segments. We train a discriminative model that encodes a temporal decomposition of video sequences, and appearance models for each motion segment. In recognition, a query video is matched to the model according to the learned appearances and motion segment decomposition. Classification is made based on the quality of matching between the motion segment classifiers and the temporal segments in the query sequence. To validate our approach, we introduce a new dataset of complex Olympic Sports activities. We show that our algorithm performs better than other state of the art methods.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
This paper proposes a new framework for estimating the Manhattan Frame (MF) of an indoor scene fr... more This paper proposes a new framework for estimating the Manhattan Frame (MF) of an indoor scene from a single RGB-D image. Our technique formulates this problem as the estimation of a rotation matrix that best aligns the normals of the captured scene to a canonical world axes. By introducing sparsity constraints, our method can simultaneously estimate the scene MF, the surfaces in the scene that are best aligned to one of three coordinate axes, and the outlier surfaces that do not align with any of the axes. To test our approach, we contribute a new set of annotations to determine ground truth MFs in each image of the popular NYUv2 dataset. We use this new benchmark to experimentally demonstrate that our method is more accurate, faster, more reliable and more robust than the methods used in the literature. We further motivate our technique by showing how it can be used to address the RGB-D SLAM problem in indoor scenes by incorporating it into and improving the performance of a popular RGB-D SLAM method.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
One of the cornerstone principles of deep models is their abstraction capacity, i.e. their abilit... more One of the cornerstone principles of deep models is their abstraction capacity, i.e. their ability to learn abstract concepts from 'simpler' ones. Through extensive experiments, we characterize the nature of the relationship between abstract concepts (specifically objects in images) learned by popular and high performing convolutional networks (conv-nets) and established mid-level representations used in computer vision (specifically semantic visual attributes). We focus on attributes due to their impact on several applications, such as object description, retrieval and mining, and active (and zero-shot) learning. Among the findings we uncover, we show empirical evidence of the existence of Attribute Centric Nodes (ACNs) within a conv-net, which is trained to recognize objects (not attributes) in images. These special conv-net nodes (1) collectively encode information pertinent to visual attribute representation and discrimination, (2) are unevenly and sparsely distribution across all layers of the conv-net, and (3) play an important role in conv-net based object recognition.
Automation in Construction, 2014
Workface assessmentthe process of determining the overall activity rates of onsite construction w... more Workface assessmentthe process of determining the overall activity rates of onsite construction workers throughout a daytypically involves manual visual observations which are time-consuming and laborintensive. To minimize subjectivity and the time required for conducting detailed assessments, and allowing managers to spend their time on the more important task of assessing and implementing improvements, we propose a new inexpensive vision-based method using RGB-D sensors that is applicable to interior construction operations. This is a particularly challenging task as construction activities have a large range of intra-class variability including varying sequences of body posture and time-spent on each individual activity. The skeleton extraction algorithms from RGB-D sequences produce noisy outputs when workers interact with tools or when there is a significant body occlusion within the camera's field-of-view. Existing vision-based methods are also limited as they can primarily classify "atomic" activities from RGB-D sequences involving one worker conducting a single activity. To address these limitations, our method includes three components: 1) an algorithm for detecting, tracking, and extracting body skeleton features from depth images; 2) a discriminative bag-of-poses activity classifier for classifying single visual activities from a given body skeleton sequence; and 3) a Hidden Markov Model to represent emission probabilities in the form of a statistical distribution of single activity classifiers. For training and testing purposes, we introduce a new dataset of eleven RGB-D sequences for interior drywall construction operations involving three actual construction workers conducting eight different activities in various interior locations. Our results with an average accuracy of 76% on the testing dataset show the promise of vision-based methods using RGB-D sequences for facilitating the activity analysis workface assessment.
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009
This paper presents a method that considers not only patch appearances, but also patch relationsh... more This paper presents a method that considers not only patch appearances, but also patch relationships in the form of adjectives and prepositions for natural scene recognition. Most of the existing scene categorization approaches only use patch appearances or co-occurrence of patch appearances to determine the scene categories, but the relationships among patches remain ignored. Those relationships are, however, critical for recognition and understanding. For example, a 'beach' scene can be characterized by a 'sky' region above 'sand', and a 'water' region between 'sky' and 'sand'. We believe that exploiting such relations between image regions can improve scene recognition. In our approach, each image is represented as a spatial pyramid, from which we obtain a collection of patch appearances with spatial layout information. We apply a feature mining approach to get discriminative patch combinations. The mined patch combinations can be interpreted as adjectives or prepositions, which are used for scene understanding and recognition. Experimental results on a fifteen class scene dataset show that our approach achieves competitive state-of-the-art recognition accuracy, while providing a rich description of the scene classes in terms of the mined adjectives and prepositions.
2013 IEEE International Conference on Computer Vision Workshops, 2013
We introduce a new method for representing the dynamics of human-object interactions in videos. P... more We introduce a new method for representing the dynamics of human-object interactions in videos. Previous algorithms tend to focus on modeling the spatial relationships between objects and actors, but ignore the evolving nature of this relationship through time. Our algorithm captures the dynamic nature of human-object interactions by modeling how these patterns evolve with respect to time. Our experiments show that encoding such temporal evolution is crucial for correctly discriminating human actions that involve similar objects and spatial human-object relationships, but only differ on the temporal aspect of the interaction, e.g. answer phone and dial phone We validate our approach on two human activity datasets and show performance improvements over competing state-of-the-art representations.
Real-Time and Automated Recognition and 2D Tracking of Construction Workers and Equipment from Site Video Streams
Computing in Civil Engineering (2012), 2012
ABSTRACT This paper presents an automated and real-time algorithm for recognition and 2D tracking... more ABSTRACT This paper presents an automated and real-time algorithm for recognition and 2D tracking of construction workers and equipment from site video streams. In recent years, several research studies have proposed semi-automated vision-based methods for tracking of construction workers and equipment. Nonetheless, there is still a need for automated initial recognition and real-time tracking of these resources in video streams. To address these limitations, a new algorithm based on histograms of Oriented Gradients (HOG) is ...
Construction Research Congress 2012, 2012
Video recording of construction operations provides an understandable data that could be used to ... more Video recording of construction operations provides an understandable data that could be used to analyze and improve construction performance. Despite the benefits, manual stopwatch study of previously recorded videos can be laborintensive, may suffer from biases of the observers, and impractical after substantial period of observations. To address these limitations, this paper presents a new visionbased method for automated action recognition of construction equipment from different camera viewpoints. This is particularly a challenging task as construction equipment can be partially occluded and they usually come in wide variety of sizes and appearances. The scale and pose of the equipment action can also significantly vary based on the camera configurations. In the proposed method, first a video is represented as a collection of spatio-temporal features by extracting space-time interest points and describing each feature with a histogram of oriented gradients (HOG). The algorithm automatically learns the probability distributions of the spatiotemporal features and action categories using a multiple binary Support Vector Machine (SVM) classifier. This strategy handles noisy feature points arisen from typical dynamic backgrounds. Given a novel video sequence, the multiple binary SVM classifier recognizes and localizes multiple equipment actions in long and dynamic video sequences containing multiple equipment actions. We have exhaustively tested our algorithm on 1,200 videos from earthmoving operations. Results with average accuracy of 85% across all categories of equipment actions reflect the promise of the proposed method for automated performance monitoring.
2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010
We present an automatic and efficient method to extract spatio-temporal human volumes from video,... more We present an automatic and efficient method to extract spatio-temporal human volumes from video, which combines top-down model-based and bottom-up appearancebased approaches. From the top-down perspective, our algorithm applies shape priors probabilistically to candidate image regions obtained by pedestrian detection, and provides accurate estimates of the human body areas which serve as important constraints for bottom-up processing. Temporal propagation of the identified region is performed with bottom-up cues in an efficient level-set framework, which takes advantage of the sparse top-down information that is available. Our formulation also optimizes the extracted human volume across frames through belief propagation and provides temporally coherent human regions. We demonstrate the ability of our method to extract human body regions efficiently and automatically from a large, challenging dataset collected from YouTube.
2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007
2008 IEEE Workshop on Motion and video Computing, 2008
Spatial-temporal local motion features have shown promising results in complex human action class... more Spatial-temporal local motion features have shown promising results in complex human action classification. Most of the previous works [6],[16],[21] treat these spatialtemporal features as a bag of video words, omitting any long range, global information in either the spatial or temporal domain. Other ways of learning temporal signature of motion tend to impose a fixed trajectory of the features or parts of human body returned by tracking algorithms. This leaves little flexibility for the algorithm to learn the optimal temporal pattern describing these motions. In this paper, we propose the usage of spatial-temporal correlograms to encode flexible long range temporal information into the spatial-temporal motion features. This results into a much richer description of human actions. We then apply an unsupervised generative model to learn different classes of human actions from these ST-correlograms. KTH dataset, one of the most challenging and popular human action dataset, is used for experimental evaluation. Our algorithm achieves the highest classification accuracy reported for this dataset under an unsupervised learning scheme.
Proceedings of International Conference on Multimedia Retrieval, 2014
Recent efforts in computer vision tackle the problem of human activity understanding in video seq... more Recent efforts in computer vision tackle the problem of human activity understanding in video sequences. Traditionally, these algorithms require annotated video data to learn models. In this paper, we introduce a novel data collection framework, to take advantage of the large amount of video data available on the web. We use this new framework to retrieve videos of human activities in order to build datasets for training and evaluating computer vision algorithms. We rely on Amazon Mechanical Turk workers to obtain high accuracy annotations. An agglomerative clustering technique brings the possibility to achieve reliable and consistent annotations for temporal localization of human activities in videos. Using two different datasets, Olympics Sports and our novel Daily Human Activities dataset, we show that our collection/annotation framework achieves robust annotations for human activities in large amount of video data.
Construction Research Congress 2012, 2012
In this paper we present a novel method for reliable recognition of construction workers and thei... more In this paper we present a novel method for reliable recognition of construction workers and their actions using color and depth data from a Microsoft Kinect sensor. Our algorithm is based on machine learning techniques, in which meaningful visual features are extracted based on the estimated body pose of workers. We adopt a bag-of-poses representation for worker actions and combine it with powerful discriminative classifiers to achieve accurate action recognition. The discriminative framework is able to focus on the visual aspects that are distinctive and can detect and recognize actions from different workers. We train and test our algorithm by using 80 videos from four workers involved in five drywall related construction activities. These videos were all collected from drywall construction activities inside of an under construction dining hall facility. The proposed algorithm is further validated by recognizing the actions of a construction worker that was never seen before in the training dataset. Experimental results show that our method achieves an average precision of 85.28 percent. The results reflect the promise of the proposed method for automated assessment of craftsmen productivity, safety, and occupational health at indoor environments.
2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014
This paper proposes a framework for recognizing complex human activities in videos. Our method de... more This paper proposes a framework for recognizing complex human activities in videos. Our method describes human activities in a hierarchical discriminative model that operates at three semantic levels. At the lower level, body poses are encoded in a representative but discriminative pose dictionary. At the intermediate level, encoded poses span a space where simple human actions are composed. At the highest level, our model captures temporal and spatial compositions of actions into complex human activities. Our human activity classifier simultaneously models which body parts are relevant to the action of interest as well as their appearance and composition using a discriminative approach. By formulating model learning in a maxmargin framework, our approach achieves powerful multiclass discrimination while providing useful annotations at the intermediate semantic level. We show how our hierarchical compositional model provides natural handling of occlusions. To evaluate the effectiveness of our proposed framework, we introduce a new dataset of composed human activities. We provide empirical evidence that our method achieves state-of-the-art activity classification performance on several benchmark datasets.
International Journal of Computer Vision, 2008
We present a novel unsupervised learning method for human action categories. A video sequence is ... more We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and intermediate topics corresponding to human action categories. This is achieved by using a probabilistic Latent Semantic Analysis (pLSA) model. Given a novel video sequence, the model can categorize and localize the human action(s) contained in the video. We test our algorithm on two challenging datasets: the KTH human action dataset and a recent dataset of figure skating actions. Our results are on par or slightly better than the best reported results. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.
Automation in Construction, 2013
This paper presents a computer vision based algorithm for automated 2D detection of construction ... more This paper presents a computer vision based algorithm for automated 2D detection of construction workers and equipment from site video streams. The state-of-the-art research proposes semi-automated detection methods for tracking of construction workers and equipment. Considering the number of active equipment and workers on jobsites and their frequency of appearance in a camera's field of view, application of semi-automated techniques can be time-consuming. To address this limitation, a new algorithm based on Histograms of Oriented Gradients and Colors (HOG+C) is proposed. Our proposed detector uses a single sliding window at multiple scales to identify the potential candidates for the location of equipment and workers in 2D. Each detection window is first divided into small spatial regions and then the gradient orientations and hue-saturation colors are locally histogrammed and concatenated to form the HOG + C descriptors. Tiling the sliding detection window with a dense and overlapping grid of formed descriptors and using a binary Support Vector Machine (SVM) classifier for each resource enables automated 2D detection of workers and equipment. A new comprehensive benchmark dataset containing over 8000 annotated video frames including equipment and workers from different construction projects is introduced. This dataset contains a large range of pose, scale, background, illumination, and occlusion variation. Our preliminary results on detection of standing workers, excavators and dump trucks with an average accuracy of 98.83%, 82.10%, and 84.88% respectively indicate the applicability of the proposed method for automated activity analysis of workers and equipment from single video cameras. Unlike other state-of-the-art algorithms in automated resource tracking, this method particularly detects idle resources and does not need manual or semi-automated initialization of the resource locations in 2D video frames. The experimental results and the perceived benefits of the proposed method are discussed in detail.
Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers
Advanced Engineering Informatics, 2013
ABSTRACT Video recordings of earthmoving construction operations provide understandable data that... more ABSTRACT Video recordings of earthmoving construction operations provide understandable data that can be used for benchmarking and analyzing their performance. These recordings further support project managers to take corrective actions on performance deviations and in turn improve operational efficiency. Despite these benefits, manual stopwatch studies of previously recorded videos can be labor-intensive, may suffer from biases of the observers, and are impractical after substantial period of observations. This paper presents a new computer vision based algorithm for recognizing single actions of earthmoving construction equipment. This is particularly a challenging task as equipment can be partially occluded in site video streams and usually come in wide variety of sizes and appearances. The scale and pose of the equipment actions can also significantly vary based on the camera configurations. In the proposed method, a video is initially represented as a collection of spatio-temporal visual features by extracting space–time interest points and describing each feature with a Histogram of Oriented Gradients (HOG). The algorithm automatically learns the distributions of the spatio-temporal features and action categories using a multi-class Support Vector Machine (SVM) classifier. This strategy handles noisy feature points arisen from typical dynamic backgrounds. Given a video sequence captured from a fixed camera, the multi-class SVM classifier recognizes and localizes equipment actions. For the purpose of evaluation, a new video dataset is introduced which contains 859 sequences from excavator and truck actions. This dataset contains large variations of equipment pose and scale, and has varied backgrounds and levels of occlusion. The experimental results with average accuracies of 86.33% and 98.33% show that our supervised method outperforms previous algorithms for excavator and truck action recognition. The results hold the promise for applicability of the proposed method for construction activity analysis.
Computer Vision – ECCV 2010, 2010
Much recent research in human activity recognition has focused on the problem of recognizing simp... more Much recent research in human activity recognition has focused on the problem of recognizing simple repetitive (walking, running, waving) and punctual actions (sitting up, opening a door, hugging). However, many interesting human activities are characterized by a complex temporal composition of simple actions. Automatic recognition of such complex actions can benefit from a good understanding of the temporal structures. We present in this paper a framework for modeling motion by exploiting the temporal structure of the human activities. In our framework, we represent activities as temporal compositions of motion segments. We train a discriminative model that encodes a temporal decomposition of video sequences, and appearance models for each motion segment. In recognition, a query video is matched to the model according to the learned appearances and motion segment decomposition. Classification is made based on the quality of matching between the motion segment classifiers and the temporal segments in the query sequence. To validate our approach, we introduce a new dataset of complex Olympic Sports activities. We show that our algorithm performs better than other state of the art methods.