Alexander Hauptmann | Carnegie Mellon University (original) (raw)

Papers by Alexander Hauptmann

Proceedings of the 24th ACM international conference on Multimedia, 2016

Describing videos with natural language is one of the ultimate goals of video understanding. Vide... more Describing videos with natural language is one of the ultimate goals of video understanding. Video records multimodal information including image, motion, aural, speech and so on. MSR Video to Language Challenge provides a good chance to study multi-modality fusion in caption task. In this paper, we propose the multi-modal fusion encoder and integrate it with text sequence decoder into an end-toend video caption framework. Features from visual, aural, speech and meta modalities are fused together to represent the video contents. Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs) are then used as the decoder to generate natural language sentences. Experimental results show the e↵ectiveness of multi-modal fusion encoder trained in the end-to-end framework, which achieved top performance in both common metrics evaluation and human evaluation.

IEEE Transactions on Pattern Analysis and Machine Intelligence

Scene graph is a structured representation of a scene that can clearly express the objects, attri... more Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene. As computer vision technology continues to develop, people are no longer satisfied with simply detecting and recognizing objects in images; instead, people look forward to a higher level of understanding and reasoning about visual scenes. For example, given an image, we want to not only detect and recognize objects in the image, but also understand the relationship between objects (visual relationship detection), and generate a text description (image captioning) based on the image content. Alternatively, we might want the machine to tell us what the little girl in the image is doing (Visual Question Answering (VQA)), or even remove the dog from the image and find similar images (image editing and retrieval), etc. These tasks require a higher level of understanding and reasoning for image vision tasks. The scene graph is just such a powerful tool for scene understanding. Therefore, scene graphs have attracted the attention of a large number of researchers, and related research is often cross-modal, complex, and rapidly developing. However, no relatively systematic survey of scene graphs exists at present. To this end, this survey conducts a comprehensive investigation of the current scene graph research. More specifically, we first summarize the general definition of the scene graph, then conducte a comprehensive and systematic discussion on the generation method of the scene graph (SGG) and the SGG with the aid of prior knowledge. We then investigate the main applications of scene graphs and summarize the most commonly used datasets. Finally, we provide some insights into the future development of scene graphs.

This paper describes CMU and USC/ISI’s OPERA system that performs endto-end information extractio... more This paper describes CMU and USC/ISI’s OPERA system that performs endto-end information extraction from multiple media, integrates results across English, Russian, and Ukrainian, produces Knowledge Bases containing the extracted information, and performs hypothesis reasoning over the results.

Proceedings of the AAAI Conference on Artificial Intelligence

Matrix factorization (MF) has been attracting much attention due to its wide applications. Howeve... more Matrix factorization (MF) has been attracting much attention due to its wide applications. However, since MF models are generally non-convex, most of the existing methods are easily stuck into bad local minima, especially in the presence of outliers and missing data. To alleviate this deficiency, in this study we present a new MF learning methodology by gradually including matrix elements into MF training from easy to complex. This corresponds to a recently proposed learning fashion called self-paced learning (SPL), which has been demonstrated to be beneficial in avoiding bad local minima. We also generalize the conventional binary (hard) weighting scheme for SPL to a more effective real-valued (soft) weighting manner. The effectiveness of the proposed self-paced MF method is substantiated by a series of experiments on synthetic, structure from motion and background subtraction data.

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06, 2006

Combining the output from multiple retrieval sources over the same document collection is of grea... more Combining the output from multiple retrieval sources over the same document collection is of great importance to a number of retrieval tasks such as multimedia retrieval, web retrieval and meta-search. To merge retrieval sources adaptively according to query topics, we propose a series of new approaches called probabilistic latent query analysis (pLQA), which can associate non-identical combination weights with latent classes underlying the query space. Compared with previous query independent and query-class based combination methods, the proposed approaches have the advantage of being able to discover latent query classes automatically without using prior human knowledge, to assign one query to a mixture of query classes, and to determine the number of query classes under a model selection principle. Experimental results on two retrieval tasks, i.e., multimedia retrieval and meta-search, demonstrate that the proposed methods can uncover sensible latent classes from training data, and can achieve considerable performance gains.

Computer Vision – ECCV 2020 Workshops, 2020

We propose an improved discriminative model prediction method for robust long-term tracking based... more We propose an improved discriminative model prediction method for robust long-term tracking based on a pre-trained short-term tracker. The baseline pre-trained short-term tracker is SuperDiMP which combines the bounding-box regressor of PrDiMP with the standard DiMP classifier. Our tracker RLT-DiMP improves SuperDiMP in the following three aspects: (1) Uncertainty reduction using random erasing: To make our model robust, we exploit an agreement from multiple images after erasing random small rectangular areas as a certainty. And then, we correct the tracking state of our model accordingly. (2) Random search with spatio-temporal constraints: we propose a robust random search method with a score penalty applied to prevent the problem of sudden detection at a distance. (3) Background augmentation for more discriminative feature learning: We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter. In experiments on the VOT-LT2020 benchmark dataset, the proposed method achieves comparable performance to the state-of-the-art long-term trackers.

2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)

This paper presents a novel approach to aid face recognition: Using multiple views of a face, we ... more This paper presents a novel approach to aid face recognition: Using multiple views of a face, we construct a 3D model instead of directly using the 2D images for recognition. Our framework is designed for videos, which contain many instances of a target face from a sequence of slightly differing views, as opposed to a single static picture of the face. Specifically, we reconstruct the 3D face shapes from two orthogonal views and select features based on pairwise distances between landmark points on the model using Fisher's Linear Discriminant. While 3D face shape reconstruction is sensitive to the quality of the feature point localization, our experiments show that 3D reconstruction together with the regularized Fisher's Linear Discriminant can provide highly accurate face recognition from multiple facial views. Experiments on the Carnegie Mellon PIE (Pose, Illumination and Expressions) database containing 68 people's faces with at least 3 expressions under varying lighting conditions demonstrate vastly improved performance

CMU's Informedia project has collected and automatically processed a multi-terabyte video cor... more CMU's Informedia project has collected and automatically processed a multi-terabyte video corpus containing 8 years of CNN broadcasts and other video sources [5]. Previous work has demonstrated multi-modal querying by text, image, time, and location, and the ability to summarize a single document or a set of documents matching a query. We now plan to organize the corpus or a subset along multiple dimensions, or perspectives, adding relevant background material, significantly expanding and accelerating the viewer's comprehension and integration of knowledge. A perspective can provide factual background information, a history of an issue, the view of a biased source, a technical or medical perspective, or any of dozens of others. This abstract proposes a cityscape metaphor for organizing visual context in terms of perspectives.

Television news has been the predominant way of understanding the world around us, but individual... more Television news has been the predominant way of understanding the world around us, but individual news broadcasters can frame or mislead audience's understanding about political and social issues. We aim to develop a computer system that can automatically identify highly biased television news, which may prompt audience to seek news stories from contrasting viewpoints. But can computers determine if news videos were produced by broadcasters holding differing ideological beliefs? We developed a method of identifying differing ideological perspectives based on a large-scale visual concept ontology, and the experimental results were promising.

Our prototype automatic title generation system inspired by statistical machine-translation appro... more Our prototype automatic title generation system inspired by statistical machine-translation approaches [1] treats the document title like a translation of the document. Titles can be generated without extracting words from the document. A large corpus of documents with human-assigned titles is required for training title "translation" models. On an f1 evaluation score our approach outperformed another approach based on Bayesian probability estimates [7].

Nearly 2.5 million Americans currently reside in nursing homes and assisted living facilities in ... more Nearly 2.5 million Americans currently reside in nursing homes and assisted living facilities in the United States, accounting for approximately 5% of persons 65 years and older.1The aging of the "Baby Boomer" generation is expected to lead to an exponential growth in the need for some form of long-term care (LTC) for this segment of the population within the next 25 years. In light of these sobering demographic shifts, there is an urgency to address the profound concerns that exist about the quality-of-care (QoC) and quality-of-life (QoL) of this frailest segment of our population.

AdaBoost has proved to be an effective method to improve the performance of base classifiers both... more AdaBoost has proved to be an effective method to improve the performance of base classifiers both theoretically and empirically. However, previous studies have shown that AdaBoost might suffer from the overfitting problem, especially for noisy data. In addition, most current work on boosting assumes that the combination weights are fixed constants and therefore does not take particular input patterns into consideration. In this paper, we present a new boosting algorithm, "WeightBoost", which tries to solve these two problems by introducing an input-dependent regularization factor to the combination weight. Similarly to AdaBoost, we derive a learning procedure for WeightBoost, which is guaranteed to minimize training errors. Empirical studies on eight different UCI data sets and one text categorization data set show that WeightBoost almost always achieves a considerably better classification accuracy than AdaBoost. Furthermore, experiments on data with artificially controll...

Logistic Regression (LR) has been widely used in statistics for many years, and has received exte... more Logistic Regression (LR) has been widely used in statistics for many years, and has received extensive study in machine learning community recently due to its close relations to Support Vector Machines (SVM) and AdaBoost. In this paper, we use a modified version of LR to approximate the optimization of SVM by a sequence of unconstrained optimization problems. We prove that our approximation will converge to SVM, and propose an iterative algorithm called "MLRCG" which uses Conjugate Gradient as its inner loop. Multiclass version "MMLR-CG" is also obtained after simple modifications. We compare the MLR-CG with SVMlight over different text categorization collections, and show that our algorithm is much more efficient than SVMlight when the number of training examples is very large. Results of the multiclass version MMLR-CG is also reported.

The McKIZ Aware Community will enable us to move the paradigm of an aware and assistive home to t... more The McKIZ Aware Community will enable us to move the paradigm of an aware and assistive home to the development of an aware and assistive community infrastructure by incorporating devices and methods into a small urban community of homes, recreation facilities, retail and service providers, on city streets with vehicular traffic and public transportation. This broadens our research, data collection and evaluation of persons with disabilities and aging residents to include instrumental activities of daily living and quality of life that extend beyond the confines of their home.

Active learning has been demonstrated to be a useful tool to reduce human labeling effort for man... more Active learning has been demonstrated to be a useful tool to reduce human labeling effort for many multimedia applications. However, most of the previous work on multimedia active learning has gloss the multi-modality problem very much. From several experimental results, multi-modality fusion plays an important role to boost performance of multimedia classification. In this paper, we present a multi-modality active learning approach which enhances the process of active learning approach from single-modality to multi-modality. The experimental results on the TRECVID 2004 semantic feature extraction task show that the proposed active learning approach works more effectively than single-modality approach and also demonstrate a significantly reduced amount of labeled data.

In this paper, we present and compare automatically generated titles for machine-translated docum... more In this paper, we present and compare automatically generated titles for machine-translated documents using several different statistics-based methods. A Naïve Bayesian, a K-Nearest Neighbour, a TF-IDF and an it-erative Expectation-Maximization method for title gen-eration were applied to 1000 original English news documents and again to the same documents translated from English into Portuguese, French or German and back to English using SYSTRAN. The AutoSummari-zation function of Microsoft Word was used as a base line. Results on several metrics show that the statistics-based methods of title generation for machine-translated documents are fairly language independent and title generation is possible at a level approaching the accuracy of titles generated for the original English documents.

At TRECVID 2005, CMU participated in the low-level feature extraction task, the semantic concept ... more At TRECVID 2005, CMU participated in the low-level feature extraction task, the semantic concept feature extraction task, automatic, manual and interactive search tasks and the BBC stock footage challenge

For the first time in 2007, TRECVID considered structured evaluation of automated video summariza... more For the first time in 2007, TRECVID considered structured evaluation of automated video summarization, utilizing BBC rushes video. This paper discusses in detail our approaches for producing the submitted summaries to TRECVID, including the two baseline methods. The cluster method performed well in terms of coverage, and adequately in terms of user satisfaction, but did take longer to review. We conducted additional evaluations using the same TRECVID assessment interface to judge 2 additional methods for summary generation: 25x (simple speed-up by 25 times), and pz (emphasizing pans and zooms). Data from 4 human assessors shows significant differences between the cluster, pz, and 25x approaches. The best coverage (text inclusion performance) is obtained by 25x, but at the expense of taking the most time to evaluate and perceived as the most redundant. Method pz was easier to use than cluster and had better performance on pan/zoom recall tasks, leading into discussions on how summari...

We submitted a number of semantic classifiers, most of which were merely trained on keyframes. We... more We submitted a number of semantic classifiers, most of which were merely trained on keyframes. We also experimented with runs of classifiers were trained exclusively on text data and relative time within the video, while a few were trained using all available multiple modalities

The Informedia team participated in the tasks of Rushes summarization, high-level feature extract... more The Informedia team participated in the tasks of Rushes summarization, high-level feature extraction and event detection in surveillance video. For the rushes summarization, our basic idea was to use subsampled video at the appropriate rate, showing almost the whole video faster, and then modify the result to remove garbage frames. Sinply subsampling the frames proved to be the best method for summarizing BBC rushes video, with other improvements not improving the basic inclusion rate, nor appreciably affecting the other subjective metrics. For the high-level feature detection, we trained exclusively on TRECVID'05 data and trying to assess and predict the reliability of the detectors. The voting scheme for combining multiple classifiers performed best, marginally better than trying to predict the best classifier based on a robustness calculation from within dataset cross-domain performance. For event detection, we found that the overall approach was effective at characterizing a...

Proceedings of the 24th ACM international conference on Multimedia, 2016

IEEE Transactions on Pattern Analysis and Machine Intelligence

Proceedings of the AAAI Conference on Artificial Intelligence

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06, 2006

Computer Vision – ECCV 2020 Workshops, 2020

2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)