Yong Rui - Academia.edu (original) (raw)

Papers by Yong Rui

1997 Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries, 1997

... The bottom architecture is an example of ho wthe model is used to describe an image object. .... more ... The bottom architecture is an example of ho wthe model is used to describe an image object. ... V ariousfeatures, such as color, texture, shape, layout, motion parameters, etc,are extracted to make the MIR system flexible enough to support different information need of different ...

Workshop on Multimedia Information Systems, 1997

IntroductionWith the advances in storage technology and the advent of the World Wide Web, there h... more IntroductionWith the advances in storage technology and the advent of the World Wide Web, there has been an explosionin the amount and complexity of digital information being generated, analyzed, stored, accessed and transmitted.Most of this data is multimedia in nature, including digital images, video, audio and simple text data. To makeuse of this vast amount of multimedia data, we need

Person-based indices and timelines can enable fast and non-linear access to recorded meetings. Th... more Person-based indices and timelines can enable fast and non-linear access to recorded meetings. This paper focuses on how to automatically construct those indices and timelines by using face recognition techniques. While there exist extensive research in generic face recognition, recognizing faces in recorded meetings is still an understudied area. Real-world meeting videos impose several interesting and unique challenges including complex lighting, low imaging quality, and large variations in head pose and size. In this paper, a promising approach based on MRC- Boosting is presented to address these challenges, which achieves encouraging performance on real-world meeting videos and shows superior accuracy and robustness compared to two popular existing approaches.

... free paper. Printed in the United States of America. Page 7. To my parents, Wei, and Michael.... more ... free paper. Printed in the United States of America. Page 7. To my parents, Wei, and Michael. -Sean To my parents, Dongqin, and Olivia. -Yong To Margaret; Caroline,Marjorie, Thomas, Gregory. -Tom Page 8. Page 9. Contents 1 ...

Machine-aided retrieval of multimedia information—image [44], video [170], or audio [195], etc.—i... more Machine-aided retrieval of multimedia information—image [44], video [170], or audio [195], etc.—is achieved based on representations in the form of descriptors (or feature vectors). Two issues arise: one is the effectiveness of the representation, ie, to what extent can the meaningful contents of the media be represented in these vectors? The other is the selection of a similarity metric during the retrieval process. This is an important issue because the similarity metric dynamically depends upon the user and the user defined query class, ...

ABSTRACT Different from the existing work focusing on emotion type detection, the proposed approa... more ABSTRACT Different from the existing work focusing on emotion type detection, the proposed approach in this paper provides flexibility for users to pick up their favorite affective content by choosing either emotion intensity levels or emotion types. Specifically, we propose a hierarchical structure for movie emotions and analyze emotion intensity and emotion type by using arousal and valence related features hierarchically. Firstly, three emotion intensity levels are detected by using fuzzy c-mean clustering on arousal features. Fuzzy clustering provides a mathematical model to represent vagueness, which is close to human perception. Then, valence related features are used to detect five emotion types. Considering video is continuous time series data and the occurrence of a certain emotion is affected by recent emotional history, conditional random fields (CRFs) are used to capture the context information. Outperforming Hidden Markov Model, CRF relaxes the independence assumption for states required by HMM and avoids bias problem. Experimental results show that CRF-based hierarchical method outperforms the one-step method on emotion type detection. User study shows that majority of the viewers prefer to have option of accessing movie content by emotion intensity levels. Majority of the users are satisfied with the proposed emotion detection.

| This paper will first briefly survey the existing impact of multimedia information retrieval (M... more | This paper will first briefly survey the existing impact of multimedia information retrieval (MIR) in applications. It will then analyze the current trends of MIR research which can have an influence on future applications. It will then detail the future possibilities and bottlenecks in applying the MIR research results in the main target application areas, such as the consumer (e.g., personal video recorders, web information retrieval), public safety (e.g., automated smart surveillance systems), and professional world (e.g., automated meeting capture and summarization). In particular, recommendations will be made to the research community regarding the challenges that need to be met to make the knowledge transfer towards the applications more efficient and effective. It will also attempt to study the trends in the applications which can inform the MIR community on directing intellectual resources towards MIR problems which can have a maximal real-world impact.

... Furthermore, the HSV, CIE-LAB, and Munsell color spaces also attempt to make the colorspace p... more ... Furthermore, the HSV, CIE-LAB, and Munsell color spaces also attempt to make the colorspace perceptu-ally uniform. ... We chose the HSV (hue, saturation, and value) color space for simplicity. Spatial space is just the 2-D Cartesian space spanned ...

Sports video semantic event detection is essential for sports video summarization and retrieval. ... more Sports video semantic event detection is essential for sports video summarization and retrieval. Extensive research efforts have been devoted to this area in recent years. However, the existing sports video event detection approaches heavily rely on either video content itself, which face the difficulty of high-level semantic information extraction from video content using computer vision and image processing techniques, or manually generated video ontology, which is domain specific and difficult to be automatically aligned with the video content. In this paper, we present a novel approach for sports video semantic event detection based on analysis and alignment of webcast text and broadcast video. Webcast text is a text broadcast channel for sports game which is co-produced with the broadcast video and is easily obtained from the web. We first analyze webcast text to cluster and detect text events in an unsupervised way using probabilistic latent semantic analysis (pLSA). Based on the detected text event and video structure analysis, we employ a conditional random field model (CRFM) to align text event and video event by detecting event moment and event boundary in the video. Incorporation of webcast text into sports video analysis significantly facilitates sports video semantic event detection. We conducted experiments on 33 hours of soccer and basketball games for webcast analysis, broadcast video analysis and text/video semantic alignment. The results are encouraging and compared with the manually labeled ground truth.

Given rapid improvements in storage devices, network infrastructure and streaming-media technolog... more Given rapid improvements in storage devices, network infrastructure and streaming-media technologies, a large number of corporations and universities are recording lectures and making them available online for anytime, anywhere access. However, producing high-quality lecture videos is still labor intensive and expensive. Fortunately, recent technology advances are making it feasible to build automated camera management systems to capture lectures. In this paper we report our design of such a system, including system configuration, audio-visual tracking techniques, software architecture, and user study. Motivated by different roles in a professional video production team, we have developed a multi-cinematographer single-director camera management system. The system performs lecturer tracking, audience tracking, and video editing all fully automatically, and offers quality close to that of human-operated systems.

Combining learning with vision techniques in interactive image retrieval has been an active resea... more Combining learning with vision techniques in interactive image retrieval has been an active research topic during the past few years. However, existing learning techniques either are based on heuristics or fail to analyze the working con-ditions. Furthermore, there is almost no in depth study ...

unscented Kalman filter to exploit object dynamics in nonlinear systems for robust contour tracking.

Supporting multimedia search has emerged as an important research topic. There are three paradigm... more Supporting multimedia search has emerged as an important research topic. There are three paradigms on the research spectrum that ranges from the least automatic to the most automatic. On the far left end, there is the pure manual labeling paradigm that labels multimedia content, e.g., images and video clips, manually with text labels and then use text search to search

Abstract Content-Based Image Retrieval (CBIR) has become one of the most active research areas in... more Abstract Content-Based Image Retrieval (CBIR) has become one of the most active research areas in the past few years. Many visual feature representations have been explored and many systems built. While these research e orts establish the basis of CBIR, the usefulness of ...

We propose a new multiple instance learning (MIL) al-gorithm to learn image categories. Unlike ex... more We propose a new multiple instance learning (MIL) al-gorithm to learn image categories. Unlike existing MIL al-gorithms, in which the individual instances in a bag are as-sumed to be independent with each other, we develop con-current tensors to explicitly model the inter-dependency ...

Journal of visual communication and image …, 1999

This paper provides a comprehensive survey of the technical achievements in the research area of ... more This paper provides a comprehensive survey of the technical achievements in the research area of image retrieval, especially content-based image retrieval, an area that has been so active and prosperous in the past few years. The survey includes 100+ papers covering the ...

Decrypting the secret of beauty or attractiveness has been the pursuit of artists and philosopher... more Decrypting the secret of beauty or attractiveness has been the pursuit of artists and philosophers for centuries. To date, the
computational model for attractiveness estimation has been actively explored in computer vision and multimedia community, yet with the focus mainly on facial features. In this article, we conduct a comprehensive study on female attractiveness conveyed by single/multiplemodalities of cues, that is, face, dressing and/or voice, and aim to discover how different modalities individually and collectively affect the human sense of beauty. To extensively investigate the problem, we collect the Multi-Modality Beauty (M2B) dataset, which is annotated with attractiveness levels converted from manual k-wise ratings and semantic attributes of different modalities. Inspired by the common consensus that middle-level attribute prediction can assist higher-level computer vision tasks, we manually labeled many attributes for each modality. Next, a tri-layer Dual-supervised Feature-Attribute-Task (DFAT) network is proposed to jointly learn the attribute model and attractiveness model of single/multiple modalities. To remedy possible loss of information caused by incomplete manual attributes, we also propose a novel Latent Dual-supervised Feature-Attribute-Task (LDFAT) network, where latent attributes are combined with manual attributes to contribute to the final attractiveness estimation. The extensive experimental evaluations on the collected M2B dataset well demonstrate the effectiveness of the proposed DFAT and LDFAT networks for female attractiveness prediction.

2006 IEEE International Conference on Multimedia and Expo, 2006

Group-to-individual (G2I) distributed meeting is an important but understudied area. Because of t... more Group-to-individual (G2I) distributed meeting is an important but understudied area. Because of the asymmetry between different parties in G2I meetings, it has two unique challenges: 1) the remote participant tends to be ignored by the local participants; and 2) the remote participant has inferior audio, video, and data experience than the local participants. To address these issues, in this paper we present PING, a system explicitly designed for G2I distributed meetings that combines recent advances in both hardware, e.g., microphone arrays, remote person stand-in devices, and software, e.g., audio-video processing, to improve users' G2I meeting experience. We report how PING addresses the above two challenges and its system design and implementation.

1997 Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries, 1997

Workshop on Multimedia Information Systems, 1997

unscented Kalman filter to exploit object dynamics in nonlinear systems for robust contour tracking.

Journal of visual communication and image …, 1999

2006 IEEE International Conference on Multimedia and Expo, 2006