COST292 experimental framework for TRECVID 2006 (original) (raw)

The COST292 experimental framework for TRECVID 2007

2007

In this paper, we give an overview of the four tasks submitted to TRECVID 2007 by COST292. In shot boundary (SB) detection task, four SB detectors have been developed and the results are merged using two merging algorithms. The framework developed for the high-level feature extraction task comprises four systems. The first system transforms a set of low-level descriptors into the semantic space using Latent Semantic Analysis and utilises neural networks for feature detection. The second system uses a Bayesian classifier trained with a "bag of subregions". The third system uses a multi-modal classifier based on SVMs and several descriptors. The fourth system uses two image classifiers based on ant colony optimisation and particle swarm optimisation respectively. The system submitted to the search task is an interactive retrieval application combining retrieval functionalities in various modalities with a user interface supporting automatic and interactive search over all queries submitted. Finally, the rushes task submission is based on a video summarisation and browsing system comprising two different interest curve algorithms and three features.

Florida International University and University of Miami TRECVID 2009-high level feature extraction

2009

In this paper, the details about FIU-UM group TRECVID2009 high-level feature extraction task submission are presented. Six runs were conducted using different feature sets, data pruning approaches, classification algorithms, and ranking methods. A proportion of TRECVID2009 development data were randomly sampled from the whole development data archives (all TRECVID2007 development data and test data), which include all positive data instances (target-high-level feature data) and partial negative data instances (around one-third non-target-high-level feature data) for each high-level feature. Two strategies dealing with the skipping/not-sure shots were also introduced. First four runs treated the skipping/not-sure data instances as positive instances in the training data (ALL), and the last two runs disregarded these skipping/not-sure data instances from the training data (PURE). • FIU-UM-3: SF+ALL+DB+SB, training on partial TRECVID2009 development data with all positive sets (ALL) and using shot-based low-level features (SF), distance-based pruning (DB), subspace-based classifier (SB), and a ranking process used subspace-based scores from the classifier. • FIU-UM-4: SF+ALL+DB+SB+SVMC, training on partial TRECVID2009 development data with all positive set (ALL) and using shot-based low-level features (SF), distance-based pruning (DB), subspace-based classifier (SB), and SVMC ranking method. The SVMC method brings the retrieval results from SVM with chi-square kernel (SVMC) and considers these results as additional scores which are later combined with subspace-based scores to form the final ranking scores. • FIU-UM-5: KF+PURE+CB+MCA+RANK, training on partial TRECVID2009 development data with pure positive set (PURE) and using key-frame based low-level features (KF), correlationbased pruning (CB), MCA-based classifier (MCA), and ranking method (RANK). • FIU-UM-6: SF+PURE+DB+SB, training on partial TRECVID2009 development data with pure positive set (PURE) and using shot-based low-level features (SF), distance-based pruning (DB), subspace-based classifier (SB), and a ranking process used subspace-based scores from the classifier. In the TRECVID2009 high-level feature extraction task submission, we are able to improve the framework in several ways. First, more key-frame based visual features (513) were extracted in addition to the 28 old shot-based features, and different normalization methods were applied. Second, all development data (219 videos) and testing data (619 videos) were processed. Third, a key-frame detection algorithm was implemented to extract the key-frames from testing videos, which are not provided by TRECVID. Fourth, different data pruning methods were proposed to solve the data imbalance issue, and from other experimental results, our proposed methods performs well on removing noisy data and selecting the typical positive and negative data instances. Fifth, two new classifiers were proposed in our framework rather than using the existing classifiers like Support Vector Machine, Decision Tree, etc. Finally, in addition to concept detection, we are able to extend our framework to the area of video retrieval. In other words, we proposed several scoring methods to rank the retrieved results. However, we are still facing a lot of challenges. First, as can be seen from the description of each run, three runs by utilizing the CB+MCA model were trained by the key-frame based low/mid-level visual features. By adding some low-level audio features, the extraction performance for some highlevel features would be improved, such as person-playing-a-musical-instrument, people-dancing, and singing. Similarly, more visual features would help the runs trained only by the shot-based feature data. Therefore, how to integrate the audio features with the key-frame based features and add more visual features with shot-based features need to be done. Second, to solve the data imbalance problem, the negative data instances were first randomly sampled. This is very risky since by doing this, the difference of the distribution of the training set and testing set could be enlarged. Then even the training performance is pretty good as in our experiments, the testing results may not be as good as expected. Therefore, more investigations on data sampling and data pruning should be considered. Third, from the results we could see that the ranking methods are not good enough. More research on ranking the retrieved results should be studied.

COST292 experimental framework for TRECVID 2008

2008

Abstract In this paper, we give an overview of the four tasks submitted to TRECVID 2008 by COST292. The high-level feature extraction framework comprises four systems. The first system transforms a set of low-level descriptors into the semantic space using Latent Semantic Analysis and utilises neural networks for feature detection. The second system uses a multi-modal classifier based on SVMs and several descriptors.

COST292 experiments for TRECVID 2006

2006

In this paper we give an overview of the four TRECVID tasks submitted by COST292, European network of institutions in the area of semantic multimodal analysis and retrieval of digital video media. Initially, we present shot boundary evaluation method based on results merged using a confidence measure. The two SB detectors user here are presented, one of the Technical University of Delft and one of the LaBRI, University of Bordeaux 1, followed by the description of the merging algorithm. The high-level feature extraction task comprises three separate systems. The first system, developed by the National Technical University of Athens (NTUA) utilises a set of MPEG-7 low-level descriptors and Latent Semantic Analysis to detect the features. The second system, developed by Bilkent University, uses a Bayesian classifier trained with a "bag of subregions" for each keyframe. The third system by the Middle East Technical University (METU) exploits textual information in the video using character recognition methodology. The system submitted to the search task is an interactive retrieval application developed by Queen Mary, University of London, University of Zilina and ITI from Thessaloniki, combining basic retrieval functionalities in various modalities (i.e. visual, audio, textual) with a user interface supporting the submission of queries using any combination of the available retrieval tools and the accumulation of relevant retrieval results over all queries submitted by a single user during a specified time interval. Finally, the rushes task submission comprises a video summarisation and browsing system specifically designed to intuitively and efficiently presents rushes material in video production environment. This system is a result of joint work

Florida International University and University of Miami TRECVID 2008 - High Level Feature Extraction

TREC Video Retrieval Evaluation, 2008

This paper describes the FIU-UM group TRECVID 2008 high level feature extraction task submission. We have used a correlation based video semantic concept detection system for this task submission. This system first extracts shot based low-level audiovisual features from the raw data source (audio and video files). The resulting numerical feature set is then discretized. Multiple correspondence analysis (MCA) is then used to explore the correlation between items, which are the feature-value pairs generated by the discretization process, and the different concepts. This process generates both positive and negative rules. During the classification process each instance (shot) is tested against each rule. The score for each instance determines the final classification. We have conducted two runs using two different predetermined values as the score threshold for classification:

MSRA-USTC-SJTU at TRECVID 2007: High-level feature extraction and search

2007

This paper describes the MSRA-USTC-SJTU experiments for TRECVID 2007. We performed the experiments in high-level feature extraction and automatic search tasks. For high-level feature extraction, we investigated the benefit of unlabeled data by semi-supervised learning, and the multi-layer (ML) multi-instance (MI) relation embedded in video by MLMI kernel, as well as the correlations between concepts by correlative multi-label learning. For automatic search, we fuse text, visual example, and concept-based models while using temporal consistency and face information for re-ranking and result refinement.

National Institute of Informatics, Japan at TRECVID 2008

2008

This paper reports our experiments for TRECVID 2008 tasks: high level feature extraction, search and contentbased copy detection. For the high level feature extraction task, we use the baseline features such as color moments, edge orientation histogram and local binary patterns with SVM classifiers and nearest neighbor classifiers. For the search task, we use different approaches including search by the baseline features and search by concept suggestion. And for the video copy detection task, we study two approaches that are based on the pattern of motions in feature point trajectories and matches of all frame pairs using normalized cross correlation. Our approaches can be considered as one of the baseline approaches for evaluation of these tasks. I. HIGH LEVEL FEATURE EXTRACTION

TÜBITAK UZAY at TRECVID 2009: High-Level Feature Extraction and Content-Based Copy Detection

2009

In this notebook paper, we discuss and give an overview of our participation to the High-Level Feature Extraction (HLFE) and Content-Based Copy Detection (CBCD) tasks of TRECVID 2009. In our HLFE system both visual and audio concept detection has been implemented and also complimentary standalone detectors have been incorporated to the system. For the visual concept detection, a generalized visual feature extraction method based on codebook approach is employed. On the other hand, our audio system is a hierarchic system with continuous static spectral features and Gaussian Mixture Model classifiers. Furthermore, for the CBCD task our group has submitted to video-only, audio-only and audio + video subtasks. Our video-based CBCD system utilizes local interest points and bag-of-words concept and for the audio copy detection voting-based audio fingerprint matching method has been utilized.

Columbia University TRECVID-2005 Video Search and High-Level Feature Extraction

2005

High-Level feature extraction  A_DCON1_1: Choose the best-performing classifier from the following runs for each concept.  A_DCON2_2: linear weighted fusion of 4 SVM classifiers using color/texture, partsbased classifier, and tf-idf text classifier.  A_DCON3_3: same as above, except a new input-adaptive fusion method was used.  A_DCON4_4: average fusion of 4 SVM classifiers using color/texture, parts-based classifier, and naïve Bayesian text classifier.  A_DCON5_5: same as above, except the naïve Bayesian text classifier was not fused.  A_DCON6_6: Choose the best-performing uni-model visual classifier among the 4 SVM's and parts-based classifier for each concept. No classifier fusion was done.  A_DCON7_7: a single SVM classifier using color and texture trained on the whole training set (90 videos) with 20% negative samples.