Andrew Zisserman - Academia.edu (original) (raw)
Uploads
Papers by Andrew Zisserman
Lecture Notes in Computer Science, 2004
Medical Image Analysis, 2002
ArXiv, 2017
We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human acti... more We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. We also carry out a preliminary analysis of whether imbalance in the dataset leads to bias in the classifiers.
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it... more The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while...
We consider the question: what can be learnt by looking at and listening to a large number of unl... more We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.
ArXiv, 2020
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and... more We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results using the I3D network.
We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 cl... more We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos. This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Computer Vision – ECCV 2018
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
… Vision, Graphics and …, 2006
We investigate to what extent 'bag of visual words' mod- els can be used ... more We investigate to what extent 'bag of visual words' mod- els can be used to distinguish categories which have sig- nificant visual similarity. To this end we develop and op- timize a nearest neighbour classifier architecture, which is evaluated on a very challenging database of flower images. The flower categories are chosen to be indistinguishable on colour alone (for example),
Computer Vision, 2003. Proceedings. …, 2003
Advances in neural information processing systems, Oct 1, 2003
Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution... more Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution images by recovering or inventing plausible high-frequency image content. Typical approaches try to reconstruct a high-resolution image using the sub-pixel displacements of several lowresolution images, usually regularized by a generic smoothness prior over the high-resolution image space. Other methods use training data to learn low-to-high-resolution matches, and have been highly successful even in the single- ...
Lecture Notes in Computer Science, 2015
We describe a method of automated reconstruction of buildings from a set of uncalibrated photogra... more We describe a method of automated reconstruction of buildings from a set of uncalibrated photographs. The method proceeds in two steps (i) Recovering the camera corresponding to each photograph and a set of sparse scene features using uncalibrated structure from motion techniques developed in the Computer Vision community. (ii) A novel plane-sweep algorithm which progressively constructs a piecewise planar D model of the building. In both steps, the rich geometric constraints present in architectural scenes are utilized. It is also demonstrated that window indentations may be computed automatically. The methods are illustrated on an image triplet of a college court at Oxford, and on the CIPA reference image set of Zurich City Hall.
Lecture Notes in Computer Science, 2004
Medical Image Analysis, 2002
ArXiv, 2017
We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human acti... more We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. We also carry out a preliminary analysis of whether imbalance in the dataset leads to bias in the classifiers.
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it... more The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while...
We consider the question: what can be learnt by looking at and listening to a large number of unl... more We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.
ArXiv, 2020
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and... more We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results using the I3D network.
We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 cl... more We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos. This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Computer Vision – ECCV 2018
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
… Vision, Graphics and …, 2006
We investigate to what extent 'bag of visual words' mod- els can be used ... more We investigate to what extent 'bag of visual words' mod- els can be used to distinguish categories which have sig- nificant visual similarity. To this end we develop and op- timize a nearest neighbour classifier architecture, which is evaluated on a very challenging database of flower images. The flower categories are chosen to be indistinguishable on colour alone (for example),
Computer Vision, 2003. Proceedings. …, 2003
Advances in neural information processing systems, Oct 1, 2003
Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution... more Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution images by recovering or inventing plausible high-frequency image content. Typical approaches try to reconstruct a high-resolution image using the sub-pixel displacements of several lowresolution images, usually regularized by a generic smoothness prior over the high-resolution image space. Other methods use training data to learn low-to-high-resolution matches, and have been highly successful even in the single- ...
Lecture Notes in Computer Science, 2015
We describe a method of automated reconstruction of buildings from a set of uncalibrated photogra... more We describe a method of automated reconstruction of buildings from a set of uncalibrated photographs. The method proceeds in two steps (i) Recovering the camera corresponding to each photograph and a set of sparse scene features using uncalibrated structure from motion techniques developed in the Computer Vision community. (ii) A novel plane-sweep algorithm which progressively constructs a piecewise planar D model of the building. In both steps, the rich geometric constraints present in architectural scenes are utilized. It is also demonstrated that window indentations may be computed automatically. The methods are illustrated on an image triplet of a college court at Oxford, and on the CIPA reference image set of Zurich City Hall.