Andrew Zisserman - Academia.edu (original) (raw)

Uploads

Papers by Andrew Zisserman

Research paper thumbnail of Automatic learning of British Sign Language from signed TV broadcasts

Research paper thumbnail of A Linguistic Feature Vector for the Visual Interpretation of Sign Language

Lecture Notes in Computer Science, 2004

Research paper thumbnail of Estimation of the partial volume effect in MRI

Medical Image Analysis, 2002

Research paper thumbnail of The Kinetics Human Action Video Dataset

ArXiv, 2017

We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human acti... more We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. We also carry out a preliminary analysis of whether imbalance in the dataset leads to bias in the classifiers.

Research paper thumbnail of Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it... more The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while...

Research paper thumbnail of Look, Listen and Learn

We consider the question: what can be learnt by looking at and listening to a large number of unl... more We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.

Research paper thumbnail of A Short Note on the Kinetics-700-2020 Human Action Dataset

ArXiv, 2020

We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and... more We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results using the I3D network.

Research paper thumbnail of A Short Note on the Kinetics-700 Human Action Dataset

We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 cl... more We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos. This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.

Research paper thumbnail of The Visual Centrifuge: Model-Free Layered Video Representations

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Research paper thumbnail of Massively Parallel Video Networks

Computer Vision – ECCV 2018

Research paper thumbnail of Exploiting Temporal Context for 3D Human Pose Estimation in the Wild

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Research paper thumbnail of Using weak continuity constraints

Research paper thumbnail of Learning Class-Specific Edges for Object Detection and Segmentation

… Vision, Graphics and …, 2006

Research paper thumbnail of A visual vocabulary for flower classification

We investigate to what extent 'bag of visual words' mod- els can be used ... more We investigate to what extent 'bag of visual words' mod- els can be used to distinguish categories which have sig- nificant visual similarity. To this end we develop and op- timize a nearest neighbour classifier architecture, which is evaluated on a very challenging database of flower images. The flower categories are chosen to be indistinguishable on colour alone (for example),

Research paper thumbnail of Video Google: A text retrieval approach to object matching in videos

Computer Vision, 2003. Proceedings. …, 2003

Research paper thumbnail of A sampled texture prior for image super-resolution

Advances in neural information processing systems, Oct 1, 2003

Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution... more Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution images by recovering or inventing plausible high-frequency image content. Typical approaches try to reconstruct a high-resolution image using the sub-pixel displacements of several lowresolution images, usually regularized by a generic smoothness prior over the high-resolution image space. Other methods use training data to learn low-to-high-resolution matches, and have been highly successful even in the single- ...

Research paper thumbnail of DisLocation: Scalable Descriptor Distinctiveness for Location Recognition

Lecture Notes in Computer Science, 2015

Research paper thumbnail of ResearchArticle Overcoming Registration Uncertainty in Image Super-Resolution: Maximize or Marginalize?

Research paper thumbnail of Automated architecture reconstruction from close-range photogrammetry

We describe a method of automated reconstruction of buildings from a set of uncalibrated photogra... more We describe a method of automated reconstruction of buildings from a set of uncalibrated photographs. The method proceeds in two steps (i) Recovering the camera corresponding to each photograph and a set of sparse scene features using uncalibrated structure from motion techniques developed in the Computer Vision community. (ii) A novel plane-sweep algorithm which progressively constructs a piecewise planar D model of the building. In both steps, the rich geometric constraints present in architectural scenes are utilized. It is also demonstrated that window indentations may be computed automatically. The methods are illustrated on an image triplet of a college court at Oxford, and on the CIPA reference image set of Zurich City Hall.

Research paper thumbnail of Motion from point matches using affine epipolar geometry

Research paper thumbnail of Automatic learning of British Sign Language from signed TV broadcasts

Research paper thumbnail of A Linguistic Feature Vector for the Visual Interpretation of Sign Language

Lecture Notes in Computer Science, 2004

Research paper thumbnail of Estimation of the partial volume effect in MRI

Medical Image Analysis, 2002

Research paper thumbnail of The Kinetics Human Action Video Dataset

ArXiv, 2017

We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human acti... more We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. We also carry out a preliminary analysis of whether imbalance in the dataset leads to bias in the classifiers.

Research paper thumbnail of Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it... more The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while...

Research paper thumbnail of Look, Listen and Learn

We consider the question: what can be learnt by looking at and listening to a large number of unl... more We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.

Research paper thumbnail of A Short Note on the Kinetics-700-2020 Human Action Dataset

ArXiv, 2020

We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and... more We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results using the I3D network.

Research paper thumbnail of A Short Note on the Kinetics-700 Human Action Dataset

We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 cl... more We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos. This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.

Research paper thumbnail of The Visual Centrifuge: Model-Free Layered Video Representations

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Research paper thumbnail of Massively Parallel Video Networks

Computer Vision – ECCV 2018

Research paper thumbnail of Exploiting Temporal Context for 3D Human Pose Estimation in the Wild

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Research paper thumbnail of Using weak continuity constraints

Research paper thumbnail of Learning Class-Specific Edges for Object Detection and Segmentation

… Vision, Graphics and …, 2006

Research paper thumbnail of A visual vocabulary for flower classification

We investigate to what extent 'bag of visual words' mod- els can be used ... more We investigate to what extent 'bag of visual words' mod- els can be used to distinguish categories which have sig- nificant visual similarity. To this end we develop and op- timize a nearest neighbour classifier architecture, which is evaluated on a very challenging database of flower images. The flower categories are chosen to be indistinguishable on colour alone (for example),

Research paper thumbnail of Video Google: A text retrieval approach to object matching in videos

Computer Vision, 2003. Proceedings. …, 2003

Research paper thumbnail of A sampled texture prior for image super-resolution

Advances in neural information processing systems, Oct 1, 2003

Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution... more Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution images by recovering or inventing plausible high-frequency image content. Typical approaches try to reconstruct a high-resolution image using the sub-pixel displacements of several lowresolution images, usually regularized by a generic smoothness prior over the high-resolution image space. Other methods use training data to learn low-to-high-resolution matches, and have been highly successful even in the single- ...

Research paper thumbnail of DisLocation: Scalable Descriptor Distinctiveness for Location Recognition

Lecture Notes in Computer Science, 2015

Research paper thumbnail of ResearchArticle Overcoming Registration Uncertainty in Image Super-Resolution: Maximize or Marginalize?

Research paper thumbnail of Automated architecture reconstruction from close-range photogrammetry

We describe a method of automated reconstruction of buildings from a set of uncalibrated photogra... more We describe a method of automated reconstruction of buildings from a set of uncalibrated photographs. The method proceeds in two steps (i) Recovering the camera corresponding to each photograph and a set of sparse scene features using uncalibrated structure from motion techniques developed in the Computer Vision community. (ii) A novel plane-sweep algorithm which progressively constructs a piecewise planar D model of the building. In both steps, the rich geometric constraints present in architectural scenes are utilized. It is also demonstrated that window indentations may be computed automatically. The methods are illustrated on an image triplet of a college court at Oxford, and on the CIPA reference image set of Zurich City Hall.

Research paper thumbnail of Motion from point matches using affine epipolar geometry