Angjoo Kanazawa | University of Maryland, College Park (original) (raw)
Papers by Angjoo Kanazawa
ArXiv, 2015
With the rise of Augmented Reality, Virtual Reality and 3D printing, methods for acquiring 3D mod... more With the rise of Augmented Reality, Virtual Reality and 3D printing, methods for acquiring 3D models from the real world are more important then ever. One approach to generate 3D models is by modifying an existing template 3D mesh to fit the pose and shape of similar objects in images. To model the pose of an highly articulated and deformable object, it is essential to understand how an object class can articulate and deform. In this paper we propose to learn a class model of articulation and deformation from a set of annotated Internet images. To do so, we incorporate the idea of local stiffness, which specifies the amount of distortion allowed for a local region. Our system jointly learns the stiffness as it deforms a template 3D mesh to fit the pose of the objects in images. We show that this seemingly complex task can be solved with a sequence of convex optimization programs. We demonstrate our approach on two highly articulated and deformable animals, cats and horses. Our appro...
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Figure 1: Autoregressive prediction of human 3D motion from video. We present Predicting Human Dy... more Figure 1: Autoregressive prediction of human 3D motion from video. We present Predicting Human Dynamics (PHD), a neural autoregressive framework that takes past video frames as input to predict the motion of a 3D human body model. As shown, PHD takes in a video sequence of a person and predicts the future 3D human motion. We show the predictions from two different viewpoints.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
We present an approach to matching images of objects in fine-grained datasets without using part ... more We present an approach to matching images of objects in fine-grained datasets without using part annotations, with an application to the challenging problem of weakly supervised single-view reconstruction. This is in contrast to prior works that require part annotations, since matching objects across class and pose variations is challenging with appearance features alone. We overcome this challenge through a novel deep learning architecture, WarpNet, that aligns an object in one image with a different object in another. We exploit the structure of the fine-grained dataset to create artificial data for training this network in an unsupervised-discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. On the CUB-200-2011 dataset of bird categories, we improve the AP over an appearance-only network by 13.6%. We further demonstrate that our WarpNet matches, together with the structure of fine-grained datasets, allow single-view reconstructions with quality comparable to using annotated point correspondences.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
We present the first method to perform automatic 3D pose, shape and texture capture of animals fr... more We present the first method to perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild. In particular, we focus on the problem of capturing 3D information about Grevy's zebras from a collection of images. The Grevy's zebra is one of the most endangered species in Africa, with only a few thousand individuals left. Capturing the shape and pose of these animals can provide biologists and conservationists with information about animal health and behavior. In contrast to research on human pose, shape and texture estimation, training data for endangered species is limited, the animals are in complex natural scenes with occlusion, they are naturally camouflaged, travel in herds, and look similar to each other. To overcome these challenges, we integrate the recent SMAL animal model into a networkbased regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation. Going beyond state-of-the-art methods for human shape and pose estimation, our method learns a shape space for zebras during training. Learning such a shape space from images using only a photometric loss is novel, and the approach can be used to learn shape in other settings with limited 3D supervision. Moreover, we couple 3D pose and shape prediction with the task of texture synthesis, obtaining a full texture map of the animal from a single image. We show that the predicted texture map allows a novel per-instance unsupervised optimization over the network features. This method, SMALST (SMAL with learned Shape and Texture) goes beyond previous work, which assumed manual keypoints and/or segmentation, to regress directly from pixels to 3D animal shape, pose and texture. Code and data are available at https://github.com/silviazuffi/smalst.
2017 International Conference on 3D Vision (3DV)
Existing markerless motion capture methods often assume known backgrounds, static cameras, and se... more Existing markerless motion capture methods often assume known backgrounds, static cameras, and sequence specific motion priors, limiting their application scenarios. Here we present a fully automatic method that, given multi-view videos, estimates 3D human pose and body shape. We take the recently proposed SMPLify method [12] as the base method and extend it in several ways. First we fit a 3D human body model to 2D features detected in multi-view images. Second, we use a CNN method to segment the person in each image and fit the 3D body model to the contours, further improving accuracy. Third we utilize a generic and robust DCT temporal prior to handle the left and right side swapping issue sometimes introduced by the 2D pose estimator. Validation on standard benchmarks shows our results are comparable to the state of the art and also provide a realistic 3D shape avatar. We also demonstrate accurate results on HumanEva and on challenging monocular sequences of dancing from YouTube.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Input Reconstruction Side and top down view Part Segmentation Input Reconstruction Side and top d... more Input Reconstruction Side and top down view Part Segmentation Input Reconstruction Side and top down view Part Segmentation Figure 1: Human Mesh Recovery (HMR): End-to-end adversarial learning of human pose and shape. We describe a real time framework for recovering the 3D joint angles and shape of the body from a single RGB image. The first two rows show results from our model trained with some 2D-to-3D supervision, the bottom row shows results from a model that is trained in a fully weakly-supervised manner without using any paired 2D-to-3D supervision. We infer the full 3D body even in case of occlusions and truncations. Note that we capture head and limb orientations.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
From an image of a person in action, we can easily guess the 3D motion of the person in the immed... more From an image of a person in action, we can easily guess the 3D motion of the person in the immediate past and future. This is because we have a mental model of 3D human dynamics that we have acquired from observing visual sequences of humans in motion. We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding of image features. At test time, from video, the learned temporal representation give rise to smooth 3D mesh predictions. From a single image, our model can recover the current 3D mesh as well as its 3D past and future motion. Our approach is designed so it can learn from videos with 2D pose annotations in a semi-supervised manner. Though annotated data is always limited, there are millions of videos uploaded daily on the Internet. In this work, we harvest this Internet-scale source of unlabeled data by training our model on unlabeled video with pseudo-ground truth 2D pose obtained from an off-the-shelf 2D pose detector. Our experiments show that adding more videos with pseudo-ground truth 2D pose monotonically improves 3D prediction performance. We evaluate our model, Human Mesh and Motion Recovery (HMMR), on the recent challenging dataset of 3D Poses in the Wild and obtain state-of-the-art performance on the 3D prediction task without any fine-tuning. The project website with video, code, and data can be found at https://akanazawa.github.io/ human_dynamics/.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Animals from images. We learn an articulated, 3D, statistical shape model of animals using very l... more Animals from images. We learn an articulated, 3D, statistical shape model of animals using very little training data. We fit the shape and pose of the model to 2D image cues showing how it generalizes to previously unseen shapes.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
We present SfSNet, an end-to-end learning framework for producing an accurate decomposition of an... more We present SfSNet, an end-to-end learning framework for producing an accurate decomposition of an unconstrained human face image into shape, reflectance and illuminance. SfSNet is designed to reflect a physical lambertian rendering model. SfSNet learns from a mixture of labeled synthetic and unlabeled real world images. This allows the network to capture low frequency variations from synthetic and high frequency details from real images through the photometric reconstruction loss. SfSNet consists of a new decomposition architecture with residual blocks that learns a complete separation of albedo and normal. This is used along with the original image to predict lighting. Sf-SNet produces significantly better quantitative and qualitative results than state-of-the-art methods for inverse rendering and independent normal and illumination estimation.
Computer Vision – ECCV 2018
We present a learning framework for recovering the 3D shape, camera, and texture of an object fro... more We present a learning framework for recovering the 3D shape, camera, and texture of an object from a single image. The shape is represented as a deformable 3D mesh model of an object category where a shape is parameterized by a learned mean shape and per-instance predicted deformation. Our approach allows leveraging an annotated image collection for training, where the deformable model and the 3D prediction mechanism are learned without relying on ground-truth 3D or multi-view supervision. Our representation enables us to go beyond existing 3D prediction approaches by incorporating texture inference as prediction of an image in a canonical appearance space. Additionally, we show that semantic keypoints can be easily associated with the predicted shapes. We present qualitative and quantitative results of our approach on CUB and PAS-CAL3D datasets and show that we can learn to predict diverse shapes and textures across objects using only annotated image collections. The project website can be found at https://akanazawa.github.io/cmr/.
2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 1, 2019
Pixel-aligned Implicit function (PIFu): We present pixel-aligned implicit function (PIFu), which ... more Pixel-aligned Implicit function (PIFu): We present pixel-aligned implicit function (PIFu), which allows recovery of high-resolution 3D textured surfaces of clothed humans from a single input image (top row). Our approach can digitize intricate variations in clothing, such as wrinkled skirts and high-heels, including complex hairstyles. The shape and textures can be fully recovered including largely unseen regions such as the back of the subject. PIFu can also be naturally extended to multi-view input images (bottom row).
Convolutional Neural Networks (ConvNets) have shown excellent results on many visual classificati... more Convolutional Neural Networks (ConvNets) have shown excellent results on many visual classification tasks. With the exception of ImageNet, these datasets are carefully crafted such that objects are well-aligned at similar scales. Naturally, the feature learning problem gets more challenging as the amount of variation in the data increases, as the models have to learn to be invariant to certain changes in appearance. Recent results on the ImageNet dataset show that given enough data, ConvNets can learn such invariances producing very discriminative features [1]. But could we do more: use less parameters, less data, learn more discriminative features, if certain invariances were built into the learning process? In this paper we present a simple model that allows ConvNets to learn features in a locally scale-invariant manner without increasing the number of model parameters. We show on a modified MNIST dataset that when faced with scale variation, building in scale-invariance allows ConvNets to learn more discriminative features with reduced chances of over-fitting.
Lecture Notes in Computer Science, 2012
ABSTRACT We propose a novel approach to fine-grained image classification in which instances from... more ABSTRACT We propose a novel approach to fine-grained image classification in which instances from different classes share common parts but have wide variation in shape and appearance. We use dog breed identification as a test case to show that extracting corresponding parts improves classification performance. This domain is especially challenging since the appearance of corresponding parts can vary dramatically, e.g., the faces of bulldogs and beagles are very different. To find accurate correspondences, we build exemplar-based geometric and appearance models of dog breeds and their face parts. Part correspondence allows us to extract and compare descriptors in like image locations. Our approach also features a hierarchy of parts (e.g., face and eyes) and breed-specific part localization. We achieve 67% recognition rate on a large real-world dataset including 133 dog breeds and 8,351 images, and experimental results show that accurate part localization significantly increases classification performance compared to state-of-the-art approaches.
ArXiv, 2015
With the rise of Augmented Reality, Virtual Reality and 3D printing, methods for acquiring 3D mod... more With the rise of Augmented Reality, Virtual Reality and 3D printing, methods for acquiring 3D models from the real world are more important then ever. One approach to generate 3D models is by modifying an existing template 3D mesh to fit the pose and shape of similar objects in images. To model the pose of an highly articulated and deformable object, it is essential to understand how an object class can articulate and deform. In this paper we propose to learn a class model of articulation and deformation from a set of annotated Internet images. To do so, we incorporate the idea of local stiffness, which specifies the amount of distortion allowed for a local region. Our system jointly learns the stiffness as it deforms a template 3D mesh to fit the pose of the objects in images. We show that this seemingly complex task can be solved with a sequence of convex optimization programs. We demonstrate our approach on two highly articulated and deformable animals, cats and horses. Our appro...
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Figure 1: Autoregressive prediction of human 3D motion from video. We present Predicting Human Dy... more Figure 1: Autoregressive prediction of human 3D motion from video. We present Predicting Human Dynamics (PHD), a neural autoregressive framework that takes past video frames as input to predict the motion of a 3D human body model. As shown, PHD takes in a video sequence of a person and predicts the future 3D human motion. We show the predictions from two different viewpoints.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
We present an approach to matching images of objects in fine-grained datasets without using part ... more We present an approach to matching images of objects in fine-grained datasets without using part annotations, with an application to the challenging problem of weakly supervised single-view reconstruction. This is in contrast to prior works that require part annotations, since matching objects across class and pose variations is challenging with appearance features alone. We overcome this challenge through a novel deep learning architecture, WarpNet, that aligns an object in one image with a different object in another. We exploit the structure of the fine-grained dataset to create artificial data for training this network in an unsupervised-discriminative learning approach. The output of the network acts as a spatial prior that allows generalization at test time to match real images across variations in appearance, viewpoint and articulation. On the CUB-200-2011 dataset of bird categories, we improve the AP over an appearance-only network by 13.6%. We further demonstrate that our WarpNet matches, together with the structure of fine-grained datasets, allow single-view reconstructions with quality comparable to using annotated point correspondences.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
We present the first method to perform automatic 3D pose, shape and texture capture of animals fr... more We present the first method to perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild. In particular, we focus on the problem of capturing 3D information about Grevy's zebras from a collection of images. The Grevy's zebra is one of the most endangered species in Africa, with only a few thousand individuals left. Capturing the shape and pose of these animals can provide biologists and conservationists with information about animal health and behavior. In contrast to research on human pose, shape and texture estimation, training data for endangered species is limited, the animals are in complex natural scenes with occlusion, they are naturally camouflaged, travel in herds, and look similar to each other. To overcome these challenges, we integrate the recent SMAL animal model into a networkbased regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation. Going beyond state-of-the-art methods for human shape and pose estimation, our method learns a shape space for zebras during training. Learning such a shape space from images using only a photometric loss is novel, and the approach can be used to learn shape in other settings with limited 3D supervision. Moreover, we couple 3D pose and shape prediction with the task of texture synthesis, obtaining a full texture map of the animal from a single image. We show that the predicted texture map allows a novel per-instance unsupervised optimization over the network features. This method, SMALST (SMAL with learned Shape and Texture) goes beyond previous work, which assumed manual keypoints and/or segmentation, to regress directly from pixels to 3D animal shape, pose and texture. Code and data are available at https://github.com/silviazuffi/smalst.
2017 International Conference on 3D Vision (3DV)
Existing markerless motion capture methods often assume known backgrounds, static cameras, and se... more Existing markerless motion capture methods often assume known backgrounds, static cameras, and sequence specific motion priors, limiting their application scenarios. Here we present a fully automatic method that, given multi-view videos, estimates 3D human pose and body shape. We take the recently proposed SMPLify method [12] as the base method and extend it in several ways. First we fit a 3D human body model to 2D features detected in multi-view images. Second, we use a CNN method to segment the person in each image and fit the 3D body model to the contours, further improving accuracy. Third we utilize a generic and robust DCT temporal prior to handle the left and right side swapping issue sometimes introduced by the 2D pose estimator. Validation on standard benchmarks shows our results are comparable to the state of the art and also provide a realistic 3D shape avatar. We also demonstrate accurate results on HumanEva and on challenging monocular sequences of dancing from YouTube.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Input Reconstruction Side and top down view Part Segmentation Input Reconstruction Side and top d... more Input Reconstruction Side and top down view Part Segmentation Input Reconstruction Side and top down view Part Segmentation Figure 1: Human Mesh Recovery (HMR): End-to-end adversarial learning of human pose and shape. We describe a real time framework for recovering the 3D joint angles and shape of the body from a single RGB image. The first two rows show results from our model trained with some 2D-to-3D supervision, the bottom row shows results from a model that is trained in a fully weakly-supervised manner without using any paired 2D-to-3D supervision. We infer the full 3D body even in case of occlusions and truncations. Note that we capture head and limb orientations.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
From an image of a person in action, we can easily guess the 3D motion of the person in the immed... more From an image of a person in action, we can easily guess the 3D motion of the person in the immediate past and future. This is because we have a mental model of 3D human dynamics that we have acquired from observing visual sequences of humans in motion. We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding of image features. At test time, from video, the learned temporal representation give rise to smooth 3D mesh predictions. From a single image, our model can recover the current 3D mesh as well as its 3D past and future motion. Our approach is designed so it can learn from videos with 2D pose annotations in a semi-supervised manner. Though annotated data is always limited, there are millions of videos uploaded daily on the Internet. In this work, we harvest this Internet-scale source of unlabeled data by training our model on unlabeled video with pseudo-ground truth 2D pose obtained from an off-the-shelf 2D pose detector. Our experiments show that adding more videos with pseudo-ground truth 2D pose monotonically improves 3D prediction performance. We evaluate our model, Human Mesh and Motion Recovery (HMMR), on the recent challenging dataset of 3D Poses in the Wild and obtain state-of-the-art performance on the 3D prediction task without any fine-tuning. The project website with video, code, and data can be found at https://akanazawa.github.io/ human_dynamics/.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Animals from images. We learn an articulated, 3D, statistical shape model of animals using very l... more Animals from images. We learn an articulated, 3D, statistical shape model of animals using very little training data. We fit the shape and pose of the model to 2D image cues showing how it generalizes to previously unseen shapes.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
We present SfSNet, an end-to-end learning framework for producing an accurate decomposition of an... more We present SfSNet, an end-to-end learning framework for producing an accurate decomposition of an unconstrained human face image into shape, reflectance and illuminance. SfSNet is designed to reflect a physical lambertian rendering model. SfSNet learns from a mixture of labeled synthetic and unlabeled real world images. This allows the network to capture low frequency variations from synthetic and high frequency details from real images through the photometric reconstruction loss. SfSNet consists of a new decomposition architecture with residual blocks that learns a complete separation of albedo and normal. This is used along with the original image to predict lighting. Sf-SNet produces significantly better quantitative and qualitative results than state-of-the-art methods for inverse rendering and independent normal and illumination estimation.
Computer Vision – ECCV 2018
We present a learning framework for recovering the 3D shape, camera, and texture of an object fro... more We present a learning framework for recovering the 3D shape, camera, and texture of an object from a single image. The shape is represented as a deformable 3D mesh model of an object category where a shape is parameterized by a learned mean shape and per-instance predicted deformation. Our approach allows leveraging an annotated image collection for training, where the deformable model and the 3D prediction mechanism are learned without relying on ground-truth 3D or multi-view supervision. Our representation enables us to go beyond existing 3D prediction approaches by incorporating texture inference as prediction of an image in a canonical appearance space. Additionally, we show that semantic keypoints can be easily associated with the predicted shapes. We present qualitative and quantitative results of our approach on CUB and PAS-CAL3D datasets and show that we can learn to predict diverse shapes and textures across objects using only annotated image collections. The project website can be found at https://akanazawa.github.io/cmr/.
2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 1, 2019
Pixel-aligned Implicit function (PIFu): We present pixel-aligned implicit function (PIFu), which ... more Pixel-aligned Implicit function (PIFu): We present pixel-aligned implicit function (PIFu), which allows recovery of high-resolution 3D textured surfaces of clothed humans from a single input image (top row). Our approach can digitize intricate variations in clothing, such as wrinkled skirts and high-heels, including complex hairstyles. The shape and textures can be fully recovered including largely unseen regions such as the back of the subject. PIFu can also be naturally extended to multi-view input images (bottom row).
Convolutional Neural Networks (ConvNets) have shown excellent results on many visual classificati... more Convolutional Neural Networks (ConvNets) have shown excellent results on many visual classification tasks. With the exception of ImageNet, these datasets are carefully crafted such that objects are well-aligned at similar scales. Naturally, the feature learning problem gets more challenging as the amount of variation in the data increases, as the models have to learn to be invariant to certain changes in appearance. Recent results on the ImageNet dataset show that given enough data, ConvNets can learn such invariances producing very discriminative features [1]. But could we do more: use less parameters, less data, learn more discriminative features, if certain invariances were built into the learning process? In this paper we present a simple model that allows ConvNets to learn features in a locally scale-invariant manner without increasing the number of model parameters. We show on a modified MNIST dataset that when faced with scale variation, building in scale-invariance allows ConvNets to learn more discriminative features with reduced chances of over-fitting.
Lecture Notes in Computer Science, 2012
ABSTRACT We propose a novel approach to fine-grained image classification in which instances from... more ABSTRACT We propose a novel approach to fine-grained image classification in which instances from different classes share common parts but have wide variation in shape and appearance. We use dog breed identification as a test case to show that extracting corresponding parts improves classification performance. This domain is especially challenging since the appearance of corresponding parts can vary dramatically, e.g., the faces of bulldogs and beagles are very different. To find accurate correspondences, we build exemplar-based geometric and appearance models of dog breeds and their face parts. Part correspondence allows us to extract and compare descriptors in like image locations. Our approach also features a hierarchy of parts (e.g., face and eyes) and breed-specific part localization. We achieve 67% recognition rate on a large real-world dataset including 133 dog breeds and 8,351 images, and experimental results show that accurate part localization significantly increases classification performance compared to state-of-the-art approaches.