Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models (original) (raw)
Seoul National University1, Naver Webtoon AI2
*Indicates Equal Contribution
Given a 3D object mesh, we generates numerous 3D Human-Object Interaction samples, and learn a novel affordance representation called Comprehensive Affordance (ComA) which models both contact and non-contact patterns.
Abstract
Understanding the inherent human knowledge in interacting with a given environment (e.g., affordance) is essential for improving AI to better assist humans. While existing approaches primarily focus on human-object contacts during interactions, such affordance representation cannot fully address other important aspects of human-object interactions (HOIs), i.e. patterns of relative positions and orientations. In this paper, we introduce a novel affordance representation, named Comprehensive Affordance (ComA). Given a 3D object mesh, ComA models the distribution of relative orientation and proximity of vertices in interacting human meshes, capturing plausible patterns of contact, relative orientations, and spatial relationships. To construct the distribution, we present a novel pipeline that synthesizes diverse and realistic 3D HOI samples given any 3D target object mesh. The pipeline leverages a pre-trained 2D inpainting diffusion model to generate HOI images from object renderings and lifts them into 3D. To avoid the generation of false affordances, we propose a new inpainting framework, Adaptive Mask Inpainting. Since ComA is built on synthetic samples, it can extend to any object in an unbounded manner. Through extensive experiments, we demonstrate that ComA outperforms competitors that rely on human annotations in modeling contact-based affordance. Importantly, we also showcase the potential of ComA to reconstruct human-object interactions in 3D through an optimization framework, highlighting its advantage in incorporating both contact and non-contact properties
Key Takeaways
Traditional affordances focus on contact in human-object interactions. However, important patterns like relative orientations and positions cannot be expressed through contact alone. These overlooked aspects are crucial for further understanding affordance.
Comprehensive Affordance (ComA) is the first to capture both high-resolution contact and non-contact interactions, offering a complete view of object affordances.
Comprehensive Affordance (ComA) is essentially the joint distribution of relative orientation and proximity, between human surface points and object surface points.
Such distributions contains rich pointwise information on proximity and orientations, which can be used to derive various forms of affordance including contact, orientational tendency, and spatial relation.
Since ComA models pointwise distributions, we can infer richer contact information such as local contact correspondences.
We present a scalable method to learn ComA for any 3D objects. In a nutshell, (1) we leverage pre-trained diffusion model togenerate large-scale samples of 3D humans interacting with the given object, and (2) we use that generated dataset to learn ComA.
Overview
Our 3D human-object interaction sample generation pipeline first renders the 3D object from multiple viewpoints, (2) insert humans interacting with the object into these renderings using pre-trained inpainting diffusion models, (3) lift the inferred human back into 3D space by resolving depth ambiguities through our specialized optimization pipeline.
Adaptive Mask Inpainting
During inserting humans into image via inpainting, the object geometry and details within the mask region are not preserved, resulting in false affordances. Our Adaptive Mask Inpainting alleviates this by progressively specifying the inpainting region over diffusion timesteps. Check out the huggingface pipeline here.
Depth Optimization using Weak Auxiliary Cues
For each provided image, we find similar images with relevant human poses from different viewpoints using RANSAC, based on joint reprojection error. We utilize these images as weak auxiliary cues to optimize depth and resolve ambiguities in 3D space.
ComA enables diverse applications, including reconstructing human-object interactions (figure).We can adapt these applications to any 3D objects using our Dataset Generation method.
Video
Results
Contact-based Affordance
Motorcycle
Keyboard
Skateboard
Soccer Ball
Suitcase
Tennis Racket
Orientational Affordance
Spatial Affordance
Input
Full Body
Hand
Face
BibTeX
@inproceedings{ComA,
title="Beyond the Contact: Discovering Comprehensive Affordance for 3D Objects from Pre-trained 2D Diffusion Models",
author="Kim, Hyeonwoo and Han, Sookwan and Kwon, Patrick and Joo, Hanbyul",
booktitle=ECCV,
year={2024}
}