SOCIAL MEDIA TITLE TAG (original) (raw)
MeshAvatar: Learning High-quality Triangular
Human Avatars from Multi-view Videos
Tsinghua University1, NNKosmos Technology2
ECCV 2024
Given the multi-view videos of a specific subject, our method learns his triangular avatar with intrinsic material decomposition. After training, the avatar not only supports synthesis under novel poses and novel lighting conditions, but also enables texture editing and material manipulation.
Abstract
We present a novel pipeline for learning high-quality triangular human avatars from multi-view videos. Recent methods for avatar learning are typically based on neural radiance fields (NeRF), which is not compatible with traditional graphics pipeline and poses great challenges for operations like editing or synthesizing under different environments. To overcome these limitations, our method represents the avatar with an explicit triangular mesh extracted from an implicit SDF field, complemented by an implicit material field conditioned on given poses. Leveraging this triangular avatar representation, we incorporate physics-based rendering to accurately decompose geometry and texture. To enhance both the geometric and appearance details, we further employ a 2D UNet as the network backbone and introduce pseudo normal ground-truth as additional supervision. Experiments show that our method can learn triangular avatars with high-quality geometry reconstruction and plausible material decomposition, inherently supporting editing, manipulation or relighting operations.
Method Overview
Our pipeline learns a hybrid human avatar represented in the form of (a) an explicit skinned mesh and (b) implicit pose-dependent material fields. Such a representation inherently supports (c) physics-based ray tracing and can be trained in an end-to-end manner using (d) normal estimation as an additional supervision signal.
Video Presentation
Comparisons
Quantitative Results
Ours | AvatarReX | AnimatableGaussians | AnimatableGaussians* | Xu et al. | Lin et al. | IntrinsicAvatar | |
---|---|---|---|---|---|---|---|
Representation | hybrid | SDF | 3DGS | 3DGS | SDF | SDF | SDF |
Relightable? | ✔ | ✔ | ✔ | ✔ | ✔ | ||
Training Time(~100 frames) | ~3h | 2.5 days | 4h(mono.) | ||||
Training Time(~1000 frames) | ~16h | 2 days | 2 days(RTX 4090) | 2 days(RTX 4090) | 30h | ||
Inference Time(per image) | 180ms | 30s | 100ms | 4~10s | 5s | 40s | 20s |
Comparisons with recent SOTA methods on neural avatars. We achieved 20x faster at inference.
Qualitative Results
We evaluated our method on AvatarReX[1] and ActorsHQ[2] datasets. Our method could reconstruct fine-grained dynamic human geometry.
References
[1] Zheng, Zerong, et al. "Avatarrex: Real-time expressive full-body avatars." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-19.
[2] Işık, Mustafa, et al. "Humanrf: High-fidelity neural radiance fields for humans in motion." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-12.
BibTeX
@misc{chen2024meshavatar,
title={MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos},
author={Yushuo Chen and Zerong Zheng and Zhe Li and Chao Xu and Yebin Liu},
year={2024},
eprint={2407.08414},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.08414},
}