GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation (original) (raw)
Motivation
GPT-4V (or other MLLMs) can understand 3D content via multi-view images as input.
GPT-4V Caption: Intricately detailed steampunk apparatus, primarily of mechanical design nature, appearing three-dimensional. With worn metallic and glassy texture. Showcasing a central clock face and multiple gauges, and accentuated by pipes, gears, and levers. Crafted mainly from aged bronze and accented with glass and wood. Intended for time display and possible atmospheric measurements, and is static. Exhibiting a Victorian steampunk style, set in an industrial workshop environment with a nostalgic and inventive mood & atmosphere.
GPT-4V Caption: Detailed potted plant on a rugged terrain, primarily of organic and naturalistic structure, appearing full and lifelike. Showcasing a vibrant green plant with yellow flowers and accompanied by smaller pink blossoms, and accentuated by a scattering of pebbles and rocks. Crafted mainly from digital textures mimicking natural materials and accented with subtle shading. Intended for environmental visualization and is static. Exhibiting a contemporary and natural aesthetic, set in an outdoor-like setting with a serene and peaceful atmosphere.
Prompt Distribution
Controllable prompt generator. More complexity or more creative prompts often lead to a more challenging evaluation setting. Our prompt generator can produce prompts with various levels of creativity and complexity. This allows us to examine textto-3D models’ performance in different cases more efficiently.
Different Complexity Levels
A large, multi-layered, symmetrical wedding cake, with smooth fondant, delicate piping, and lifelike sugar flowers in full bloom, displayed on a silver stand.
A solid, symmetrical, smooth stone fountain, with water cascading over its edges into a clear, circular pond surrounded by blooming lilies, in the center of a sunlit courtyard.
Different Creativity Levels
Orange monarch butterfly resting on a dandelion.
Frog with a translucent skin displaying a mechanical heart beating.
Method Overview
We create a customizable instruction template that contains necessary information for GPT-4V to conduct comparison tasks for two 3D assets. We complete this template with different evaluation criteria, input 3D images, and random seeds to create the final 3D-aware prompts for GPT-4V. GPT-4V will then consume these inputs to output its assessments. Finally, we assemble GPT-4V’s answers to create a robust final estimate of the task.
Examples
BibTeX
@inproceedings{wu2023gpteval3d,
author = {Tong Wu and Guandao Yang and Zhibing Li and Kai Zhang and
Ziwei Liu and Leonidas Guibas and Dahua Lin and Gordon Wetzstein},
title = {GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation},
booktitle = {CVPR},
year = {2024},
}
We thank Nerfies for providing this amazing project template.