End-to-End 3D Learning Workshop (original) (raw)

Recent advances in foundation models—such as GPT for language and CLIP/Segment Anything for images—have driven rapid progress in two-dimensional (2D) and textual data processing. However, an expanding range of real-world applications, including large-scale autonomous systems, robotics, extended reality, and molecular modeling, requires deeper three-dimensional (3D) understanding. Moving beyond flat images to spatially grounded 3D representations is essential for addressing a broader set of physical and spatial challenges.

Many current 3D learning techniques (e.g., 3D reconstruction via Structure-from-Motion with dense stereo) still rely on sequential pipelines that are slow, prone to errors, and challenging to scale to massive, web-scale data. This workshop therefore aims to explore how these fragmented approaches can be replaced by a single, differentiable framework—one that processes raw imagery to directly produce complete 3D outputs. Such an end-to-end design can be trained on large-scale unannotated data and can enable downstream tasks that require a generalizable, real-time understanding of real-world geometry and semantics.

More specifically, the End-to-End 3D Learning (E2E3D) workshop will examine how best to unify modeling, inference, and optimization into a single data-driven architecture. By allowing AI systems to perceive images and generate 3D content, reason about geometric and semantic relationships, and integrate multi-modal signals, we can advance content creation, machine perception, and autonomous control.

Focus of the Workshop:

This workshop brings together researchers from computer vision, robotics, extended reality (XR), autonomous driving, scientific imaging, and related fields to foster interdisciplinary discussions on next-generation 3D systems. By spotlighting recent breakthroughs and identifying key challenges, we aim to inspire innovative research and practical applications across these domains.