Second Joint Egocentric Vision Workshop (original) (raw)
Overview
Wearable cameras, smart glasses, and AR/VR headsets are gaining importance for research and commercial use. They feature various sensors like cameras, depth sensors, microphones, IMUs, and GPS. Advances in machine perception enable precise user localization (SLAM), eye tracking, and hand tracking. This data allows understanding user behavior, unlocking new interaction possibilities with augmented reality. Egocentric devices may soon automatically recognize user actions, surroundings, gestures, and social relationships. These devices have broad applications in assistive technology, education, fitness, entertainment, gaming, eldercare, robotics, and augmented reality, positively impacting society.
Previously, research in this field faced challenges due to limited datasets in a data-intensive environment. However, the community's recent efforts have addressed this issue by releasing numerous large-scale datasets covering various aspects of egocentric perception, including HoloAssist, Ego4D, Ego-Exo4D, EPIC-KITCHENS, and HD-EPIC.
The goal of this workshop is to provide an exciting discussion forum for researchers working in this challenging and fast-growing area, and to provide a means to unlock the potential of data-driven research with our datasets to further the state-of-the-art.
Challenges
We welcome submissions to the challenges from February to May (see important dates) through the leaderboards linked below. Participants to the challenges are requested to submit a technical report on their method. This is a requirement for the competition. Reports should be 2-6 pages including references. Submissions should use the CVPR format and should be submitted through the CMT website.
HoloAssist Challenges
HoloAssist is a large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks.
Action Recognition
**Lead:**Taein Kwon, ETH Zurich, Switzerland
**Summary:**Action Recognition on the holoassist dataset. Input could be RGB images or multiple modalities.
Mistake Detection
**Lead:**Mahdi Rad, Microsoft, Switzerland
**Summary:**Mistake detection is defined following the convention Assembly101 but applied to fine-grained actions in our benchmark. We take the features from the fine-grained action clips from the beginning of the coarse-grained action until the end of the current action clip, and the model predicts a label from {correct, mistake}.
Ego4D Challenges
Ego4D is a massive-scale, egocentric dataset and benchmark suite collected across 74 worldwide locations and 9 countries, with over 3,670 hours of daily-life activity video. Please find details below on our challenges:
Ego4D Episodic Memory
Track: Visual Queries
Lead: Suyog Jain, Meta, US
**Summary:**Given an egocentric video, the goal is to answer queries of the form "Where did I last see object X?", where the query object X is specified as a static image.
Ego4D Episodic Memory
Track: Natural Language Queries
Lead: Suyog Jain, Meta, US
**Summary:**Given an egocentric video V and a natural language query Q, the goal is to identify a response track r, such that the answer to Q can be deduced from r.
Ego4D Episodic Memory
Track: Moment Queries
**Lead:**Chen Zhao & Merey Ramazanova, KAUST, SA
Summary: Given an input video and a query action category, the goal is to retrieve all the instances of this action category in the video.
Current SOTA: Paper
Previous Winner: 34.99
Ego4D Episodic Memory
Track: Goal Step
Lead: Yale Song, Meta, US
Summary: Given an untrimmed egocentric video, identify the temporal action segment corresponding to a natural language description of the step. Specifically, predict the (start_time, end_time) for a given keystep description.
Current SOTA: Paper
Previous Winner: 35.18 r@1, IoU=0.3
Ego4D Episodic Memory
Track: EgoSchema
Lead: Karttikeya Mangalam & Raiymbek Akshulakov, UC Berkeley, US
Summary: EgoSchema is a very long-form video question-answering dataset and benchmark to evaluate long video understanding capabilities of modern vision and language systems.
Current SOTA: 0.75 (report unavailable)
Previous Winner: N/A
Ego4D Social Interaction
Track: Looking at me
Lead: Xizi Wang, Indiana University, US
**Summary:**The task focuses on identifying communicative acts that are directed towards the camera-wearer, as distinguished from those directed to other social partners
Ego4D Social Interaction
Track: Talking to me
**Lead:**Xizi Wang, Indiana University, US
**Summary:**Given a video and audio segment with the same tracked faces and an additional label that identifies speaker status, classify whether each visible face is talking to the camera wearer.
Ego4D Forecasting
Track: Short-term object interaction anticipation
**Lead:**Francesco Ragusa, University of Catania, IT
**Summary:**This task aims to predict the next human-object interaction happening after a given timestamp. Given an input video, the goal is to anticipate 1)the spatial positions of the active objects, 2) the category of each detected next active objects, 3) how each active object will be used (verb), 4) and when the interaction will begin.
Current SOTA:Paper 1; Paper 2
Previous Winner: Top-5 Overall mAP: 7.21
Ego4D Forecasting
Track: Long-term action anticipation
Lead: Tushar Nagarajan, FAIR, US
Summary: This task aims to predict the next Z future actions after a given action. Given an input video up to a particular timestep (corresponding to the last visible action), the goal is to predict a list of action classes [(verb1, noun1), (verb2, noun2) ... (verbZ, nounZ)] that follow it.
Current SOTA: Paper
Previous Winner: N/A
Ego-Exo4D Challenges
Ego-Exo4D is a diverse, large-scale multi-modal multi view video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego- centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair).
EgoExo4D Pose Challenge
Track: Ego-Pose Body
Lead: Juanita Puentes Mozo & Gabriel Perez Santamaria, Los Andes
Summary: The EgoExo4D Body Pose Challenge aims to accurately estimate body pose using only first-person raw video and/or egocentric camera pose.
Current SOTA: EgoCast (MPJPE: 14.36)
Previous Winner: MPJPE: 15.32
EgoExo4D Pose Challenge
Track: Ego-Pose Hands
**Lead:**Shan Shu, University of Pennsylvania, US
Summary:
Current SOTA:
Previous Winner:
EgoExo4D Proficiency Estimation
Track: Demonstrator Proficiency
Lead: Arjun Somayazulu, UT Austin, US
Summary: Given synchronized egocentric and exocentric video of a demonstrator performing a task, classify the proficiency skill level of the demonstrator.
Current SOTA: EgoExo4D benchmark baseline
Previous Winner: N/A
EgoExo4D Keysteps
Track: Fine-grained Keystep Recognition
Lead: Sherry Xue & Tushar Nagarajan, UT Austin, US
Summary:
Current SOTA:
Previous Winner:
EgoExo4D Relations
Track: Correspondence
Lead: Sanjay Haresh, Simon Fraser, Canada
Summary: The challenge is aimed at methods for object correspondences across ego-centric and exo-centric views. Given a pair of time-synchronized egocentric and exocentric videos, as well as a query object track in one of the views, the goal is to output the corresponding mask for the same object instance in the other view for all frames where the object is visible in both views.
Current SOTA: Paper
Previous Winner: N/A
EgoExo4D Keysteps
Track: Procedure Understanding
**Lead:**Antonino Furnari, University of Catania, IT
**Summary:**The objective of this task is to infer a procedure's underlying structure from observing natural videos of subjects performing the procedure.
Current SOTA:
Previous Winner:
EPIC-Kitchens Challenges
Please check the EPIC-KITCHENS website for more information on the EPIC-KITCHENS challenges. Links to individual challenges are also reported below.
Action Recognition
Lead: Prajwal Gatti and Siddhant Bansal, University of Bristol, UK
Summary: Classify the action's verb and noun depicted in a trimmed video clip.
Current SOTA: Paper
Previous Winner: 48.1% - top 1 / 77.4% - top 5
Action Detection
Lead: Francesco Ragusa and Antonino Furnari, University of Catania, IT
Summary: The challenge requires detecting and recognising all action instances within an untrimmed video. The challenge will be carried out on the EPIC-KITCHENS-100 dataset.
Current SOTA: Results
Previous Winner: Action Avg. mAP 31.97
Domain Adaptation Challenge for Action Recognition
Lead: Saptarshi Sinha and Prajwal Gatti, University of Bristol, UK
Summary: Given labelled videos from the source domain and unlabelled videos from the target domain, the goal is to classify actions in the target domain. An action is defined as a verb and noun depicted in a trimmed video clip.
Current SOTA: Paper
Previous Winner: 43.17 for action accuracy
Multi-Instance Retrieval
Lead: Prajwal Gatti and Michael Wray, University of Bristol, UK
Summary: Perform cross-modal retrieval by searching between vision and text modalities.
Current SOTA: Paper
Previous Winner: Normalised Discounted Cumulative Gain (%) Avg. - 74.25
Semi-Supervised Video-Object Segmentation
Lead: Rhodri Guerrier and Ahmad Darkhalil, University of Bristol, UK
**Summary:**Given a sub-sequence of frames with M object masks in the first frame, the goal of this challenge is to segment these through the remaining frames. Other objects not present in the first frame of the sub-sequence are excluded from this benchmark.
Current SOTA:Webpage
EPIC-SOUNDS Audio-Based Interaction Recognition
Lead: Omar Emara and Jacob Chalk, University of Bristol, UK
Summary: Recognising interactions from audio data from EPIC-Sounds (classify the audio).
Current SOTA: User: JMCarrot
Previous Winner: N/A
EPIC-SOUNDS Audio-Based Interaction Detection
Lead: Omar Emara and Jacob Chalk, University of Bristol, UK
Summary: Classify all audio-based interactions (recognition) from audio data of EPIC-Sounds and predict their start and end times for a given video.
Current SOTA: User: shuming
Previous Winner: N/A
HD-EPIC Challenge
Please check the HD-EPIC website for more information on the HD-EPIC challenges. Links to individual challenges are also reported below.
HD-EPIC Challenges - VQA
Lead: Prajwal Gatti (University of Bristol, UK) and Kaiting Liu (Leiden University, Netherlands)
Summary: Given a question belonging to any one of the seven types defined in the HD-EPIC VQA benchmark, the goal is to predict the correct answer among the five listed choices.
Current SOTA: Gemini Pro
Previous Winner: N/A
Call for Abstracts
You are invited to submit extended abstracts to the second edition of joint egocentric vision workshop which will be held alongside CVPR 2025 in Nashville.
These abstracts represent existing or ongoing work and will not be published as part of any proceedings. We welcome all works that focus within the Egocentric Domain, it is not necessary to use the Ego4D dataset within your work. We expect a submission may contain one or more of the following topics (this is a non-exhaustive list):
- Egocentric vision for human activity analysis and understanding, including action recognition, action detection, audio-visual action perception and object state change detection
- Egocentric vision for anticipating human behaviour, actions and objects
- Egocentric vision for 3D perception and interaction, including dynamic scene reconstruction, hand-object reconstruction, long-term object tracking, NLQ and visual queries, long-term video understanding
- Head-mounted eye tracking and gaze estimation including attention modelling and next fixation prediction
- Egocentric vision for object/event recognition and retrieval
- Egocentric vision for summarization
- Daily life and activity monitoring
- Egocentric vision for human skill learning, assistance, and robotics
- Egocentric vision for social interaction and human behaviour understanding
- Privacy and ethical concerns with wearable sensors and egocentric vision
- Egocentric vision for health and social good
- Symbiotic human-machine vision systems, human-wearable devices interaction
- Interactive AR/VR and Egocentric online/real-time perception
Format
The length of the extended abstracts is 2-4 pages, including figures, tables, and references. We invite submissions of ongoing or already published work, as well as reports on demonstrations and prototypes. The joint egocentric vision workshop gives opportunities for authors to present their work to the egocentric community to provoke discussion and feedback. Accepted work will be presented as either an oral presentation (either virtual or in-person) or as a poster presentation. The review will be single-blind, so there is no need to anonymize your work, but otherwise will follow the format of the CVPR submissions, information can be foundhere. Accepted abstracts will not be published as part of a proceedings, so can be uploaded to ArXiv etc. and the links will be provided on the workshop’s webpage. The submission will be managed with the CMT website.
Important Dates
Challenges Leaderboards Open | Feb 2025 |
---|---|
Challenges Leaderboards Close | 19 May 2025(some challenges have extended their deadline,please check respective challenge's webpage) |
Challenges Technical Reports Deadline (on CMT) | 23 May 2025 (11:55 PM AoE)(some challenges have extended their deadline,please check respective challenge's webpage) |
Notification to Challenge Winners | 30 May 2025 |
Challenge Reports ArXiv Deadline | 6 June 2025 |
Extended Abstract Deadline (on CMT) | |
Extended Abstract Notification to Authors | 23 May 2025 |
Extended Abstracts ArXiv Deadline | 2 June 2025 |
Workshop Date | 12 June 2025 |
Program
All dates are local to Nashville's time, CST.
Workshop Location: Room Grand B1
Time | Event |
---|---|
08:50-09:00 | Welcome and Introductions |
09:00-09:30 | Invited Keynote 1: Siyu Tang, ETH Zürich, CHTalk Title: Towards an egocentric multimodal foundation model |
09:30-10:00 | HoloAssist Challenges |
10:00-11:00 | Coffee Break and Poster Session |
11:00-11:30 | Invited Keynote 2: Kris Kitani, CMU, USA |
11:30-12:00 | EPIC-KITCHENS & HD-EPIC Challenges |
12:00-12:30 | Oral Presentations (Group 1) EgoLife: Towards Egocentric Life Assistant HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning |
12:30-13:30 | Lunch Break |
13:30-14:00 | EgoVis Distinguished Papers Award |
14:00-14:30 | Invited Keynote 3: Xiaolong Wang, UCSD, USA |
14:30-15:30 | Ego4D & Ego-Exo4D Challenges |
15:30-16:00 | Coffee Break |
16:00-16:30 | Invited Keynote 4: Arsha Nagrani, Google DeepMind |
16:30-17:05 | Aria Gen2 |
17:05-17:35 | Oral Presentations (Group 2) FIction: 4D Future Interaction Prediction from Video EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision Estimating Body and Hand Motion in an Ego‑sensed World Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory |
17:35-17:45 | Conclusion |
Papers
Note to authors: Please hang your poster in Hall D against the boards with workshop's name. Posters can be put up ONLY during the poster session time (10.00 - 11.00).
Follow CVPR poster guidelines for poster dimensions.
All workshop posters are in Exhibition Hall D.
Extended Abstracts
PosterBoard # | Title | Authors | arXiv Link |
---|---|---|---|
77 | Leadership Assessment in Pediatric Intensive Care Unit Team Training | Liangyang Ouyang (The University of Tokyo); Yuki Sakai (The University of Tokyo); Ryosuke Furuta (The University of Tokyo); Hisataka Nozawa (The University of Tokyo); Hikoro Matsui (The University of Tokyo); Yoichi Sato (The University of Tokyo) | link |
78 | What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning | Chi-Hsi Kung (Indiana University); Frangil Ramirez (Indiana University); Juhyung Ha (Indiana University); Yi-Ting Chen (National Yang-Ming Chiao-Tung University); David Crandall (Indiana University); Yi-Hsuan Tsai (Atmanity Inc.) | link |
79 | Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory | Zaira Manigrasso (University of Udine); Matteo Dunnhofer (University of Udine); Antonino Furnari (University of Catania); Moritz nottebaum (University of Udine); Antonio Finocchiaro (University of Catania); Davide Marana (University of Udine); Rosario Forte (University of Catania); Giovanni Maria Farinella (University of Catania); Micheloni Christian (University of Udine) | link |
80 | Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning | Xueyi Ke (Nanyang Technological University); Satoshi Tsutsui (Nanyang Technological University, Singapore); Yayun Zhang (Max Planck Institute for Psycholinguistics); Bihan Wen (Nanyang Technological University) | link |
81 | PAIR-Net: Enhancing Egocentric Speaker Detection via Pretrained Audio–Visual Fusion and Alignment Loss | Yu Wang (Indiana University); Juhyung Ha (Indiana University Bloomington); David Crandall (Indiana University) | link |
82 | From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities | Dominick Reilly (University of North Carolina at Charlotte); Manish Kumar Govind (University of North Carolina at Charlotte); Le Xue (Salesforce AI Research); Srijan Das (University of North Carolina at Charlotte) | link |
83 | EASG-Bench : Video Q&A Benchmark with Egocentric Action Scene Graphs | Ivan Rodin (University of Catania); Tz-Yin Wu (Intel Labs); Kyle Min (Intel Labs); Sharath Sridhar (Intel Labs); Antonino Furnari (University of Catania); Subarna Tripathi (Intel Labs); Giovanni Maria Farinella (University of Catania) | link |
84 | ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition | Sanjoy Kundu (Auburn University); Shanmukha Vellamcheti (Auburn University); Sathyanarayan N. Aakur (Auburn University) | link |
85 | Reasoning on hierarchical representation of human behavior from Ego-videos | Simone Alberto Peirone (Politecnico di Torino); Francesca Pistilli (Politecnico di Torino); Giuseppe Averta (Politecnico di Torino) | link |
86 | Learning reusable concepts across different egocentric video understanding tasks | Simone Alberto Peirone (Politecnico di Torino); Francesca Pistilli (Politecnico di Torino); Antonio Alliegro (Politecnico di Torino); Tatiana Tommasi (Politecnico di Torino); Giuseppe Averta (Politecnico di Torino) | link |
87 | Efficient Egocentric Action Recognition with Multimodal Data | Marco Calzavara (ETH Zurich); Ard Kastrati (ETH Zurich); Matteo Macchini (Magic Leap); Dushan Vasilevski (Magic Leap); Roger Wattenhofer (ETH Zurich) | link |
88 | Sequence-Based Identification of First-Person Camera Wearers in Third-Person Views | Ziwei Zhao (Indiana University); Xizi Wang (Indiana University); Yuchen Wang (Indiana University); Feng Cheng (ByteDance); David Crandall (Indiana University) | link |
89 | Vid2Coach: Transforming How-To Videos into Task Assistants | Mina Huh (University of Texas at Austin); Zihui Xue (University of Texas at Austin); Ujjaini Das (University of Texas at Austin); Kumar Ashutosh (University of Texas at Austin); Kristen Grauman (University of Texas at Austin); Amy Pavel (University of Texas at Austin) | link |
90 | Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera | Zhengdi Yu (Imperial College London); Stefanos Zafeiriou (Imperial College London); Tolga Birdal (Imperial College London) | link |
91 | Keystep Recognition using Graph Neural Networks | Julia Romero (University of Colorado Boulder); Kyle Min (Intel Labs); Subarna Tripathi (Intel Labs); Morteza Karimzadeh (University of Colorado Boulder) | link |
92 | Improving Keystep Recognition in Ego-Video via Dexterous Focus | Zachary Chavis (University of Minnesota); Stephen Guy (University of Minnesota); Hyun Soo Park (University of Minnesota) | link |
Invited CVPR Papers
PosterBoard # | Title | Authors | Paper Link |
---|---|---|---|
93 | HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos | Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Engel, Tomas Hodan | link |
94 | Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision | Tomoya Yoshida, Shuhei Kurita, Taichi Nishimura, Shinsuke Mori | link |
95 | FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video | Andrea Boscolo Camiletto, Jian Wang, Eduardo Alvarado, Rishabh Dabral, Thabo Beeler, Marc Habermann, Christian Theobalt | link |
96 | Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos | Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, Federico Tombari | link |
97 | HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos | Jinglei Zhang, Jiankang Deng, Chao Ma, Rolandos Alexandros Potamias | link |
98 | Layered motion fusion: Lifting motion segmentation to 3D in egocentric videos | Vadim Tschernezki, Diane Larlus, Andrea Vedaldi, Iro Laina | link |
99 | REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning | Jihyun Lee, Weipeng Xu, Alexander Richard, Shih-En Wei, Shunsuke Saito, Shaojie Bai, Te-Li Wang, Minhyuk Sung, Tae-Kyun Kim, Jason Saragih | link |
100 | EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering | Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-seng Chua, Angela Yao | link |
101 | EgoLife: Towards Egocentric Life Assistant | Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Bo Li, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Ziwei Liu | link |
102 | DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Video | Lorenzo Mur-Labadia, Jose J. Guerrero, Ruben Martinez-Cantin | link |
103 | Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities | Michele Mazzamuto, Antonino Furnari, Yoichi Sato, Giovanni Maria Farinella | link |
104 | EgoLM: Multi-Modal Language Model of Egocentric Motions | Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard Newcombe, Ziwei Liu, Lingni Ma | link |
105 | EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision | Yiming Zhao, Taein Kwon, Paul Streli, Marc Pollefeys, Christian Holz | link |
106 | HD-EPIC: A Highly-Detailed Egocentric Video Dataset | Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen | link |
107 | Estimating Body and Hand Motion in an Ego‑sensed World | Brent Yi, Vickie Ye, Maya Zheng, Yunqi Li, Lea Müller, Georgios Pavlakos, Yi Ma, Jitendra Malik, Angjoo Kanazawa | link |
108 | Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations | Jungin Park, Jiyoung Lee, Kwanghoon Sohn | link |
109 | GEM: A Generalizable Ego-vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control | Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, Alex Alahi | link |
110 | Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning | Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman | link |
111 | Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos | Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman | link |
112 | ExpertAF: Expert Actionable Feedback from Video | Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman | link |
113 | FIction: 4D Future Interaction Prediction from Video | Kumar Ashutosh, Georgios Pavlakos, Kristen Grauman | link |
114 | Progress-Aware Video Frame Captioning | Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman | link |
115 | BIMBA: Selective-Scan Compression for Long-Range Video Question Answering | Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, Lorenzo Torresani | link |
116 | VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos | Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal | link |
117 | VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation | Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, Stefan Leutenegger | link |
Invited Speakers
Workshop Organisers
Co-organizing Advisors
Related Past Events
This workshop follows the footsteps of the following previous events:
- EgoVis@CVPR2024: First Joint Egocentric Vision (EgoVis) Workshop in conjunction with CVPR 2024;
EPIC-Kitchens and Ego4D Past Workshops:
- Ego4D&EPIC@CVPR2023: Joint International 1st Ego4D and 10th EPIC Workshop in conjunction with CVPR 2023;
- 2nd International Ego4D Workshop in conjunction with ECCV 2022;
- Ego4D&EPIC@CVPR2022: Joint International 1st Ego4D and 10th EPIC Workshop in conjunction with CVPR 2022;
- EPIC@ICCV21: The Ninth International Workshop on Egocentric Perception, Interaction and Computing in conjunction with ICCV 2021;
- EPIC@CVPR21: The Eighth International Workshop on Egocentric Perception, Interaction and Computing in conjunction with CVPR 2021;
- EPIC@ECCV20: The Seventh International Workshop on Egocentric Perception, Interaction and Computing in conjunction with ECCV 2020;
- EPIC@CVPR20: The Sixth International Workshop on Egocentric Perception, Interaction and Computing in conjunction with CVPR 2020;
- EPIC@ICCV19: The Fifth International Workshop on Egocentric Perception, Interaction and Computing in conjunction with ICCV 2019;
- EPIC@CVPR19: The Fourth International Workshop on Egocentric Perception, Interaction and Computing in conjunction with CVPR 2019;
- EPIC@ECCV18: The Third International Workshop on Egocentric Perception, Interaction and Computing in conjunction with ECCV 2018;
- EPIC@ICCV17: The Second International Workshop on Egocentric Perception, Interaction and Computing in conjunction with ICCV 2017;
- EPIC@ECCV16: The First International Workshop on Egocentric Perception, Interaction and Computing in conjunction with ECCV 2016;
Human Body, Hands, and Activities from Egocentric and Multi-view Cameras Past Workshops:
- Human Body, Hands, and Activities from Egocentric and Multi-view Cameras (HBHA) run alongside ECCV 2022;
Project Aria Past Tutorials:
- Aria tutorial run alongside CVPR 2023;
- Aria tutorial run alongside CVPR 2022;