GitHub - PolyU-ChenLab/ETBench: 👾 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding (NeurIPS 2024) (original) (raw)

Ye Liu1,2, Zongyang Ma2,3, Zhongang Qi2, Yang Wu4, Ying Shan2, Chang Wen Chen1

1The Hong Kong Polytechnic University 2ARC Lab, Tencent PCG
3Institute of Automation, Chinese Academy of Sciences 4Tencent AI Lab

E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark) is a comprehensive solution for open-ended event-level video-language understanding. This project consists of the following three contributions:

We focus on 4 essential capabilities for time-sensitive video understanding: referring, grounding, dense captioning, and complex understanding. The examples (categorized by background colors) are as follows.

🔥 News

🏆 Leaderboard

Our online leaderboard is under construction. Stay tuned!

🔮 Benchmark

Please refer to the Benchmark page for details about E.T. Bench.

🛠️ Model

Please refer to the Model page for training and testing E.T. Chat.

📦 Dataset

Please refer to the Dataset page for downloading E.T. Instruct 164K.

📖 Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2024etbench, title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding}, author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying}, booktitle={Neural Information Processing Systems (NeurIPS)}, year={2024} }

💡 Acknowledgements

This project was built upon the following repositories with many thanks to their authors.

LLaVA, LAVIS, EVA, LLaMA-VID, TimeChat, densevid_eval