GitHub - Deaddawn/DreamFrame-code (original) (raw)

DreamFrame:

Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes

We propose DreamFrame, a novel framework designed to create synthetic, high-quality data for video understanding.

Contents

Install

Please follow the instructions below to install the required packages. Our training process is mainly based on LLaMA-VID. And our short video evaluation process is mainly based on quantitative_evaluation from Video-ChatGPT.

Clone this repository

git clone https://github.com/Deaddawn/DreamFrame-code.git

Install Package (Tested on A100 and RTX3090, CUDA 11.8. We recommend sticking to the package versions we provided, as changes in the versions of diffusers and transformers may lead to certain issues.)

conda create -n DreamFrame python=3.10 -y conda activate DreamFrame cd DreamFrame pip install -r requirements.txt

​Generation

The data generation process of DreamFrame mainly consists of three stage: (1) Move Plot Generation (2) Style Immobilization Process (3) Video Instruction Data Generation

Move Plot Generation

We basically adopt a story expanding strategy which incrementally generates frame descriptions through three levels. We provide three-level example prompts. Use any LLM(We use GPT-4) to generate frame descriptions and organize them into a JSON file like this story_js

Style Immobilization Process

Style Immobilization is to learn a style embedding which can be used to generate style consistent key frames. To learn the style embedding, we will need a style-related keyword and a set of style-related images. Keyword can be obtained from stage one. For style-related images, we simply use sdxl-1.0-base to generate these based on the detail style description (you can find an example in the prompt we provide).

Here, we provide an example to show how you can train a style embedding. We use keyword "Dramatic".

cd StyleImmobilization python style_embedding.py --style_keyword Dramatic --image_path ./style

The learned style embedding will be saved at folder "Embeddings". This should only take 5~10 minutes (tested on A100).

Video Instruction Data Generation

After train a style embedding, you can start to generate consistent keyframes based on the aformentioned json file like this:

cd StyleImmobilization python generate.py --js_path ./json/story_info_0.json --embed_path ./Embeddings/story_0_Dramatic.pt --keyword Dramatic --save_path ./save_path

Model

We provide our baseline model and model trained on our generated dataset. For more detailed information, refer to LLaMA-VID-model. And please follow LLaMA-VID to prepare the necessary settings and feel free to use our provided checkpiont.

Type Max Token Base LLM Finetuning Data Finetuning schedule Download
Base Model 64K Vicuna-7B-v1.5 LLaVA1.5-VideoChatGPT-Instruct full_ft-1e ckpt
DreamFrame-7B 64K Vicuna-7B-v1.5 LLaVA1.5-VideoChatGPT-Instruct + DreamFrameQA full_ft-1e ckpt

Dataset

Data generated from our pipeline consists of key frame images, corresponding QAs and dialogues. You can download it from here DreamFrame-Data

Pipeline

Evaluation

We follow MVBench, Video-Bench and TempCompass to conduct evaluations.

Evaluation Results

Results

Generation Results

Comparison Results

Acknowledgement

We would like to thank the following repos for their great work: