GitHub - Deaddawn/DreamFrame-code (original) (raw)
DreamFrame:
Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes
We propose DreamFrame, a novel framework designed to create synthetic, high-quality data for video understanding.
Contents
Install
Please follow the instructions below to install the required packages. Our training process is mainly based on LLaMA-VID. And our short video evaluation process is mainly based on quantitative_evaluation from Video-ChatGPT.
Clone this repository
git clone https://github.com/Deaddawn/DreamFrame-code.git
Install Package (Tested on A100 and RTX3090, CUDA 11.8. We recommend sticking to the package versions we provided, as changes in the versions of diffusers and transformers may lead to certain issues.)
conda create -n DreamFrame python=3.10 -y conda activate DreamFrame cd DreamFrame pip install -r requirements.txt
​Generation
The data generation process of DreamFrame mainly consists of three stage: (1) Move Plot Generation (2) Style Immobilization Process (3) Video Instruction Data Generation
Move Plot Generation
We basically adopt a story expanding strategy which incrementally generates frame descriptions through three levels. We provide three-level example prompts. Use any LLM(We use GPT-4) to generate frame descriptions and organize them into a JSON file like this story_js
Style Immobilization Process
Style Immobilization is to learn a style embedding which can be used to generate style consistent key frames. To learn the style embedding, we will need a style-related keyword and a set of style-related images. Keyword can be obtained from stage one. For style-related images, we simply use sdxl-1.0-base to generate these based on the detail style description (you can find an example in the prompt we provide).
Here, we provide an example to show how you can train a style embedding. We use keyword "Dramatic".
cd StyleImmobilization python style_embedding.py --style_keyword Dramatic --image_path ./style
The learned style embedding will be saved at folder "Embeddings". This should only take 5~10 minutes (tested on A100).
Video Instruction Data Generation
After train a style embedding, you can start to generate consistent keyframes based on the aformentioned json file like this:
cd StyleImmobilization python generate.py --js_path ./json/story_info_0.json --embed_path ./Embeddings/story_0_Dramatic.pt --keyword Dramatic --save_path ./save_path
Model
We provide our baseline model and model trained on our generated dataset. For more detailed information, refer to LLaMA-VID-model. And please follow LLaMA-VID to prepare the necessary settings and feel free to use our provided checkpiont.
Type | Max Token | Base LLM | Finetuning Data | Finetuning schedule | Download |
---|---|---|---|---|---|
Base Model | 64K | Vicuna-7B-v1.5 | LLaVA1.5-VideoChatGPT-Instruct | full_ft-1e | ckpt |
DreamFrame-7B | 64K | Vicuna-7B-v1.5 | LLaVA1.5-VideoChatGPT-Instruct + DreamFrameQA | full_ft-1e | ckpt |
Dataset
Data generated from our pipeline consists of key frame images, corresponding QAs and dialogues. You can download it from here DreamFrame-Data
Pipeline
Evaluation
We follow MVBench, Video-Bench and TempCompass to conduct evaluations.
Evaluation Results
Results
Generation Results
Comparison Results
Acknowledgement
We would like to thank the following repos for their great work:
- Our model is trained based on LLaMA-VID.
- We build our pipeline based on textual-inversion