GitHub - Deaddawn/DreamFrame-code (original) (raw)

DreamFrame:

Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes

We propose DreamFrame, a novel framework designed to create synthetic, high-quality data for video understanding.

Install
Data Generation
Model
Dataset
Pipeline
Evaluation
Results
Acknowledgement

Install

Please follow the instructions below to install the required packages. Our training process is mainly based on LLaMA-VID. And our short video evaluation process is mainly based on quantitative_evaluation from Video-ChatGPT.

Clone this repository

git clone https://github.com/Deaddawn/DreamFrame-code.git

Install Package (Tested on A100 and RTX3090, CUDA 11.8. We recommend sticking to the package versions we provided, as changes in the versions of diffusers and transformers may lead to certain issues.)

conda create -n DreamFrame python=3.10 -y conda activate DreamFrame cd DreamFrame pip install -r requirements.txt

Generation

The data generation process of DreamFrame mainly consists of three stage: (1) Move Plot Generation (2) Style Immobilization Process (3) Video Instruction Data Generation

Move Plot Generation

We basically adopt a story expanding strategy which incrementally generates frame descriptions through three levels. We provide three-level example prompts. Use any LLM(We use GPT-4) to generate frame descriptions and organize them into a JSON file like this story_js

Style Immobilization Process

Style Immobilization is to learn a style embedding which can be used to generate style consistent key frames. To learn the style embedding, we will need a style-related keyword and a set of style-related images. Keyword can be obtained from stage one. For style-related images, we simply use sdxl-1.0-base to generate these based on the detail style description (you can find an example in the prompt we provide).

Here, we provide an example to show how you can train a style embedding. We use keyword "Dramatic".

cd StyleImmobilization python style_embedding.py --style_keyword Dramatic --image_path ./style

The learned style embedding will be saved at folder "Embeddings". This should only take 5~10 minutes (tested on A100).

Video Instruction Data Generation

After train a style embedding, you can start to generate consistent keyframes based on the aformentioned json file like this:

cd StyleImmobilization python generate.py --js_path ./json/story_info_0.json --embed_path ./Embeddings/story_0_Dramatic.pt --keyword Dramatic --save_path ./save_path

Model

We provide our baseline model and model trained on our generated dataset. For more detailed information, refer to LLaMA-VID-model. And please follow LLaMA-VID to prepare the necessary settings and feel free to use our provided checkpiont.

Type	Max Token	Base LLM	Finetuning Data	Finetuning schedule	Download
Base Model	64K	Vicuna-7B-v1.5	LLaVA1.5-VideoChatGPT-Instruct	full_ft-1e	ckpt
DreamFrame-7B	64K	Vicuna-7B-v1.5	LLaVA1.5-VideoChatGPT-Instruct + DreamFrameQA	full_ft-1e	ckpt

Dataset

Data generated from our pipeline consists of key frame images, corresponding QAs and dialogues. You can download it from here DreamFrame-Data

GitHub - Deaddawn/DreamFrame-code (original) (raw)

DreamFrame:

Contents

Install

Generation

Move Plot Generation

Style Immobilization Process

Video Instruction Data Generation

Model

Dataset

Pipeline

Evaluation

Evaluation Results

Results

Generation Results

Comparison Results

Acknowledgement

GitHub - Deaddawn/DreamFrame-code (original) (raw)

DreamFrame:

Contents

Install

​Generation

Move Plot Generation

Style Immobilization Process

Video Instruction Data Generation

Model

Dataset

Pipeline

Evaluation

Evaluation Results

Results

Generation Results

Comparison Results

Acknowledgement

Generation