GitHub - salesforce/LAVIS: LAVIS - A One-stop Library for Language-Vision Intelligence (original) (raw)

Latest Release docs license Downloads

LAVIS - A Library for Language-Vision Intelligence

What's New: πŸŽ‰

A simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities (image, video, audio, 3D) without extensive modality-specific customization.

A text-to-image generation model that trains 20x than DreamBooth. Also facilitates zero-shot subject-driven generation and editing.

A new vision-language instruction-tuning framework using BLIP-2 models, achieving state-of-the-art zero-shot generalization performance on a wide range of vision-language tasks.

A generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new state-of-the-art on zero-shot captioning (on NoCaps 121.6 CIDEr score vs previous best 113.2). In addition, equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the new zero-shot instructed vision-to-language generation capabilities for various interesting applications!

A plug-and-play module that enables off-the-shelf use of Large Language Models (LLMs) for visual question answering (VQA). Img2LLM-VQA surpasses Flamingo on zero-shot VQA on VQAv2 (61.9 vs 56.3), while in contrast requiring no end-to-end training!

A modular zero-shot VQA framework that requires no PLMs training, achieving SoTA zero-shot VQA performance.

Technical Report and Citing LAVIS

You can find more details in our technical report.

If you're using LAVIS in your research or applications, please cite it using this BibTeX:

@inproceedings{li-etal-2023-lavis, title = "{LAVIS}: A One-stop Library for Language-Vision Intelligence", author = "Li, Dongxu and Li, Junnan and Le, Hung and Wang, Guangsen and Savarese, Silvio and Hoi, Steven C.H.", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-demo.3", pages = "31--41", abstract = "We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets. LAVIS supports training, evaluation and benchmarking on a rich variety of tasks, including multimodal classification, retrieval, captioning, visual question answering, dialogue and pre-training. In the meantime, the library is also highly extensible and configurable, facilitating future development and customization. In this technical report, we describe design principles, key components and functionalities of the library, and also present benchmarking results across common language-vision tasks.", }

Table of Contents

Introduction

LAVIS is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets. It features a unified interface design to access

Key features of LAVIS include:

The following table shows the supported tasks, datasets and models in our library. This is a continuing effort and we are working on further growing the list.

Tasks Supported Models Supported Datasets
Image-text Pre-training ALBEF, BLIP COCO, VisualGenome, SBU ConceptualCaptions
Image-text Retrieval ALBEF, BLIP, CLIP COCO, Flickr30k
Text-image Retrieval ALBEF, BLIP, CLIP COCO, Flickr30k
Visual Question Answering ALBEF, BLIP VQAv2, OKVQA, A-OKVQA
Image Captioning BLIP COCO, NoCaps
Image Classification CLIP ImageNet
Natural Language Visual Reasoning (NLVR) ALBEF, BLIP NLVR2
Visual Entailment (VE) ALBEF SNLI-VE
Visual Dialogue BLIP VisDial
Video-text Retrieval BLIP, ALPRO MSRVTT, DiDeMo
Text-video Retrieval BLIP, ALPRO MSRVTT, DiDeMo
Video Question Answering (VideoQA) BLIP, ALPRO MSRVTT, MSVD
Video Dialogue VGD-GPT AVSD
Multimodal Feature Extraction ALBEF, CLIP, BLIP, ALPRO customized
Text-to-image Generation [COMING SOON]

Installation

  1. (Optional) Creating conda environment

conda create -n lavis python=3.8 conda activate lavis

  1. install from PyPI

pip install salesforce-lavis

  1. Or, for development, you may build from source

git clone https://github.com/salesforce/LAVIS.git cd LAVIS pip install -e .

Getting Started

Model Zoo

Model zoo summarizes supported models in LAVIS, to view:

from lavis.models import model_zoo print(model_zoo)

==================================================

Architectures Types

==================================================

albef_classification ve

albef_feature_extractor base

albef_nlvr nlvr

albef_pretrain base

albef_retrieval coco, flickr

albef_vqa vqav2

alpro_qa msrvtt, msvd

alpro_retrieval msrvtt, didemo

blip_caption base_coco, large_coco

blip_classification base

blip_feature_extractor base

blip_nlvr nlvr

blip_pretrain base

blip_retrieval coco, flickr

blip_vqa vqav2, okvqa, aokvqa

clip_feature_extractor ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50

clip ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50

gpt_dialogue base

Let’s see how to use models in LAVIS to perform inference on example data. We first load a sample image from local.

import torch from PIL import Image

setup device to use

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

load sample image

raw_image = Image.open("docs/_static/merlion.png").convert("RGB")

This example image shows Merlion park (source), a landmark in Singapore.

Image Captioning

In this example, we use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each pre-trained model with its preprocessors (transforms), accessed via load_model_and_preprocess().

import torch from lavis.models import load_model_and_preprocess device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.

this also loads the associated image processors

model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)

preprocess the image

vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)

image = vis_processors"eval".unsqueeze(0).to(device)

generate caption

model.generate({"image": image})

['a large fountain spewing water into the air']

Visual question answering (VQA)

BLIP model is able to answer free-form questions about images in natural language. To access the VQA model, simply replace the name and model_type arguments passed to load_model_and_preprocess().

from lavis.models import load_model_and_preprocess model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)

ask a random question.

question = "Which city is this photo taken?" image = vis_processors"eval".unsqueeze(0).to(device) question = txt_processors"eval" model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")

['singapore']

Unified Feature Extraction Interface

LAVIS provides a unified interface to extract features from each architecture. To extract features, we load the feature extractor variants of each model. The multimodal feature can be used for multimodal classification. The low-dimensional unimodal features can be used to compute cross-modal similarity.

from lavis.models import load_model_and_preprocess model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device) caption = "a large fountain spewing water into the air" image = vis_processors"eval".unsqueeze(0).to(device) text_input = txt_processors"eval" sample = {"image": image, "text_input": [text_input]}

features_multimodal = model.extract_features(sample) print(features_multimodal.multimodal_embeds.shape)

torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks

features_image = model.extract_features(sample, mode="image") features_text = model.extract_features(sample, mode="text") print(features_image.image_embeds.shape)

torch.Size([1, 197, 768])

print(features_text.text_embeds.shape)

torch.Size([1, 12, 768])

low-dimensional projected features

print(features_image.image_embeds_proj.shape)

torch.Size([1, 197, 256])

print(features_text.text_embeds_proj.shape)

torch.Size([1, 12, 256])

similarity = features_image.image_embeds_proj[:,0,:] @ features_text.text_embeds_proj[:,0,:].t() print(similarity)

tensor([[0.2622]])

Load Datasets

LAVIS inherently supports a wide variety of common language-vision datasets by providing automatic download tools to help download and organize these datasets. After downloading, to load the datasets, use the following code:

from lavis.datasets.builders import dataset_zoo dataset_names = dataset_zoo.get_names() print(dataset_names)

['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',

'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',

'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',

'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']

After downloading the images, we can use load_dataset() to obtain the dataset.

from lavis.datasets.builders import load_dataset coco_dataset = load_dataset("coco_caption") print(coco_dataset.keys())

dict_keys(['train', 'val', 'test'])

print(len(coco_dataset["train"]))

566747

print(coco_dataset["train"][0])

{'image': <PIL.Image.Image image mode=RGB size=640x480>,

'text_input': 'A woman wearing a net on her head cutting a cake. ',

'image_id': 0}

If you already host a local copy of the dataset, you can pass in the vis_path argument to change the default location to load images.

coco_dataset = load_dataset("coco_caption", vis_path=YOUR_LOCAL_PATH)

Jupyter Notebook Examples

See examples for more inference examples, e.g. captioning, feature extraction, VQA, GradCam, zeros-shot classification.

Resources and Tools

Documentations

For more details and advanced usages, please refer todocumentation.

Ethical and Responsible Use

We note that models in LAVIS provide no guarantees on their multimodal abilities; incorrect or biased predictions may be observed. In particular, the datasets and pretrained models utilized in LAVIS may contain socioeconomic biases which could result in misclassification and other unwanted behaviors such as offensive or inappropriate speech. We strongly recommend that users review the pre-trained models and overall system in LAVIS before practical adoption. We plan to improve the library by investigating and mitigating these potential biases and inappropriate behaviors in the future.

Contact us

If you have any questions, comments or suggestions, please do not hesitate to contact us at lavis@salesforce.com.

License

BSD 3-Clause License