GitHub - OpenGVLab/InternImage: [CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (original) (raw)

[中文版本]

InternImage: Large-Scale Vision Foundation Model

The official implementation of

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

[Paper] [Blog in Chinese]

Highlights

👍 The strongest open-source visual universal backbone model with up to 3 billion parameters
🏆 Achieved 90.1% Top1 accuracy in ImageNet, the most accurate among open-source models
🏆 Achieved 65.5 mAP on the COCO benchmark dataset for object detection, the only model that exceeded 65.0 mAP

News

Jan 22, 2024: 🚀 Support DCNv4 in InternImage!
Feb 28, 2023: 🚀 InternImage is accepted to CVPR 2023!
Nov 18, 2022: 🚀 InternImage-XL merged into BEVFormer v2 achieves state-of-the-art performance of 63.4 NDS on nuScenes Camera Only.
Nov 10, 2022: 🚀 InternImage-H achieves a new record 65.4 mAP on COCO detection test-dev and 62.9 mIoU on ADE20K, outperforming previous models by a large margin.

History

Models for other downstream tasks
Support CVPR 2023 Workshop on End-to-End Autonomous Driving, see here
Support extracting intermediate features, see here
Low-cost training with DeepSpeed, see here
Compiling-free .whl package of DCNv3 operator, see here
InternImage-H(1B)/G(3B)
TensorRT inference for classification/detection/segmentation models
Classification code of the InternImage series
InternImage-T/S/B/L/XL ImageNet-1K pretrained model
InternImage-L/XL ImageNet-22K pretrained model
InternImage-T/S/B/L/XL detection and instance segmentation model
InternImage-T/S/B/L/XL semantic segmentation model

Introduction

InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.

Some other projects related to InternImage include the pretraining algorithm "M3I-Pretraining," the general-purpose decoder series "Uni-Perceiver," and the autonomous driving perception encoder series "BEVFormer."

Performance

InternImage achieved an impressive Top-1 accuracy of 90.1% on the ImageNet benchmark dataset using only publicly available data for image classification. Apart from two undisclosed models trained with additional datasets by Google and Microsoft, InternImage is the only open-source model that achieves a Top-1 accuracy of over 90.0%, and it is also the largest model in scale worldwide.
InternImage outperformed all other models worldwide on the COCO object detection benchmark dataset with a remarkable mAP of 65.5, making it the only model that surpasses 65 mAP in the world.
InternImage also demonstrated world's best performance on 16 other important visual benchmark datasets, covering a wide range of tasks such as classification, detection, and segmentation, making it the top-performing model across multiple domains.

Classification

Image Classification	Scene Classification	Long-Tail Classification
ImageNet	Places365	Places 205	iNaturalist 2018
90.1	61.2	71.7	92.6

Detection

General Object Detection	Long-Tail Object Detection	Autonomous Driving Object Detection	Dense Object Detection
COCO	VOC 2007	VOC 2012	OpenImage	LVIS minival	LVIS val	BDD100K	nuScenes	CrowdHuman
65.5	94.0	97.2	74.1	65.8	63.2	38.8	64.8	97.2

Segmentation

Semantic Segmentation	Street Segmentation	RGBD Segmentation
ADE20K	COCO Stuff-10K	Pascal Context	CityScapes	NYU Depth V2
62.9	59.6	70.3	87.0	68.1

Released Models

Open-Source Visual Pretrained Models

name	pretrain	resolution	#param	download
InternImage-L	IN-22K	384x384	223M	pth \| hf
InternImage-XL	IN-22K	384x384	335M	pth \| hf
InternImage-H	Joint 427M -> IN-22K	384x384	1.08B	pth \| hf
InternImage-G	Joint 427M -> IN-22K	384x384	3B	pth \| hf

ImageNet-1K Image Classification

name	pretrain	resolution	acc@1	#param	FLOPs	download
InternImage-T	IN-1K	224x224	83.5	30M	5G	pth \| hf	cfg
InternImage-S	IN-1K	224x224	84.2	50M	8G	pth \| hf	cfg
InternImage-B	IN-1K	224x224	84.9	97M	16G	pth \| hf	cfg
InternImage-L	IN-22K	384x384	87.7	223M	108G	pth \| hf	cfg
InternImage-XL	IN-22K	384x384	88.0	335M	163G	pth \| hf	cfg
InternImage-H	Joint 427M -> IN-22K	640x640	89.6	1.08B	1478G	pth \| hf	cfg
InternImage-G	Joint 427M -> IN-22K	512x512	90.1	3B	2700G	pth \| hf	cfg

COCO Object Detection and Instance Segmentation

backbone	method	schd	box mAP	mask mAP	#param	FLOPs	download
InternImage-T	Mask R-CNN	1x	47.2	42.5	49M	270G	ckpt \| cfg
InternImage-T	Mask R-CNN	3x	49.1	43.7	49M	270G	ckpt \| cfg
InternImage-S	Mask R-CNN	1x	47.8	43.3	69M	340G	ckpt \| cfg
InternImage-S	Mask R-CNN	3x	49.7	44.5	69M	340G	ckpt \| cfg
InternImage-B	Mask R-CNN	1x	48.8	44.0	115M	501G	ckpt \| cfg
InternImage-B	Mask R-CNN	3x	50.3	44.8	115M	501G	ckpt \| cfg
InternImage-L	Cascade	1x	54.9	47.7	277M	1399G	ckpt \| cfg
InternImage-L	Cascade	3x	56.1	48.5	277M	1399G	ckpt \| cfg
InternImage-XL	Cascade	1x	55.3	48.1	387M	1782G	ckpt \| cfg
InternImage-XL	Cascade	3x	56.2	48.8	387M	1782G	ckpt \| cfg

backbone	method	box mAP (val/test)	#param	download
CB-InternImage-H	DINO (TTA)	65.0 / 65.4	2.18B	ckpt \| cfg
CB-InternImage-G	DINO (TTA)	65.3 / 65.5	6B	TODO

ADE20K Semantic Segmentation

backbone	method	resolution	mIoU (ss/ms)	#param	FLOPs	download
InternImage-T	UperNet	512x512	47.9 / 48.1	59M	944G	ckpt \| cfg
InternImage-S	UperNet	512x512	50.1 / 50.9	80M	1017G	ckpt \| cfg
InternImage-B	UperNet	512x512	50.8 / 51.3	128M	1185G	ckpt \| cfg
InternImage-L	UperNet	640x640	53.9 / 54.1	256M	2526G	ckpt \| cfg
InternImage-XL	UperNet	640x640	55.0 / 55.3	368M	3142G	ckpt \| cfg
InternImage-H	UperNet	896x896	59.9 / 60.3	1.12B	3566G	ckpt \| cfg
InternImage-H	Mask2Former	896x896	62.5 / 62.9	1.31B	4635G	ckpt \| cfg

Main Results of FPS

Export classification model from pytorch to tensorrt

Export detection model from pytorch to tensorrt

Export segmentation model from pytorch to tensorrt

name	resolution	#param	FLOPs	batch 1 FPS (TensorRT)
InternImage-T	224x224	30M	5G	156
InternImage-S	224x224	50M	8G	129
InternImage-B	224x224	97M	16G	116
InternImage-L	384x384	223M	108G	56
InternImage-XL	384x384	335M	163G	47

Before using mmdeploy to convert our PyTorch models to TensorRT, please make sure you have the DCNv3 custom operator built correctly. You can build it with the following command:

export MMDEPLOY_DIR=/the/root/path/of/MMDeploy

prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3

cp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt

build custom ops

cd ${MMDEPLOY_DIR} mkdir -p build && cd build cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} .. make -j$(nproc) && make install

install the mmdeploy after building custom ops

cd ${MMDEPLOY_DIR} pip install -e .

For more details on building custom ops, please referring to this document.

Foundation Models

Uni-Perceiver: A Pre-training unified architecture for generic perception for zero-shot and few-shot tasks
Uni-Perceiver v2: A generalist model for large-scale vision and vision-language tasks
M3I-Pretraining: One-stage pre-training paradigm via maximizing multi-modal mutual information
InternVL: A leading multimodal large language model excelling in tasks such as OCR, multimodal reasoning, and dialogue

Autonomous Driving

BEVFormer: A cutting-edge baseline for camera-based 3D detection
BEVFormer v2: Adapting modern image backbones to Bird's-Eye-View recognition via perspective supervision

Application in Challenges

2022 Waymo 3D Camera-Only Detection Challenge: BEVFormer++ ranks 1st based on InternImage
nuScenes 3D detection: BEVFormer v2 achieves SOTA performance of 64.8 NDS on nuScenes Camera Only
CVPR 2023 Workshop End-to-End Autonomous Driving: InternImage supports the baseline of the 3D Occupancy Prediction Challenge and OpenLane Topology Challenge

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{wang2023internimage, title={Internimage: Exploring large-scale vision foundation models with deformable convolutions}, author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={14408--14419}, year={2023} }