GitHub - OpenGVLab/InternImage: [CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions (original) (raw)

[δΈ­ζ–‡η‰ˆζœ¬]

InternImage: Large-Scale Vision Foundation Model

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

The official implementation of

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions.

[Paper] [Blog in Chinese]

Highlights

News

History

Introduction

InternImage is an advanced vision foundation model developed by researchers from Shanghai AI Laboratory, Tsinghua University, and other institutions. Unlike models based on Transformers, InternImage employs DCNv3 as its core operator. This approach equips the model with dynamic and effective receptive fields required for downstream tasks like object detection and segmentation, while enabling adaptive spatial aggregation.

Some other projects related to InternImage include the pretraining algorithm "M3I-Pretraining," the general-purpose decoder series "Uni-Perceiver," and the autonomous driving perception encoder series "BEVFormer."

Performance

Classification

Image Classification Scene Classification Long-Tail Classification
ImageNet Places365 Places 205 iNaturalist 2018
90.1 61.2 71.7 92.6

Detection

General Object Detection Long-Tail Object Detection Autonomous Driving Object Detection Dense Object Detection
COCO VOC 2007 VOC 2012 OpenImage LVIS minival LVIS val BDD100K nuScenes CrowdHuman
65.5 94.0 97.2 74.1 65.8 63.2 38.8 64.8 97.2

Segmentation

Semantic Segmentation Street Segmentation RGBD Segmentation
ADE20K COCO Stuff-10K Pascal Context CityScapes NYU Depth V2
62.9 59.6 70.3 87.0 68.1

Released Models

Open-Source Visual Pretrained Models

name pretrain resolution #param download
InternImage-L IN-22K 384x384 223M pth | hf
InternImage-XL IN-22K 384x384 335M pth | hf
InternImage-H Joint 427M -> IN-22K 384x384 1.08B pth | hf
InternImage-G Joint 427M -> IN-22K 384x384 3B pth | hf

ImageNet-1K Image Classification

name pretrain resolution acc@1 #param FLOPs download
InternImage-T IN-1K 224x224 83.5 30M 5G pth | hf cfg
InternImage-S IN-1K 224x224 84.2 50M 8G pth | hf cfg
InternImage-B IN-1K 224x224 84.9 97M 16G pth | hf cfg
InternImage-L IN-22K 384x384 87.7 223M 108G pth | hf cfg
InternImage-XL IN-22K 384x384 88.0 335M 163G pth | hf cfg
InternImage-H Joint 427M -> IN-22K 640x640 89.6 1.08B 1478G pth | hf cfg
InternImage-G Joint 427M -> IN-22K 512x512 90.1 3B 2700G pth | hf cfg

COCO Object Detection and Instance Segmentation

backbone method schd box mAP mask mAP #param FLOPs download
InternImage-T Mask R-CNN 1x 47.2 42.5 49M 270G ckpt | cfg
InternImage-T Mask R-CNN 3x 49.1 43.7 49M 270G ckpt | cfg
InternImage-S Mask R-CNN 1x 47.8 43.3 69M 340G ckpt | cfg
InternImage-S Mask R-CNN 3x 49.7 44.5 69M 340G ckpt | cfg
InternImage-B Mask R-CNN 1x 48.8 44.0 115M 501G ckpt | cfg
InternImage-B Mask R-CNN 3x 50.3 44.8 115M 501G ckpt | cfg
InternImage-L Cascade 1x 54.9 47.7 277M 1399G ckpt | cfg
InternImage-L Cascade 3x 56.1 48.5 277M 1399G ckpt | cfg
InternImage-XL Cascade 1x 55.3 48.1 387M 1782G ckpt | cfg
InternImage-XL Cascade 3x 56.2 48.8 387M 1782G ckpt | cfg
backbone method box mAP (val/test) #param download
CB-InternImage-H DINO (TTA) 65.0 / 65.4 2.18B ckpt | cfg
CB-InternImage-G DINO (TTA) 65.3 / 65.5 6B TODO

ADE20K Semantic Segmentation

backbone method resolution mIoU (ss/ms) #param FLOPs download
InternImage-T UperNet 512x512 47.9 / 48.1 59M 944G ckpt | cfg
InternImage-S UperNet 512x512 50.1 / 50.9 80M 1017G ckpt | cfg
InternImage-B UperNet 512x512 50.8 / 51.3 128M 1185G ckpt | cfg
InternImage-L UperNet 640x640 53.9 / 54.1 256M 2526G ckpt | cfg
InternImage-XL UperNet 640x640 55.0 / 55.3 368M 3142G ckpt | cfg
InternImage-H UperNet 896x896 59.9 / 60.3 1.12B 3566G ckpt | cfg
InternImage-H Mask2Former 896x896 62.5 / 62.9 1.31B 4635G ckpt | cfg

Main Results of FPS

Export classification model from pytorch to tensorrt

Export detection model from pytorch to tensorrt

Export segmentation model from pytorch to tensorrt

name resolution #param FLOPs batch 1 FPS (TensorRT)
InternImage-T 224x224 30M 5G 156
InternImage-S 224x224 50M 8G 129
InternImage-B 224x224 97M 16G 116
InternImage-L 384x384 223M 108G 56
InternImage-XL 384x384 335M 163G 47

Before using mmdeploy to convert our PyTorch models to TensorRT, please make sure you have the DCNv3 custom operator built correctly. You can build it with the following command:

export MMDEPLOY_DIR=/the/root/path/of/MMDeploy

prepare our custom ops, you can find it at InternImage/tensorrt/modulated_deform_conv_v3

cp -r modulated_deform_conv_v3 ${MMDEPLOY_DIR}/csrc/mmdeploy/backend_ops/tensorrt

build custom ops

cd ${MMDEPLOY_DIR} mkdir -p build && cd build cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} .. make -j$(nproc) && make install

install the mmdeploy after building custom ops

cd ${MMDEPLOY_DIR} pip install -e .

For more details on building custom ops, please referring to this document.

Foundation Models

Autonomous Driving

Application in Challenges

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{wang2023internimage, title={Internimage: Exploring large-scale vision foundation models with deformable convolutions}, author={Wang, Wenhai and Dai, Jifeng and Chen, Zhe and Huang, Zhenhang and Li, Zhiqi and Zhu, Xizhou and Hu, Xiaowei and Lu, Tong and Lu, Lewei and Li, Hongsheng and others}, booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages={14408--14419}, year={2023} }