Mask RCNN (original) (raw)
Mask R-CNN
Last Updated : 20 May, 2026
Mask R-CNN is an advanced computer vision model used for object detection and instance segmentation. It extends Faster R-CNN by adding a mask prediction branch that generates pixel-level segmentation masks for detected objects.
- Detects objects and predicts bounding boxes
- Generates segmentation masks for each object instance
- Uses a Fully Convolutional Network (FCN) for mask prediction
- Provides accurate pixel-level object segmentation
- Widely used in computer vision and image analysis applications
Instance Segmentation
Instance segmentation identifies and separates each individual object present in an image by assigning unique pixel-level masks to every object instance.
- Detects and segments each object separately
- Classifies individual pixels belonging to objects
- Generates segmentation masks for each object instance
- Provides detailed object boundaries and localisation
- Helps improve object understanding in images

Instance Segmentation
Working of Mask R-CNN
Mask R-CNN extends the two-stage Faster R-CNN architecture by adding a separate mask prediction branch for instance segmentation. It detects objects, classifies them and generates pixel-level segmentation masks for each object instance.
- Uses a Region Proposal Network (RPN) to generate object proposals
- Performs object classification and bounding box prediction
- Adds a parallel mask branch for segmentation mask generation
- Uses RoI Align for accurate pixel-to-pixel feature alignment
- Produces class labels, bounding boxes and segmentation masks as output
Mask R-CNN Architecture
Mask R-CNN was proposed by Kaiming He et al. in 2017 as an extension of Faster R-CNN for instance segmentation. Along with object detection and bounding box prediction, it also generates a binary segmentation mask for each detected object.
Mask R-CNN Architecture
Main components include:
- Backbone Network
- Region Proposal Network
- Mask Representation
- RoI Align
1. Backbone Network
The backbone network extracts feature maps from the input image using architectures like ResNet-C4 and ResNet-FPN.
- Uses deep CNNs for feature extraction
- Feature Pyramid Network (FPN) improves multi-scale object detection
- Uses 1×1 and 3×3 convolutions for efficient feature processing
- Generates feature maps such as P2, P3, P4, P5 and P6
Mask R-CNN backbone architecture
2. Region Proposal Network
The RPN generates candidate object regions from convolutional feature maps.
- Uses 3×3 convolution layers for proposal generation
- Predicts objectness scores and bounding box coordinates
- Uses anchor boxes with different aspect ratios
- Helps identify potential object locations efficiently
Anchor Generation Mask R-CNN
3. Mask Representation
The mask branch predicts segmentation masks for each Region of Interest (RoI).
- Uses a Fully Convolutional Network (FCN) for mask prediction
- Preserves pixel-level spatial information
- Generates an m×m segmentation mask for each class
- Uses RoI Align to create fixed-size feature maps for mask generation
4. RoI Align
RoI align has the same motive as of RoI pool, to generate the fixed size regions of interest from region proposals. It works in the following steps:
ROI Align
Given the feature map of the previous Convolution layer of size _h*w, divide this feature map into _M * N grids of equal size (we will NOT just take integer value).
The mask R-CNN inference speed is around _2 fps, which is good considering the addition of a segmentation branch in the architecture.
Applications
- Human pose estimation and body part detection
- Self-driving cars for object and lane detection
- Drone image mapping and aerial analysis
- Medical image segmentation and analysis
- Video surveillance and object tracking
- Image editing and augmented reality applications
Advantages
- Reduces computational cost compared to exhaustive search methods
- Flexible architecture that supports different backbone networks
- Achieves state-of-the-art performance in instance segmentation tasks
Limitations
- Requires high computing resources such as GPUs
- Needs detailed pixel-level annotated datasets for training
- Training and inference can be slower compared to simpler detection models
- Less suitable for real-time applications with strict latency requirements