What are the main steps in a typical Computer Vision Pipeline? (original) (raw)

Last Updated : 11 Aug, 2025

Computer vision is a field of artificial intelligence (AI) that enables machines to interpret and understand the visual world. By using digital images from cameras and videos and deep learning models, machines can accurately identify and classify objects and then react to what they “see.” A computer vision pipeline outlines the steps required to process and analyze visual data. Here, we delve into the main steps of a typical computer vision pipeline.

1. Image Acquisition

Collect raw data with digital cameras, webcams, CCTV systems, smartphones, drones, satellites, X-ray, MRI or microscopy devices. Use still images, video frames or multi-spectral formats. Consider lighting, object position, environmental factors, sensor calibration, frame rate and image resolution at this stage.

import cv2 img = cv2.imread('sample_image.jpg') if img is None: raise FileNotFoundError( "sample_image.jpg not found. Please upload an image with this name.")

`

Loads an image named "sample_image.jpg". If the image is not found, the program exits with an error message.

To download the used sample, click here.

2. Preprocessing

Collect raw data with digital cameras, webcams, CCTV systems, smartphones, drones, satellites, X-ray, MRI or microscopy devices. Use still images, video frames or multi-spectral formats. Consider lighting, object position, environmental factors, sensor calibration, frame rate and image resolution at this stage.

img_blur = cv2.GaussianBlur(img, (5, 5), 0) img_resized = cv2.resize(img_blur, (224, 224)) img_normalized = img_resized / 255.0 from torchvision import transforms transform = transforms.Compose([ transforms.ToPILImage(), transforms.RandomHorizontalFlip(), transforms.RandomRotation(10), transforms.ToTensor() ]) img_tensor = transform(img_normalized.astype('float32')) img_tensor = torch.unsqueeze(img_tensor, 0)

`

Applies Gaussian blur for noise reduction, resizes the image to 224×224 pixels for deep learning model compatibility, normalizes pixel values and augments data using flip and rotation. Converts image to tensor format for model input.

3. Image Segmentation

Apply simple thresholding (e.g., Otsu’s method), adaptive thresholding or edge detectors like Canny, Sobel, Prewitt or Laplacian. Explore region-based methods like Watershed, region growing, SLIC or Felzenszwalb superpixels. For pixel-wise analysis, use semantic segmentation models (U-Net, DeepLab, FCN) or apply instance segmentation such as Mask R-CNN. Refine segment masks with morphological operations like erosion or dilation.

edges = cv2.Canny((img_normalized * 255).astype(np.uint8), 100, 200)

`

Performs edge detection on the normalized image using the Canny algorithm, highlighting object boundaries and regions of interest.

Use classical keypoint detectors like Harris, FAST, SIFT, SURF or ORB. Extract feature descriptors such as BRIEF, FREAK, LBP. Add shape descriptors (Hu Moments, Fourier descriptors) or texture features (Gabor filters, Haralick features). Analyze color and edge histograms. For modern pipelines, extract multi-scale features with convolutional neural networks, often leveraging transfer learning from pretrained models.

import torch from torchvision import models model = models.resnet18(pretrained=True)

`

Initializes a ResNet-18 model, which can produce deep features from images via its convolutional layers, either for use directly or for transfer learning.

5. Object Detection

Use classical approaches like sliding windows with HOG+SVM or Viola-Jones. For deep learning, apply single-stage models (YOLO, SSD) for real-time tasks or two-stage models (Faster R-CNN) for high-accuracy projects. Output bounding box coordinates, class labels and confidence scores. Use anchor boxes and multi-scale detection when appropriate.

detector = models.detection.fasterrcnn_resnet50_fpn(pretrained=True) detector.eval() outputs = detector(img_tensor)

`

Deploys a pre-trained Faster R-CNN model to detect objects within the image tensor. Returns bounding boxes, labels and scores for each detection.

6. Object Recognition and Classification

Assign detected objects to classes using algorithms including SVM, k-NN, Decision Trees or deep CNNs. Connect softmax layers for multi-class tasks. Use transfer learning by fine-tuning existing networks for custom datasets. Address multi-label objects and hierarchical categories if necessary.

if len(outputs[0]['labels']) > 0: best_label = outputs[0]['labels'][0] print(f'Predicted class of first detected object: {best_label}') else: print('No objects detected.')

`

Outputs the most confident class label for the first detected object, confirming successful object recognition.

7. Post-Processing

Apply non-max suppression to remove overlapping or duplicate detections. Filter results by confidence thresholds. For video, aggregate and stabilize predictions using tracking algorithms like Kalman filter, SORT or DeepSORT.

if len(outputs[0]['boxes']) > 0: boxes = outputs[0]['boxes'].detach().numpy() scores = outputs[0]['scores'].detach().numpy() indices = cv2.dnn.NMSBoxes( boxes.tolist(), scores.tolist(), score_threshold=0.5, nms_threshold=0.3) if len(indices) > 0: for i in indices.flatten(): box = boxes[i] cv2.rectangle(img_resized, (int(box[0]), int( box[1])), (int(box[2]), int(box[3])), (255, 0, 0), 2) print("Image processing complete. Visualization requires displaying the image.") import matplotlib.pyplot as plt plt.imshow(cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB)) plt.axis('off') plt.show() else: print("No objects remaining after NMS.") else: print("No boxes detected by the model.")

`

Performs non-max suppression to refine detection results, draws bounding boxes on the image and visualizes the final processed image using matplotlib for environments without GUI support.

8. Visualization and Interpretation

Overlay bounding boxes, masks or keypoints on outputs. Display performance metrics including accuracy, precision, recall, F1-score, IoU and confusion matrices. Build interactive and real-time dashboards for monitoring and reporting.

Output:

output

Output