Computer Vision & OpenCV Cheatsheet Cheatsheet

🖼️

Image Basics

Foundation

Understanding how images are represented digitally is fundamental to computer vision. Images are multi-dimensional arrays of pixel values, and manipulating these arrays is the foundation of all CV work.

image-basics-opencv.py

import cv2
import numpy as np
from PIL import Image

# ── Image Representation ──
# Images are NumPy arrays: shape = (height, width, channels)
img = cv2.imread('photo.jpg')           # BGR format (OpenCV default)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)  # Convert to RGB
gray = cv2.imread('photo.jpg', cv2.IMREAD_GRAYSCALE)  # Grayscale

print(f"Shape: {img.shape}")             # (H, W, 3) for color, (H, W) for gray
print(f"Data type: {img.dtype}")         # uint8 (0-255)
print(f"Size: {img.size} pixels")        # H * W * channels
print(f"Pixel (100,100): {img[100, 100]}")  # BGR values at (row, col)

# ── Color Spaces ──
img_hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)     # Hue, Saturation, Value
img_lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)     # L*a*b* (perceptual)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)   # Grayscale

# ── Image Properties ──
height, width = img.shape[:2]
center = (width // 2, height // 2)
total_pixels = height * width
aspect_ratio = width / height

# ── Pixel Operations ──
# Access and modify pixels
pixel = img[100, 100]           # BGR values at row=100, col=100
img[100, 100] = [255, 0, 0]     # Set pixel to blue

# Region of Interest (ROI)
roi = img[50:200, 100:300]       # Crop region [y1:y2, x1:x2]
img[0:150, 0:200] = roi           # Paste ROI elsewhere

# ── Image Arithmetic ──
img_add = cv2.add(img1, img2)              # Saturated addition (clips at 255)
img_sub = cv2.subtract(img1, img2)         # Saturated subtraction
img_blend = cv2.addWeighted(img1, 0.7, img2, 0.3, 0)  # Blend images
img_bitwise = cv2.bitwise_and(img, mask)   # Apply binary mask
img_invert = cv2.bitwise_not(img)          # Invert colors

Image Formats Comparison

Format	Compression	Transparency	Animation	Best For
JPEG	Lossy	No	No	Photographs, web images
PNG	Lossless	Yes (alpha)	No	Graphics, screenshots, overlays
WebP	Both (lossy/lossless)	Yes	Yes	Modern web, smaller than JPEG/PNG
BMP	None	No	No	Raw image storage, Windows
TIFF	Lossless (LZW)	Yes	No	Printing, high-quality archival
GIF	Lossless	Yes (binary)	Yes	Simple animations, low-color graphics
AVIF	Lossy	Yes	Yes	Next-gen web (50% smaller than JPEG)
SVG	Vector	Yes	Yes (CSS)	Icons, logos, scalable graphics

image-transforms.py

# ── Geometric Transformations ──
# Resize (interpolation methods: INTER_NEAREST, INTER_LINEAR, INTER_CUBIC, INTER_LANCZOS4)
resized = cv2.resize(img, (300, 200))                       # Absolute size
resized = cv2.resize(img, None, fx=0.5, fy=0.5)           # Scale factor
resized = cv2.resize(img, (300, 200), interpolation=cv2.INTER_AREA)  # Downscale

# Rotation
M = cv2.getRotationMatrix2D(center=center, angle=45, scale=1.0)
rotated = cv2.warpAffine(img, M, (width, height))

# Translation
M = np.float32([[1, 0, 50], [0, 1, 30]])   # Shift right 50, down 30
translated = cv2.warpAffine(img, M, (width, height))

# Affine Transformation (3 point pairs)
pts1 = np.float32([[50, 50], [200, 50], [50, 200]])
pts2 = np.float32([[10, 100], [200, 50], [100, 250]])
M = cv2.getAffineTransform(pts1, pts2)
warped = cv2.warpAffine(img, M, (width, height))

# Perspective Transform (4 point pairs)
pts1 = np.float32([[56, 65], [368, 52], [28, 387], [389, 390]])
pts2 = np.float32([[0, 0], [300, 0], [0, 300], [300, 300]])
M = cv2.getPerspectiveTransform(pts1, pts2)
warped = cv2.warpPerspective(img, M, (300, 300))

# Flip
flipped_h = cv2.flip(img, 1)   # Horizontal
flipped_v = cv2.flip(img, 0)   # Vertical
flipped_b = cv2.flip(img, -1)  # Both

⚠️

Key concept: OpenCV loads images in BGR order, not RGB. Always convert to RGB for display with Matplotlib or for processing with deep learning frameworks that expect RGB.

🔧

OpenCV Operations

Core Skills

OpenCV provides a comprehensive set of image processing operations. Mastering these operations is essential for building any computer vision pipeline.

opencv-core-operations.py

import cv2
import numpy as np

# ── Thresholding ──
gray = cv2.imread('photo.jpg', cv2.IMREAD_GRAYSCALE)

# Simple threshold
_, binary = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)

# Adaptive threshold (handles uneven lighting)
adaptive = cv2.adaptiveThreshold(
    gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY, blockSize=11, C=2
)

# Otsu's method (automatic threshold selection)
_, otsu = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

# ── Morphological Operations ──
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))

erosion = cv2.erode(binary, kernel, iterations=1)       # Shrink white regions
dilation = cv2.dilate(binary, kernel, iterations=1)      # Grow white regions
opening = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)   # Erode then dilate (remove noise)
closing = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)  # Dilate then erode (fill holes)
gradient = cv2.morphologyEx(binary, cv2.MORPH_GRADIENT, kernel)  # Edge detection
tophat = cv2.morphologyEx(binary, cv2.MORPH_TOPHAT, kernel)    # Difference between input and opening
blackhat = cv2.morphologyEx(binary, cv2.MORPH_BLACKHAT, kernel)  # Difference between closing and input

# ── Contour Detection ──
contours, hierarchy = cv2.findContours(
    binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
)

for contour in contours:
    area = cv2.contourArea(contour)           # Area
    perimeter = cv2.arcLength(contour, True)  # Perimeter
    x, y, w, h = cv2.boundingRect(contour)   # Bounding rectangle
    rect_area = cv2.minAreaRect(contour)      # Minimum area rectangle
    circle = cv2.minEnclosingCircle(contour)  # Minimum enclosing circle
    hull = cv2.convexHull(contour)            # Convex hull
    approx = cv2.approxPolyDP(contour, 0.02 * perimeter, True)  # Polygon approximation
    if len(approx) == 4:  # Rectangle detection
        print("Rectangle detected")

# ── Drawing Functions ──
img_draw = img.copy()
cv2.rectangle(img_draw, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.circle(img_draw, (cx, cy), radius, (255, 0, 0), -1)  # Filled circle
cv2.line(img_draw, (0, 0), (100, 100), (0, 0, 255), 2)
cv2.putText(img_draw, 'Label', (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2)
cv2.polylines(img_draw, [points], True, (0, 255, 0), 2)

opencv-histogram.py

# ── Histogram Analysis ──
# Calculate histogram
hist = cv2.calcHist([gray], [0], None, [256], [0, 256])

# Color histogram
color_hist_b = cv2.calcHist([img], [0], None, [256], [0, 256])
color_hist_g = cv2.calcHist([img], [1], None, [256], [0, 256])
color_hist_r = cv2.calcHist([img], [2], None, [256], [0, 256])

# ── Histogram Equalization ──
equalized = cv2.equalizeHist(gray)

# CLAHE (Contrast Limited Adaptive Histogram Equalization)
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
clahe_img = clahe.apply(gray)

# ── Image Blurring (Noise Reduction) ──
avg_blur = cv2.blur(img, (5, 5))                    # Average blur
gaussian = cv2.GaussianBlur(img, (5, 5), 0)         # Gaussian blur
median = cv2.medianBlur(img, 5)                      # Median blur (salt/pepper)
bilateral = cv2.bilateralFilter(img, 9, 75, 75)      # Bilateral (edge-preserving)
# Non-local means (best quality, slow)
nlm = cv2.fastNlMeansDenoisingColored(img, None, h=10, hForColorComponents=10,
                                         templateWindowSize=7, searchWindowSize=21)

OpenCV Essential Functions Quick Reference

Function	Purpose	Key Parameters
cv2.imread()	Read image from file	path, flags (IMREAD_COLOR/GRAYSCALE/UNCHANGED)
cv2.imwrite()	Save image to file	path, img, params (JPEG quality, PNG compression)
cv2.cvtColor()	Convert color space	src, code (COLOR_BGR2RGB, COLOR_BGR2GRAY, etc.)
cv2.resize()	Resize image	src, dsize or fx/fy, interpolation
cv2.threshold()	Binary threshold	src, thresh, maxval, type (BINARY, OTSU, TRUNC)
cv2.Canny()	Canny edge detection	image, threshold1, threshold2
cv2.findContours()	Find contours	image, mode (RETR_EXTERNAL), method
cv2.warpAffine()	Apply affine transform	src, M, dsize
cv2.bilateralFilter()	Edge-preserving blur	src, d, sigmaColor, sigmaSpace
cv2.matchTemplate()	Template matching	image, templ, method (TM_CCOEFF_NORMED)

🔲

Filters & Edge Detection

Image Processing

Filters and edge detection are fundamental image processing techniques. Convolution-based filters can enhance, blur, sharpen, and detect edges in images.

edge-detection.py

import cv2
import numpy as np

# ── Canny Edge Detection (Most Popular) ──
edges = cv2.Canny(img, threshold1=50, threshold2=150)
# threshold1 = lower bound (weak edges)
# threshold2 = upper bound (strong edges)
# Edges between thresholds are kept only if connected to strong edges

# Automatic Canny edge detection (based on median pixel value)
def auto_canny(image, sigma=0.33):
    median = np.median(image)
    lower = int(max(0, (1.0 - sigma) * median))
    upper = int(min(255, (1.0 + sigma) * median))
    return cv2.Canny(image, lower, upper)

# ── Sobel Operator ──
sobel_x = cv2.Sobel(gray, cv2.CV_64F, 1, 0, ksize=3)   # Horizontal edges
sobel_y = cv2.Sobel(gray, cv2.CV_64F, 0, 1, ksize=3)   # Vertical edges
sobel_mag = np.sqrt(sobel_x**2 + sobel_y**2)              # Magnitude
sobel_mag = np.uint8(np.clip(sobel_mag, 0, 255))

# ── Laplacian Operator ──
laplacian = cv2.Laplacian(gray, cv2.CV_64F)
laplacian = np.uint8(np.clip(np.abs(laplacian), 0, 255))

# ── Scharr Operator (more accurate than Sobel 3x3) ──
scharr_x = cv2.Scharr(gray, cv2.CV_64F, 1, 0)
scharr_y = cv2.Scharr(gray, cv2.CV_64F, 0, 1)

# ── LoG (Laplacian of Gaussian) ──
blurred = cv2.GaussianBlur(gray, (3, 3), 0)
log = cv2.Laplacian(blurred, cv2.CV_64F, ksize=3)

# ── Custom Convolution Kernel ──
# Sharpening kernel
sharpen = np.array([[0, -1, 0],
                    [-1, 5, -1],
                    [0, -1, 0]])
sharpened = cv2.filter2D(img, -1, sharpen)

# Emboss kernel
emboss = np.array([[-2, -1, 0],
                   [-1,  1, 1],
                   [ 0,  1, 2]])
embossed = cv2.filter2D(img, -1, emboss)

Edge Detection Comparison

Method	Noise Sensitivity	Edge Quality	Parameter(s)	Best For
Canny	Low (built-in smoothing)	Clean, thin edges	t1, t2 thresholds	General purpose, most popular
Sobel	Medium	Thick edges, directional	ksize (3,5,7)	Gradient direction analysis
Laplacian	High	All edges, no direction	ksize	Fast edge detection
LoG (Marr-Hildreth)	Low (Gaussian smoothing)	Zero-crossing edges	sigma (kernel size)	Blob detection, scale-space
Scharr	Medium	More accurate than Sobel	None	Better rotation invariance

Common Image Processing Kernels

Kernel	3x3 Matrix	Effect
Identity	[[0,0,0],[0,1,0],[0,0,0]]	No change (passthrough)
Box Blur	[[1/9]*9]	Average blur (smooth)
Gaussian 3x3	[[1,2,1],[2,4,2],[1,2,1]]/16	Weighted blur (smooth)
Sharpen	[[0,-1,0],[-1,5,-1],[0,-1,0]]	Enhance edges
Edge Enhance	[[0,-1,0],[-1,4,-1],[0,-1,0]]	Amplify edges
Emboss	[[-2,-1,0],[-1,1,1],[0,1,2]]	3D emboss effect
Sobel X	[[-1,0,1],[-2,0,2],[-1,0,1]]	Horizontal edges
Sobel Y	[[-1,-2,-1],[0,0,0],[1,2,1]]	Vertical edges

💡

Tip: For Canny edge detection, use a 2:1 or 3:1 ratio between threshold2 and threshold1. Use auto_canny with sigma=0.33 for automatic threshold selection based on image median.

🎯

Object Detection

Find Objects

Object detection locates and classifies multiple objects in an image. Modern approaches range from two-stage detectors (Faster R-CNN) to single-stage (YOLO) to transformer-based (DETR).

yolo-detection.py

import cv2
import numpy as np

# ── YOLO with OpenCV (DNN module) ──
# Load pre-trained YOLO model
net = cv2.dnn.readNetFromDarknet('yolov4.cfg', 'yolov4.weights')
layer_names = net.getLayerNames()
output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]

# Load COCO class names (80 classes)
with open('coco.names', 'r') as f:
    classes = [line.strip() for line in f.readlines()]

# Process image
def detect_objects(image, conf_threshold=0.5, nms_threshold=0.4):
    height, width = image.shape[:2]
    blob = cv2.dnn.blobFromImage(image, 1/255.0, (416, 416),
                                  swapRB=True, crop=False)
    net.setInput(blob)
    outputs = net.forward(output_layers)

    boxes, confidences, class_ids = [], [], []
    for output in outputs:
        for detection in output:
            scores = detection[5:]
            class_id = np.argmax(scores)
            confidence = scores[class_id]
            if confidence > conf_threshold:
                center_x = int(detection[0] * width)
                center_y = int(detection[1] * height)
                w = int(detection[2] * width)
                h = int(detection[3] * height)
                x = int(center_x - w / 2)
                y = int(center_y - h / 2)
                boxes.append([x, y, w, h])
                confidences.append(float(confidence))
                class_ids.append(class_id)

    # Non-Maximum Suppression (remove overlapping boxes)
    indices = cv2.dnn.NMSBoxes(boxes, confidences,
                                conf_threshold, nms_threshold)
    results = []
    for i in indices:
        i = i[0]
        results.append({
            'class': classes[class_ids[i]],
            'confidence': confidences[i],
            'box': boxes[i]
        })
    return results

results = detect_objects(img)
for r in results:
    x, y, w, h = r['box']
    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)
    cv2.putText(img, f"{r['class']} {r['confidence']:.2f}",
                (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,255,0), 2)

yolo-ultralytics.py

# ── YOLO with Ultralytics (Recommended) ──
from ultralytics import YOLO

# Load pre-trained YOLOv8 model
model = YOLO('yolov8n.pt')  # nano (fastest), s, m, l, x

# Train on custom dataset (YOLO format: images/ + labels/)
results = model.train(
    data='dataset.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    device='0',              # GPU device
    workers=8,
    pretrained=True,
    optimizer='AdamW',
    lr0=0.01,
    augment=True,
    mosaic=True,
)

# Inference
results = model('image.jpg')
for result in results:
    boxes = result.boxes
    for box in boxes:
        cls = int(box.cls[0])
        conf = float(box.conf[0])
        xyxy = box.xyxy[0].tolist()
        class_name = model.names[cls]

# Export to different formats
model.export(format='onnx')    # ONNX
model.export(format='engine')  # TensorRT
model.export(format='openvino')  # OpenVINO
model.export(format='tflite')  # TensorFlow Lite

Object Detection Models Comparison

Model	Architecture	Speed (FPS)	mAP (COCO)	Best For
YOLOv8n	Single-stage	130+ (T4)	37.3	Edge deployment, real-time
YOLOv8x	Single-stage	25 (T4)	53.9	Highest accuracy YOLO
YOLOv9	PGI + GELAN	100+ (T4)	55.6	State-of-the-art single-stage
Faster R-CNN	Two-stage	10-20 (T4)	42.0	High accuracy, slower
SSD	Single-stage	30-60 (T4)	25.1	Balanced speed/accuracy
DETR	Transformer	20 (T4)	46.2	End-to-end, no NMS needed
RT-DETR	Real-time DETR	100+ (T4)	53.1	Real-time transformer-based
EfficientDet	BiFPN	40-80 (T4)	39.6	Efficient for mobile

Object Detection Metrics

IoU (Intersection over Union)Overlap area / union area between predicted and ground truth box. IoU > 0.5 = detection. IoU > 0.75 = tight localization.

PrecisionTrue detections / all detections. How many predicted boxes are correct.

RecallDetected objects / total ground truth objects. How many ground truth objects were found.

mAP@0.5Mean Average Precision at IoU=0.5. Primary metric for COCO-style evaluation.

mAP@0.5:0.95mAP averaged over IoU thresholds from 0.5 to 0.95. COCO's primary metric.

NMS (Non-Maximum Suppression)Removes overlapping bounding boxes by keeping the highest confidence box and suppressing lower-confidence overlaps.

✂️

Image Segmentation

Pixel-Level

Image segmentation classifies every pixel in an image, going beyond bounding boxes to provide precise object boundaries. It ranges from semantic segmentation to instance segmentation.

segmentation-examples.py

# ── Semantic Segmentation with OpenCV (GrabCut) ──
import cv2
import numpy as np

img = cv2.imread('photo.jpg')
mask = np.zeros(img.shape[:2], np.uint8)

# Define foreground and background rectangles
rect = (50, 50, 450, 290)

# GrabCut algorithm
bgd_model = np.zeros((1, 65), np.float64)
fgd_model = np.zeros((1, 65), np.float64)
cv2.grabCut(img, mask, rect, bgd_model, fgd_model, 5, cv2.GC_INIT_WITH_RECT)

# Create binary mask (foreground = 1, background = 0)
binary_mask = np.where((mask == 2) | (mask == 0), 0, 1).astype('uint8')
segmented = img * binary_mask[:, :, np.newaxis]

# ── Watershed Segmentation ──
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

# Noise removal
kernel = np.ones((3, 3), np.uint8)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=2)

# Sure background area
sure_bg = cv2.dilate(opening, kernel, iterations=3)

# Finding sure foreground area using distance transform
dist_transform = cv2.distanceTransform(opening, cv2.DIST_L2, 5)
ret, sure_fg = cv2.threshold(dist_transform, 0.7 * dist_transform.max(), 255, 0)

# Unknown region
sure_fg = np.uint8(sure_fg)
unknown = cv2.subtract(sure_bg, sure_fg)

# Marker labeling
ret, markers = cv2.connectedComponents(sure_fg)
markers = markers + 1  # Add 1 to all labels (background = 1)
markers[unknown == 255] = 0  # Unknown region = 0

# Watershed
markers = cv2.watershed(img, markers)
img[markers == -1] = [255, 0, 0]  # Mark boundaries in red

sam-segmentation.py

# ── Segment Anything Model (SAM) ──
from segment_anything import SamPredictor, sam_model_registry

# Load model (vit_h, vit_l, vit_b)
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)

# Set image
predictor.set_image(image)

# Predict with point prompt
input_point = np.array([[500, 375]])  # x, y coordinates
input_label = np.array([1])          # 1 = foreground, 0 = background
masks, scores, logits = predictor.predict(
    point_coords=input_point,
    point_labels=input_label,
    multimask_output=True
)
# masks: (num_masks, H, W) boolean arrays
# scores: confidence scores for each mask

# Predict with bounding box prompt
input_box = np.array([425, 600, 700, 875])  # x1, y1, x2, y2
masks, scores, logits = predictor.predict(
    box=input_box,
    multimask_output=False
)

Segmentation Types Comparison

Type	Output	What It Does	Example Model	Use Case
Semantic	Class per pixel	Classifies each pixel into categories	DeepLabV3+, FCN, U-Net	Scene understanding, road segmentation
Instance	Mask per object	Separates individual objects of same class	Mask R-CNN, SOLOv2, YOLOv8-seg	Counting objects, individual tracking
Panoptic	Both semantic + instance	Complete scene understanding	Panoptic FPN, MaskFormer	Autonomous driving, robotics
Interactive	User-guided masks	Click/box-based segmentation	SAM (Meta), IRIS	Photo editing, annotation tools
Medical	Organ/lesion segmentation	Precise medical image analysis	U-Net, nnU-Net, Swin UNETR	Tumor detection, organ delineation

🧠

Deep Learning Models

Neural Networks

Deep learning has revolutionized computer vision. Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are the backbone of modern CV systems.

pytorch-image-classification.py

import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torchvision import datasets
from torch.utils.data import DataLoader

# ── Pre-trained Models (Transfer Learning) ──
# ResNet-50 (pretrained on ImageNet)
resnet = models.resnet50(weights='IMAGENET1K_V2')
num_features = resnet.fc.in_features  # 2048
resnet.fc = nn.Linear(num_features, num_classes)  # Replace final layer

# EfficientNet-V2
efficientnet = models.efficientnet_v2_s(weights='IMAGENET1K_V1')

# Vision Transformer (ViT)
vit = models.vit_b_16(weights='IMAGENET1K_V1')

# ── Image Preprocessing Pipeline ──
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),       # Standard input size for most models
    transforms.ToTensor(),           # Convert PIL to Tensor [0, 1]
    transforms.Normalize(             # ImageNet statistics
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

# ── Training Loop ──
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = resnet.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

for epoch in range(num_epochs):
    model.train()
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    scheduler.step()

Popular CV Architectures

Architecture	Year	Parameters	Top-1 Acc	Key Innovation	Best For
ResNet-50	2015	25.6M	80.4%	Skip connections (residual learning)	General classification baseline
ResNet-152	2016	60.2M	82.0%	Deeper residual network	High-accuracy classification
EfficientNet-V2-S	2021	21.5M	83.9%	Compound scaling + Fused-MBConv	Efficient, mobile-friendly
EfficientNet-V2-L	2021	119M	85.1%	Larger compound scaling	High accuracy with efficiency
ConvNeXt-L	2022	199M	85.1%	Modernized CNN (matches ViT)	CNN alternative to ViT
ViT-B/16	2020	86M	81.8%	Pure transformer for images	Transfer learning, large datasets
ViT-L/16	2020	307M	85.0%	Large ViT with more data	Best accuracy with large data
Swin Transformer	2021	88M	86.3%	Hierarchical window attention	Dense prediction tasks
DINOv2	2023	305M	86.5%	Self-supervised ViT	Feature extraction, zero-shot

⚠️

Transfer learning tip:For small datasets (< 10K images), freeze the backbone and only train the final classification head. For medium datasets, fine-tune the last few layers. For large datasets, train the entire model with a low learning rate.

👤

Face Detection

Identity

Face detection and recognition are among the most widely deployed CV applications. From phone unlock to surveillance, understanding face detection pipelines is essential.

face-detection.py

import cv2
import numpy as np

# ── Haar Cascade Face Detection (Classic, Fast, OpenCV) ──
face_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
)
eye_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + 'haarcascade_eye.xml'
)

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(
    gray, scaleFactor=1.1, minNeighbors=5,
    minSize=(30, 30), flags=cv2.CASCADE_SCALE_IMAGE
)

for (x, y, w, h) in faces:
    cv2.rectangle(img, (x, y), (x+w, y+h), (255, 0, 0), 2)
    roi_gray = gray[y:y+h, x:x+w]
    eyes = eye_cascade.detectMultiScale(roi_gray, scaleFactor=1.1, minNeighbors=5)
    for (ex, ey, ew, eh) in eyes:
        cv2.rectangle(img, (x+ex, y+ey), (x+ex+ew, y+ey+eh), (0, 255, 0), 2)

# ── DNN Face Detection (More Accurate) ──
face_net = cv2.dnn.readNetFromCaffe('deploy.prototxt', 'res10_300x300.caffemodel')
blob = cv2.dnn.blobFromImage(cv2.resize(img, (300, 300)), 1.0,
                               (104.0, 177.0, 123.0))
face_net.setInput(blob)
detections = face_net.forward()

h, w = img.shape[:2]
for i in range(detections.shape[2]):
    confidence = detections[0, 0, i, 2]
    if confidence > 0.7:
        box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
        (startX, startY, endX, endY) = box.astype('int')
        cv2.rectangle(img, (startX, startY), (endX, endY), (0, 255, 0), 2)

face-recognition.py

# ── Face Recognition with face_recognition library ──
# pip install face_recognition
import face_recognition

# Load and encode known faces
known_image = face_recognition.load_image_file("known_person.jpg")
known_encoding = face_recognition.face_encodings(known_image)[0]

# Find faces in unknown image
unknown_image = face_recognition.load_image_file("group_photo.jpg")
face_locations = face_recognition.face_locations(unknown_image)
face_encodings = face_recognition.face_encodings(unknown_image, face_locations)

# Compare faces
for (top, right, bottom, left), face_encoding in zip(face_locations, face_encodings):
    matches = face_recognition.compare_faces([known_encoding], face_encoding, tolerance=0.6)
    distance = face_recognition.face_distance([known_encoding], face_encoding)
    name = "Known Person" if matches[0] else "Unknown"

Face Detection Models Comparison

Method	Speed	Accuracy	Lighting Robust	Best For
Haar Cascade	Very Fast (CPU)	Low-Medium	Low	Real-time, low-resource devices
DNN (Caffe)	Fast (CPU)	Medium-High	Medium	Better accuracy, still CPU-friendly
MTCNN	Medium	High	High	Landmark detection + alignment
RetinaFace	Fast (GPU)	Very High	High	Production face detection
MediaPipe	Very Fast (CPU/GPU)	High	High	Mobile, real-time, face mesh
BlazeFace	Very Fast (Mobile)	Medium	Medium	Android/iOS face detection

🚫

Privacy note: Face detection and recognition raise significant privacy and ethical concerns. Always comply with local regulations (GDPR, CCPA, BIPA). Obtain consent. Consider bias in face recognition across demographics.

💬

Interview Q&A

Get Hired

Essential computer vision interview questions.

Q1: What is the difference between object detection and segmentation?

Answer: Object detection draws bounding rectangles around objects and classifies them. It answers "what objects are where?" but doesn't give precise pixel-level boundaries.

Segmentation classifies every pixel in the image. Semantic segmentation assigns a class to each pixel (road, car, person). Instance segmentation further distinguishes between individual objects of the same class (car #1, car #2). Segmentation provides much more precise boundaries than bounding boxes.

Q2: How does CNN work? Explain convolution intuitively.

Answer: A CNN processes images through learnable filters (kernels) that slide across the image. Each filter detects a specific pattern: edges, textures, shapes, objects. Early layers detect simple patterns (edges, colors); deeper layers detect complex patterns (faces, car wheels).

A convolution operation multiplies the filter weights with the input patch and sums the results. This produces a feature map highlighting where the pattern appears. Pooling layers downsample feature maps. Fully connected layers at the end perform classification.

Key properties: Weight sharing (same filter applied everywhere), translation invariance (detects pattern anywhere), hierarchical features (simple to complex).

Q3: What is data augmentation and why is it important for CV?

Answer: Data augmentation creates variations of training images to artificially increase dataset size and improve model generalization. Common augmentations: random flips, rotations, crops, color jitter, Gaussian noise, perspective transforms, Cutout/MixUp/CutMix.

It is critical because: (1) CV models need large datasets, but labeled data is expensive. (2) It prevents overfitting by showing the model variations. (3) It makes the model invariant to transformations it should handle (e.g., horizontal flips for objects that have no inherent orientation).

Q4: What is the difference between IoU and mAP?

Answer: IoU (Intersection over Union) measures the overlap between a predicted bounding box and ground truth. IoU = intersection area / union area. It ranges from 0 (no overlap) to 1 (perfect match). A detection is considered correct if IoU exceeds a threshold (typically 0.5).

mAP (mean Average Precision) is the mean of Average Precision across all classes. AP is the area under the precision-recall curve for a single class. mAP@0.5 uses IoU=0.5 as the correctness threshold. mAP@0.5:0.95 averages mAP across IoU thresholds from 0.5 to 0.95 in steps of 0.05 (COCO's primary metric).

Q5: How do you handle class imbalance in object detection?

Answer: Object detection naturally has class imbalance: most image patches are background (negative), few contain objects (positive).

Focal LossDown-weights well-classified examples, focusing on hard examples. Used in RetinaNet.

OHEMOnline Hard Example Mining selects the hardest negative samples for training.

Data AugmentationOversample rare classes, use copy-paste augmentation to add rare objects.

Class-balanced samplingWeight sampling probability inversely proportional to class frequency.

Custom metricsMonitor per-class AP, not just mAP. Focus on rare classes.

💡

Interview tip: Always be prepared to draw architectures on a whiteboard (CNN, ResNet, YOLO). Understand the tradeoffs between accuracy, speed, and model size. Know when to use each approach for a given problem.

⏳

Loading cheatsheet...