Image Basics, OpenCV Operations, Filters & Edge Detection, Object Detection, Segmentation, Deep Learning Models, Face Detection — CV mastery.
Understanding how images are represented digitally is fundamental to computer vision. Images are multi-dimensional arrays of pixel values, and manipulating these arrays is the foundation of all CV work.
import cv2
import numpy as np
from PIL import Image
# ── Image Representation ──
# Images are NumPy arrays: shape = (height, width, channels)
img = cv2.imread('photo.jpg') # BGR format (OpenCV default)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # Convert to RGB
gray = cv2.imread('photo.jpg', cv2.IMREAD_GRAYSCALE) # Grayscale
print(f"Shape: {img.shape}") # (H, W, 3) for color, (H, W) for gray
print(f"Data type: {img.dtype}") # uint8 (0-255)
print(f"Size: {img.size} pixels") # H * W * channels
print(f"Pixel (100,100): {img[100, 100]}") # BGR values at (row, col)
# ── Color Spaces ──
img_hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV) # Hue, Saturation, Value
img_lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB) # L*a*b* (perceptual)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # Grayscale
# ── Image Properties ──
height, width = img.shape[:2]
center = (width // 2, height // 2)
total_pixels = height * width
aspect_ratio = width / height
# ── Pixel Operations ──
# Access and modify pixels
pixel = img[100, 100] # BGR values at row=100, col=100
img[100, 100] = [255, 0, 0] # Set pixel to blue
# Region of Interest (ROI)
roi = img[50:200, 100:300] # Crop region [y1:y2, x1:x2]
img[0:150, 0:200] = roi # Paste ROI elsewhere
# ── Image Arithmetic ──
img_add = cv2.add(img1, img2) # Saturated addition (clips at 255)
img_sub = cv2.subtract(img1, img2) # Saturated subtraction
img_blend = cv2.addWeighted(img1, 0.7, img2, 0.3, 0) # Blend images
img_bitwise = cv2.bitwise_and(img, mask) # Apply binary mask
img_invert = cv2.bitwise_not(img) # Invert colors| Format | Compression | Transparency | Animation | Best For |
|---|---|---|---|---|
| JPEG | Lossy | No | No | Photographs, web images |
| PNG | Lossless | Yes (alpha) | No | Graphics, screenshots, overlays |
| WebP | Both (lossy/lossless) | Yes | Yes | Modern web, smaller than JPEG/PNG |
| BMP | None | No | No | Raw image storage, Windows |
| TIFF | Lossless (LZW) | Yes | No | Printing, high-quality archival |
| GIF | Lossless | Yes (binary) | Yes | Simple animations, low-color graphics |
| AVIF | Lossy | Yes | Yes | Next-gen web (50% smaller than JPEG) |
| SVG | Vector | Yes | Yes (CSS) | Icons, logos, scalable graphics |
# ── Geometric Transformations ──
# Resize (interpolation methods: INTER_NEAREST, INTER_LINEAR, INTER_CUBIC, INTER_LANCZOS4)
resized = cv2.resize(img, (300, 200)) # Absolute size
resized = cv2.resize(img, None, fx=0.5, fy=0.5) # Scale factor
resized = cv2.resize(img, (300, 200), interpolation=cv2.INTER_AREA) # Downscale
# Rotation
M = cv2.getRotationMatrix2D(center=center, angle=45, scale=1.0)
rotated = cv2.warpAffine(img, M, (width, height))
# Translation
M = np.float32([[1, 0, 50], [0, 1, 30]]) # Shift right 50, down 30
translated = cv2.warpAffine(img, M, (width, height))
# Affine Transformation (3 point pairs)
pts1 = np.float32([[50, 50], [200, 50], [50, 200]])
pts2 = np.float32([[10, 100], [200, 50], [100, 250]])
M = cv2.getAffineTransform(pts1, pts2)
warped = cv2.warpAffine(img, M, (width, height))
# Perspective Transform (4 point pairs)
pts1 = np.float32([[56, 65], [368, 52], [28, 387], [389, 390]])
pts2 = np.float32([[0, 0], [300, 0], [0, 300], [300, 300]])
M = cv2.getPerspectiveTransform(pts1, pts2)
warped = cv2.warpPerspective(img, M, (300, 300))
# Flip
flipped_h = cv2.flip(img, 1) # Horizontal
flipped_v = cv2.flip(img, 0) # Vertical
flipped_b = cv2.flip(img, -1) # BothOpenCV provides a comprehensive set of image processing operations. Mastering these operations is essential for building any computer vision pipeline.
import cv2
import numpy as np
# ── Thresholding ──
gray = cv2.imread('photo.jpg', cv2.IMREAD_GRAYSCALE)
# Simple threshold
_, binary = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)
# Adaptive threshold (handles uneven lighting)
adaptive = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, blockSize=11, C=2
)
# Otsu's method (automatic threshold selection)
_, otsu = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# ── Morphological Operations ──
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 5))
erosion = cv2.erode(binary, kernel, iterations=1) # Shrink white regions
dilation = cv2.dilate(binary, kernel, iterations=1) # Grow white regions
opening = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel) # Erode then dilate (remove noise)
closing = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel) # Dilate then erode (fill holes)
gradient = cv2.morphologyEx(binary, cv2.MORPH_GRADIENT, kernel) # Edge detection
tophat = cv2.morphologyEx(binary, cv2.MORPH_TOPHAT, kernel) # Difference between input and opening
blackhat = cv2.morphologyEx(binary, cv2.MORPH_BLACKHAT, kernel) # Difference between closing and input
# ── Contour Detection ──
contours, hierarchy = cv2.findContours(
binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
)
for contour in contours:
area = cv2.contourArea(contour) # Area
perimeter = cv2.arcLength(contour, True) # Perimeter
x, y, w, h = cv2.boundingRect(contour) # Bounding rectangle
rect_area = cv2.minAreaRect(contour) # Minimum area rectangle
circle = cv2.minEnclosingCircle(contour) # Minimum enclosing circle
hull = cv2.convexHull(contour) # Convex hull
approx = cv2.approxPolyDP(contour, 0.02 * perimeter, True) # Polygon approximation
if len(approx) == 4: # Rectangle detection
print("Rectangle detected")
# ── Drawing Functions ──
img_draw = img.copy()
cv2.rectangle(img_draw, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.circle(img_draw, (cx, cy), radius, (255, 0, 0), -1) # Filled circle
cv2.line(img_draw, (0, 0), (100, 100), (0, 0, 255), 2)
cv2.putText(img_draw, 'Label', (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), 2)
cv2.polylines(img_draw, [points], True, (0, 255, 0), 2)# ── Histogram Analysis ──
# Calculate histogram
hist = cv2.calcHist([gray], [0], None, [256], [0, 256])
# Color histogram
color_hist_b = cv2.calcHist([img], [0], None, [256], [0, 256])
color_hist_g = cv2.calcHist([img], [1], None, [256], [0, 256])
color_hist_r = cv2.calcHist([img], [2], None, [256], [0, 256])
# ── Histogram Equalization ──
equalized = cv2.equalizeHist(gray)
# CLAHE (Contrast Limited Adaptive Histogram Equalization)
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
clahe_img = clahe.apply(gray)
# ── Image Blurring (Noise Reduction) ──
avg_blur = cv2.blur(img, (5, 5)) # Average blur
gaussian = cv2.GaussianBlur(img, (5, 5), 0) # Gaussian blur
median = cv2.medianBlur(img, 5) # Median blur (salt/pepper)
bilateral = cv2.bilateralFilter(img, 9, 75, 75) # Bilateral (edge-preserving)
# Non-local means (best quality, slow)
nlm = cv2.fastNlMeansDenoisingColored(img, None, h=10, hForColorComponents=10,
templateWindowSize=7, searchWindowSize=21)| Function | Purpose | Key Parameters |
|---|---|---|
| cv2.imread() | Read image from file | path, flags (IMREAD_COLOR/GRAYSCALE/UNCHANGED) |
| cv2.imwrite() | Save image to file | path, img, params (JPEG quality, PNG compression) |
| cv2.cvtColor() | Convert color space | src, code (COLOR_BGR2RGB, COLOR_BGR2GRAY, etc.) |
| cv2.resize() | Resize image | src, dsize or fx/fy, interpolation |
| cv2.threshold() | Binary threshold | src, thresh, maxval, type (BINARY, OTSU, TRUNC) |
| cv2.Canny() | Canny edge detection | image, threshold1, threshold2 |
| cv2.findContours() | Find contours | image, mode (RETR_EXTERNAL), method |
| cv2.warpAffine() | Apply affine transform | src, M, dsize |
| cv2.bilateralFilter() | Edge-preserving blur | src, d, sigmaColor, sigmaSpace |
| cv2.matchTemplate() | Template matching | image, templ, method (TM_CCOEFF_NORMED) |
Filters and edge detection are fundamental image processing techniques. Convolution-based filters can enhance, blur, sharpen, and detect edges in images.
import cv2
import numpy as np
# ── Canny Edge Detection (Most Popular) ──
edges = cv2.Canny(img, threshold1=50, threshold2=150)
# threshold1 = lower bound (weak edges)
# threshold2 = upper bound (strong edges)
# Edges between thresholds are kept only if connected to strong edges
# Automatic Canny edge detection (based on median pixel value)
def auto_canny(image, sigma=0.33):
median = np.median(image)
lower = int(max(0, (1.0 - sigma) * median))
upper = int(min(255, (1.0 + sigma) * median))
return cv2.Canny(image, lower, upper)
# ── Sobel Operator ──
sobel_x = cv2.Sobel(gray, cv2.CV_64F, 1, 0, ksize=3) # Horizontal edges
sobel_y = cv2.Sobel(gray, cv2.CV_64F, 0, 1, ksize=3) # Vertical edges
sobel_mag = np.sqrt(sobel_x**2 + sobel_y**2) # Magnitude
sobel_mag = np.uint8(np.clip(sobel_mag, 0, 255))
# ── Laplacian Operator ──
laplacian = cv2.Laplacian(gray, cv2.CV_64F)
laplacian = np.uint8(np.clip(np.abs(laplacian), 0, 255))
# ── Scharr Operator (more accurate than Sobel 3x3) ──
scharr_x = cv2.Scharr(gray, cv2.CV_64F, 1, 0)
scharr_y = cv2.Scharr(gray, cv2.CV_64F, 0, 1)
# ── LoG (Laplacian of Gaussian) ──
blurred = cv2.GaussianBlur(gray, (3, 3), 0)
log = cv2.Laplacian(blurred, cv2.CV_64F, ksize=3)
# ── Custom Convolution Kernel ──
# Sharpening kernel
sharpen = np.array([[0, -1, 0],
[-1, 5, -1],
[0, -1, 0]])
sharpened = cv2.filter2D(img, -1, sharpen)
# Emboss kernel
emboss = np.array([[-2, -1, 0],
[-1, 1, 1],
[ 0, 1, 2]])
embossed = cv2.filter2D(img, -1, emboss)| Method | Noise Sensitivity | Edge Quality | Parameter(s) | Best For |
|---|---|---|---|---|
| Canny | Low (built-in smoothing) | Clean, thin edges | t1, t2 thresholds | General purpose, most popular |
| Sobel | Medium | Thick edges, directional | ksize (3,5,7) | Gradient direction analysis |
| Laplacian | High | All edges, no direction | ksize | Fast edge detection |
| LoG (Marr-Hildreth) | Low (Gaussian smoothing) | Zero-crossing edges | sigma (kernel size) | Blob detection, scale-space |
| Scharr | Medium | More accurate than Sobel | None | Better rotation invariance |
| Kernel | 3x3 Matrix | Effect |
|---|---|---|
| Identity | [[0,0,0],[0,1,0],[0,0,0]] | No change (passthrough) |
| Box Blur | [[1/9]*9] | Average blur (smooth) |
| Gaussian 3x3 | [[1,2,1],[2,4,2],[1,2,1]]/16 | Weighted blur (smooth) |
| Sharpen | [[0,-1,0],[-1,5,-1],[0,-1,0]] | Enhance edges |
| Edge Enhance | [[0,-1,0],[-1,4,-1],[0,-1,0]] | Amplify edges |
| Emboss | [[-2,-1,0],[-1,1,1],[0,1,2]] | 3D emboss effect |
| Sobel X | [[-1,0,1],[-2,0,2],[-1,0,1]] | Horizontal edges |
| Sobel Y | [[-1,-2,-1],[0,0,0],[1,2,1]] | Vertical edges |
Object detection locates and classifies multiple objects in an image. Modern approaches range from two-stage detectors (Faster R-CNN) to single-stage (YOLO) to transformer-based (DETR).
import cv2
import numpy as np
# ── YOLO with OpenCV (DNN module) ──
# Load pre-trained YOLO model
net = cv2.dnn.readNetFromDarknet('yolov4.cfg', 'yolov4.weights')
layer_names = net.getLayerNames()
output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]
# Load COCO class names (80 classes)
with open('coco.names', 'r') as f:
classes = [line.strip() for line in f.readlines()]
# Process image
def detect_objects(image, conf_threshold=0.5, nms_threshold=0.4):
height, width = image.shape[:2]
blob = cv2.dnn.blobFromImage(image, 1/255.0, (416, 416),
swapRB=True, crop=False)
net.setInput(blob)
outputs = net.forward(output_layers)
boxes, confidences, class_ids = [], [], []
for output in outputs:
for detection in output:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > conf_threshold:
center_x = int(detection[0] * width)
center_y = int(detection[1] * height)
w = int(detection[2] * width)
h = int(detection[3] * height)
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)
# Non-Maximum Suppression (remove overlapping boxes)
indices = cv2.dnn.NMSBoxes(boxes, confidences,
conf_threshold, nms_threshold)
results = []
for i in indices:
i = i[0]
results.append({
'class': classes[class_ids[i]],
'confidence': confidences[i],
'box': boxes[i]
})
return results
results = detect_objects(img)
for r in results:
x, y, w, h = r['box']
cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)
cv2.putText(img, f"{r['class']} {r['confidence']:.2f}",
(x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,255,0), 2)# ── YOLO with Ultralytics (Recommended) ──
from ultralytics import YOLO
# Load pre-trained YOLOv8 model
model = YOLO('yolov8n.pt') # nano (fastest), s, m, l, x
# Train on custom dataset (YOLO format: images/ + labels/)
results = model.train(
data='dataset.yaml',
epochs=100,
imgsz=640,
batch=16,
device='0', # GPU device
workers=8,
pretrained=True,
optimizer='AdamW',
lr0=0.01,
augment=True,
mosaic=True,
)
# Inference
results = model('image.jpg')
for result in results:
boxes = result.boxes
for box in boxes:
cls = int(box.cls[0])
conf = float(box.conf[0])
xyxy = box.xyxy[0].tolist()
class_name = model.names[cls]
# Export to different formats
model.export(format='onnx') # ONNX
model.export(format='engine') # TensorRT
model.export(format='openvino') # OpenVINO
model.export(format='tflite') # TensorFlow Lite| Model | Architecture | Speed (FPS) | mAP (COCO) | Best For |
|---|---|---|---|---|
| YOLOv8n | Single-stage | 130+ (T4) | 37.3 | Edge deployment, real-time |
| YOLOv8x | Single-stage | 25 (T4) | 53.9 | Highest accuracy YOLO |
| YOLOv9 | PGI + GELAN | 100+ (T4) | 55.6 | State-of-the-art single-stage |
| Faster R-CNN | Two-stage | 10-20 (T4) | 42.0 | High accuracy, slower |
| SSD | Single-stage | 30-60 (T4) | 25.1 | Balanced speed/accuracy |
| DETR | Transformer | 20 (T4) | 46.2 | End-to-end, no NMS needed |
| RT-DETR | Real-time DETR | 100+ (T4) | 53.1 | Real-time transformer-based |
| EfficientDet | BiFPN | 40-80 (T4) | 39.6 | Efficient for mobile |
Image segmentation classifies every pixel in an image, going beyond bounding boxes to provide precise object boundaries. It ranges from semantic segmentation to instance segmentation.
# ── Semantic Segmentation with OpenCV (GrabCut) ──
import cv2
import numpy as np
img = cv2.imread('photo.jpg')
mask = np.zeros(img.shape[:2], np.uint8)
# Define foreground and background rectangles
rect = (50, 50, 450, 290)
# GrabCut algorithm
bgd_model = np.zeros((1, 65), np.float64)
fgd_model = np.zeros((1, 65), np.float64)
cv2.grabCut(img, mask, rect, bgd_model, fgd_model, 5, cv2.GC_INIT_WITH_RECT)
# Create binary mask (foreground = 1, background = 0)
binary_mask = np.where((mask == 2) | (mask == 0), 0, 1).astype('uint8')
segmented = img * binary_mask[:, :, np.newaxis]
# ── Watershed Segmentation ──
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# Noise removal
kernel = np.ones((3, 3), np.uint8)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=2)
# Sure background area
sure_bg = cv2.dilate(opening, kernel, iterations=3)
# Finding sure foreground area using distance transform
dist_transform = cv2.distanceTransform(opening, cv2.DIST_L2, 5)
ret, sure_fg = cv2.threshold(dist_transform, 0.7 * dist_transform.max(), 255, 0)
# Unknown region
sure_fg = np.uint8(sure_fg)
unknown = cv2.subtract(sure_bg, sure_fg)
# Marker labeling
ret, markers = cv2.connectedComponents(sure_fg)
markers = markers + 1 # Add 1 to all labels (background = 1)
markers[unknown == 255] = 0 # Unknown region = 0
# Watershed
markers = cv2.watershed(img, markers)
img[markers == -1] = [255, 0, 0] # Mark boundaries in red# ── Segment Anything Model (SAM) ──
from segment_anything import SamPredictor, sam_model_registry
# Load model (vit_h, vit_l, vit_b)
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)
# Set image
predictor.set_image(image)
# Predict with point prompt
input_point = np.array([[500, 375]]) # x, y coordinates
input_label = np.array([1]) # 1 = foreground, 0 = background
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True
)
# masks: (num_masks, H, W) boolean arrays
# scores: confidence scores for each mask
# Predict with bounding box prompt
input_box = np.array([425, 600, 700, 875]) # x1, y1, x2, y2
masks, scores, logits = predictor.predict(
box=input_box,
multimask_output=False
)| Type | Output | What It Does | Example Model | Use Case |
|---|---|---|---|---|
| Semantic | Class per pixel | Classifies each pixel into categories | DeepLabV3+, FCN, U-Net | Scene understanding, road segmentation |
| Instance | Mask per object | Separates individual objects of same class | Mask R-CNN, SOLOv2, YOLOv8-seg | Counting objects, individual tracking |
| Panoptic | Both semantic + instance | Complete scene understanding | Panoptic FPN, MaskFormer | Autonomous driving, robotics |
| Interactive | User-guided masks | Click/box-based segmentation | SAM (Meta), IRIS | Photo editing, annotation tools |
| Medical | Organ/lesion segmentation | Precise medical image analysis | U-Net, nnU-Net, Swin UNETR | Tumor detection, organ delineation |
Deep learning has revolutionized computer vision. Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are the backbone of modern CV systems.
import torch
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
from torchvision import datasets
from torch.utils.data import DataLoader
# ── Pre-trained Models (Transfer Learning) ──
# ResNet-50 (pretrained on ImageNet)
resnet = models.resnet50(weights='IMAGENET1K_V2')
num_features = resnet.fc.in_features # 2048
resnet.fc = nn.Linear(num_features, num_classes) # Replace final layer
# EfficientNet-V2
efficientnet = models.efficientnet_v2_s(weights='IMAGENET1K_V1')
# Vision Transformer (ViT)
vit = models.vit_b_16(weights='IMAGENET1K_V1')
# ── Image Preprocessing Pipeline ──
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224), # Standard input size for most models
transforms.ToTensor(), # Convert PIL to Tensor [0, 1]
transforms.Normalize( # ImageNet statistics
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
# ── Training Loop ──
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = resnet.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
for epoch in range(num_epochs):
model.train()
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step()| Architecture | Year | Parameters | Top-1 Acc | Key Innovation | Best For |
|---|---|---|---|---|---|
| ResNet-50 | 2015 | 25.6M | 80.4% | Skip connections (residual learning) | General classification baseline |
| ResNet-152 | 2016 | 60.2M | 82.0% | Deeper residual network | High-accuracy classification |
| EfficientNet-V2-S | 2021 | 21.5M | 83.9% | Compound scaling + Fused-MBConv | Efficient, mobile-friendly |
| EfficientNet-V2-L | 2021 | 119M | 85.1% | Larger compound scaling | High accuracy with efficiency |
| ConvNeXt-L | 2022 | 199M | 85.1% | Modernized CNN (matches ViT) | CNN alternative to ViT |
| ViT-B/16 | 2020 | 86M | 81.8% | Pure transformer for images | Transfer learning, large datasets |
| ViT-L/16 | 2020 | 307M | 85.0% | Large ViT with more data | Best accuracy with large data |
| Swin Transformer | 2021 | 88M | 86.3% | Hierarchical window attention | Dense prediction tasks |
| DINOv2 | 2023 | 305M | 86.5% | Self-supervised ViT | Feature extraction, zero-shot |
Face detection and recognition are among the most widely deployed CV applications. From phone unlock to surveillance, understanding face detection pipelines is essential.
import cv2
import numpy as np
# ── Haar Cascade Face Detection (Classic, Fast, OpenCV) ──
face_cascade = cv2.CascadeClassifier(
cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
)
eye_cascade = cv2.CascadeClassifier(
cv2.data.haarcascades + 'haarcascade_eye.xml'
)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(
gray, scaleFactor=1.1, minNeighbors=5,
minSize=(30, 30), flags=cv2.CASCADE_SCALE_IMAGE
)
for (x, y, w, h) in faces:
cv2.rectangle(img, (x, y), (x+w, y+h), (255, 0, 0), 2)
roi_gray = gray[y:y+h, x:x+w]
eyes = eye_cascade.detectMultiScale(roi_gray, scaleFactor=1.1, minNeighbors=5)
for (ex, ey, ew, eh) in eyes:
cv2.rectangle(img, (x+ex, y+ey), (x+ex+ew, y+ey+eh), (0, 255, 0), 2)
# ── DNN Face Detection (More Accurate) ──
face_net = cv2.dnn.readNetFromCaffe('deploy.prototxt', 'res10_300x300.caffemodel')
blob = cv2.dnn.blobFromImage(cv2.resize(img, (300, 300)), 1.0,
(104.0, 177.0, 123.0))
face_net.setInput(blob)
detections = face_net.forward()
h, w = img.shape[:2]
for i in range(detections.shape[2]):
confidence = detections[0, 0, i, 2]
if confidence > 0.7:
box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
(startX, startY, endX, endY) = box.astype('int')
cv2.rectangle(img, (startX, startY), (endX, endY), (0, 255, 0), 2)# ── Face Recognition with face_recognition library ──
# pip install face_recognition
import face_recognition
# Load and encode known faces
known_image = face_recognition.load_image_file("known_person.jpg")
known_encoding = face_recognition.face_encodings(known_image)[0]
# Find faces in unknown image
unknown_image = face_recognition.load_image_file("group_photo.jpg")
face_locations = face_recognition.face_locations(unknown_image)
face_encodings = face_recognition.face_encodings(unknown_image, face_locations)
# Compare faces
for (top, right, bottom, left), face_encoding in zip(face_locations, face_encodings):
matches = face_recognition.compare_faces([known_encoding], face_encoding, tolerance=0.6)
distance = face_recognition.face_distance([known_encoding], face_encoding)
name = "Known Person" if matches[0] else "Unknown"| Method | Speed | Accuracy | Lighting Robust | Best For |
|---|---|---|---|---|
| Haar Cascade | Very Fast (CPU) | Low-Medium | Low | Real-time, low-resource devices |
| DNN (Caffe) | Fast (CPU) | Medium-High | Medium | Better accuracy, still CPU-friendly |
| MTCNN | Medium | High | High | Landmark detection + alignment |
| RetinaFace | Fast (GPU) | Very High | High | Production face detection |
| MediaPipe | Very Fast (CPU/GPU) | High | High | Mobile, real-time, face mesh |
| BlazeFace | Very Fast (Mobile) | Medium | Medium | Android/iOS face detection |
Essential computer vision interview questions.
Answer: Object detection draws bounding rectangles around objects and classifies them. It answers "what objects are where?" but doesn't give precise pixel-level boundaries.
Segmentation classifies every pixel in the image. Semantic segmentation assigns a class to each pixel (road, car, person). Instance segmentation further distinguishes between individual objects of the same class (car #1, car #2). Segmentation provides much more precise boundaries than bounding boxes.
Answer: A CNN processes images through learnable filters (kernels) that slide across the image. Each filter detects a specific pattern: edges, textures, shapes, objects. Early layers detect simple patterns (edges, colors); deeper layers detect complex patterns (faces, car wheels).
A convolution operation multiplies the filter weights with the input patch and sums the results. This produces a feature map highlighting where the pattern appears. Pooling layers downsample feature maps. Fully connected layers at the end perform classification.
Key properties: Weight sharing (same filter applied everywhere), translation invariance (detects pattern anywhere), hierarchical features (simple to complex).
Answer: Data augmentation creates variations of training images to artificially increase dataset size and improve model generalization. Common augmentations: random flips, rotations, crops, color jitter, Gaussian noise, perspective transforms, Cutout/MixUp/CutMix.
It is critical because: (1) CV models need large datasets, but labeled data is expensive. (2) It prevents overfitting by showing the model variations. (3) It makes the model invariant to transformations it should handle (e.g., horizontal flips for objects that have no inherent orientation).
Answer: IoU (Intersection over Union) measures the overlap between a predicted bounding box and ground truth. IoU = intersection area / union area. It ranges from 0 (no overlap) to 1 (perfect match). A detection is considered correct if IoU exceeds a threshold (typically 0.5).
mAP (mean Average Precision) is the mean of Average Precision across all classes. AP is the area under the precision-recall curve for a single class. mAP@0.5 uses IoU=0.5 as the correctness threshold. mAP@0.5:0.95 averages mAP across IoU thresholds from 0.5 to 0.95 in steps of 0.05 (COCO's primary metric).
Answer: Object detection naturally has class imbalance: most image patches are background (negative), few contain objects (positive).