Computer Vision & Deep Learning II

CMSC 178IP - Module 10

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Computer Vision & Deep Learning II

CMSC 178IP - Module 10

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Learning Objectives

By the end of this module, you will be able to:

Build image classification systems
Understand object detection architectures (YOLO, Faster R-CNN)
Implement semantic segmentation with U-Net
Evaluate models using IoU, mAP, and other metrics
Apply transfer learning for computer vision tasks

Image Classification

Assigning labels to images

MNIST Dataset

MNIST: 70,000 handwritten digits (28×28 grayscale). The "Hello World" of deep learning.

CIFAR-10 Dataset

CIFAR-10: 60,000 color images (32×32) in 10 classes. More challenging than MNIST.

CNN for Classification

Classification Pipeline:
Conv layers → Feature extraction
FC layers → Classification
Softmax → Probability distribution over classes

Feature Maps Visualization

What the CNN learns at different layers

Training and Evaluation

Monitor: Loss and accuracy on train/val sets. Gap indicates overfitting.

Confusion Matrix

Metrics from confusion matrix:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × (Precision × Recall) / (Precision + Recall)

Multiclass Predictions

Sample predictions showing correct classifications and errors

Object Detection

Finding and localizing objects

Object Detection Task

Object Detection: Not just "what" but "where".

Output: Class label + Bounding box (x, y, width, height)

YOLO Architecture

YOLO (You Only Look Once):

Single-shot detector - very fast
Divides image into grid cells
Each cell predicts B boxes + class probabilities
Real-time detection possible

Knowledge Check

Think About It

Why is YOLO faster than two-stage detectors like Faster R-CNN?

Click the blurred area to reveal the answer

Faster R-CNN

Two-stage detector:

Region Proposal Network (RPN): Generate candidate boxes
Classification head: Classify and refine boxes

More accurate but slower than YOLO.

IoU (Intersection over Union)

IoU: Measures overlap between predicted and ground truth boxes.

IoU = \frac{\text{Area of Intersection}}{\text{Area of Union}}

IoU > 0.5 typically considered a "correct" detection.

Non-Maximum Suppression

NMS: Remove duplicate detections.

Sort boxes by confidence
Keep highest confidence box
Remove boxes with IoU > threshold
Repeat for remaining boxes

Mean Average Precision

mAP: Standard metric for object detection.
Average of Average Precision (AP) across all classes.
AP = Area under Precision-Recall curve.

Semantic Segmentation

Pixel-level classification

Segmentation Types

Semantic
Classify every pixel
No instance distinction

Instance
Separate each object
Distinguishes individuals

Knowledge Check

Think About It

What is the difference between semantic and instance segmentation?

Click the blurred area to reveal the answer

U-Net Architecture

U-Net: Encoder-decoder with skip connections.

Encoder: Downsample, extract features
Decoder: Upsample, recover spatial detail
Skip connections: Preserve fine details

Segmentation Example

Input image and pixel-wise segmentation output

Implementation

import torch
import torchvision.models as models

# Transfer learning with pretrained ResNet
model = models.resnet18(pretrained=True)

# Freeze feature extractor
for param in model.parameters():
    param.requires_grad = False

# Replace classifier for new task
model.fc = torch.nn.Linear(512, num_classes)

# Object detection with torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
detector = fasterrcnn_resnet50_fpn(pretrained=True)

Summary

Key Takeaways

Classification: Assign single label per image
Object Detection: YOLO (fast), Faster R-CNN (accurate)
IoU & NMS: Essential for evaluating and refining detections
mAP: Standard detection evaluation metric
Segmentation: Pixel-level classification (U-Net)
Transfer learning: Leverage pretrained models

Questions?

Thank you for your attention!

Next: Module 11 - Generative Models

End of Module 10

Computer Vision & Deep Learning II

Questions?