Back to Course
CMSC 173

Module 12: Artificial Neural Networks

1 / --

Artificial Neural Networks

CMSC 173 - Module 12

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

What Are Neural Networks?

Artificial Neural Networks: Computing systems inspired by biological neural networks

Biological Inspiration

  • Neurons: Basic processing units
  • Synapses: Weighted connections
  • Learning: Adapting connection strengths
  • Parallel processing: Massive connectivity

Artificial Counterpart

  • Perceptrons: Mathematical neurons
  • Weights: Learnable parameters
  • Training: Gradient-based optimization
  • Layers: Organized processing units

Key Insight

Neural networks can learn complex non-linear mappings from data by adjusting weights through training.

Why Neural Networks?

Motivation: Limitations of Linear Models

Linear Models

  • Limited to linear decision boundaries
  • Cannot solve XOR problem
  • Restricted representational power
  • Simple but insufficient for complex data
Example: XOR Problem \begin{tabular}{cc|c} $x_1$ & $x_2$ & XOR \\ \hline 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ \end{tabular} No linear classifier can solve this!

Neural Networks

  • Non-linear decision boundaries
  • Universal approximation capability
  • Hierarchical feature learning
  • Scalable to complex problems
Universal Approximation Theorem: A neural network with a single hidden layer can approximate any continuous function to arbitrary accuracy (given sufficient neurons). Key Advantages:
  • Automatic feature extraction
  • End-to-end learning
  • Flexible architectures

The Perceptron: Building Block of Neural Networks

\begin{tikzpicture}[scale=1.0, every node/.style={scale=1.0}] % Input nodes - better vertical spacing and centered \node[input neuron] (x1) at (0,2.5) {$x_1$}; \node[input neuron] (x2) at (0,1.5) {$x_2$}; \node[input neuron] (x3) at (0,0.5) {$x_3$}; \node[bias neuron] (x0) at (0,-0.5) {$x_0$}; % \node[above=0.05cm of x0, font=\tiny] {bias}; % Intermediate processing nodes - better horizontal alignment \node[computation] (sum) at (3.5,1) {$\sum$}; \node[activation] (sigma) at (5.5,1) {$\sigma$}; % Output neuron - centered vertically with processing nodes \node[output neuron, large neuron] (y) at (7.5,1) {$y$}; % Connections from inputs to summation with better routing \coordinate (sumIn) at (2.8,1); \draw[strong connection] (x1) -- (sumIn); \draw[strong connection] (x2) -- (sumIn); \draw[strong connection] (x3) -- (sumIn); \draw[strong connection] (x0) -- (sumIn); % Weight labels positioned clearly above connections \node at (1.4, 2.2) [font=\small] {$w_1$}; \node at (1.4, 1.5) [font=\small] {$w_2$}; \node at (1.4, 0.8) [font=\small] {$w_3$}; \node at (1.4, 0.1) [font=\small] {$b$}; % Flow arrows between processing stages \draw[flow arrow] (sum) -- (sigma); \draw[flow arrow] (sigma) -- (y); % Mathematical formulation - positioned below with consistent spacing \node[below=0.8cm of sum, font=\small] {$z = \sum_{i=1}^{n} w_i x_i + b$}; \node[below=0.8cm of sigma, font=\small] {$y = \sigma(z)$}; % Input and Output labels - better positioning \node[left=0.2cm of x1, font=\bfseries\small] {Inputs}; \node[right=0.2cm of y, font=\bfseries\small] {Output}; \end{tikzpicture}

Mathematical Model

Linear Combination: $z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T\mathbf{x} + b$ Activation: $y = \sigma(z)$ where $\sigma$ is an activation function

Neural Network Components and Architecture

Single Processing Unit \begin{tikzpicture}[scale=0.8, every node/.style={scale=0.8}] % Single neuron diagram - centered at (3,2) \node[neuron, minimum size=1.2cm] (neuron) at (3,2) {$y$}; % Input nodes properly positioned \node[left] (x1) at (0,3.2) {$x_1$}; \node[left] (x2) at (0,2.4) {$x_2$}; \node[left] (xdots) at (0,1.6) {$⋮$}; \node[left] (xD) at (0,0.8) {$x_D$}; \node[left] (bias) at (0,3.8) {$w_0$}; % Input connections with clear weight labels \draw[connection] (x1) -- (neuron) node[pos=0.3, above] {$w_1$}; \draw[connection] (x2) -- (neuron) node[pos=0.3, above] {$w_2$}; \draw[connection] (xdots) -- (neuron); \draw[connection] (xD) -- (neuron) node[pos=0.3, below] {$w_D$}; \draw[connection] (bias) -- (neuron) node[pos=0.35, above, sloped, font=\tiny] {bias}; % Output with clear spacing \draw[connection] (neuron) -- (5.5,2) node[right] {$y := \sigma(z)$}; % Activation function annotation - better positioned \node[above=0.5cm of neuron, font=\tiny] {Activation}; \node[above=0.25cm of neuron, font=\tiny] {Function, $\sigma$}; \end{tikzpicture} \small{Single processing unit with inputs $x_1, …, x_D$, weights $w_1, …, w_D$, bias $w_0$, and activation function $\sigma$.}
Multi-Layer Perceptron \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] % Input layer - vertically centered \foreach \y in {1,2,3,4} { \node[input neuron] (I-\y) at (0,{4.5-\y}) {$x_\y$}; } % Hidden layer 1 - centered with 5 nodes \foreach \y in {1,2,3,4,5} { \node[hidden neuron] (H1-\y) at (2.8,{5-\y}) {}; } % Hidden layer 2 - centered with 3 nodes \foreach \y in {1,2,3} { \node[hidden neuron] (H2-\y) at (5.6,{3.5-\y}) {}; } % Output layer - centered \node[output neuron] (O-1) at (8.4,2) {$y$}; % Connections input to hidden1 \foreach \i in {1,2,3,4} { \foreach \j in {1,2,3,4,5} { \draw[connection, opacity=0.25] (I-\i) -- (H1-\j); } } % Connections hidden1 to hidden2 \foreach \i in {1,2,3,4,5} { \foreach \j in {1,2,3} { \draw[connection, opacity=0.25] (H1-\i) -- (H2-\j); } } % Connections hidden2 to output \foreach \i in {1,2,3} { \draw[connection, opacity=0.25] (H2-\i) -- (O-1); } % Layer labels - better positioned \node[layer label] at (0,-0.8) {Input}; \node[layer label] at (2.8,-0.8) {Hidden 1}; \node[layer label] at (5.6,-0.8) {Hidden 2}; \node[layer label] at (8.4,-0.8) {Output}; \end{tikzpicture} \small{Multi-layer perceptron with fully connected layers. Each connection represents a learnable weight parameter.}

Key Concepts

Processing Unit: $z = \sum_{i=1}^{D} w_i x_i + w_0$, then $y = \sigma(z)$ \\ Network: Multiple units arranged in layers with feedforward connections

Perceptron: Mathematical Formulation

Complete Mathematical Description: $$\begin{aligned}z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T\mathbf{x} + b \\ y = \sigma(z) = \sigma(\mathbf{w}^T\mathbf{x} + b)\end{aligned}$$ where:
  • $\mathbf{x} = [x_1, x_2, …, x_n]^T$: input vector
  • $\mathbf{w} = [w_1, w_2, …, w_n]^T$: weight vector
  • $b$: bias term
  • $\sigma(\cdot)$: activation function

Step Function (Original)

$$\sigma(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}$$ Problem: Not differentiable

Sigmoid Function (Modern)

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$ Advantage: Smooth and differentiable

Perceptron Learning Algorithm

Goal: Learn weights $\mathbf{w}$ and bias $b$ to minimize prediction error

Original Perceptron Rule

For misclassified point $(x_i, y_i)$: $$w_j := w_j + \alpha (y_i - \hat{y}_i) x_{ij}$$ $$b := b + \alpha (y_i - \hat{y}_i)$$ where $\alpha$ is the learning rate. Convergence: Guaranteed for linearly separable data

Gradient Descent (Modern)

Define loss function: $L = \frac{1}{2}(y - \hat{y})^2$ Weight updates: $$\begin{aligned}w_j := w_j - \alpha \frac{\partial L}{\partial w_j} \\ = w_j - \alpha (y - \hat{y}) \sigma'(z) x_j \\ b := b - \alpha \frac{\partial L}{\partial b} \\ = b - \alpha (y - \hat{y}) \sigma'(z)\end{aligned}$$

Limitation

Single perceptron can only learn linearly separable functions. Solution: Multi-layer networks!

Activation Functions: The Heart of Non-linearity

[Figure: ../figures/activation_functions.png]

Purpose

Activation functions introduce non-linearity into the network, enabling it to learn complex patterns.

Activation Functions: Mathematical Properties

Sigmoid Function

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$ Properties:
  • Range: $(0, 1)$
  • Smooth and differentiable
  • Output interpretable as probability
Derivative: $$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$ Issues: Vanishing gradients for large $|x|$

Hyperbolic Tangent

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$ Properties:
  • Range: $(-1, 1)$
  • Zero-centered output
  • Steeper gradients than sigmoid
Derivative: $$\tanh'(x) = 1 - \tanh^2(x)$$ Advantage: Better than sigmoid

Activation Functions: ReLU Family

ReLU (Rectified Linear Unit)

$$\text{ReLU}(x) = \max(0, x)$$ Advantages:
  • Computationally efficient
  • No vanishing gradient for $x > 0$
  • Sparse activation
  • Most popular choice
Derivative: $$\text{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}$$

Leaky ReLU

$$\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}$$ Advantages:
  • Avoids "dying ReLU" problem
  • Small gradient for negative inputs
  • Typically $\alpha = 0.01$
Derivative: $$\text{LeakyReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ \alpha & \text{if } x \leq 0 \end{cases}$$

Activation Function Derivatives

[Figure: ../figures/activation_derivatives.png]

Why Derivatives Matter

Derivatives are crucial for backpropagation - they determine how errors flow backward through the network during training.

Choosing Activation Functions

\begin{tipblock}{Guidelines} Hidden Layers:
  • ReLU: Default choice (fast, effective)
  • Leaky ReLU: If dying ReLU is a problem
  • Tanh: For zero-centered data
  • Sigmoid: Avoid (vanishing gradients)
Output Layer:
  • Sigmoid: Binary classification
  • Softmax: Multi-class classification
  • Linear: Regression
  • Tanh: Regression (bounded output)
\end{tipblock}

Common Issues

Vanishing Gradients:
  • Sigmoid/Tanh derivatives $\rightarrow 0$ for large inputs
  • Deep networks suffer from this
  • Solution: ReLU activations
Dying ReLU:
  • Neurons get stuck at zero output
  • No gradient flows through
  • Solution: Leaky ReLU, initialization

Best Practice

Start with ReLU for hidden layers and choose output activation based on your task.

Multi-Layer Neural Network Architecture

\begin{tikzpicture}[scale=0.85, every node/.style={scale=0.85}] % Input layer - centered vertically \foreach \y in {0,1,2,3} { \node[input neuron] (I-\y) at (0,{3.5-\y}) {$x_{\y}$}; } % Hidden layer 1 - centered with 5 nodes \foreach \y in {0,1,2,3,4} { \node[hidden neuron] (H1-\y) at (3,{4-\y}) {}; } % Hidden layer 2 - centered with 3 nodes \foreach \y in {0,1,2} { \node[hidden neuron] (H2-\y) at (6,{3-\y}) {}; } % Output layer - centered with 2 nodes \foreach \y in {0,1} { \node[output neuron] (O-\y) at (9,{2-\y}) {$y_{\y}$}; } % Connections input to hidden1 \foreach \i in {0,1,2,3} { \foreach \j in {0,1,2,3,4} { \draw[connection, opacity=0.3] (I-\i) -- (H1-\j); } } % Connections hidden1 to hidden2 \foreach \i in {0,1,2,3,4} { \foreach \j in {0,1,2} { \draw[connection, opacity=0.3] (H1-\i) -- (H2-\j); } } % Connections hidden2 to output \foreach \i in {0,1,2} { \foreach \j in {0,1} { \draw[connection, opacity=0.3] (H2-\i) -- (O-\j); } } % Layer labels - better positioned \node[layer label] at (0,-1.3) {Input}; \node[layer label] at (3,-1.3) {Hidden 1}; \node[layer label] at (6,-1.3) {Hidden 2}; \node[layer label] at (9,-1.3) {Output}; % Weight matrix labels - positioned above network \node[above, font=\small] at (1.5,4.5) {$\mathbf{W}^{(1)}, \mathbf{b}^{(1)}$}; \node[above, font=\small] at (4.5,4.5) {$\mathbf{W}^{(2)}, \mathbf{b}^{(2)}$}; \node[above, font=\small] at (7.5,4.5) {$\mathbf{W}^{(3)}, \mathbf{b}^{(3)}$}; % Forward propagation arrows \foreach \x in {1.5,4.5,7.5} { \draw[flow arrow] (\x,-0.8) -- (\x+1,-0.8); } \node[below] at (4.5,-0.8) {Forward Propagation}; \end{tikzpicture}

Key Components

Layers: Input → Hidden → Hidden → ... → Output Connections: Each neuron connects to all neurons in the next layer (fully connected)

Network Architecture: Mathematical Representation

For a network with $L$ layers: $$\begin{aligned}\mathbf{a}^{(0)} = \mathbf{x} \text{(input layer)} \\ \mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} \text{for } l = 1, 2, …, L \\ \mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)}) \text{for } l = 1, 2, …, L \\ \hat{\mathbf{y}} = \mathbf{a}^{(L)} \text{(output layer)}\end{aligned}$$ where:
  • $\mathbf{W}^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$: weight matrix for layer $l$
  • $\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$: bias vector for layer $l$
  • $n_l$: number of neurons in layer $l$
  • $\sigma^{(l)}$: activation function for layer $l$

Network Dimensions and Parameters

Matrix Dimensions

For layer $l$:
  • Input: $\mathbf{a}^{(l-1)}$ has shape $(n_{l-1}, 1)$
  • Weights: $\mathbf{W}^{(l)}$ has shape $(n_l, n_{l-1})$
  • Output: $\mathbf{a}^{(l)}$ has shape $(n_l, 1)$
Batch Processing:
  • Input batch: $\mathbf{A}^{(l-1)}$ has shape $(n_{l-1}, m)$
  • Output batch: $\mathbf{A}^{(l)}$ has shape $(n_l, m)$
  • where $m$ is the batch size

Parameter Count

Total parameters: $$\sum_{l=1}^{L} (n_l \times n_{l-1} + n_l)$$ Example: 784 → 128 → 64 → 10 $$\begin{aligned}784 \times 128 + 128 \\ + 128 \times 64 + 64 \\ + 64 \times 10 + 10 \\ = 109,386 \text{ parameters}\end{aligned}$$ Memory scales with:
  • Network depth
  • Layer width
  • Batch size

Network Design Considerations

Depth vs Width

Deeper Networks:
  • More layers, fewer neurons per layer
  • Better feature hierarchies
  • Can represent more complex functions
  • Risk: vanishing gradients
Wider Networks:
  • Fewer layers, more neurons per layer
  • More parameters at each level
  • Easier to train
  • Risk: overfitting
\begin{tipblock}{Architecture Guidelines} Hidden Layer Size:
  • Start with 1-2 hidden layers
  • Size between input and output dimensions
  • Rule of thumb: $\sqrt{n_{input} \times n_{output}}$
Number of Layers:
  • Simple problems: 1-2 hidden layers
  • Complex problems: 3+ layers
  • Very deep: Requires special techniques
\end{tipblock}

Rule of Thumb

Start simple and gradually increase complexity. Use validation performance to guide architecture choices.

Forward Propagation: Information Flow

\begin{tikzpicture}[scale=1.0, every node/.style={scale=1.0}] % Network structure (simplified 3-layer) - centered vertically \node[input neuron] (x1) at (0,2) {$x_1$}; \node[input neuron] (x2) at (0,1) {$x_2$}; \node[input neuron] (x3) at (0,0) {$x_3$}; \node[hidden neuron] (h1) at (3.5,1.5) {$h_1$}; \node[hidden neuron] (h2) at (3.5,0.5) {$h_2$}; \node[output neuron] (y) at (7,1) {$y$}; % Connections with reduced opacity \draw[strong connection, opacity=0.25] (x1) -- (h1); \draw[strong connection, opacity=0.25] (x1) -- (h2); \draw[strong connection, opacity=0.25] (x2) -- (h1); \draw[strong connection, opacity=0.25] (x2) -- (h2); \draw[strong connection, opacity=0.25] (x3) -- (h1); \draw[strong connection, opacity=0.25] (x3) -- (h2); \draw[strong connection, opacity=0.25] (h1) -- (y); \draw[strong connection, opacity=0.25] (h2) -- (y); % Flow arrows and computations - positioned above network \node[computation] (z1) at (1.75,2.8) {$\mathbf{z}^{(1)}$}; \node[activation] (a1) at (1.75,3.6) {$\sigma$}; \draw[flow arrow] (z1) -- (a1); \node[above=0.1cm of a1, font=\small] {$\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})$}; \node[computation] (z2) at (5.25,2.8) {$\mathbf{z}^{(2)}$}; \node[activation] (a2) at (5.25,3.6) {$\sigma$}; \draw[flow arrow] (z2) -- (a2); \node[above=0.1cm of a2, font=\small] {$\mathbf{a}^{(2)} = \sigma(\mathbf{z}^{(2)})$}; % Mathematical formulation - better positioning \node[below, font=\small] at (1.75,-0.8) {$\mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}$}; \node[below, font=\small] at (5.25,-0.8) {$\mathbf{z}^{(2)} = \mathbf{W}^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)}$}; % Layer labels \node[layer label] at (0,-1.5) {Input}; \node[layer label] at (3.5,-1.5) {Hidden}; \node[layer label] at (7,-1.5) {Output}; % Forward flow arrows - positioned below labels \draw[flow arrow, very thick] (0.8,-1.3) -- (2.7,-1.3); \draw[flow arrow, very thick] (4.3,-1.3) -- (6.2,-1.3); \node[below, font=\small\bfseries] at (3.5,-1.1) {Forward Propagation}; \end{tikzpicture}

Forward Pass

Information flows from input to output, layer by layer, to compute predictions.

Forward Propagation Algorithm

Step-by-step Process: \begin{algorithm}[H] \caption{Forward Propagation} \begin{algorithmic}[1] \STATE Input: $\mathbf{x}$, weights $\{\mathbf{W}^{(l)}\}$, biases $\{\mathbf{b}^{(l)}\}$ \STATE Set $\mathbf{a}^{(0)} = \mathbf{x}$ \FOR{$l = 1$ to $L$} \STATE Compute pre-activation: $\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$ \STATE Apply activation: $\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$ \ENDFOR \STATE Output: $\hat{\mathbf{y}} = \mathbf{a}^{(L)}$ \end{algorithmic} \end{algorithm}
\begin{exampleblock}{Vectorized Implementation} For batch processing: $$\begin{aligned}\mathbf{Z}^{(l)} = \mathbf{A}^{(l-1)} \mathbf{W}^{(l)T} + \mathbf{b}^{(l)} \\ \mathbf{A}^{(l)} = \sigma^{(l)}(\mathbf{Z}^{(l)})\end{aligned}$$ where $\mathbf{A}^{(l)}$ has shape $(m, n_l)$ for $m$ examples. Computational Complexity: $O(L \cdot N \cdot M)$ where $L$ = layers, $N$ = max neurons/layer, $M$ = batch size \end{exampleblock}
\begin{exampleblock}{Example Calculation} Network: $2 \rightarrow 3 \rightarrow 1$ Input: $\mathbf{x} = [0.5, 0.8]^T$ Layer 1: $\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$ $\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})$ Layer 2: $z^{(2)} = \mathbf{w}^{(2)T} \mathbf{a}^{(1)} + b^{(2)}$ $\hat{y} = \sigma(z^{(2)})$ All intermediate values $\mathbf{z}^{(l)}, \mathbf{a}^{(l)}$ are stored for backpropagation. \end{exampleblock}

Forward Propagation: Implementation Details

Memory Considerations

Storage Requirements:
  • Store all activations $\mathbf{a}^{(l)}$
  • Store all pre-activations $\mathbf{z}^{(l)}$
  • Needed for backpropagation
Memory Usage: $$\text{Memory} \propto \sum_{l=0}^{L} n_l \times \text{batch\_size}$$ Trade-offs:
  • Larger batches: More memory, better GPU utilization
  • Smaller batches: Less memory, more gradient noise

Numerical Stability

Common Issues:
  • Overflow: Large intermediate values
  • Underflow: Very small values → 0
  • NaN propagation: Invalid operations
Solutions:
  • Proper weight initialization
  • Batch normalization
  • Gradient clipping
  • Use stable activation functions (ReLU)

Key Insight

Forward propagation is computationally straightforward, but proper implementation requires attention to memory usage and numerical stability.

Forward Pass: Handworked Example

Network: 2 inputs → 2 hidden → 1 output (sigmoid activation)

Given

Input: $\mathbf{x} = \begin{bmatrix} 0.5 \\ 0.8 \end{bmatrix}$ Weights \& Biases: $$\mathbf{W}^{(1)} = \begin{bmatrix} 0.2 & 0.4 \\ 0.3 & 0.1 \end{bmatrix}, \mathbf{b}^{(1)} = \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}$$ $$\mathbf{W}^{(2)} = \begin{bmatrix} 0.6 & 0.5 \end{bmatrix}, b^{(2)} = 0.3$$ Activation: $\sigma(z) = \frac{1}{1 + e^{-z}}$

Step 1: Hidden Layer

$$\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$$ $$= \begin{bmatrix} 0.2 & 0.4 \\ 0.3 & 0.1 \end{bmatrix} \begin{bmatrix} 0.5 \\ 0.8 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}$$ $$= \begin{bmatrix} 0.2(0.5) + 0.4(0.8) \\ 0.3(0.5) + 0.1(0.8) \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}$$ $$= \begin{bmatrix} 0.1 + 0.32 \\ 0.15 + 0.08 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.52 \\ 0.43 \end{bmatrix}$$

Forward Pass: Handworked Example (continued)

Step 2: Hidden Activations

$$\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) = \sigma\left(\begin{bmatrix} 0.52 \\ 0.43 \end{bmatrix}\right)$$ $$a_1^{(1)} = \sigma(0.52) = \frac{1}{1 + e^{-0.52}} = \frac{1}{1 + 0.595} = 0.627$$ $$a_2^{(1)} = \sigma(0.43) = \frac{1}{1 + e^{-0.43}} = \frac{1}{1 + 0.651} = 0.606$$ $$\mathbf{a}^{(1)} = \begin{bmatrix} 0.627 \\ 0.606 \end{bmatrix}$$

Step 3: Output Layer

$$z^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + b^{(2)}$$ $$= \begin{bmatrix} 0.6 & 0.5 \end{bmatrix} \begin{bmatrix} 0.627 \\ 0.606 \end{bmatrix} + 0.3$$ $$= 0.6(0.627) + 0.5(0.606) + 0.3$$ $$= 0.376 + 0.303 + 0.3 = 0.979$$ Final Output: $$\hat{y} = \sigma(0.979) = \frac{1}{1 + e^{-0.979}} = 0.727$$

Summary

Input $[0.5, 0.8]$ → Hidden $[0.627, 0.606]$ → Output $0.727$

Backpropagation: Error Flow

\begin{tikzpicture}[scale=1.0, every node/.style={scale=1.0}] % Network structure (same as forward, but show backward flow) \node[input neuron] (x1) at (0,2) {$x_1$}; \node[input neuron] (x2) at (0,1) {$x_2$}; \node[input neuron] (x3) at (0,0) {$x_3$}; \node[hidden neuron] (h1) at (3.5,1.5) {$h_1$}; \node[hidden neuron] (h2) at (3.5,0.5) {$h_2$}; \node[output neuron] (y) at (7,1) {$y$}; % Forward connections (lighter) \draw[weak connection, opacity=0.15] (x1) -- (h1); \draw[weak connection, opacity=0.15] (x1) -- (h2); \draw[weak connection, opacity=0.15] (x2) -- (h1); \draw[weak connection, opacity=0.15] (x2) -- (h2); \draw[weak connection, opacity=0.15] (x3) -- (h1); \draw[weak connection, opacity=0.15] (x3) -- (h2); \draw[weak connection, opacity=0.15] (h1) -- (y); \draw[weak connection, opacity=0.15] (h2) -- (y); % Error/gradient flow (backward arrows) - positioned below \draw[error arrow, very thick] (6.2,-1.3) -- (4.3,-1.3); \draw[error arrow, very thick] (2.7,-1.3) -- (0.8,-1.3); \node[below, font=\small\bfseries] at (3.5,-1.1) {Error Backpropagation}; % Loss function \node[function box] (loss) at (8.2,1) {$L$}; \draw[error arrow] (y) -- (loss); \node[right=0.05cm of loss, font=\small] {Loss}; % Gradient computations - positioned above network \node[computation] (grad2) at (5.25,2.8) {$\boldsymbol{\delta}^{(2)}$}; \node[computation] (grad1) at (1.75,2.8) {$\boldsymbol{\delta}^{(1)}$}; % Chain rule arrows \draw[gradient arrow] (loss) to[bend left=20] (grad2); \draw[gradient arrow] (grad2) to[bend left=20] (grad1); % Mathematical formulation - better positioning \node[below, font=\small] at (1.75,-0.8) {$\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \boldsymbol{\delta}^{(1)} \mathbf{x}^T$}; \node[below, font=\small] at (5.25,-0.8) {$\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \boldsymbol{\delta}^{(2)} (\mathbf{a}^{(1)})^T$}; % Layer labels \node[layer label] at (0,-1.5) {Input}; \node[layer label] at (3.5,-1.5) {Hidden}; \node[layer label] at (7,-1.5) {Output}; % Gradient flow labels - above network \node[above=0.1cm of grad1, font=\scriptsize] {$\boldsymbol{\delta}^{(1)} = (\mathbf{W}^{(2)})^T \boldsymbol{\delta}^{(2)} \odot \sigma'(\mathbf{z}^{(1)})$}; \node[above=0.1cm of grad2, font=\scriptsize] {$\boldsymbol{\delta}^{(2)} = \frac{\partial L}{\partial \mathbf{a}^{(2)}} \odot \sigma'(\mathbf{z}^{(2)})$}; \end{tikzpicture}

Backpropagation

Efficient algorithm to compute gradients by propagating errors backward through the network using the chain rule.

Mathematical Foundation: Chain Rule

Goal: Compute $\frac{\partial L}{\partial \mathbf{W}^{(l)}}$ and $\frac{\partial L}{\partial \mathbf{b}^{(l)}}$ for all layers Chain Rule Application: $$\begin{aligned}\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{W}^{(l)}} \\ \frac{\partial L}{\partial \mathbf{b}^{(l)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{b}^{(l)}} \\ \frac{\partial L}{\partial \mathbf{a}^{(l-1)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{a}^{(l-1)}}\end{aligned}$$ Key Insight: Define error terms $\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}$

Gradient Computations

$$\begin{aligned}\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T \\ \frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)} \\ \boldsymbol{\delta}^{(l-1)} = (\mathbf{W}^{(l)})^T \boldsymbol{\delta}^{(l)} \odot \sigma'(\mathbf{z}^{(l-1)})\end{aligned}$$

Output Layer

For output layer $L$: $$\boldsymbol{\delta}^{(L)} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \odot \sigma'(\mathbf{z}^{(L)})$$ Common case (MSE + sigmoid): $$\boldsymbol{\delta}^{(L)} = (\mathbf{a}^{(L)} - \mathbf{y}) \odot \mathbf{a}^{(L)} \odot (1 - \mathbf{a}^{(L)})$$
where $\odot$ denotes element-wise multiplication.

Backpropagation Algorithm

\begin{algorithm}[H] \caption{Backpropagation} \begin{algorithmic}[1] \STATE Input: Training example $(\mathbf{x}, \mathbf{y})$, network weights \STATE Forward Pass: Compute all $\mathbf{a}^{(l)}$ and $\mathbf{z}^{(l)}$ (store them!) \STATE Compute Output Error: $\boldsymbol{\delta}^{(L)} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \odot \sigma'(\mathbf{z}^{(L)})$ \FOR{$l = L-1$ down to $1$} \STATE Propagate Error: $\boldsymbol{\delta}^{(l)} = (\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})$ \ENDFOR \FOR{$l = 1$ to $L$} \STATE Compute Gradients: \STATE $\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T$ \STATE $\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$ \ENDFOR \end{algorithmic} \end{algorithm}

Computational Complexity

Time: $O(\text{number of weights})$
  • Same order as forward pass
  • Very efficient vs numerical gradients
Space: $O(\text{network size})$
  • Must store all activations
  • Memory scales with depth

Why Backpropagation Works

  • Efficiency: Reuses computations via chain rule
  • Automatic: No manual gradient derivation
  • Exact: Computes exact gradients
  • General: Works for any differentiable network
Historical Impact:
  • Rumelhart, Hinton, Williams (1986)
  • Made deep learning practical

Computational Graph Perspective

\begin{tikzpicture}[scale=0.9, every node/.style={scale=0.9}] % Input nodes - better vertical alignment \node[input neuron] (x) at (0,1.8) {$x$}; \node[input neuron] (w) at (0,0.2) {$w$}; % Operation nodes (rectangles for operations) - aligned at y=1 \node[function box] (mul) at (2.5,1) {$\times$}; \node[function box] (add) at (5,1) {$+$}; \node[function box] (sig) at (7.5,1) {$\sigma$}; % Intermediate results - positioned above operations \node[computation] (z1) at (2.5,2.2) {$z_1$}; \node[computation] (z2) at (5,2.2) {$z_2$}; \node[output neuron] (y) at (10,1) {$y$}; % Bias - positioned below mul operation \node[bias neuron] (b) at (2.5,-0.6) {$b$}; % Forward edges \draw[flow arrow] (x) -- (mul); \draw[flow arrow] (w) -- (mul); \draw[flow arrow] (mul) -- (add); \draw[flow arrow] (b) -- (add); \draw[flow arrow] (add) -- (sig); \draw[flow arrow] (sig) -- (y); % Intermediate value labels - better positioning \draw[connection] (mul) -- (z1) node[midway, right, font=\tiny] {$wx$}; \draw[connection] (add) -- (z2) node[midway, right, font=\tiny] {$wx + b$}; % Loss function \node[function box] (loss) at (12,1) {$L$}; \draw[error arrow] (y) -- (loss); % Backward gradients (dashed red arrows) \draw[gradient arrow] (loss) to[bend left=12] (sig); \draw[gradient arrow] (sig) to[bend left=12] (add); \draw[gradient arrow] (add) to[bend left=12] (mul); \draw[gradient arrow] (add) to[bend right=12] (b); \draw[gradient arrow] (mul) to[bend left=12] (x); \draw[gradient arrow] (mul) to[bend right=12] (w); % Gradient labels - cleaner positioning \node[above, font=\tiny] at (11,1.4) {$\frac{\partial L}{\partial y}$}; \node[above, font=\tiny] at (8.75,1.4) {$\frac{\partial L}{\partial z_2}$}; \node[above, font=\tiny] at (6.25,1.4) {$\frac{\partial L}{\partial z_1}$}; \node[left, font=\tiny] at (1.3,0.6) {$\frac{\partial L}{\partial w}$}; \node[right, font=\tiny] at (3.2,-0.4) {$\frac{\partial L}{\partial b}$}; % Mathematical operations - positioned below \node[below, font=\small] at (2.5,-1.2) {$z_1 = w \cdot x$}; \node[below, font=\small] at (5,-1.2) {$z_2 = z_1 + b$}; \node[below, font=\small] at (7.5,-1.2) {$y = \sigma(z_2)$}; \end{tikzpicture}

Modern View

Backpropagation is automatic differentiation applied to computational graphs. Modern frameworks (TensorFlow, PyTorch) build these graphs automatically.

4-Layer Neural Network: Differential Equation Derivation

Network Structure: Input → Hidden1 → Hidden2 → Hidden3 → Output Forward Pass Equations: $$\begin{aligned}\mathbf{a}^{(0)} = \mathbf{x} \text{(input)} \\ \mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{a}^{(0)} + \mathbf{b}^{(1)}, \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) \\ \mathbf{z}^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + \mathbf{b}^{(2)}, \mathbf{a}^{(2)} = \sigma(\mathbf{z}^{(2)}) \\ \mathbf{z}^{(3)} = \mathbf{W}^{(3)} \mathbf{a}^{(2)} + \mathbf{b}^{(3)}, \mathbf{a}^{(3)} = \sigma(\mathbf{z}^{(3)}) \\ \mathbf{z}^{(4)} = \mathbf{W}^{(4)} \mathbf{a}^{(3)} + \mathbf{b}^{(4)}, \mathbf{a}^{(4)} = \sigma(\mathbf{z}^{(4)}) \text{(output)}\end{aligned}$$ Loss Function: $L = \frac{1}{2}||\mathbf{a}^{(4)} - \mathbf{y}||^2$

Output Layer Error

Starting from the output layer: $$\begin{aligned}\boldsymbol{\delta}^{(4)} = \frac{\partial L}{\partial \mathbf{z}^{(4)}} \\ = \frac{\partial L}{\partial \mathbf{a}^{(4)}} \odot \frac{\partial \mathbf{a}^{(4)}}{\partial \mathbf{z}^{(4)}} \\ = (\mathbf{a}^{(4)} - \mathbf{y}) \odot \sigma'(\mathbf{z}^{(4)})\end{aligned}$$

Chain Rule Application

For hidden layers ($l = 3, 2, 1$): $$\begin{aligned}\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \\ = \frac{\partial L}{\partial \mathbf{z}^{(l+1)}} \frac{\partial \mathbf{z}^{(l+1)}}{\partial \mathbf{a}^{(l)}} \frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{z}^{(l)}} \\ = (\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})\end{aligned}$$

4-Layer Network: Complete Backpropagation Derivation

Step-by-Step Gradient Computation:

Error Propagation

Layer 4 (Output): $$\boldsymbol{\delta}^{(4)} = (\mathbf{a}^{(4)} - \mathbf{y}) \odot \sigma'(\mathbf{z}^{(4)})$$ Layer 3: $$\boldsymbol{\delta}^{(3)} = (\mathbf{W}^{(4)})^T \boldsymbol{\delta}^{(4)} \odot \sigma'(\mathbf{z}^{(3)})$$ Layer 2: $$\boldsymbol{\delta}^{(2)} = (\mathbf{W}^{(3)})^T \boldsymbol{\delta}^{(3)} \odot \sigma'(\mathbf{z}^{(2)})$$ Layer 1: $$\boldsymbol{\delta}^{(1)} = (\mathbf{W}^{(2)})^T \boldsymbol{\delta}^{(2)} \odot \sigma'(\mathbf{z}^{(1)})$$

Weight and Bias Gradients

For each layer $l = 1, 2, 3, 4$: Weight Gradients: $$\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T$$ Bias Gradients: $$\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$$ Update Rules: $$\begin{aligned}\mathbf{W}^{(l)} := \mathbf{W}^{(l)} - \alpha \frac{\partial L}{\partial \mathbf{W}^{(l)}} \\ \mathbf{b}^{(l)} := \mathbf{b}^{(l)} - \alpha \frac{\partial L}{\partial \mathbf{b}^{(l)}}\end{aligned}$$

Key Insight

The error flows backward through the network, with each layer's error depending on the next layer's error multiplied by the transpose of the connecting weights.

Gradient Descent Optimization

[Figure: ../figures/gradient_descent_visualization.png]

Weight Update Rule

$$\mathbf{W}^{(l)} := \mathbf{W}^{(l)} - \alpha \frac{\partial L}{\partial \mathbf{W}^{(l)}}$$ $$\mathbf{b}^{(l)} := \mathbf{b}^{(l)} - \alpha \frac{\partial L}{\partial \mathbf{b}^{(l)}}$$ where $\alpha$ is the learning rate.

The Overfitting Problem

[Figure: ../figures/overfitting_regularization_demo.png]

Overfitting

Model learns training data too well, memorizing noise instead of generalizable patterns.

L1 and L2 Regularization

Add penalty terms to the loss function to control model complexity

L2 Regularization (Ridge)

$$L_{total} = L_{data} + \lambda \sum_{l} ||\mathbf{W}^{(l)}||_2^2$$ where $||\mathbf{W}^{(l)}||_2^2 = \sum_i \sum_j (W_{ij}^{(l)})^2$ Effect:
  • Shrinks weights towards zero
  • Uniform penalty on all weights
  • Smooth weight distributions
  • Preferred for most applications
Gradient Modification: $$\frac{\partial L_{total}}{\partial \mathbf{W}^{(l)}} = \frac{\partial L_{data}}{\partial \mathbf{W}^{(l)}} + 2\lambda \mathbf{W}^{(l)}$$

L1 Regularization (Lasso)

$$L_{total} = L_{data} + \lambda \sum_{l} ||\mathbf{W}^{(l)}||_1$$ where $||\mathbf{W}^{(l)}||_1 = \sum_i \sum_j |W_{ij}^{(l)}|$ Effect:
  • Promotes sparsity (many weights → 0)
  • Automatic feature selection
  • Creates sparse networks
  • Useful for interpretability
Gradient Modification: $$\frac{\partial L_{total}}{\partial \mathbf{W}^{(l)}} = \frac{\partial L_{data}}{\partial \mathbf{W}^{(l)}} + \lambda \text{sign}(\mathbf{W}^{(l)})$$

Hyperparameter $$

Controls regularization strength: larger $\lambda$ → more regularization → simpler model

L1 vs L2 Regularization Comparison

[Figure: ../figures/l1_vs_l2_regularization.png]

When to Use L2

  • General-purpose regularization
  • All features potentially relevant
  • Want smooth weight shrinkage
  • Most common choice

When to Use L1

  • Feature selection needed
  • Many irrelevant features
  • Want sparse models
  • Interpretability important

Dropout: A Different Approach

\begin{tikzpicture}[scale=0.8, every node/.style={scale=0.8}] % Training network (with dropout) \node[above, font=\bfseries] at (2.25,3.8) {Training (with Dropout)}; % Input layer - centered vertically \foreach \y in {0,1,2,3} { \node[input neuron] (TI-\y) at (0,{3-\y}) {$x_{\y}$}; } % Hidden layer with some dropped out neurons - centered \node[hidden neuron] (TH-0) at (2.5,{3-0}) {}; \node[hidden neuron, fill=gray!50, draw=gray] (TH-1) at (2.5,{3-1}) {\scriptsize X}; % Dropped out \node[hidden neuron] (TH-2) at (2.5,{3-2}) {}; \node[hidden neuron, fill=gray!50, draw=gray] (TH-3) at (2.5,{3-3}) {\scriptsize X}; % Dropped out % Output layer - centered \node[output neuron] (TO) at (5,1.5) {$y$}; % Active connections only - reduced opacity \draw[strong connection, opacity=0.3] (TI-0) -- (TH-0); \draw[strong connection, opacity=0.3] (TI-0) -- (TH-2); \draw[strong connection, opacity=0.3] (TI-1) -- (TH-0); \draw[strong connection, opacity=0.3] (TI-1) -- (TH-2); \draw[strong connection, opacity=0.3] (TI-2) -- (TH-0); \draw[strong connection, opacity=0.3] (TI-2) -- (TH-2); \draw[strong connection, opacity=0.3] (TI-3) -- (TH-0); \draw[strong connection, opacity=0.3] (TI-3) -- (TH-2); \draw[strong connection, opacity=0.3] (TH-0) -- (TO); \draw[strong connection, opacity=0.3] (TH-2) -- (TO); % Testing network (no dropout) \node[above, font=\bfseries] at (8.75,3.8) {Testing (no Dropout)}; % Input layer - centered vertically \foreach \y in {0,1,2,3} { \node[input neuron] (EI-\y) at (6.5,{3-\y}) {$x_{\y}$}; } % Hidden layer - all active, centered \foreach \y in {0,1,2,3} { \node[hidden neuron] (EH-\y) at (9,{3-\y}) {}; } % Output layer - centered \node[output neuron] (EO) at (11.5,1.5) {$y$}; % All connections active - very light \foreach \i in {0,1,2,3} { \foreach \j in {0,1,2,3} { \draw[connection, opacity=0.2] (EI-\i) -- (EH-\j); } } \foreach \j in {0,1,2,3} { \draw[connection, opacity=0.2] (EH-\j) -- (EO); } % Dropout probability label - better positioning \node[below, font=\footnotesize] at (2.5,-0.5) {Dropout rate: $p = 0.5$}; \node[below, font=\footnotesize] at (9,-0.5) {All neurons active}; % Layer labels \node[layer label] at (0,-1.1) {Input}; \node[layer label] at (2.5,-1.1) {Hidden}; \node[layer label] at (5,-1.1) {Output}; \node[layer label] at (6.5,-1.1) {Input}; \node[layer label] at (9,-1.1) {Hidden}; \node[layer label] at (11.5,-1.1) {Output}; \end{tikzpicture}

Dropout Technique

Randomly set neurons to zero during training to prevent co-adaptation and improve generalization.

Dropout: Mathematical Formulation

Training Phase: $$\begin{aligned}\mathbf{r}^{(l)} \sim \text{Bernoulli}(p) \text{(dropout mask)} \\ \tilde{\mathbf{a}}^{(l)} = \mathbf{r}^{(l)} \odot \mathbf{a}^{(l)} \text{(apply mask)} \\ \mathbf{z}^{(l+1)} = \mathbf{W}^{(l+1)} \tilde{\mathbf{a}}^{(l)} + \mathbf{b}^{(l+1)}\end{aligned}$$ Testing Phase: $$\mathbf{z}^{(l+1)} = p \cdot \mathbf{W}^{(l+1)} \mathbf{a}^{(l)} + \mathbf{b}^{(l+1)} \text{(scale weights)}$$

Dropout Benefits

  • Prevents overfitting: Reduces complex co-adaptations
  • Model averaging: Approximates ensemble of networks
  • Robust features: Forces redundant representations
  • Easy to implement: Simple modification to forward pass
Typical rates: 0.2-0.5 for hidden layers, 0.1-0.2 for input
\begin{exampleblock}{Implementation Notes} Training vs Testing:
  • Training: Randomly drop neurons
  • Testing: Use all neurons but scale outputs
  • Modern frameworks handle this automatically
Why Scaling Works:
  • Training: Each neuron is "on" with probability $p$
  • Testing: All neurons are "on"
  • Scaling by $p$ maintains expected activation levels
\end{exampleblock}

Best Practice

Use dropout in hidden layers only, not in output layer. Start with rate 0.5 and tune.

Regularization Comparison

[Figure: ../figures/regularization_comparison.png]

Choosing Regularization

Start with:
  • L2 regularization ($\lambda = 0.01$)
  • Dropout (rate = 0.5)
  • Early stopping
If still overfitting:
  • Increase regularization strength
  • Add more dropout
  • Reduce model complexity

Other Techniques

Early Stopping:
  • Monitor validation loss
  • Stop when it starts increasing
  • Simple and effective
Data Augmentation:
  • Artificially increase training data
  • Add noise, rotations, etc.
  • Domain-specific techniques

Training Curves with Regularization

[Figure: ../figures/training_curves_regularization.png]

Monitoring Training

Use validation curves to detect overfitting and choose regularization strength.

Weight Initialization

Proper initialization is crucial for successful training

Poor Initialization

All zeros: No learning (symmetry) $$W_{ij} = 0 \Rightarrow \text{no gradient flow}$$ Too large: Exploding gradients $$W_{ij} \sim \mathcal{N}(0, 1) \Rightarrow \text{saturation}$$ Too small: Vanishing gradients $$W_{ij} \sim \mathcal{N}(0, 0.01) \Rightarrow \text{weak signals}$$

Good Initialization

Xavier/Glorot (Sigmoid/Tanh): $$W_{ij} \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in} + n_{out}}}\right)$$ He initialization (ReLU): $$W_{ij} \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)$$ Bias initialization: $$b_i = 0 \text{ (usually sufficient)}$$

Why These Work

} Maintain activation variance and gradient variance across layers during initialization.

Learning Rate and Optimization

Learning Rate Selection

Too high: Overshooting, instability
  • Loss explodes or oscillates
  • Network doesn't converge
  • Weights become very large
Too low: Slow convergence
  • Training takes forever
  • Gets stuck in local minima
  • Poor final performance
Good range: Typically $10^{-4}$ to $10^{-1}$

Advanced Optimizers

SGD with Momentum: $$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta) \nabla L$$ $$\mathbf{W} := \mathbf{W} - \alpha \mathbf{v}_t$$ Adam (Adaptive Moments): $$\begin{aligned}\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \nabla L \\ \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) (\nabla L)^2 \\ \mathbf{W} := \mathbf{W} - \alpha \frac{\mathbf{m}_t}{\sqrt{\mathbf{v}_t} + \epsilon}\end{aligned}$$ Default choice: Adam with $\alpha = 0.001$

Learning Rate Scheduling

Decay strategies: Step decay, exponential decay, cosine annealing. Start high, reduce during training.

Training Diagnostics

Monitor these metrics during training:

Loss Monitoring

  • Training loss: Should decrease monotonically
  • Validation loss: Should decrease, then stabilize
  • Gap: Indicates overfitting if too large
Warning Signs:
  • Loss increases: Learning rate too high
  • Loss plateaus early: Learning rate too low
  • Validation loss increases: Overfitting

Gradient Monitoring

  • Gradient norms: Should be reasonable ($10^{-6}$ to $10^{-1}$)
  • Vanishing: Gradients → 0 in early layers
  • Exploding: Gradients become very large

Activation Monitoring

  • Activation statistics: Mean, std, sparsity
  • Dead neurons: Always output zero
  • Saturated neurons: Always in saturation region
Healthy activations:
  • Reasonable variance (not too small/large)
  • Some sparsity (for ReLU)
  • No layers completely dead

Weight Monitoring

  • Weight distributions: Should be reasonable
  • Weight updates: $|\Delta W| / |W| \approx 10^{-3}$
  • Layer-wise learning rates: May need adjustment

Tools

Use TensorBoard, Weights \& Biases, or similar tools for comprehensive monitoring and visualization.

Common Problems and Solutions

Problem: Vanishing Gradients

Symptoms:
  • Early layers don't learn
  • Gradients approach zero
Solutions:
  • Use ReLU activations
  • Proper weight initialization
  • Batch normalization

Problem: Overfitting

Symptoms:
  • Training accuracy >> validation accuracy
  • Validation loss increases
Solutions:
  • Add regularization (L2, dropout)
  • Reduce model complexity
  • More training data

Problem: Exploding Gradients

Symptoms:
  • Loss becomes NaN
  • Weights blow up
Solutions:
  • Gradient clipping
  • Lower learning rate
  • Better initialization

Problem: Slow Convergence

Symptoms:
  • Loss decreases slowly
  • Gets stuck in plateaus
Solutions:
  • Increase learning rate
  • Use adaptive optimizers

Neural Networks: Key Takeaways

Core Concepts

  • Perceptron: Basic building block
  • Multi-layer: Enable complex mappings
  • Activation functions: Provide non-linearity
  • Forward propagation: Compute predictions
  • Backpropagation: Compute gradients efficiently
  • Regularization: Prevent overfitting

Mathematical Foundation

  • Matrix operations for efficiency
  • Chain rule for gradient computation
  • Optimization theory for training
  • Probability theory for interpretation

Best Practices

  • Architecture: Start simple, add complexity gradually
  • Initialization: Xavier/He for proper gradient flow
  • Optimization: Adam optimizer with proper learning rate
  • Regularization: L2 + Dropout for generalization
  • Monitoring: Track loss, gradients, activations
  • Debugging: Systematic approach to problems

When to Use Neural Networks

  • Large datasets available
  • Complex non-linear patterns
  • End-to-end learning desired
  • Feature engineering is difficult

Modern Deep Learning

These fundamentals scale to modern architectures: CNNs, RNNs, Transformers, ResNets, etc.

Applications \& Real-World Impact

Computer Vision

  • Image classification: ResNet, EfficientNet
  • Object detection: YOLO, R-CNN
  • Segmentation: U-Net, Mask R-CNN
  • Face recognition: DeepFace, FaceNet
  • Medical imaging: Cancer detection, radiology

Natural Language Processing

  • Language models: GPT, BERT, T5
  • Translation: Google Translate, DeepL
  • Chatbots: ChatGPT, virtual assistants
  • Text analysis: Sentiment, summarization

Other Domains

  • Speech: Recognition, synthesis, processing
  • Recommendation: Netflix, Amazon, Spotify
  • Games: AlphaGo, OpenAI Five, StarCraft
  • Robotics: Control, perception, planning
  • Finance: Trading, fraud detection, risk
  • Science: Drug discovery, climate modeling

Emerging Areas

  • Generative AI: DALL-E, Midjourney, Stable Diffusion
  • Multimodal: CLIP, GPT-4V
  • Reinforcement Learning: Autonomous systems
  • Scientific Computing: Physics, chemistry, biology

Impact

Neural networks have revolutionized AI and are now fundamental to most modern machine learning applications.

Looking Forward: Advanced Topics

What's Next After This Foundation?

Specialized Architectures

  • Convolutional Neural Networks (CNNs)
    • Spatial structure exploitation
    • Translation invariance
    • Computer vision applications
  • Recurrent Neural Networks (RNNs)
    • Sequential data processing
    • Memory and temporal dynamics
    • LSTM, GRU variants
  • Transformer Networks
    • Attention mechanisms
    • Parallel processing
    • Modern NLP backbone

Advanced Techniques

  • Batch Normalization
    • Internal covariate shift
    • Training acceleration
  • Residual Connections
    • Very deep networks
    • Gradient flow improvement
  • Attention Mechanisms
    • Selective focus
    • Long-range dependencies
  • Generative Models
    • VAEs, GANs, Diffusion
    • Creative AI applications

Next Steps

Practice implementation, experiment with real datasets, and explore specialized architectures for your domain of interest.

End of Module 12

Artificial Neural Networks

Questions?