Artificial Neural Networks

CMSC 173 - Module 12

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

What Are Neural Networks?

Artificial Neural Networks: Computing systems inspired by biological neural networks

Biological InspirationNeurons: Basic processing units
Synapses: Weighted connections
Learning: Adapting connection strengths
Parallel processing: Massive connectivity

Artificial CounterpartPerceptrons: Mathematical neurons
Weights: Learnable parameters
Training: Gradient-based optimization
Layers: Organized processing units

Key Insight

Neural networks can learn complex non-linear mappings from data by adjusting weights through training.

Why Neural Networks?

Motivation: Limitations of Linear Models

Linear ModelsLimited to linear decision boundaries
Cannot solve XOR problem
Restricted representational power
Simple but insufficient for complex data

Example: XOR Problem \begin{tabular}{cc|c} $x_1$ & $x_2$ & XOR \\ \hline 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ \end{tabular} No linear classifier can solve this!

Neural NetworksNon-linear decision boundaries
Universal approximation capability
Hierarchical feature learning
Scalable to complex problems

Universal Approximation Theorem: A neural network with a single hidden layer can approximate any continuous function to arbitrary accuracy (given sufficient neurons). Key Advantages:

Automatic feature extraction
End-to-end learning
Flexible architectures

The Perceptron: Building Block of Neural Networks

\begin{tikzpicture}[scale=1.0, every node/.style={scale=1.0}] % Input nodes - better vertical spacing and centered \node[input neuron] (x1) at (0,2.5) {$x_1$}; \node[input neuron] (x2) at (0,1.5) {$x_2$}; \node[input neuron] (x3) at (0,0.5) {$x_3$}; \node[bias neuron] (x0) at (0,-0.5) {$x_0$}; % \node[above=0.05cm of x0, font=\tiny] {bias}; % Intermediate processing nodes - better horizontal alignment \node[computation] (sum) at (3.5,1) {$\sum$}; \node[activation] (sigma) at (5.5,1) {$\sigma$}; % Output neuron - centered vertically with processing nodes \node[output neuron, large neuron] (y) at (7.5,1) {$y$}; % Connections from inputs to summation with better routing \coordinate (sumIn) at (2.8,1); \draw[strong connection] (x1) -- (sumIn); \draw[strong connection] (x2) -- (sumIn); \draw[strong connection] (x3) -- (sumIn); \draw[strong connection] (x0) -- (sumIn); % Weight labels positioned clearly above connections \node at (1.4, 2.2) [font=\small] {$w_1$}; \node at (1.4, 1.5) [font=\small] {$w_2$}; \node at (1.4, 0.8) [font=\small] {$w_3$}; \node at (1.4, 0.1) [font=\small] {$b$}; % Flow arrows between processing stages \draw[flow arrow] (sum) -- (sigma); \draw[flow arrow] (sigma) -- (y); % Mathematical formulation - positioned below with consistent spacing \node[below=0.8cm of sum, font=\small] {$z = \sum_{i=1}^{n} w_i x_i + b$}; \node[below=0.8cm of sigma, font=\small] {$y = \sigma(z)$}; % Input and Output labels - better positioning \node[left=0.2cm of x1, font=\bfseries\small] {Inputs}; \node[right=0.2cm of y, font=\bfseries\small] {Output}; \end{tikzpicture}

Mathematical Model

Linear Combination: $z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T\mathbf{x} + b$ Activation: $y = \sigma(z)$ where $\sigma$ is an activation function

Neural Network Components and Architecture

Single Processing Unit \begin{tikzpicture}[scale=0.8, every node/.style={scale=0.8}] % Single neuron diagram - centered at (3,2) \node[neuron, minimum size=1.2cm] (neuron) at (3,2) {$y$}; % Input nodes properly positioned \node[left] (x1) at (0,3.2) {$x_1$}; \node[left] (x2) at (0,2.4) {$x_2$}; \node[left] (xdots) at (0,1.6) {$⋮$}; \node[left] (xD) at (0,0.8) {$x_D$}; \node[left] (bias) at (0,3.8) {$w_0$}; % Input connections with clear weight labels \draw[connection] (x1) -- (neuron) node[pos=0.3, above] {$w_1$}; \draw[connection] (x2) -- (neuron) node[pos=0.3, above] {$w_2$}; \draw[connection] (xdots) -- (neuron); \draw[connection] (xD) -- (neuron) node[pos=0.3, below] {$w_D$}; \draw[connection] (bias) -- (neuron) node[pos=0.35, above, sloped, font=\tiny] {bias}; % Output with clear spacing \draw[connection] (neuron) -- (5.5,2) node[right] {$y := \sigma(z)$}; % Activation function annotation - better positioned \node[above=0.5cm of neuron, font=\tiny] {Activation}; \node[above=0.25cm of neuron, font=\tiny] {Function, $\sigma$}; \end{tikzpicture} \small{Single processing unit with inputs $x_1, …, x_D$, weights $w_1, …, w_D$, bias $w_0$, and activation function $\sigma$.}

Multi-Layer Perceptron \begin{tikzpicture}[scale=0.7, every node/.style={scale=0.7}] % Input layer - vertically centered \foreach \y in {1,2,3,4} { \node[input neuron] (I-\y) at (0,{4.5-\y}) {$x_\y$}; } % Hidden layer 1 - centered with 5 nodes \foreach \y in {1,2,3,4,5} { \node[hidden neuron] (H1-\y) at (2.8,{5-\y}) {}; } % Hidden layer 2 - centered with 3 nodes \foreach \y in {1,2,3} { \node[hidden neuron] (H2-\y) at (5.6,{3.5-\y}) {}; } % Output layer - centered \node[output neuron] (O-1) at (8.4,2) {$y$}; % Connections input to hidden1 \foreach \i in {1,2,3,4} { \foreach \j in {1,2,3,4,5} { \draw[connection, opacity=0.25] (I-\i) -- (H1-\j); } } % Connections hidden1 to hidden2 \foreach \i in {1,2,3,4,5} { \foreach \j in {1,2,3} { \draw[connection, opacity=0.25] (H1-\i) -- (H2-\j); } } % Connections hidden2 to output \foreach \i in {1,2,3} { \draw[connection, opacity=0.25] (H2-\i) -- (O-1); } % Layer labels - better positioned \node[layer label] at (0,-0.8) {Input}; \node[layer label] at (2.8,-0.8) {Hidden 1}; \node[layer label] at (5.6,-0.8) {Hidden 2}; \node[layer label] at (8.4,-0.8) {Output}; \end{tikzpicture} \small{Multi-layer perceptron with fully connected layers. Each connection represents a learnable weight parameter.}

Key Concepts

Processing Unit: $z = \sum_{i=1}^{D} w_i x_i + w_0$, then $y = \sigma(z)$ \\ Network: Multiple units arranged in layers with feedforward connections

Perceptron: Mathematical Formulation

Complete Mathematical Description: $$\begin{aligned}z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T\mathbf{x} + b \\ y = \sigma(z) = \sigma(\mathbf{w}^T\mathbf{x} + b)\end{aligned}$$ where:

$\mathbf{x} = [x_1, x_2, …, x_n]^T$: input vector
$\mathbf{w} = [w_1, w_2, …, w_n]^T$: weight vector
$b$: bias term
$\sigma(\cdot)$: activation function

Step Function (Original)$$\sigma(z) = \begin{cases}
1 & \text{if } z \geq 0 \\
0 & \text{if } z < 0
\end{cases}$$
Problem: Not differentiable

Sigmoid Function (Modern)$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Advantage: Smooth and differentiable

Perceptron Learning Algorithm

Goal: Learn weights $\mathbf{w}$ and bias $b$ to minimize prediction error

Original Perceptron RuleFor misclassified point $(x_i, y_i)$:
$$w_j := w_j + \alpha (y_i - \hat{y}_i) x_{ij}$$
$$b := b + \alpha (y_i - \hat{y}_i)$$

where $\alpha$ is the learning rate.

Convergence: Guaranteed for linearly separable data

Gradient Descent (Modern)Define loss function: $L = \frac{1}{2}(y - \hat{y})^2$

Weight updates:
$$\begin{aligned}w_j := w_j - \alpha \frac{\partial L}{\partial w_j} \\ = w_j - \alpha (y - \hat{y}) \sigma'(z) x_j \\ b := b - \alpha \frac{\partial L}{\partial b} \\ = b - \alpha (y - \hat{y}) \sigma'(z)\end{aligned}$$

Limitation

Single perceptron can only learn linearly separable functions. Solution: Multi-layer networks!

Activation Functions: The Heart of Non-linearity

[Figure: ../figures/activation_functions.png]

Purpose

Activation functions introduce non-linearity into the network, enabling it to learn complex patterns.

Activation Functions: Mathematical Properties

Sigmoid Function$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Properties:
Range: $(0, 1)$
Smooth and differentiable
Output interpretable as probability


Derivative:
$$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$

Issues: Vanishing gradients for large $|x|$

Hyperbolic Tangent$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

Properties:
Range: $(-1, 1)$
Zero-centered output
Steeper gradients than sigmoid


Derivative:
$$\tanh'(x) = 1 - \tanh^2(x)$$

Advantage: Better than sigmoid

Activation Functions: ReLU Family

ReLU (Rectified Linear Unit)$$\text{ReLU}(x) = \max(0, x)$$

Advantages:
Computationally efficient
No vanishing gradient for $x > 0$
Sparse activation
Most popular choice


Derivative:
$$\text{ReLU}'(x) = \begin{cases}
1 & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}$$

Leaky ReLU$$\text{LeakyReLU}(x) = \begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}$$

Advantages:
Avoids "dying ReLU" problem
Small gradient for negative inputs
Typically $\alpha = 0.01$


Derivative:
$$\text{LeakyReLU}'(x) = \begin{cases}
1 & \text{if } x > 0 \\
\alpha & \text{if } x \leq 0
\end{cases}$$

Activation Function Derivatives

[Figure: ../figures/activation_derivatives.png]

Why Derivatives Matter

Derivatives are crucial for backpropagation - they determine how errors flow backward through the network during training.

Choosing Activation Functions

\begin{tipblock}{Guidelines} Hidden Layers:

ReLU: Default choice (fast, effective)
Leaky ReLU: If dying ReLU is a problem
Tanh: For zero-centered data
Sigmoid: Avoid (vanishing gradients)

Output Layer:

Sigmoid: Binary classification
Softmax: Multi-class classification
Linear: Regression
Tanh: Regression (bounded output)

\end{tipblock}

Common IssuesVanishing Gradients:
Sigmoid/Tanh derivatives $\rightarrow 0$ for large inputs
Deep networks suffer from this
Solution: ReLU activations


Dying ReLU:
Neurons get stuck at zero output
No gradient flows through
Solution: Leaky ReLU, initialization

Best Practice

Start with ReLU for hidden layers and choose output activation based on your task.

Multi-Layer Neural Network Architecture

\begin{tikzpicture}[scale=0.85, every node/.style={scale=0.85}] % Input layer - centered vertically \foreach \y in {0,1,2,3} { \node[input neuron] (I-\y) at (0,{3.5-\y}) {$x_{\y}$}; } % Hidden layer 1 - centered with 5 nodes \foreach \y in {0,1,2,3,4} { \node[hidden neuron] (H1-\y) at (3,{4-\y}) {}; } % Hidden layer 2 - centered with 3 nodes \foreach \y in {0,1,2} { \node[hidden neuron] (H2-\y) at (6,{3-\y}) {}; } % Output layer - centered with 2 nodes \foreach \y in {0,1} { \node[output neuron] (O-\y) at (9,{2-\y}) {$y_{\y}$}; } % Connections input to hidden1 \foreach \i in {0,1,2,3} { \foreach \j in {0,1,2,3,4} { \draw[connection, opacity=0.3] (I-\i) -- (H1-\j); } } % Connections hidden1 to hidden2 \foreach \i in {0,1,2,3,4} { \foreach \j in {0,1,2} { \draw[connection, opacity=0.3] (H1-\i) -- (H2-\j); } } % Connections hidden2 to output \foreach \i in {0,1,2} { \foreach \j in {0,1} { \draw[connection, opacity=0.3] (H2-\i) -- (O-\j); } } % Layer labels - better positioned \node[layer label] at (0,-1.3) {Input}; \node[layer label] at (3,-1.3) {Hidden 1}; \node[layer label] at (6,-1.3) {Hidden 2}; \node[layer label] at (9,-1.3) {Output}; % Weight matrix labels - positioned above network \node[above, font=\small] at (1.5,4.5) {$\mathbf{W}^{(1)}, \mathbf{b}^{(1)}$}; \node[above, font=\small] at (4.5,4.5) {$\mathbf{W}^{(2)}, \mathbf{b}^{(2)}$}; \node[above, font=\small] at (7.5,4.5) {$\mathbf{W}^{(3)}, \mathbf{b}^{(3)}$}; % Forward propagation arrows \foreach \x in {1.5,4.5,7.5} { \draw[flow arrow] (\x,-0.8) -- (\x+1,-0.8); } \node[below] at (4.5,-0.8) {Forward Propagation}; \end{tikzpicture}

Key Components

Layers: Input → Hidden → Hidden → ... → Output Connections: Each neuron connects to all neurons in the next layer (fully connected)

Network Architecture: Mathematical Representation

For a network with $L$ layers: $$\begin{aligned}\mathbf{a}^{(0)} = \mathbf{x} \text{(input layer)} \\ \mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} \text{for } l = 1, 2, …, L \\ \mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)}) \text{for } l = 1, 2, …, L \\ \hat{\mathbf{y}} = \mathbf{a}^{(L)} \text{(output layer)}\end{aligned}$$ where:

$\mathbf{W}^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}$: weight matrix for layer $l$
$\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}$: bias vector for layer $l$
$n_l$: number of neurons in layer $l$
$\sigma^{(l)}$: activation function for layer $l$

Network Dimensions and Parameters

Matrix DimensionsFor layer $l$:
Input: $\mathbf{a}^{(l-1)}$ has shape $(n_{l-1}, 1)$
Weights: $\mathbf{W}^{(l)}$ has shape $(n_l, n_{l-1})$
Output: $\mathbf{a}^{(l)}$ has shape $(n_l, 1)$




Batch Processing:
Input batch: $\mathbf{A}^{(l-1)}$ has shape $(n_{l-1}, m)$
Output batch: $\mathbf{A}^{(l)}$ has shape $(n_l, m)$
where $m$ is the batch size

Parameter CountTotal parameters:
$$\sum_{l=1}^{L} (n_l \times n_{l-1} + n_l)$$

Example: 784 → 128 → 64 → 10
$$\begin{aligned}784 \times 128 + 128 \\ + 128 \times 64 + 64 \\ + 64 \times 10 + 10 \\ = 109,386 \text{ parameters}\end{aligned}$$

Memory scales with:
Network depth
Layer width
Batch size

Network Design Considerations

Depth vs WidthDeeper Networks:
More layers, fewer neurons per layer
Better feature hierarchies
Can represent more complex functions
Risk: vanishing gradients


Wider Networks:
Fewer layers, more neurons per layer
More parameters at each level
Easier to train
Risk: overfitting

\begin{tipblock}{Architecture Guidelines} Hidden Layer Size:

Start with 1-2 hidden layers
Size between input and output dimensions
Rule of thumb: $\sqrt{n_{input} \times n_{output}}$

Number of Layers:

Simple problems: 1-2 hidden layers
Complex problems: 3+ layers
Very deep: Requires special techniques

\end{tipblock}

Rule of Thumb

Start simple and gradually increase complexity. Use validation performance to guide architecture choices.

Forward Propagation: Information Flow

\begin{tikzpicture}[scale=1.0, every node/.style={scale=1.0}] % Network structure (simplified 3-layer) - centered vertically \node[input neuron] (x1) at (0,2) {$x_1$}; \node[input neuron] (x2) at (0,1) {$x_2$}; \node[input neuron] (x3) at (0,0) {$x_3$}; \node[hidden neuron] (h1) at (3.5,1.5) {$h_1$}; \node[hidden neuron] (h2) at (3.5,0.5) {$h_2$}; \node[output neuron] (y) at (7,1) {$y$}; % Connections with reduced opacity \draw[strong connection, opacity=0.25] (x1) -- (h1); \draw[strong connection, opacity=0.25] (x1) -- (h2); \draw[strong connection, opacity=0.25] (x2) -- (h1); \draw[strong connection, opacity=0.25] (x2) -- (h2); \draw[strong connection, opacity=0.25] (x3) -- (h1); \draw[strong connection, opacity=0.25] (x3) -- (h2); \draw[strong connection, opacity=0.25] (h1) -- (y); \draw[strong connection, opacity=0.25] (h2) -- (y); % Flow arrows and computations - positioned above network \node[computation] (z1) at (1.75,2.8) {$\mathbf{z}^{(1)}$}; \node[activation] (a1) at (1.75,3.6) {$\sigma$}; \draw[flow arrow] (z1) -- (a1); \node[above=0.1cm of a1, font=\small] {$\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})$}; \node[computation] (z2) at (5.25,2.8) {$\mathbf{z}^{(2)}$}; \node[activation] (a2) at (5.25,3.6) {$\sigma$}; \draw[flow arrow] (z2) -- (a2); \node[above=0.1cm of a2, font=\small] {$\mathbf{a}^{(2)} = \sigma(\mathbf{z}^{(2)})$}; % Mathematical formulation - better positioning \node[below, font=\small] at (1.75,-0.8) {$\mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}$}; \node[below, font=\small] at (5.25,-0.8) {$\mathbf{z}^{(2)} = \mathbf{W}^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)}$}; % Layer labels \node[layer label] at (0,-1.5) {Input}; \node[layer label] at (3.5,-1.5) {Hidden}; \node[layer label] at (7,-1.5) {Output}; % Forward flow arrows - positioned below labels \draw[flow arrow, very thick] (0.8,-1.3) -- (2.7,-1.3); \draw[flow arrow, very thick] (4.3,-1.3) -- (6.2,-1.3); \node[below, font=\small\bfseries] at (3.5,-1.1) {Forward Propagation}; \end{tikzpicture}

Forward Pass

Information flows from input to output, layer by layer, to compute predictions.

Forward Propagation Algorithm

Step-by-step Process: \begin{algorithm}[H] \caption{Forward Propagation} \begin{algorithmic}[1] \STATE Input: $\mathbf{x}$, weights $\{\mathbf{W}^{(l)}\}$, biases $\{\mathbf{b}^{(l)}\}$ \STATE Set $\mathbf{a}^{(0)} = \mathbf{x}$ \FOR{$l = 1$ to $L$} \STATE Compute pre-activation: $\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$ \STATE Apply activation: $\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})$ \ENDFOR \STATE Output: $\hat{\mathbf{y}} = \mathbf{a}^{(L)}$ \end{algorithmic} \end{algorithm}

\begin{exampleblock}{Vectorized Implementation} For batch processing: $$\begin{aligned}\mathbf{Z}^{(l)} = \mathbf{A}^{(l-1)} \mathbf{W}^{(l)T} + \mathbf{b}^{(l)} \\ \mathbf{A}^{(l)} = \sigma^{(l)}(\mathbf{Z}^{(l)})\end{aligned}$$ where $\mathbf{A}^{(l)}$ has shape $(m, n_l)$ for $m$ examples. Computational Complexity: $O(L \cdot N \cdot M)$ where $L$ = layers, $N$ = max neurons/layer, $M$ = batch size \end{exampleblock}

\begin{exampleblock}{Example Calculation} Network: $2 \rightarrow 3 \rightarrow 1$ Input: $\mathbf{x} = [0.5, 0.8]^T$ Layer 1: $\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$ $\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})$ Layer 2: $z^{(2)} = \mathbf{w}^{(2)T} \mathbf{a}^{(1)} + b^{(2)}$ $\hat{y} = \sigma(z^{(2)})$ All intermediate values $\mathbf{z}^{(l)}, \mathbf{a}^{(l)}$ are stored for backpropagation. \end{exampleblock}

Forward Propagation: Implementation Details

Memory ConsiderationsStorage Requirements:
Store all activations $\mathbf{a}^{(l)}$
Store all pre-activations $\mathbf{z}^{(l)}$
Needed for backpropagation


Memory Usage:
$$\text{Memory} \propto \sum_{l=0}^{L} n_l \times \text{batch\_size}$$

Trade-offs:
Larger batches: More memory, better GPU utilization
Smaller batches: Less memory, more gradient noise

Numerical Stability

Common Issues:

Overflow: Large intermediate values
Underflow: Very small values → 0
NaN propagation: Invalid operations

Solutions:

Proper weight initialization
Batch normalization
Gradient clipping
Use stable activation functions (ReLU)

Key Insight

Forward propagation is computationally straightforward, but proper implementation requires attention to memory usage and numerical stability.

Forward Pass: Handworked Example

Network: 2 inputs → 2 hidden → 1 output (sigmoid activation)

GivenInput: $\mathbf{x} = \begin{bmatrix} 0.5 \\ 0.8 \end{bmatrix}$

Weights \& Biases:
$$\mathbf{W}^{(1)} = \begin{bmatrix} 0.2 & 0.4 \\ 0.3 & 0.1 \end{bmatrix},   \mathbf{b}^{(1)} = \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}$$

$$\mathbf{W}^{(2)} = \begin{bmatrix} 0.6 & 0.5 \end{bmatrix},   b^{(2)} = 0.3$$

Activation: $\sigma(z) = \frac{1}{1 + e^{-z}}$

Step 1: Hidden Layer$$\mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$$

$$= \begin{bmatrix} 0.2 & 0.4 \\ 0.3 & 0.1 \end{bmatrix} \begin{bmatrix} 0.5 \\ 0.8 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}$$

$$= \begin{bmatrix} 0.2(0.5) + 0.4(0.8) \\ 0.3(0.5) + 0.1(0.8) \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}$$

$$= \begin{bmatrix} 0.1 + 0.32 \\ 0.15 + 0.08 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.52 \\ 0.43 \end{bmatrix}$$

Forward Pass: Handworked Example (continued)

Step 2: Hidden Activations$$\mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) = \sigma\left(\begin{bmatrix} 0.52 \\ 0.43 \end{bmatrix}\right)$$

$$a_1^{(1)} = \sigma(0.52) = \frac{1}{1 + e^{-0.52}} = \frac{1}{1 + 0.595} = 0.627$$

$$a_2^{(1)} = \sigma(0.43) = \frac{1}{1 + e^{-0.43}} = \frac{1}{1 + 0.651} = 0.606$$

$$\mathbf{a}^{(1)} = \begin{bmatrix} 0.627 \\ 0.606 \end{bmatrix}$$

Step 3: Output Layer$$z^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + b^{(2)}$$

$$= \begin{bmatrix} 0.6 & 0.5 \end{bmatrix} \begin{bmatrix} 0.627 \\ 0.606 \end{bmatrix} + 0.3$$

$$= 0.6(0.627) + 0.5(0.606) + 0.3$$

$$= 0.376 + 0.303 + 0.3 = 0.979$$

Final Output:
$$\hat{y} = \sigma(0.979) = \frac{1}{1 + e^{-0.979}} = 0.727$$

Summary

Input $[0.5, 0.8]$ → Hidden $[0.627, 0.606]$ → Output $0.727$

Backpropagation: Error Flow

\begin{tikzpicture}[scale=1.0, every node/.style={scale=1.0}] % Network structure (same as forward, but show backward flow) \node[input neuron] (x1) at (0,2) {$x_1$}; \node[input neuron] (x2) at (0,1) {$x_2$}; \node[input neuron] (x3) at (0,0) {$x_3$}; \node[hidden neuron] (h1) at (3.5,1.5) {$h_1$}; \node[hidden neuron] (h2) at (3.5,0.5) {$h_2$}; \node[output neuron] (y) at (7,1) {$y$}; % Forward connections (lighter) \draw[weak connection, opacity=0.15] (x1) -- (h1); \draw[weak connection, opacity=0.15] (x1) -- (h2); \draw[weak connection, opacity=0.15] (x2) -- (h1); \draw[weak connection, opacity=0.15] (x2) -- (h2); \draw[weak connection, opacity=0.15] (x3) -- (h1); \draw[weak connection, opacity=0.15] (x3) -- (h2); \draw[weak connection, opacity=0.15] (h1) -- (y); \draw[weak connection, opacity=0.15] (h2) -- (y); % Error/gradient flow (backward arrows) - positioned below \draw[error arrow, very thick] (6.2,-1.3) -- (4.3,-1.3); \draw[error arrow, very thick] (2.7,-1.3) -- (0.8,-1.3); \node[below, font=\small\bfseries] at (3.5,-1.1) {Error Backpropagation}; % Loss function \node[function box] (loss) at (8.2,1) {$L$}; \draw[error arrow] (y) -- (loss); \node[right=0.05cm of loss, font=\small] {Loss}; % Gradient computations - positioned above network \node[computation] (grad2) at (5.25,2.8) {$\boldsymbol{\delta}^{(2)}$}; \node[computation] (grad1) at (1.75,2.8) {$\boldsymbol{\delta}^{(1)}$}; % Chain rule arrows \draw[gradient arrow] (loss) to[bend left=20] (grad2); \draw[gradient arrow] (grad2) to[bend left=20] (grad1); % Mathematical formulation - better positioning \node[below, font=\small] at (1.75,-0.8) {$\frac{\partial L}{\partial \mathbf{W}^{(1)}} = \boldsymbol{\delta}^{(1)} \mathbf{x}^T$}; \node[below, font=\small] at (5.25,-0.8) {$\frac{\partial L}{\partial \mathbf{W}^{(2)}} = \boldsymbol{\delta}^{(2)} (\mathbf{a}^{(1)})^T$}; % Layer labels \node[layer label] at (0,-1.5) {Input}; \node[layer label] at (3.5,-1.5) {Hidden}; \node[layer label] at (7,-1.5) {Output}; % Gradient flow labels - above network \node[above=0.1cm of grad1, font=\scriptsize] {$\boldsymbol{\delta}^{(1)} = (\mathbf{W}^{(2)})^T \boldsymbol{\delta}^{(2)} \odot \sigma'(\mathbf{z}^{(1)})$}; \node[above=0.1cm of grad2, font=\scriptsize] {$\boldsymbol{\delta}^{(2)} = \frac{\partial L}{\partial \mathbf{a}^{(2)}} \odot \sigma'(\mathbf{z}^{(2)})$}; \end{tikzpicture}

Backpropagation

Efficient algorithm to compute gradients by propagating errors backward through the network using the chain rule.

Mathematical Foundation: Chain Rule

Goal: Compute $\frac{\partial L}{\partial \mathbf{W}^{(l)}}$ and $\frac{\partial L}{\partial \mathbf{b}^{(l)}}$ for all layers Chain Rule Application: $$\begin{aligned}\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{W}^{(l)}} \\ \frac{\partial L}{\partial \mathbf{b}^{(l)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{b}^{(l)}} \\ \frac{\partial L}{\partial \mathbf{a}^{(l-1)}} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \frac{\partial \mathbf{z}^{(l)}}{\partial \mathbf{a}^{(l-1)}}\end{aligned}$$ Key Insight: Define error terms $\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}}$

Gradient Computations$$\begin{aligned}\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T \\ \frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)} \\ \boldsymbol{\delta}^{(l-1)} = (\mathbf{W}^{(l)})^T \boldsymbol{\delta}^{(l)} \odot \sigma'(\mathbf{z}^{(l-1)})\end{aligned}$$

Output LayerFor output layer $L$:
$$\boldsymbol{\delta}^{(L)} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \odot \sigma'(\mathbf{z}^{(L)})$$

Common case (MSE + sigmoid):
$$\boldsymbol{\delta}^{(L)} = (\mathbf{a}^{(L)} - \mathbf{y}) \odot \mathbf{a}^{(L)} \odot (1 - \mathbf{a}^{(L)})$$

where $\odot$ denotes element-wise multiplication.

Backpropagation Algorithm

\begin{algorithm}[H] \caption{Backpropagation} \begin{algorithmic}[1] \STATE Input: Training example $(\mathbf{x}, \mathbf{y})$, network weights \STATE Forward Pass: Compute all $\mathbf{a}^{(l)}$ and $\mathbf{z}^{(l)}$ (store them!) \STATE Compute Output Error: $\boldsymbol{\delta}^{(L)} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \odot \sigma'(\mathbf{z}^{(L)})$ \FOR{$l = L-1$ down to $1$} \STATE Propagate Error: $\boldsymbol{\delta}^{(l)} = (\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})$ \ENDFOR \FOR{$l = 1$ to $L$} \STATE Compute Gradients: \STATE $\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T$ \STATE $\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$ \ENDFOR \end{algorithmic} \end{algorithm}

Computational ComplexityTime: $O(\text{number of weights})$
Same order as forward pass
Very efficient vs numerical gradients

Space: $O(\text{network size})$
Must store all activations
Memory scales with depth

Why Backpropagation WorksEfficiency: Reuses computations via chain rule
Automatic: No manual gradient derivation
Exact: Computes exact gradients
General: Works for any differentiable network

Historical Impact:
Rumelhart, Hinton, Williams (1986)
Made deep learning practical

Computational Graph Perspective

\begin{tikzpicture}[scale=0.9, every node/.style={scale=0.9}] % Input nodes - better vertical alignment \node[input neuron] (x) at (0,1.8) {$x$}; \node[input neuron] (w) at (0,0.2) {$w$}; % Operation nodes (rectangles for operations) - aligned at y=1 \node[function box] (mul) at (2.5,1) {$\times$}; \node[function box] (add) at (5,1) {$+$}; \node[function box] (sig) at (7.5,1) {$\sigma$}; % Intermediate results - positioned above operations \node[computation] (z1) at (2.5,2.2) {$z_1$}; \node[computation] (z2) at (5,2.2) {$z_2$}; \node[output neuron] (y) at (10,1) {$y$}; % Bias - positioned below mul operation \node[bias neuron] (b) at (2.5,-0.6) {$b$}; % Forward edges \draw[flow arrow] (x) -- (mul); \draw[flow arrow] (w) -- (mul); \draw[flow arrow] (mul) -- (add); \draw[flow arrow] (b) -- (add); \draw[flow arrow] (add) -- (sig); \draw[flow arrow] (sig) -- (y); % Intermediate value labels - better positioning \draw[connection] (mul) -- (z1) node[midway, right, font=\tiny] {$wx$}; \draw[connection] (add) -- (z2) node[midway, right, font=\tiny] {$wx + b$}; % Loss function \node[function box] (loss) at (12,1) {$L$}; \draw[error arrow] (y) -- (loss); % Backward gradients (dashed red arrows) \draw[gradient arrow] (loss) to[bend left=12] (sig); \draw[gradient arrow] (sig) to[bend left=12] (add); \draw[gradient arrow] (add) to[bend left=12] (mul); \draw[gradient arrow] (add) to[bend right=12] (b); \draw[gradient arrow] (mul) to[bend left=12] (x); \draw[gradient arrow] (mul) to[bend right=12] (w); % Gradient labels - cleaner positioning \node[above, font=\tiny] at (11,1.4) {$\frac{\partial L}{\partial y}$}; \node[above, font=\tiny] at (8.75,1.4) {$\frac{\partial L}{\partial z_2}$}; \node[above, font=\tiny] at (6.25,1.4) {$\frac{\partial L}{\partial z_1}$}; \node[left, font=\tiny] at (1.3,0.6) {$\frac{\partial L}{\partial w}$}; \node[right, font=\tiny] at (3.2,-0.4) {$\frac{\partial L}{\partial b}$}; % Mathematical operations - positioned below \node[below, font=\small] at (2.5,-1.2) {$z_1 = w \cdot x$}; \node[below, font=\small] at (5,-1.2) {$z_2 = z_1 + b$}; \node[below, font=\small] at (7.5,-1.2) {$y = \sigma(z_2)$}; \end{tikzpicture}

Modern View

Backpropagation is automatic differentiation applied to computational graphs. Modern frameworks (TensorFlow, PyTorch) build these graphs automatically.

4-Layer Neural Network: Differential Equation Derivation

Network Structure: Input → Hidden1 → Hidden2 → Hidden3 → Output Forward Pass Equations: $$\begin{aligned}\mathbf{a}^{(0)} = \mathbf{x} \text{(input)} \\ \mathbf{z}^{(1)} = \mathbf{W}^{(1)} \mathbf{a}^{(0)} + \mathbf{b}^{(1)}, \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}) \\ \mathbf{z}^{(2)} = \mathbf{W}^{(2)} \mathbf{a}^{(1)} + \mathbf{b}^{(2)}, \mathbf{a}^{(2)} = \sigma(\mathbf{z}^{(2)}) \\ \mathbf{z}^{(3)} = \mathbf{W}^{(3)} \mathbf{a}^{(2)} + \mathbf{b}^{(3)}, \mathbf{a}^{(3)} = \sigma(\mathbf{z}^{(3)}) \\ \mathbf{z}^{(4)} = \mathbf{W}^{(4)} \mathbf{a}^{(3)} + \mathbf{b}^{(4)}, \mathbf{a}^{(4)} = \sigma(\mathbf{z}^{(4)}) \text{(output)}\end{aligned}$$ Loss Function: $L = \frac{1}{2}||\mathbf{a}^{(4)} - \mathbf{y}||^2$

Output Layer ErrorStarting from the output layer:
$$\begin{aligned}\boldsymbol{\delta}^{(4)} = \frac{\partial L}{\partial \mathbf{z}^{(4)}} \\ = \frac{\partial L}{\partial \mathbf{a}^{(4)}} \odot \frac{\partial \mathbf{a}^{(4)}}{\partial \mathbf{z}^{(4)}} \\ = (\mathbf{a}^{(4)} - \mathbf{y}) \odot \sigma'(\mathbf{z}^{(4)})\end{aligned}$$

Chain Rule ApplicationFor hidden layers ($l = 3, 2, 1$):
$$\begin{aligned}\boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} \\ = \frac{\partial L}{\partial \mathbf{z}^{(l+1)}} \frac{\partial \mathbf{z}^{(l+1)}}{\partial \mathbf{a}^{(l)}} \frac{\partial \mathbf{a}^{(l)}}{\partial \mathbf{z}^{(l)}} \\ = (\mathbf{W}^{(l+1)})^T \boldsymbol{\delta}^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})\end{aligned}$$

4-Layer Network: Complete Backpropagation Derivation

Step-by-Step Gradient Computation:

Error PropagationLayer 4 (Output):
$$\boldsymbol{\delta}^{(4)} = (\mathbf{a}^{(4)} - \mathbf{y}) \odot \sigma'(\mathbf{z}^{(4)})$$

Layer 3:
$$\boldsymbol{\delta}^{(3)} = (\mathbf{W}^{(4)})^T \boldsymbol{\delta}^{(4)} \odot \sigma'(\mathbf{z}^{(3)})$$

Layer 2:
$$\boldsymbol{\delta}^{(2)} = (\mathbf{W}^{(3)})^T \boldsymbol{\delta}^{(3)} \odot \sigma'(\mathbf{z}^{(2)})$$

Layer 1:
$$\boldsymbol{\delta}^{(1)} = (\mathbf{W}^{(2)})^T \boldsymbol{\delta}^{(2)} \odot \sigma'(\mathbf{z}^{(1)})$$

Weight and Bias GradientsFor each layer $l = 1, 2, 3, 4$:

Weight Gradients:
$$\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^T$$

Bias Gradients:
$$\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}$$

Update Rules:
$$\begin{aligned}\mathbf{W}^{(l)} := \mathbf{W}^{(l)} - \alpha \frac{\partial L}{\partial \mathbf{W}^{(l)}} \\ \mathbf{b}^{(l)} := \mathbf{b}^{(l)} - \alpha \frac{\partial L}{\partial \mathbf{b}^{(l)}}\end{aligned}$$

Key Insight

The error flows backward through the network, with each layer's error depending on the next layer's error multiplied by the transpose of the connecting weights.

Gradient Descent Optimization

[Figure: ../figures/gradient_descent_visualization.png]

Weight Update Rule

$$\mathbf{W}^{(l)} := \mathbf{W}^{(l)} - \alpha \frac{\partial L}{\partial \mathbf{W}^{(l)}}$$ $$\mathbf{b}^{(l)} := \mathbf{b}^{(l)} - \alpha \frac{\partial L}{\partial \mathbf{b}^{(l)}}$$ where $\alpha$ is the learning rate.

The Overfitting Problem

[Figure: ../figures/overfitting_regularization_demo.png]

Overfitting

Model learns training data too well, memorizing noise instead of generalizable patterns.

L1 and L2 Regularization

Add penalty terms to the loss function to control model complexity

L2 Regularization (Ridge)$$L_{total} = L_{data} + \lambda \sum_{l} ||\mathbf{W}^{(l)}||_2^2$$

where $||\mathbf{W}^{(l)}||_2^2 = \sum_i \sum_j (W_{ij}^{(l)})^2$

Effect:
Shrinks weights towards zero
Uniform penalty on all weights
Smooth weight distributions
Preferred for most applications


Gradient Modification:
$$\frac{\partial L_{total}}{\partial \mathbf{W}^{(l)}} = \frac{\partial L_{data}}{\partial \mathbf{W}^{(l)}} + 2\lambda \mathbf{W}^{(l)}$$

L1 Regularization (Lasso)$$L_{total} = L_{data} + \lambda \sum_{l} ||\mathbf{W}^{(l)}||_1$$

where $||\mathbf{W}^{(l)}||_1 = \sum_i \sum_j |W_{ij}^{(l)}|$

Effect:
Promotes sparsity (many weights → 0)
Automatic feature selection
Creates sparse networks
Useful for interpretability


Gradient Modification:
$$\frac{\partial L_{total}}{\partial \mathbf{W}^{(l)}} = \frac{\partial L_{data}}{\partial \mathbf{W}^{(l)}} + \lambda \text{sign}(\mathbf{W}^{(l)})$$

Hyperparameter $$

Controls regularization strength: larger $\lambda$ → more regularization → simpler model

L1 vs L2 Regularization Comparison

[Figure: ../figures/l1_vs_l2_regularization.png]

When to Use L2General-purpose regularization
All features potentially relevant
Want smooth weight shrinkage
Most common choice

When to Use L1Feature selection needed
Many irrelevant features
Want sparse models
Interpretability important

Dropout: A Different Approach

\begin{tikzpicture}[scale=0.8, every node/.style={scale=0.8}] % Training network (with dropout) \node[above, font=\bfseries] at (2.25,3.8) {Training (with Dropout)}; % Input layer - centered vertically \foreach \y in {0,1,2,3} { \node[input neuron] (TI-\y) at (0,{3-\y}) {$x_{\y}$}; } % Hidden layer with some dropped out neurons - centered \node[hidden neuron] (TH-0) at (2.5,{3-0}) {}; \node[hidden neuron, fill=gray!50, draw=gray] (TH-1) at (2.5,{3-1}) {\scriptsize X}; % Dropped out \node[hidden neuron] (TH-2) at (2.5,{3-2}) {}; \node[hidden neuron, fill=gray!50, draw=gray] (TH-3) at (2.5,{3-3}) {\scriptsize X}; % Dropped out % Output layer - centered \node[output neuron] (TO) at (5,1.5) {$y$}; % Active connections only - reduced opacity \draw[strong connection, opacity=0.3] (TI-0) -- (TH-0); \draw[strong connection, opacity=0.3] (TI-0) -- (TH-2); \draw[strong connection, opacity=0.3] (TI-1) -- (TH-0); \draw[strong connection, opacity=0.3] (TI-1) -- (TH-2); \draw[strong connection, opacity=0.3] (TI-2) -- (TH-0); \draw[strong connection, opacity=0.3] (TI-2) -- (TH-2); \draw[strong connection, opacity=0.3] (TI-3) -- (TH-0); \draw[strong connection, opacity=0.3] (TI-3) -- (TH-2); \draw[strong connection, opacity=0.3] (TH-0) -- (TO); \draw[strong connection, opacity=0.3] (TH-2) -- (TO); % Testing network (no dropout) \node[above, font=\bfseries] at (8.75,3.8) {Testing (no Dropout)}; % Input layer - centered vertically \foreach \y in {0,1,2,3} { \node[input neuron] (EI-\y) at (6.5,{3-\y}) {$x_{\y}$}; } % Hidden layer - all active, centered \foreach \y in {0,1,2,3} { \node[hidden neuron] (EH-\y) at (9,{3-\y}) {}; } % Output layer - centered \node[output neuron] (EO) at (11.5,1.5) {$y$}; % All connections active - very light \foreach \i in {0,1,2,3} { \foreach \j in {0,1,2,3} { \draw[connection, opacity=0.2] (EI-\i) -- (EH-\j); } } \foreach \j in {0,1,2,3} { \draw[connection, opacity=0.2] (EH-\j) -- (EO); } % Dropout probability label - better positioning \node[below, font=\footnotesize] at (2.5,-0.5) {Dropout rate: $p = 0.5$}; \node[below, font=\footnotesize] at (9,-0.5) {All neurons active}; % Layer labels \node[layer label] at (0,-1.1) {Input}; \node[layer label] at (2.5,-1.1) {Hidden}; \node[layer label] at (5,-1.1) {Output}; \node[layer label] at (6.5,-1.1) {Input}; \node[layer label] at (9,-1.1) {Hidden}; \node[layer label] at (11.5,-1.1) {Output}; \end{tikzpicture}

Dropout Technique

Randomly set neurons to zero during training to prevent co-adaptation and improve generalization.

Dropout: Mathematical Formulation

Training Phase: $$\begin{aligned}\mathbf{r}^{(l)} \sim \text{Bernoulli}(p) \text{(dropout mask)} \\ \tilde{\mathbf{a}}^{(l)} = \mathbf{r}^{(l)} \odot \mathbf{a}^{(l)} \text{(apply mask)} \\ \mathbf{z}^{(l+1)} = \mathbf{W}^{(l+1)} \tilde{\mathbf{a}}^{(l)} + \mathbf{b}^{(l+1)}\end{aligned}$$ Testing Phase: $$\mathbf{z}^{(l+1)} = p \cdot \mathbf{W}^{(l+1)} \mathbf{a}^{(l)} + \mathbf{b}^{(l+1)} \text{(scale weights)}$$

Dropout BenefitsPrevents overfitting: Reduces complex co-adaptations
Model averaging: Approximates ensemble of networks
Robust features: Forces redundant representations
Easy to implement: Simple modification to forward pass


Typical rates: 0.2-0.5 for hidden layers, 0.1-0.2 for input

\begin{exampleblock}{Implementation Notes} Training vs Testing:

Training: Randomly drop neurons
Testing: Use all neurons but scale outputs
Modern frameworks handle this automatically

Why Scaling Works:

Training: Each neuron is "on" with probability $p$
Testing: All neurons are "on"
Scaling by $p$ maintains expected activation levels

\end{exampleblock}

Best Practice

Use dropout in hidden layers only, not in output layer. Start with rate 0.5 and tune.

Regularization Comparison

[Figure: ../figures/regularization_comparison.png]

Choosing RegularizationStart with:
L2 regularization ($\lambda = 0.01$)
Dropout (rate = 0.5)
Early stopping


If still overfitting:
Increase regularization strength
Add more dropout
Reduce model complexity

Other TechniquesEarly Stopping:
Monitor validation loss
Stop when it starts increasing
Simple and effective


Data Augmentation:
Artificially increase training data
Add noise, rotations, etc.
Domain-specific techniques

Training Curves with Regularization

[Figure: ../figures/training_curves_regularization.png]

Monitoring Training

Use validation curves to detect overfitting and choose regularization strength.

Weight Initialization

Proper initialization is crucial for successful training

Poor InitializationAll zeros: No learning (symmetry)
$$W_{ij} = 0 \Rightarrow \text{no gradient flow}$$

Too large: Exploding gradients
$$W_{ij} \sim \mathcal{N}(0, 1) \Rightarrow \text{saturation}$$

Too small: Vanishing gradients
$$W_{ij} \sim \mathcal{N}(0, 0.01) \Rightarrow \text{weak signals}$$

Good InitializationXavier/Glorot (Sigmoid/Tanh):
$$W_{ij} \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in} + n_{out}}}\right)$$

He initialization (ReLU):
$$W_{ij} \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)$$

Bias initialization:
$$b_i = 0 \text{ (usually sufficient)}$$

Why These Work

} Maintain activation variance and gradient variance across layers during initialization.

Learning Rate and Optimization

Learning Rate SelectionToo high: Overshooting, instability
Loss explodes or oscillates
Network doesn't converge
Weights become very large


Too low: Slow convergence
Training takes forever
Gets stuck in local minima
Poor final performance


Good range: Typically $10^{-4}$ to $10^{-1}$

Advanced OptimizersSGD with Momentum:
$$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta) \nabla L$$
$$\mathbf{W} := \mathbf{W} - \alpha \mathbf{v}_t$$

Adam (Adaptive Moments):
$$\begin{aligned}\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \nabla L \\ \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) (\nabla L)^2 \\ \mathbf{W} := \mathbf{W} - \alpha \frac{\mathbf{m}_t}{\sqrt{\mathbf{v}_t} + \epsilon}\end{aligned}$$

Default choice: Adam with $\alpha = 0.001$

Learning Rate Scheduling

Decay strategies: Step decay, exponential decay, cosine annealing. Start high, reduce during training.

Training Diagnostics

Monitor these metrics during training:

Loss MonitoringTraining loss: Should decrease monotonically
Validation loss: Should decrease, then stabilize
Gap: Indicates overfitting if too large


Warning Signs:
Loss increases: Learning rate too high
Loss plateaus early: Learning rate too low
Validation loss increases: Overfitting

Gradient MonitoringGradient norms: Should be reasonable ($10^{-6}$ to $10^{-1}$)
Vanishing: Gradients → 0 in early layers
Exploding: Gradients become very large

Activation MonitoringActivation statistics: Mean, std, sparsity
Dead neurons: Always output zero
Saturated neurons: Always in saturation region


Healthy activations:
Reasonable variance (not too small/large)
Some sparsity (for ReLU)
No layers completely dead

Weight MonitoringWeight distributions: Should be reasonable
Weight updates: $|\Delta W| / |W| \approx 10^{-3}$
Layer-wise learning rates: May need adjustment

Tools

Use TensorBoard, Weights \& Biases, or similar tools for comprehensive monitoring and visualization.

Common Problems and Solutions

Problem: Vanishing GradientsSymptoms:
Early layers don't learn
Gradients approach zero

Solutions:
Use ReLU activations
Proper weight initialization
Batch normalization

Problem: OverfittingSymptoms:
Training accuracy >> validation accuracy
Validation loss increases

Solutions:
Add regularization (L2, dropout)
Reduce model complexity
More training data

Problem: Exploding GradientsSymptoms:
Loss becomes NaN
Weights blow up

Solutions:
Gradient clipping
Lower learning rate
Better initialization

Problem: Slow ConvergenceSymptoms:
Loss decreases slowly
Gets stuck in plateaus

Solutions:
Increase learning rate
Use adaptive optimizers

Neural Networks: Key Takeaways

Core ConceptsPerceptron: Basic building block
Multi-layer: Enable complex mappings
Activation functions: Provide non-linearity
Forward propagation: Compute predictions
Backpropagation: Compute gradients efficiently
Regularization: Prevent overfitting

Mathematical FoundationMatrix operations for efficiency
Chain rule for gradient computation
Optimization theory for training
Probability theory for interpretation

Best PracticesArchitecture: Start simple, add complexity gradually
Initialization: Xavier/He for proper gradient flow
Optimization: Adam optimizer with proper learning rate
Regularization: L2 + Dropout for generalization
Monitoring: Track loss, gradients, activations
Debugging: Systematic approach to problems

When to Use Neural NetworksLarge datasets available
Complex non-linear patterns
End-to-end learning desired
Feature engineering is difficult

Modern Deep Learning

These fundamentals scale to modern architectures: CNNs, RNNs, Transformers, ResNets, etc.

Applications \& Real-World Impact

Computer VisionImage classification: ResNet, EfficientNet
Object detection: YOLO, R-CNN
Segmentation: U-Net, Mask R-CNN
Face recognition: DeepFace, FaceNet
Medical imaging: Cancer detection, radiology

Natural Language ProcessingLanguage models: GPT, BERT, T5
Translation: Google Translate, DeepL
Chatbots: ChatGPT, virtual assistants
Text analysis: Sentiment, summarization

Other DomainsSpeech: Recognition, synthesis, processing
Recommendation: Netflix, Amazon, Spotify
Games: AlphaGo, OpenAI Five, StarCraft
Robotics: Control, perception, planning
Finance: Trading, fraud detection, risk
Science: Drug discovery, climate modeling

Emerging AreasGenerative AI: DALL-E, Midjourney, Stable Diffusion
Multimodal: CLIP, GPT-4V
Reinforcement Learning: Autonomous systems
Scientific Computing: Physics, chemistry, biology

Impact

Neural networks have revolutionized AI and are now fundamental to most modern machine learning applications.

Looking Forward: Advanced Topics

What's Next After This Foundation?

Specialized ArchitecturesConvolutional Neural Networks (CNNs)
  Spatial structure exploitation
Translation invariance
Computer vision applications

Recurrent Neural Networks (RNNs)
  Sequential data processing
Memory and temporal dynamics
LSTM, GRU variants

Transformer Networks
  Attention mechanisms
Parallel processing
Modern NLP backbone

Advanced TechniquesBatch Normalization
  Internal covariate shift
Training acceleration

Residual Connections
  Very deep networks
Gradient flow improvement

Attention Mechanisms
  Selective focus
Long-range dependencies

Generative Models
  VAEs, GANs, Diffusion
Creative AI applications

Next Steps

Practice implementation, experiment with real datasets, and explore specialized architectures for your domain of interest.

End of Module 12

Artificial Neural Networks

Questions?