From Linear Regression to Deep Networks
Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu
Neural Networks
Predicting continuous values
Optimizing parameters
Binary classification
Preventing overfitting
80 years of progress
The perceptron
Hidden layers
Deep networks
Explain linear and logistic regression, compute predictions by hand, and understand residuals & cost functions
Derive the MSE gradient step-by-step using chain rule, and apply gradient descent to optimize parameters
Describe overfitting vs. underfitting and explain how L1, L2, and dropout prevent memorization
Trace 80 years of neural network development from McCulloch-Pitts to ChatGPT
Compute forward passes through perceptrons and build AND, OR, NOT gates from single neurons
Explain how hidden layers solve XOR, work through backpropagation by hand, and count FCNN parameters
A supervised learning method that models the linear relationship between a dependent variable \(y\) and one or more independent variables \(x\).
$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_n x_n$$
Or in vector form: \(\hat{y} = \mathbf{w}^T \mathbf{x} + b\)
Find the "best fit" line that minimizes the distance between predicted and actual values.
Monthly charge = base fee + rate per kWh × usage
\(\text{Bill} = 500 + 12 \times \text{kWh used}\)
This is exactly \(\hat{y} = w_0 + w_1 x\) where:
Linear regression finds the best straight line through your data — the \(w_0\) and \(w_1\) that fit reality closest.
Predict price from area, bedrooms, location
Predict salary from years of experience
Predict temperature from historical data
Predict revenue from ad spend
Predict house price ($1000s) from area (sq ft)
| Area \(x\) (sq ft) | Price \(y\) ($1000s) |
|---|---|
| 1000 | 150 |
| 1500 | 200 |
| 2000 | 250 |
| 2500 | 300 |
| 3000 | 350 |
Find \(\hat{y} = w_0 + w_1 x\) that best fits this data.
$$w_1 = \frac{\sum_{i=1}^{m}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{m}(x_i - \bar{x})^2}$$
\(\bar{x} = \frac{1000+1500+2000+2500+3000}{5} = 2000\)
\(\bar{y} = \frac{150+200+250+300+350}{5} = 250\)
Numerator: \(250{,}000\)
Denominator: \(2{,}500{,}000\)
\(w_1 = \frac{250{,}000}{2{,}500{,}000} = \mathbf{0.1}\)
$$w_0 = \bar{y} - w_1 \bar{x} = 250 - 0.1 \times 2000 = \mathbf{50}$$
Final Model: \(\hat{y} = 50 + 0.1x\)
Using our model \(\hat{y} = 50 + 0.1x\):
| Area \(x\) | Computation | Predicted Price \(\hat{y}\) |
|---|---|---|
| 1,200 sq ft | \(50 + 0.1(1200)\) | $170K |
| 1,800 sq ft | \(50 + 0.1(1800)\) | $230K |
| 2,200 sq ft | \(50 + 0.1(2200)\) | $270K |
| 3,500 sq ft | \(50 + 0.1(3500)\) | $400K |
The model can predict for any area value — even ones not in the training data.

The vertical dashed lines are the residuals (errors): \(e_i = \hat{y}_i - y_i\). Each one measures how far off our prediction is. The "best" line minimizes these gaps overall.
Positive and negative errors cancel out! A line through the middle of a scattered cloud could have total error = 0 despite being terrible.
\(\sum e_i\)
Cancels out — useless
\(\sum |e_i|\)
Works, but not differentiable at 0
\(\sum e_i^2\)
Always positive, differentiable, penalizes big errors more
We need to take the derivative of the cost function to do gradient descent. Squared errors give us smooth derivatives — crucial for optimization.
By minimizing the average squared distance between predictions and actual values.
$$J(w_0, w_1) = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}_i - y_i)^2$$
\(J = \frac{1}{10}[(150-150)^2 + (200-200)^2 + \ldots] = 0\). Perfect fit!
How do we efficiently minimize \(J\) when data isn't perfectly linear? Enter: Gradient Descent
An iterative optimization algorithm that finds the minimum of a function by repeatedly taking steps proportional to the negative of the gradient.
$$w := w - \alpha \frac{\partial J}{\partial w}$$
Where \(\alpha\) is the learning rate (step size)
Imagine standing on a foggy hill. You can only feel the slope under your feet. The gradient tells you which direction is steepest — walk the opposite way to go downhill.
Linear regression, logistic regression, neural networks, SVMs, transformers — virtually every ML model is trained with some form of gradient descent.
For simple linear regression with 1 feature, yes — we used the closed-form formula to get \(w_1 = 0.1\). But what about...
Gradient descent is the universal optimizer — it's how we train everything from logistic regression to GPT-4.
The derivative \(\frac{df}{dx}\) tells you the slope (rate of change) of \(f\) at any point. Positive slope = function is increasing. Negative slope = decreasing.
| Rule | Formula | Example |
|---|---|---|
| Power Rule | \(\frac{d}{dx}[x^n] = nx^{n-1}\) | \(\frac{d}{dx}[x^2] = 2x\) |
| Constant Rule | \(\frac{d}{dx}[cf(x)] = c \cdot f'(x)\) | \(\frac{d}{dx}[3x^2] = 6x\) |
| Chain Rule | \(\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)\) | \(\frac{d}{dx}[(2x+1)^2] = 2(2x+1) \cdot 2\) |
With these 3 rules, you can derive every gradient in this lecture. Let's try it on MSE.
You're standing on a hillside in dense fog. You can only feel the slope under your feet.
Strategy: Always step in the steepest downhill direction. Eventually, you reach the valley (minimum).

Subtracting a negative number = adding → we move right (toward the minimum)
Subtracting a positive number = decreasing → we move left (toward the minimum)
Either way, the negative gradient always points toward the minimum!
Optimize millions of weights via backpropagation
Find optimal parameters for prediction models
Train CNNs for object detection and classification
Train transformers for language understanding
1. Initialize weights w randomly
2. Repeat until convergence:
a. Compute gradient: g = dJ/dw
b. Update weights: w = w - α × g
3. Return wThe gradient \(\frac{\partial J}{\partial w}\) tells us the direction of steepest ascent. We go in the opposite direction (hence the minus sign).
$$J(w_0, w_1) = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}_i - y_i)^2$$
Since \(\hat{y}_i = w_0 + w_1 x_i\), we can write:
$$J(w_0, w_1) = \frac{1}{2m}\sum_{i=1}^{m}(w_0 + w_1 x_i - y_i)^2$$
\(\frac{\partial J}{\partial w_1}\) — how does the cost change when we adjust the slope?
\(\frac{\partial J}{\partial w_0}\) — how does the cost change when we adjust the intercept?
$$\frac{\partial J}{\partial w_1} = \frac{1}{2m}\sum_{i=1}^{m} \frac{\partial}{\partial w_1}\left[(w_0 + w_1 x_i - y_i)^2\right]$$
$$= \frac{1}{2m}\sum_{i=1}^{m} 2(w_0 + w_1 x_i - y_i) \cdot \frac{\partial}{\partial w_1}(w_0 + w_1 x_i - y_i)$$
The inner derivative w.r.t. \(w_1\): \(\frac{\partial}{\partial w_1}(w_0 + w_1 x_i - y_i) = x_i\)
$$= \frac{1}{m}\sum_{i=1}^{m}(w_0 + w_1 x_i - y_i) \cdot x_i$$
$$\boxed{\frac{\partial J}{\partial w_1} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i) \cdot x_i}$$
$$\frac{\partial J}{\partial w_0} = \frac{1}{2m}\sum_{i=1}^{m} 2(w_0 + w_1 x_i - y_i) \cdot \frac{\partial}{\partial w_0}(w_0 + w_1 x_i - y_i)$$
\(\frac{\partial}{\partial w_0}(w_0 + w_1 x_i - y_i) = \mathbf{1}\)
Compare: for \(w_1\) it was \(x_i\)
$$\frac{\partial J}{\partial w_0} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)$$
No \(x_i\) — just the average error!
Intuition: \(\frac{\partial J}{\partial w_0}\) is the average prediction error. \(\frac{\partial J}{\partial w_1}\) is the error weighted by the input — larger inputs contribute more.
$$w_1 := w_1 - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i) \cdot x_i$$
$$w_0 := w_0 - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)$$
These two equations are all you need to train linear regression. Repeat until convergence.
Start: \(w_0 = 0,\; w_1 = 0,\; \alpha = 0.0000001\) (tiny LR because features are large)
Errors: \(\hat{y}_i - y_i = [0-150,\; 0-200,\; 0-250,\; 0-300,\; 0-350] = [-150, -200, -250, -300, -350]\)
\(= \frac{1}{5}(-150-200-250-300-350)\)
\(= \frac{-1250}{5} = \mathbf{-250}\)
\(= \frac{1}{5}[(-150)(1000) + (-200)(1500) + \ldots]\)
\(= \frac{-575{,}000}{5} \times 10^{-7} \approx \mathbf{-550{,}000}\)
\(w_0' = 0 - 10^{-7}(-250) = 0.000025\), \(w_1' = 0 - 10^{-7}(-550000) = 0.055\). One step closer!
| Iteration | \(w_0\) | \(w_1\) | \(J\) (cost) |
|---|---|---|---|
| 0 | 0.000 | 0.000 | 32,500 |
| 10 | 0.003 | 0.042 | 10,240 |
| 100 | 0.28 | 0.084 | 1,056 |
| 1,000 | 12.4 | 0.097 | 28.7 |
| 10,000 | 45.8 | 0.100 | 0.35 |
| ~50,000 | 50.0 | 0.100 | ≈ 0 |
GD converges to \(w_0 = 50,\; w_1 = 0.1\) — exactly the same as the closed-form formula.
Unlike the formula, GD works for any differentiable model — logistic regression, neural networks, transformers...
Minimize \(J(w) = (w - 3)^2\)
\(J(w)\) is minimized when \(w = 3\). Let's see if gradient descent finds it!
Since \(J(w) = (w-3)^2\) is a perfect parabola with minimum at \(w=3\), GD should converge there. Starting at \(w=0\), the gradient is negative (slope points downhill to the right), so GD will move \(w\) to the right toward 3.
| Iter | \(w\) | \(\frac{dJ}{dw} = 2(w-3)\) | Update | New \(w\) |
|---|---|---|---|---|
| 0 | 0.000 | \(2(0-3) = -6.0\) | \(0 - 0.2(-6.0)\) | 1.200 |
| 1 | 1.200 | \(2(1.2-3) = -3.6\) | \(1.2 - 0.2(-3.6)\) | 1.920 |
| 2 | 1.920 | \(2(1.92-3) = -2.16\) | \(1.92 - 0.2(-2.16)\) | 2.352 |
| 3 | 2.352 | \(2(2.352-3) = -1.296\) | \(2.352 - 0.2(-1.296)\) | 2.611 |
| 4 | 2.611 | \(2(2.611-3) = -0.778\) | \(2.611 - 0.2(-0.778)\) | 2.767 |
Converging toward \(w = 3\)! Each step gets us closer to the minimum.

Steps get smaller as we approach the minimum — the gradient decreases!

Slow convergence

Optimal convergence

Overshoots — may diverge!
Non-convex functions have multiple valleys. GD might get stuck in a shallow one instead of finding the deepest.
Points where gradient = 0 but it's not a minimum. The surface curves up in one direction and down in another.
If features have very different ranges (e.g., age 0–100 vs salary 0–1M), GD zig-zags instead of going straight to the minimum.
Common criteria: gradient < threshold, cost change < \(\epsilon\), or fixed number of iterations.
MSE is convex (one valley), so GD always finds the global minimum. Neural networks? Not so lucky — but it works anyway!
| Variant | Data per Step | Speed | Stability |
|---|---|---|---|
| Batch GD | All samples | Slow | Very stable |
| Stochastic GD | 1 sample | Fast | Noisy |
| Mini-batch GD | \(k\) samples | Balanced | Balanced |
Mini-batch gradient descent (batch size 32–256) is the standard in deep learning. It balances speed with stable convergence.
Linear regression can predict -0.2 or 1.3 — what does a negative probability mean?
A single outlier can shift the entire line, changing the decision boundary dramatically.
We need a function that squishes any real number into the range \((0, 1)\) → Enter: the sigmoid function
A classification algorithm that models the probability of a binary outcome using the sigmoid (logistic) function.
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)$$
The sigmoid maps any real number to \((0, 1)\) — perfect for probabilities. Output > 0.5 → class 1, otherwise class 0.
Named after the logistic function, coined by Pierre-François Verhulst (1845) to model population growth. Same S-curve, different application!

| \(z\) | \(e^{-z}\) | \(1 + e^{-z}\) | \(\sigma(z) = \frac{1}{1+e^{-z}}\) | Interpretation |
|---|---|---|---|---|
| \(-3\) | \(e^3 = 20.09\) | \(21.09\) | 0.047 | Very likely class 0 |
| \(-1\) | \(e^1 = 2.718\) | \(3.718\) | 0.269 | Probably class 0 |
| \(0\) | \(e^0 = 1\) | \(2\) | 0.500 | Coin flip! |
| \(1\) | \(e^{-1} = 0.368\) | \(1.368\) | 0.731 | Probably class 1 |
| \(3\) | \(e^{-3} = 0.050\) | \(1.050\) | 0.953 | Very likely class 1 |
\(\sigma(0) = 0.5\) always. The function is symmetric around \(z = 0\).
For \(|z| > 5\), the sigmoid is nearly 0 or 1. The neuron is "very confident."
\(z = \mathbf{w}^T\mathbf{x} + b\)
Can be any real number
\(p = \sigma(z)\)
Squished to (0, 1)
\(p \geq 0.5 \rightarrow\) Class 1
\(p < 0.5 \rightarrow\) Class 0
Classify emails as spam or not spam
Predict disease presence from symptoms
Flag suspicious credit card transactions
Predict outcome from study hours
Predict pass/fail from study hours
| Hours \(x\) | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| Result | Fail | Fail | Fail | Pass | Pass | Pass | Pass |
\(w_1 = 1.5\), \(b = -5.0\). Model: \(\hat{y} = \sigma(1.5x - 5.0)\), threshold = 0.5
| Hours \(x\) | \(z = 1.5x - 5.0\) | \(\sigma(z) = \frac{1}{1+e^{-z}}\) | Prediction |
|---|---|---|---|
| 2 | \(1.5(2)-5 = -2.0\) | \(\frac{1}{1+e^{2.0}} = 0.119\) | Fail ✗ |
| 3 | \(1.5(3)-5 = -0.5\) | \(\frac{1}{1+e^{0.5}} = 0.378\) | Fail ✗ |
| 4 | \(1.5(4)-5 = 1.0\) | \(\frac{1}{1+e^{-1.0}} = 0.731\) | Pass ✓ |
| 5 | \(1.5(5)-5 = 2.5\) | \(\frac{1}{1+e^{-2.5}} = 0.924\) | Pass ✓ |
All predictions match! The sigmoid converts the linear score into a probability.
$$J = -\frac{1}{m}\sum_{i=1}^{m}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$$
MSE with sigmoid creates a non-convex loss surface with many local minima. Cross-entropy is convex — gradient descent always finds the global minimum.

$$\frac{\partial J}{\partial w_j} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)x_j^{(i)}$$
Surprise! This looks identical to the linear regression gradient — but \(\hat{y}\) uses the sigmoid.
| Feature | Linear Regression | Logistic Regression |
|---|---|---|
| Output | Continuous value | Probability (0–1) |
| Activation | None (identity) | Sigmoid |
| Loss Function | Mean Squared Error | Binary Cross-Entropy |
| Use Case | Regression (predict amount) | Classification (predict class) |
| Decision | No threshold | Threshold at 0.5 |
Logistic regression is essentially linear regression + sigmoid. This idea of "linear combination + activation" is the foundation of neural networks!
A technique that adds a penalty term to the loss function to prevent overfitting by discouraging overly complex models.
$$J_{\text{reg}} = J_{\text{original}} + \lambda \cdot R(\mathbf{w})$$
Where \(\lambda\) controls regularization strength
Regularization is like a teacher telling a student: "I don't just want the right answer — I want a simple explanation." Simpler models generalize better.
Models may memorize training data instead of learning general patterns — performing perfectly on training data but failing on new data.
\(\lambda = 0\): no penalty (may overfit). \(\lambda \to \infty\): all weights → 0 (underfits). Cross-validation finds the sweet spot.
Memorizes every answer from past exams word-for-word
Learns the underlying concepts and problem-solving strategies
Regularization forces the model to "understand" rather than "memorize" — slightly worse on training data, much better on new data.

High Bias — too simple

Good Balance

High Variance — too complex
$$J_{L1} = J + \lambda \sum_{i} |w_i|$$
Encourages sparsity — some weights become exactly 0, effectively performing feature selection.
$$J_{L2} = J + \lambda \sum_{i} w_i^2$$
Shrinks all weights toward 0 but never exactly 0. Smoother, more stable solutions.
You suspect many features are irrelevant and want the model to automatically ignore them. L1 acts as a built-in feature selector.
All features matter but you want to prevent any single weight from dominating. L2 keeps everything small and balanced — the default choice.
$$J_{EN} = J + \lambda_1 \sum|w_i| + \lambda_2 \sum w_i^2$$
Combines the best of both: feature selection and weight shrinkage. Used in scikit-learn's ElasticNet.
Randomly "turn off" neurons during training (e.g., 50% chance). Forces the network to not rely on any single neuron.
Like studying with random pages removed — you learn the core concepts, not surface patterns.
We'll explore dropout in detail when we cover deep network training. For now, remember: regularization = adding constraints to prevent overfitting.
Automatically remove irrelevant features by zeroing their weights
Prevent deep networks from memorizing training data
Keep models simple and interpretable
Models perform better on unseen test data
Original model: \(w_0 = 50\) (bias), \(w_1 = 0.1\) (weight), and \(\lambda = 0.01\)
\(J = 0\) (perfect fit)
\(J_{\text{reg}} = 0 + 0.01 \times (0.1^2) = 0 + 0.01 \times 0.01 = \mathbf{0.0001}\)
\(\frac{\partial J_{\text{reg}}}{\partial w_j} = \frac{\partial J}{\partial w_j} + 2\lambda w_j\) — only feature weights; bias \(w_0\) is not penalized. The penalty pushes large weights smaller during training.

We now have the classical building blocks:
Let's look at the history that led from these ideas to neural networks...
| Method | Penalty | Effect | Best for |
|---|---|---|---|
| L1 | \(\lambda\sum|w_i|\) | Zeros out weights | Feature selection |
| L2 | \(\lambda\sum w_i^2\) | Shrinks weights | Preventing large weights |
| Elastic Net | L1 + L2 | Both | Correlated features |
| Dropout | Random neuron removal | Ensemble effect | Neural networks |
From the 1940s to Today
McCulloch & Pitts — First mathematical model of an artificial neuron. Showed neurons could compute logical functions.
Frank Rosenblatt's Perceptron — First trainable neural network, implemented in hardware. Could learn to classify simple patterns.
Minsky & Papert — Published "Perceptrons", proving single-layer networks cannot solve XOR. Revealed fundamental limitations.
Funding and interest dried up. Neural network research was largely abandoned for over a decade.
Rumelhart, Hinton & Williams — Popularized backpropagation, enabling training of multi-layer networks. The key breakthrough.
Yann LeCun — Applied CNNs to handwritten digit recognition (MNIST). First practical deep learning success.
Hochreiter & Schmidhuber — Invented LSTM for sequential data, solving the vanishing gradient problem.
Geoffrey Hinton — Deep Belief Networks. The term "Deep Learning" enters mainstream AI vocabulary.
Before 1986, there was no efficient way to train multi-layer networks. Backpropagation gave us a systematic method to compute gradients for every weight — not just the output layer, but all hidden layers too. This was the missing piece.
AlexNet wins ImageNet by a huge margin. 8 layers, GPU training. The "Big Bang" of modern deep learning.
GANs (Goodfellow) — Generative Adversarial Networks can create realistic images from noise.
"Attention Is All You Need" — The Transformer architecture revolutionizes NLP and later all of AI.
ChatGPT & LLMs — Large language models reach mainstream. AI becomes a daily tool for millions.
Internet, smartphones, and sensors generate massive datasets. Deep networks need lots of data to learn effectively.
GPUs, TPUs, and cloud computing make training billion-parameter models feasible. 10,000x faster than 2000s hardware.
ReLU, batch normalization, dropout, residual connections, Adam optimizer — all invented in the 2010s.
A neural network is built from the same building blocks we've been studying — weighted sums, activation functions, and gradient-based optimization.
Why did neural networks fail in the 1970s but succeed in the 2010s?
What changed in terms of data, compute, and algorithms?
How much data existed in the 1970s vs. the age of the internet?
What hardware breakthrough made matrix multiplication 100× faster?
What training method was missing before 1986?
1970s: Limited data, no GPUs, only single-layer networks (couldn't solve XOR). 2010s: Internet-scale data, GPU parallelism (NVIDIA CUDA), and breakthroughs like backpropagation, ReLU, and batch normalization made deep networks trainable.
The Perceptron
A computational unit that takes weighted inputs, sums them, adds a bias, and applies an activation function.
$$z = \sum_{i=1}^{n} w_i x_i + b, \quad a = f(z)$$
| Dendrites | → Inputs \(x_1, x_2, \ldots\) |
| Synaptic strength | → Weights \(w_1, w_2, \ldots\) |
| Soma (cell body) | → Summation \(\Sigma\) |
| Axon hillock | → Activation function \(f\) |
| Axon | → Output \(\hat{y}\) |
Real neurons are vastly more complex — they use timing, chemical signals, and recurrent connections. The artificial neuron is a very rough approximation.
\(f(z) = \frac{1}{1+e^{-z}}\)
Range: (0, 1)
\(f(z) = \max(0, z)\)
Range: [0, ∞)
\(f(z) = \tanh(z)\)
Range: (-1, 1)
Fast to compute, avoids vanishing gradient problem, works well in practice.
A single neuron with sigmoid activation is logistic regression! Same weighted sum + same activation function.
\(w_1 = 1,\; w_2 = 1,\; b = -1.5\), step activation \(f(z) = \begin{cases}1 & z \geq 0\\0 & z < 0\end{cases}\)
| \(x_1\) | \(x_2\) | \(z = x_1 + x_2 - 1.5\) | \(f(z)\) | AND | Match? |
|---|---|---|---|---|---|
| 0 | 0 | \(0 + 0 - 1.5 = -1.5\) | 0 | 0 | ✓ |
| 0 | 1 | \(0 + 1 - 1.5 = -0.5\) | 0 | 0 | ✓ |
| 1 | 0 | \(1 + 0 - 1.5 = -0.5\) | 0 | 0 | ✓ |
| 1 | 1 | \(1 + 1 - 1.5 = 0.5\) | 1 | 1 | ✓ |
The perceptron correctly computes AND! A single neuron can learn any linearly separable function.
\(w_1 = 1,\; w_2 = 1,\; b = -0.5\), same step activation
| \(x_1\) | \(x_2\) | \(z = x_1 + x_2 - 0.5\) | \(f(z)\) | OR | Match? |
|---|---|---|---|---|---|
| 0 | 0 | \(0 + 0 - 0.5 = -0.5\) | 0 | 0 | ✓ |
| 0 | 1 | \(0 + 1 - 0.5 = 0.5\) | 1 | 1 | ✓ |
| 1 | 0 | \(1 + 0 - 0.5 = 0.5\) | 1 | 1 | ✓ |
| 1 | 1 | \(1 + 1 - 0.5 = 1.5\) | 1 | 1 | ✓ |
Just by changing the bias from \(-1.5\) to \(-0.5\), the neuron switches from AND to OR!
\(w_1 = -1,\; b = 0.5\), step activation. Only one input.
| \(x\) | \(z = -x + 0.5\) | \(f(z)\) | NOT \(x\) | Match? |
|---|---|---|---|---|
| 0 | \(-0 + 0.5 = 0.5\) | 1 | 1 | ✓ |
| 1 | \(-1 + 0.5 = -0.5\) | 0 | 0 | ✓ |
With AND, OR, and NOT, neurons can compute any Boolean function... if we stack them in layers.
| \(x_1\) | \(x_2\) | XOR |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
No! XOR is not linearly separable. No single straight line can separate the 1s from the 0s.

Output: 1 ✓ Correct!
A single neuron can only learn linearly separable patterns. It draws one straight line through feature space.
Stack multiple neurons in layers. Each neuron learns a different feature. Together, they can solve XOR and any non-linear problem!
Enter: The Multilayer Perceptron (MLP)
Minsky & Papert (1969) formally proved no single perceptron can solve XOR. This wasn't opinion — it was a theorem. The search for multilayer networks was the direct response.
One hidden neuron can learn one decision boundary. Two hidden neurons = two boundaries. Combined, they carve out any region in feature space.
Hidden Layers
A feedforward neural network with one or more hidden layers between the input and output layers. Each layer is fully connected to the next.
Input Layer → Hidden Layer(s) → Output Layer
An MLP with just one hidden layer (enough neurons) can approximate any continuous function. This was proven by Cybenko in 1989.
Feedforward = data flows one direction. Dense layer = fully connected layer. Hidden = not directly observed (neither input nor output).
Detects simple patterns — lines, curves, light/dark transitions
Combines edges into shapes — eyes, noses, wheels, letters
Combines parts into concepts — faces, cars, words
Each layer builds more abstract representations from the layer before. The network automatically learns what features matter — no manual feature engineering needed.
Each circle = a neuron. Each line = a weight. Blue = inputs, purple = hidden neurons, green = output.
Weights: \(2 \times 3 + 3 \times 1 = 9\). Biases: \(3 + 1 = 4\). Total: 13 parameters.
Solve problems that single neurons cannot
MNIST dataset — the "Hello World" of deep learning
Universal approximation theorem: can approximate any continuous function
Structured data with complex feature interactions
2 inputs, 2 hidden neurons (step activation), 1 output
\(w_{11}=1,\; w_{12}=1,\; b_1=-0.5\)
\(h_1 = f(x_1 + x_2 - 0.5)\)
\(w_{21}=1,\; w_{22}=1,\; b_2=-1.5\)
\(h_2 = f(x_1 + x_2 - 1.5)\)
\(v_1 = 1,\; v_2 = -2,\; b_o = -0.5\). \(\hat{y} = f(h_1 - 2h_2 - 0.5)\) — computes "\(h_1\) AND NOT \(h_2\)"
| \(x_1\) | \(x_2\) | \(h_1\) = f(\(x_1+x_2-0.5\)) | \(h_2\) = f(\(x_1+x_2-1.5\)) | \(\hat{y}\) = f(\(h_1-2h_2-0.5\)) | XOR | |
|---|---|---|---|---|---|---|
| 0 | 0 | f(-0.5) = 0 | f(-1.5) = 0 | f(-0.5) = 0 | 0 | ✓ |
| 0 | 1 | f(0.5) = 1 | f(-0.5) = 0 | f(0.5) = 1 | 1 | ✓ |
| 1 | 0 | f(0.5) = 1 | f(-0.5) = 0 | f(0.5) = 1 | 1 | ✓ |
| 1 | 1 | f(1.5) = 1 | f(0.5) = 1 | f(-1.5) = 0 | 0 | ✓ |
XOR Solved! The hidden layer creates intermediate features that make the problem linearly separable.
We know the output is wrong, but how do we know which hidden weights to blame? The hidden neurons don't have direct targets — only the output does.
Imagine a factory assembly line with 3 stations. The final product is defective. You trace backward:
Backpropagation does exactly this — it traces the error backward through each layer.
The math tool that makes this possible: the chain rule from calculus.
$$\frac{\partial J}{\partial w^{[l]}} = \frac{\partial J}{\partial a^{[L]}} \cdot \frac{\partial a^{[L]}}{\partial z^{[L]}} \cdot \ldots \cdot \frac{\partial z^{[l]}}{\partial w^{[l]}}$$
Backpropagation = the chain rule applied systematically across layers
Input: \(x = 1\), Target: \(y = 1\), Learning rate: \(\alpha = 0.5\)
Weights: \(w_1 = 0.5,\; b_1 = 0.2\) (hidden) | \(w_2 = 0.8,\; b_2 = 0.1\) (output)
\(z_1 = w_1 x + b_1 = 0.5(1) + 0.2 = 0.7\)
\(h = \sigma(0.7) = \frac{1}{1+e^{-0.7}} = \mathbf{0.668}\)
\(z_2 = w_2 h + b_2 = 0.8(0.668) + 0.1 = 0.634\)
\(\hat{y} = \sigma(0.634) = \frac{1}{1+e^{-0.634}} = \mathbf{0.653}\)
\(L = \frac{1}{2}(\hat{y} - y)^2 = \frac{1}{2}(0.653 - 1)^2 = \frac{1}{2}(0.121) = \mathbf{0.0603}\). We want to reduce this!
\(\frac{\partial L}{\partial \hat{y}} = \hat{y} - y = 0.653 - 1 = -0.347\)
\(\frac{\partial \hat{y}}{\partial z_2} = \hat{y}(1-\hat{y}) = 0.653 \times 0.347 = 0.2266\) (sigmoid derivative)
\(\frac{\partial z_2}{\partial w_2} = h = 0.668\)
Chain rule: \(\frac{\partial L}{\partial w_2} = (-0.347)(0.2266)(0.668) = \mathbf{-0.0525}\)
Continue the chain through \(w_2\): \(\frac{\partial z_2}{\partial h} = w_2 = 0.8\), \(\frac{\partial h}{\partial z_1} = h(1-h) = 0.668 \times 0.332 = 0.2218\), \(\frac{\partial z_1}{\partial w_1} = x = 1\)
Chain rule: \(\frac{\partial L}{\partial w_1} = (-0.347)(0.2266)(0.8)(0.2218)(1) = \mathbf{-0.01394}\)
\(w_2' = 0.8 - 0.5(-0.0525) = \mathbf{0.826}\)
\(w_1' = 0.5 - 0.5(-0.01394) = \mathbf{0.507}\)
\(z_1' = 0.507(1) + 0.2 = 0.707\)
\(h' = \sigma(0.707) = 0.670\)
\(z_2' = 0.826(0.670) + 0.1 = 0.653\)
\(\hat{y}' = \sigma(0.653) = \mathbf{0.658}\)
One step of backprop reduced the loss! Repeat this thousands of times and the network learns.

They transform the input space into a representation where the problem becomes linearly separable.
Going Deep
A deep neural network where every neuron in one layer is connected to every neuron in the next layer. Also called a Dense Network or Deep Feedforward Network.
An MLP with 2+ hidden layers = a Fully Connected Neural Network. "Deep" means multiple hidden layers.
More layers = more abstraction. Early layers learn simple features, deeper layers learn complex combinations.
A network with 2+ hidden layers is called "deep." More depth = more capacity to learn complex patterns. Most modern networks have 10–100+ layers.
Our XOR solver: 2 layers, 13 params. MNIST digit recognizer: 2 layers, 109K params. GPT-4: ~120 layers, 1.8T params.
Parameters: \((3 \times 4 + 4) + (4 \times 4 + 4) + (4 \times 2 + 2) = 16 + 20 + 10 = \mathbf{46}\) total
A forward pass through a network requires one multiplication per weight and one addition per bias. The total cost scales with parameter count.
A CPU does operations one-by-one. A GPU does thousands in parallel. Training GPT-4 on a single CPU would take ~300 years. On 25,000 GPUs: ~3 months.
Flatten pixels into a vector and classify (before CNNs took over)
Process word embeddings for sentiment analysis and text classification
FCNNs remain the go-to for structured feature data (customer churn, fraud)
Policy and value networks in game-playing agents (DQN)
2 inputs → 2 hidden (ReLU) → 1 output (sigmoid). Input: \(\mathbf{x} = [0.5,\; 0.8]\)
$$W^{[1]} = \begin{bmatrix} 0.2 & 0.4 \\ 0.6 & 0.3 \end{bmatrix}, \quad b^{[1]} = \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}$$
\(z_1^{[1]} = 0.2(0.5) + 0.4(0.8) + 0.1 = \mathbf{0.52}\)
\(z_2^{[1]} = 0.6(0.5) + 0.3(0.8) + 0.2 = \mathbf{0.74}\)
\(a^{[1]} = \text{ReLU}(z^{[1]}) = [\max(0, 0.52),\; \max(0, 0.74)] = \mathbf{[0.52,\; 0.74]}\)
$$W^{[2]} = \begin{bmatrix} 0.5 & 0.7 \end{bmatrix}, \quad b^{[2]} = 0.1$$
\(z^{[2]} = 0.5(0.52) + 0.7(0.74) + 0.1 = 0.26 + 0.518 + 0.1 = \mathbf{0.878}\)
\(a^{[2]} = \sigma(0.878) = \frac{1}{1 + e^{-0.878}} = \mathbf{0.706}\)
Prediction: \(\hat{y} = 0.706\)
Probability of class 1 = 70.6%. With threshold 0.5 → Predict Class 1
Gradients shrink exponentially through many layers, making early layers barely learn.
Gradients grow exponentially, causing unstable weight updates and NaN values.
Large networks can easily memorize training data. Need regularization + dropout.
More parameters = more computation. GPUs are essential for training.
ReLU activation, batch normalization, dropout, residual connections (skip connections), and the Adam optimizer.
Layers, neurons, activations
Xavier or He init
Compute prediction & loss
Compute gradients
Repeat from step 3
This is the same recipe whether you're training a 46-parameter XOR solver or a 175-billion-parameter GPT.
Calculate the total parameters in this FCNN:
Parameters per layer = \((\text{inputs} \times \text{outputs}) + \text{outputs}\). The first term is weights, the second is biases.
Layer 1: \(784 \times 128 + 128 = 100{,}480\)
Layer 2: \(128 \times 64 + 64 = 8{,}256\)
Layer 3: \(64 \times 10 + 10 = 650\)
Total: 109,386 parameters — and this is considered a small network!
109K parameters is tiny. GPT-3 has 175 billion. GPT-4 is estimated at 1.8 trillion. Yet this small MNIST network achieves ~98% accuracy on handwritten digits.
A neural network is just stacked regression units trained by gradient descent. Everything in Sections 5–8 builds on Sections 1–4. The math is the same — just applied to more layers.
Specialized architecture for images — filters learn edges, textures, shapes
Process sequences — text, time series, speech with memory cells
The architecture behind GPT, BERT, and modern AI breakthroughs
Build and train networks with PyTorch / TensorFlow
Introduction to Neural Networks
Questions?
CMSC 194.2 • University of the Philippines Cebu