CMSC 194.2

Introduction to Neural Networks

From Linear Regression to Deep Networks

Noel Jeffrey Pinton

Department of Computer Science
University of the Philippines Cebu

Neural Networks

Overview

What We'll Cover

Linear Regression

Predicting continuous values

Gradient Descent

Optimizing parameters

Logistic Regression

Binary classification

Regularization

Preventing overfitting

History of NNs

80 years of progress

Simple Neuron

The perceptron

Multilayer Perceptron

Hidden layers

Fully Connected NNs

Deep networks

Objectives

Learning Objectives

Regression

Explain linear and logistic regression, compute predictions by hand, and understand residuals & cost functions

Gradient Descent

Derive the MSE gradient step-by-step using chain rule, and apply gradient descent to optimize parameters

Regularization

Describe overfitting vs. underfitting and explain how L1, L2, and dropout prevent memorization

History

Trace 80 years of neural network development from McCulloch-Pitts to ChatGPT

Neurons & Logic

Compute forward passes through perceptrons and build AND, OR, NOT gates from single neurons

Deep Networks

Explain how hidden layers solve XOR, work through backpropagation by hand, and count FCNN parameters

SECTION 1

Linear Regression

What is Linear Regression?

Definition

A supervised learning method that models the linear relationship between a dependent variable $y$ and one or more independent variables $x$.

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_n x_n$$

Or in vector form: $\hat{y} = \mathbf{w}^T \mathbf{x} + b$

Linear Regression

Why Linear? A Real-Life Analogy

Your electricity bill

Monthly charge = base fee + rate per kWh × usage

$\text{Bill} = 500 + 12 \times \text{kWh used}$

The connection

This is exactly $\hat{y} = w_0 + w_1 x$ where:

$w_0 = 500$ (base fee / intercept)
$w_1 = 12$ (rate / slope)
$x$ = kWh used (input feature)
$\hat{y}$ = bill amount (prediction)

Many things are "roughly linear"

Distance traveled = speed × time
Taxi fare = base + rate × km
Exam score ≈ hours studied × factor
Crop yield ≈ fertilizer amount × factor

Linear Regression

Use Cases for Linear Regression

House Price Prediction

Predict price from area, bedrooms, location

Salary Estimation

Predict salary from years of experience

Weather Forecasting

Predict temperature from historical data

Sales Forecasting

Predict revenue from ad spend

Linear Regression

Worked Example: Setup

Problem

Predict house price ($1000s) from area (sq ft)

Area $x$ (sq ft)	Price $y$ ($1000s)
1000	150
1500	200
2000	250
2500	300
3000	350

Linear Regression

Computing the Slope $w_1$

$$w_1 = \frac{\sum_{i=1}^{m}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{m}(x_i - \bar{x})^2}$$

Step 1: Compute means

$\bar{x} = \frac{1000+1500+2000+2500+3000}{5} = 2000$

$\bar{y} = \frac{150+200+250+300+350}{5} = 250$

Step 2: Compute sums

Numerator: $250{,}000$

Denominator: $2{,}500{,}000$

Linear Regression

Intercept & Final Model

$$w_0 = \bar{y} - w_1 \bar{x} = 250 - 0.1 \times 2000 = \mathbf{50}$$

Final Model: $\hat{y} = 50 + 0.1x$

Linear Regression

Making Predictions

Using our model $\hat{y} = 50 + 0.1x$:

Area $x$	Computation	Predicted Price $\hat{y}$
1,200 sq ft	$50 + 0.1(1200)$	$170K
1,800 sq ft	$50 + 0.1(1800)$	$230K
2,200 sq ft	$50 + 0.1(2200)$	$270K
3,500 sq ft	$50 + 0.1(3500)$	$400K

Linear Regression

Visualization

Scatter plot of house area vs price with the regression line y = 50 + 0.1x fitting all 5 data points

Linear Regression

What Does "Error" Mean?

Residuals

The vertical dashed lines are the residuals (errors): $e_i = \hat{y}_i - y_i$. Each one measures how far off our prediction is. The "best" line minimizes these gaps overall.

Linear Regression

Why Squared Errors?

Why not just sum the errors $\sum(e_i)$?

Positive and negative errors cancel out! A line through the middle of a scattered cloud could have total error = 0 despite being terrible.

Sum of errors

$\sum e_i$

Cancels out — useless

Absolute errors

$\sum |e_i|$

Works, but not differentiable at 0

Squared errors

$\sum e_i^2$

Always positive, differentiable, penalizes big errors more

Linear Regression

Cost Function: Mean Squared Error

How do we measure "best fit"?

By minimizing the average squared distance between predictions and actual values.

$$J(w_0, w_1) = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}_i - y_i)^2$$

SECTION 2

Gradient Descent

What is Gradient Descent?

Definition

An iterative optimization algorithm that finds the minimum of a function by repeatedly taking steps proportional to the negative of the gradient.

$$w := w - \alpha \frac{\partial J}{\partial w}$$

Where $\alpha$ is the learning rate (step size)

Gradient Descent

Why Do We Need Gradient Descent?

Can't we just use a formula?

For simple linear regression with 1 feature, yes — we used the closed-form formula to get $w_1 = 0.1$. But what about...

Formula fails when...

You have 100+ features (matrix inversion is $O(n^3)$)
The model is non-linear (neural networks)
Dataset is too large for memory

GD works for...

Any differentiable function
Any number of parameters
Any dataset size (mini-batches)
Linear models and neural networks

Gradient Descent

The Calculus You Need: 3 Rules

What is a derivative?

The derivative $\frac{df}{dx}$ tells you the slope (rate of change) of $f$ at any point. Positive slope = function is increasing. Negative slope = decreasing.

Rule	Formula	Example
Power Rule	$\frac{d}{dx}[x^n] = nx^{n-1}$	$\frac{d}{dx}[x^2] = 2x$
Constant Rule	$\frac{d}{dx}[cf(x)] = c \cdot f'(x)$	$\frac{d}{dx}[3x^2] = 6x$
Chain Rule	$\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)$	$\frac{d}{dx}[(2x+1)^2] = 2(2x+1) \cdot 2$

Gradient Descent

Intuition: Rolling Downhill

Imagine

You're standing on a hillside in dense fog. You can only feel the slope under your feet.

Strategy: Always step in the steepest downhill direction. Eventually, you reach the valley (minimum).

Cost function parabola J(w) = (w-3) squared showing a ball rolling downhill toward the minimum

Gradient Descent

Why the Negative Gradient?

If gradient is negative

Subtracting a negative number = adding → we move right (toward the minimum)

If gradient is positive

Subtracting a positive number = decreasing → we move left (toward the minimum)

Gradient Descent

Use Cases for Gradient Descent

Training Neural Networks

Optimize millions of weights via backpropagation

Linear/Logistic Regression

Find optimal parameters for prediction models

Image Recognition

Train CNNs for object detection and classification

Natural Language Processing

Train transformers for language understanding

Gradient Descent

The Gradient Descent Algorithm

Gradient Descent

1. Initialize weights w randomly
2. Repeat until convergence:
   a. Compute gradient: g = dJ/dw
   b. Update weights:  w = w - α × g
3. Return w

Gradient Descent

Deriving the MSE Gradient — Setup

Our cost function (MSE)

$$J(w_0, w_1) = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}_i - y_i)^2$$

Substitute the model

Since $\hat{y}_i = w_0 + w_1 x_i$, we can write:

$$J(w_0, w_1) = \frac{1}{2m}\sum_{i=1}^{m}(w_0 + w_1 x_i - y_i)^2$$

We need two partial derivatives

$\frac{\partial J}{\partial w_1}$ — how does the cost change when we adjust the slope?

$\frac{\partial J}{\partial w_0}$ — how does the cost change when we adjust the intercept?

Gradient Descent

Deriving $\frac{\partial J}{\partial w_1}$ — Step by Step

Step 1: Apply the derivative to the sum

$$\frac{\partial J}{\partial w_1} = \frac{1}{2m}\sum_{i=1}^{m} \frac{\partial}{\partial w_1}\left[(w_0 + w_1 x_i - y_i)^2\right]$$

Step 2: Chain rule — derivative of $(\text{something})^2$

$$= \frac{1}{2m}\sum_{i=1}^{m} 2(w_0 + w_1 x_i - y_i) \cdot \frac{\partial}{\partial w_1}(w_0 + w_1 x_i - y_i)$$

The inner derivative w.r.t. $w_1$: $\frac{\partial}{\partial w_1}(w_0 + w_1 x_i - y_i) = x_i$

Step 3: Simplify — the 2 and $\frac{1}{2}$ cancel!

$$= \frac{1}{m}\sum_{i=1}^{m}(w_0 + w_1 x_i - y_i) \cdot x_i$$

Gradient Descent

Deriving $\frac{\partial J}{\partial w_0}$ — Step by Step

Same process, but the inner derivative changes

$$\frac{\partial J}{\partial w_0} = \frac{1}{2m}\sum_{i=1}^{m} 2(w_0 + w_1 x_i - y_i) \cdot \frac{\partial}{\partial w_0}(w_0 + w_1 x_i - y_i)$$

Key difference

$\frac{\partial}{\partial w_0}(w_0 + w_1 x_i - y_i) = \mathbf{1}$

Compare: for $w_1$ it was $x_i$

Result

$$\frac{\partial J}{\partial w_0} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)$$

No $x_i$ — just the average error!

Gradient Descent

The Complete Update Rules

Gradient Descent for Linear Regression

$$w_1 := w_1 - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i) \cdot x_i$$

$$w_0 := w_0 - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)$$

What we derived

Start from the cost function $J$
Take partial derivatives using chain rule
Get the gradients for each parameter
Subtract them (scaled by $\alpha$)

Two equations — that's it!

These two equations are all you need to train linear regression. Repeat until convergence.

Gradient Descent

GD for Linear Regression — Worked Example

Using our house price data (5 points)

Start: $w_0 = 0,\; w_1 = 0,\; \alpha = 0.0000001$ (tiny LR because features are large)

Iteration 0: Compute predictions ($\hat{y} = 0 + 0 \cdot x = 0$ for all)

Errors: $\hat{y}_i - y_i = [0-150,\; 0-200,\; 0-250,\; 0-300,\; 0-350] = [-150, -200, -250, -300, -350]$

$\frac{\partial J}{\partial w_0}$

$= \frac{1}{5}(-150-200-250-300-350)$

$= \frac{-1250}{5} = \mathbf{-250}$

$\frac{\partial J}{\partial w_1}$

$= \frac{1}{5}[(-150)(1000) + (-200)(1500) + \ldots]$

$= \frac{-575{,}000}{5} \times 10^{-7} \approx \mathbf{-550{,}000}$

Gradient Descent

GD for Linear Regression — Convergence

Iteration	$w_0$	$w_1$	$J$ (cost)
0	0.000	0.000	32,500
10	0.003	0.042	10,240
100	0.28	0.084	1,056
1,000	12.4	0.097	28.7
10,000	45.8	0.100	0.35
~50,000	50.0	0.100	≈ 0

Gradient Descent

Worked Example: Setup

Problem

Minimize $J(w) = (w - 3)^2$

Settings

Starting point: $w_0 = 0$
Learning rate: $\alpha = 0.2$
Gradient: $\frac{dJ}{dw} = 2(w - 3)$

Expected minimum

$J(w)$ is minimized when $w = 3$. Let's see if gradient descent finds it!

Gradient Descent

Step-by-Step Iterations

Iter	$w$	$\frac{dJ}{dw} = 2(w-3)$	Update	New $w$
0	0.000	$2(0-3) = -6.0$	$0 - 0.2(-6.0)$	1.200
1	1.200	$2(1.2-3) = -3.6$	$1.2 - 0.2(-3.6)$	1.920
2	1.920	$2(1.92-3) = -2.16$	$1.92 - 0.2(-2.16)$	2.352
3	2.352	$2(2.352-3) = -1.296$	$2.352 - 0.2(-1.296)$	2.611
4	2.611	$2(2.611-3) = -0.778$	$2.611 - 0.2(-0.778)$	2.767

Gradient Descent

Gradient Descent in Action

Gradient Descent

Effect of Learning Rate $\alpha$

Too Small ($\alpha = 0.01$)

Slow convergence

Just Right ($\alpha = 0.2$)

Learning rate just right: steady convergence

Optimal convergence

Too Large ($\alpha = 0.9$)

Learning rate too large: overshooting and divergence

Overshoots — may diverge!

Gradient Descent

Common Pitfalls

Local Minima

Non-convex functions have multiple valleys. GD might get stuck in a shallow one instead of finding the deepest.

Saddle Points

Points where gradient = 0 but it's not a minimum. The surface curves up in one direction and down in another.

Feature Scaling

If features have very different ranges (e.g., age 0–100 vs salary 0–1M), GD zig-zags instead of going straight to the minimum.

When to Stop?

Common criteria: gradient < threshold, cost change < $\epsilon$, or fixed number of iterations.

Gradient Descent

Gradient Descent Variants

Variant	Data per Step	Speed	Stability
Batch GD	All samples	Slow	Very stable
Stochastic GD	1 sample	Fast	Noisy
Mini-batch GD	$k$ samples	Balanced	Balanced

SECTION 3

Logistic Regression

Why Not Linear Regression for Classification?

Problem 1

Linear regression can predict -0.2 or 1.3 — what does a negative probability mean?

Problem 2

A single outlier can shift the entire line, changing the decision boundary dramatically.

Logistic Regression

What is Logistic Regression?

Definition

A classification algorithm that models the probability of a binary outcome using the sigmoid (logistic) function.

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)$$

Logistic Regression

The Sigmoid Function

Sigmoid function S-curve with shaded regions: predict 0 below 0.5 threshold, predict 1 above

Logistic Regression

Sigmoid — Computing by Hand

$z$	$e^{-z}$	$1 + e^{-z}$	$\sigma(z) = \frac{1}{1+e^{-z}}$	Interpretation
$-3$	$e^3 = 20.09$	$21.09$	0.047	Very likely class 0
$-1$	$e^1 = 2.718$	$3.718$	0.269	Probably class 0
$0$	$e^0 = 1$	$2$	0.500	Coin flip!
$1$	$e^{-1} = 0.368$	$1.368$	0.731	Probably class 1
$3$	$e^{-3} = 0.050$	$1.050$	0.953	Very likely class 1

Logistic Regression

From Probability to Decision

1

Linear Combination

$z = \mathbf{w}^T\mathbf{x} + b$

Can be any real number

2

Apply Sigmoid

$p = \sigma(z)$

Squished to (0, 1)

3

Apply Threshold

$p \geq 0.5 \rightarrow$ Class 1

$p < 0.5 \rightarrow$ Class 0

Logistic Regression

Use Cases for Logistic Regression

Email Spam Detection

Classify emails as spam or not spam

Medical Diagnosis

Predict disease presence from symptoms

Fraud Detection

Flag suspicious credit card transactions

Student Pass/Fail

Predict outcome from study hours

Logistic Regression

Worked Example: Setup

Problem

Predict pass/fail from study hours

Hours $x$	1	2	3	4	5	6	7
Result	Fail	Fail	Fail	Pass	Pass	Pass	Pass

Logistic Regression

Forward Pass Computation

Hours $x$	$z = 1.5x - 5.0$	$\sigma(z) = \frac{1}{1+e^{-z}}$	Prediction
2	$1.5(2)-5 = -2.0$	$\frac{1}{1+e^{2.0}} = 0.119$	Fail ✗
3	$1.5(3)-5 = -0.5$	$\frac{1}{1+e^{0.5}} = 0.378$	Fail ✗
4	$1.5(4)-5 = 1.0$	$\frac{1}{1+e^{-1.0}} = 0.731$	Pass ✓
5	$1.5(5)-5 = 2.5$	$\frac{1}{1+e^{-2.5}} = 0.924$	Pass ✓

Logistic Regression

Loss: Binary Cross-Entropy

$$J = -\frac{1}{m}\sum_{i=1}^{m}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$$

Logistic Regression

Decision Boundary

Logistic Regression

Training Logistic Regression

Gradient of cross-entropy

$$\frac{\partial J}{\partial w_j} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)x_j^{(i)}$$

Surprise! This looks identical to the linear regression gradient — but $\hat{y}$ uses the sigmoid.

Training recipe

Compute predictions: $\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b)$
Compute loss: $J$ (cross-entropy)
Compute gradients: $\frac{\partial J}{\partial w}$
Update weights: $w := w - \alpha \nabla J$
Repeat until convergence

Logistic Regression

Linear vs Logistic Regression

Feature	Linear Regression	Logistic Regression
Output	Continuous value	Probability (0–1)
Activation	None (identity)	Sigmoid
Loss Function	Mean Squared Error	Binary Cross-Entropy
Use Case	Regression (predict amount)	Classification (predict class)
Decision	No threshold	Threshold at 0.5

SECTION 4

Regularization

What is Regularization?

Definition

A technique that adds a penalty term to the loss function to prevent overfitting by discouraging overly complex models.

$$J_{\text{reg}} = J_{\text{original}} + \lambda \cdot R(\mathbf{w})$$

Where $\lambda$ controls regularization strength

Think of it as...

Regularization is like a teacher telling a student: "I don't just want the right answer — I want a simple explanation." Simpler models generalize better.

Regularization

The Overfitting Analogy

The Memorizer

Memorizes every answer from past exams word-for-word

100%

Practice exams

45%

New exam

The Understander

Learns the underlying concepts and problem-solving strategies

85%

Practice exams

82%

New exam

Regularization

Underfitting vs. Just Right vs. Overfitting

Underfitting

High Bias — too simple

Just Right

Good fit: smooth curve following the data trend

Good Balance

Overfitting

High Variance — too complex

Regularization

Types of Regularization

L1 Regularization (Lasso)

$$J_{L1} = J + \lambda \sum_{i} |w_i|$$

Encourages sparsity — some weights become exactly 0, effectively performing feature selection.

L2 Regularization (Ridge)

$$J_{L2} = J + \lambda \sum_{i} w_i^2$$

Shrinks all weights toward 0 but never exactly 0. Smoother, more stable solutions.

Regularization

Beyond L1 & L2: More Techniques

Elastic Net (L1 + L2)

$$J_{EN} = J + \lambda_1 \sum|w_i| + \lambda_2 \sum w_i^2$$

Combines the best of both: feature selection and weight shrinkage. Used in scikit-learn's ElasticNet.

Dropout (Neural Networks)

Randomly "turn off" neurons during training (e.g., 50% chance). Forces the network to not rely on any single neuron.

Like studying with random pages removed — you learn the core concepts, not surface patterns.

Regularization

Use Cases for Regularization

Feature Selection (L1)

Automatically remove irrelevant features by zeroing their weights

Neural Network Training

Prevent deep networks from memorizing training data

Reducing Complexity

Keep models simple and interpretable

Improving Generalization

Models perform better on unseen test data

Regularization

Worked Example: L2 Regularization

Applying L2 penalty to our linear regression

Original model: $w_0 = 50$ (bias), $w_1 = 0.1$ (weight), and $\lambda = 0.01$

Without regularization

$J = 0$ (perfect fit)

With L2 regularization

$J_{\text{reg}} = 0 + 0.01 \times (0.1^2) = 0 + 0.01 \times 0.01 = \mathbf{0.0001}$

Regularization

Effect of $\lambda$

Regularization

Regularization Summary

Key Takeaways

Regularization controls model complexity via a penalty term
L1 (Lasso): Feature selection (sparse weights)
L2 (Ridge): Weight shrinkage (smooth models)
$\lambda$ balances fit quality vs. complexity
Elastic Net: Combines L1 + L2 (best of both)

What's Next?

We now have the classical building blocks:

Regression (linear & logistic)
Optimization (gradient descent)
Regularization (preventing overfitting)

Let's look at the history that led from these ideas to neural networks...

Cheat Sheet

Method	Penalty	Effect	Best for
L1	$\lambda\sum\|w_i\|$	Zeros out weights	Feature selection
L2	$\lambda\sum w_i^2$	Shrinks weights	Preventing large weights
Elastic Net	L1 + L2	Both	Correlated features
Dropout	Random neuron removal	Ensemble effect	Neural networks

SECTION 5

A Brief History of Neural Networks

From the 1940s to Today

History

The Early Years (1943–1969)

1943

McCulloch & Pitts — First mathematical model of an artificial neuron. Showed neurons could compute logical functions.

1958

Frank Rosenblatt's Perceptron — First trainable neural network, implemented in hardware. Could learn to classify simple patterns.

1969

Minsky & Papert — Published "Perceptrons", proving single-layer networks cannot solve XOR. Revealed fundamental limitations.

History

The Resurgence (1986–2006)

1986

Rumelhart, Hinton & Williams — Popularized backpropagation, enabling training of multi-layer networks. The key breakthrough.

1989

Yann LeCun — Applied CNNs to handwritten digit recognition (MNIST). First practical deep learning success.

1997

Hochreiter & Schmidhuber — Invented LSTM for sequential data, solving the vanishing gradient problem.

2006

Geoffrey Hinton — Deep Belief Networks. The term "Deep Learning" enters mainstream AI vocabulary.

History

The Deep Learning Era (2012–Present)

2012

AlexNet wins ImageNet by a huge margin. 8 layers, GPU training. The "Big Bang" of modern deep learning.

2014

GANs (Goodfellow) — Generative Adversarial Networks can create realistic images from noise.

2017

"Attention Is All You Need" — The Transformer architecture revolutionizes NLP and later all of AI.

2022–25

ChatGPT & LLMs — Large language models reach mainstream. AI becomes a daily tool for millions.

175B

Parameters in GPT-3

100M+

ChatGPT users in 2 months

History

Why Neural Networks Work Now

Big Data

Internet, smartphones, and sensors generate massive datasets. Deep networks need lots of data to learn effectively.

Compute Power

GPUs, TPUs, and cloud computing make training billion-parameter models feasible. 10,000x faster than 2000s hardware.

Better Algorithms

ReLU, batch normalization, dropout, residual connections, Adam optimizer — all invented in the 2010s.

History

From Classical ML to Neural Networks

History

Think-Pair-Share

Discussion Question

Why did neural networks fail in the 1970s but succeed in the 2010s?

What changed in terms of data, compute, and algorithms?

Hint: Data

How much data existed in the 1970s vs. the age of the internet?

Hint: Compute

What hardware breakthrough made matrix multiplication 100× faster?

Hint: Algorithms

What training method was missing before 1986?

Click to reveal answer

1970s: Limited data, no GPUs, only single-layer networks (couldn't solve XOR). 2010s: Internet-scale data, GPU parallelism (NVIDIA CUDA), and breakthroughs like backpropagation, ReLU, and batch normalization made deep networks trainable.

SECTION 6

The Simple Neuron

The Perceptron

The Simple Neuron

What is an Artificial Neuron?

Definition

A computational unit that takes weighted inputs, sums them, adds a bias, and applies an activation function.

$$z = \sum_{i=1}^{n} w_i x_i + b, \quad a = f(z)$$

The Simple Neuron

Biological vs. Artificial

Mapping

Dendrites	→ Inputs $x_1, x_2, \ldots$
Synaptic strength	→ Weights $w_1, w_2, \ldots$
Soma (cell body)	→ Summation $\Sigma$
Axon hillock	→ Activation function $f$
Axon	→ Output $\hat{y}$

Important caveat

Real neurons are vastly more complex — they use timing, chemical signals, and recurrent connections. The artificial neuron is a very rough approximation.

The Simple Neuron

Common Activation Functions

Sigmoid

$f(z) = \frac{1}{1+e^{-z}}$

Range: (0, 1)

ReLU

$f(z) = \max(0, z)$

Range: [0, ∞)

Tanh

$f(z) = \tanh(z)$

Range: (-1, 1)

The Simple Neuron

What Can a Single Neuron Do?

Key Insight

A single neuron with sigmoid activation is logistic regression! Same weighted sum + same activation function.

Capabilities

Binary classification
Logic gates: AND, OR, NOT
Any linearly separable problem

Limitations

Cannot solve XOR
Cannot learn non-linear boundaries
Only one decision boundary (a line/plane)

The Simple Neuron

Worked Example: AND Gate

Neuron

$w_1 = 1,\; w_2 = 1,\; b = -1.5$, step activation $f(z) = \begin{cases}1 & z \geq 0\\0 & z < 0\end{cases}$

$x_1$	$x_2$	$z = x_1 + x_2 - 1.5$	$f(z)$	AND	Match?
0	0	$0 + 0 - 1.5 = -1.5$	0	0	✓
0	1	$0 + 1 - 1.5 = -0.5$	0	0	✓
1	0	$1 + 0 - 1.5 = -0.5$	0	0	✓
1	1	$1 + 1 - 1.5 = 0.5$	1	1	✓

The Simple Neuron

OR Gate — Change the Bias!

Neuron

$w_1 = 1,\; w_2 = 1,\; b = -0.5$, same step activation

$x_1$	$x_2$	$z = x_1 + x_2 - 0.5$	$f(z)$	OR	Match?
0	0	$0 + 0 - 0.5 = -0.5$	0	0	✓
0	1	$0 + 1 - 0.5 = 0.5$	1	1	✓
1	0	$1 + 0 - 0.5 = 0.5$	1	1	✓
1	1	$1 + 1 - 0.5 = 1.5$	1	1	✓

The Simple Neuron

NOT Gate — Single Input

Neuron

$w_1 = -1,\; b = 0.5$, step activation. Only one input.

$x$	$z = -x + 0.5$	$f(z)$	NOT $x$	Match?
0	$-0 + 0.5 = 0.5$	1	1	✓
1	$-1 + 0.5 = -0.5$	0	0	✓

AND

$w=1, b=-1.5$

OR

$w=1, b=-0.5$

NOT

$w=-1, b=0.5$

The Simple Neuron

The XOR Problem

Can a single neuron compute XOR?

$x_1$	$x_2$	XOR
0	0	0
0	1	1
1	0	1
1	1	0

No! XOR is not linearly separable. No single straight line can separate the 1s from the 0s.

XOR problem: 4 data points showing non-linear separability

The Simple Neuron

AND Gate Neuron — Traced

Step-by-step for input (1, 1)

$z = w_1 x_1 + w_2 x_2 + b$
$z = 1(1) + 1(1) + (-1.5)$
$z = 2 - 1.5 = 0.5$
$f(0.5) = 1$ (since $0.5 \geq 0$)

Output: 1 ✓ Correct!

The Simple Neuron

We Need More Power

The Limitation

A single neuron can only learn linearly separable patterns. It draws one straight line through feature space.

The Solution

Stack multiple neurons in layers. Each neuron learns a different feature. Together, they can solve XOR and any non-linear problem!

Enter: The Multilayer Perceptron (MLP)

SECTION 7

Multilayer Perceptron

Hidden Layers

Multilayer Perceptron

What is an MLP?

Definition

A feedforward neural network with one or more hidden layers between the input and output layers. Each layer is fully connected to the next.

Structure

Input Layer → Hidden Layer(s) → Output Layer

Each connection has a weight
Each neuron has a bias and activation function
Information flows forward only (no loops)

Multilayer Perceptron

What Hidden Layers Actually Learn

Layer 1: Edges

Detects simple patterns — lines, curves, light/dark transitions

Layer 2: Parts

Combines edges into shapes — eyes, noses, wheels, letters

Layer 3: Objects

Combines parts into concepts — faces, cars, words

Multilayer Perceptron

MLP Architecture

Multilayer Perceptron

Use Cases for MLP

XOR & Non-Linear Classification

Solve problems that single neurons cannot

Handwritten Digit Recognition

MNIST dataset — the "Hello World" of deep learning

Function Approximation

Universal approximation theorem: can approximate any continuous function

Tabular Data Prediction

Structured data with complex feature interactions

Multilayer Perceptron

Solving XOR with an MLP

Network

2 inputs, 2 hidden neurons (step activation), 1 output

Hidden neuron $h_1$ (OR-like)

$w_{11}=1,\; w_{12}=1,\; b_1=-0.5$

$h_1 = f(x_1 + x_2 - 0.5)$

Hidden neuron $h_2$ (AND)

$w_{21}=1,\; w_{22}=1,\; b_2=-1.5$

$h_2 = f(x_1 + x_2 - 1.5)$

Multilayer Perceptron

XOR — Full Computation

$x_1$	$x_2$	$h_1$ = f($x_1+x_2-0.5$)	$h_2$ = f($x_1+x_2-1.5$)	$\hat{y}$ = f($h_1-2h_2-0.5$)	XOR
0	0	f(-0.5) = 0	f(-1.5) = 0	f(-0.5) = 0	0	✓
0	1	f(0.5) = 1	f(-0.5) = 0	f(0.5) = 1	1	✓
1	0	f(0.5) = 1	f(-0.5) = 0	f(0.5) = 1	1	✓
1	1	f(1.5) = 1	f(0.5) = 1	f(-1.5) = 0	0	✓

Multilayer Perceptron

Backpropagation: The Key Idea

The Problem

We know the output is wrong, but how do we know which hidden weights to blame? The hidden neurons don't have direct targets — only the output does.

Factory Analogy

Imagine a factory assembly line with 3 stations. The final product is defective. You trace backward:

Station 3 (output): added the wrong label → fix station 3
Station 2 (hidden): provided a bent component → fix station 2
Station 1 (hidden): cut the material too short → fix station 1

Backpropagation does exactly this — it traces the error backward through each layer.

Multilayer Perceptron

Training MLPs: Backpropagation

Forward Pass

Feed input through network
Compute each layer's output
Get final prediction $\hat{y}$
Compute loss $J$

Backward Pass

Compute output error
Propagate error backward
Compute gradients via chain rule
Update all weights

Chain Rule

$$\frac{\partial J}{\partial w^{[l]}} = \frac{\partial J}{\partial a^{[L]}} \cdot \frac{\partial a^{[L]}}{\partial z^{[L]}} \cdot \ldots \cdot \frac{\partial z^{[l]}}{\partial w^{[l]}}$$

Backpropagation = the chain rule applied systematically across layers

Multilayer Perceptron

Backprop Example: Forward Pass

Tiny network: 1 input → 1 hidden (sigmoid) → 1 output (sigmoid)

Input: $x = 1$, Target: $y = 1$, Learning rate: $\alpha = 0.5$

Weights: $w_1 = 0.5,\; b_1 = 0.2$ (hidden) | $w_2 = 0.8,\; b_2 = 0.1$ (output)

Hidden layer

$z_1 = w_1 x + b_1 = 0.5(1) + 0.2 = 0.7$

$h = \sigma(0.7) = \frac{1}{1+e^{-0.7}} = \mathbf{0.668}$

Output layer

$z_2 = w_2 h + b_2 = 0.8(0.668) + 0.1 = 0.634$

$\hat{y} = \sigma(0.634) = \frac{1}{1+e^{-0.634}} = \mathbf{0.653}$

Multilayer Perceptron

Backprop Example: Backward Pass

Output layer gradient ($\frac{\partial L}{\partial w_2}$)

$\frac{\partial L}{\partial \hat{y}} = \hat{y} - y = 0.653 - 1 = -0.347$

$\frac{\partial \hat{y}}{\partial z_2} = \hat{y}(1-\hat{y}) = 0.653 \times 0.347 = 0.2266$ (sigmoid derivative)

$\frac{\partial z_2}{\partial w_2} = h = 0.668$

Chain rule: $\frac{\partial L}{\partial w_2} = (-0.347)(0.2266)(0.668) = \mathbf{-0.0525}$

Hidden layer gradient ($\frac{\partial L}{\partial w_1}$)

Continue the chain through $w_2$: $\frac{\partial z_2}{\partial h} = w_2 = 0.8$, $\frac{\partial h}{\partial z_1} = h(1-h) = 0.668 \times 0.332 = 0.2218$, $\frac{\partial z_1}{\partial w_1} = x = 1$

Chain rule: $\frac{\partial L}{\partial w_1} = (-0.347)(0.2266)(0.8)(0.2218)(1) = \mathbf{-0.01394}$

Multilayer Perceptron

Backprop Example: Update & Verify

Update weights ($\alpha = 0.5$)

$w_2' = 0.8 - 0.5(-0.0525) = \mathbf{0.826}$

$w_1' = 0.5 - 0.5(-0.01394) = \mathbf{0.507}$

Verify: new forward pass

$z_1' = 0.507(1) + 0.2 = 0.707$

$h' = \sigma(0.707) = 0.670$

$z_2' = 0.826(0.670) + 0.1 = 0.653$

$\hat{y}' = \sigma(0.653) = \mathbf{0.658}$

0.0603

Loss BEFORE

0.0585

Loss AFTER

XOR problem solved by MLP with two decision boundary lines

Multilayer Perceptron

How the MLP Solves XOR

What happened?

$h_1$ draws one boundary line
$h_2$ draws another boundary line
The output combines them: the region between the lines is class 1
Two linear boundaries → one non-linear decision region

The power of hidden layers

They transform the input space into a representation where the problem becomes linearly separable.

SECTION 8

Fully Connected Neural Networks

Going Deep

Fully Connected NN

What is a Fully Connected Neural Network?

Definition

A deep neural network where every neuron in one layer is connected to every neuron in the next layer. Also called a Dense Network or Deep Feedforward Network.

MLP vs FCNN

An MLP with 2+ hidden layers = a Fully Connected Neural Network. "Deep" means multiple hidden layers.

Key property

More layers = more abstraction. Early layers learn simple features, deeper layers learn complex combinations.

Fully Connected NN

FCNN Architecture

Fully Connected NN

How Many Operations? Thinking About Scale

Every parameter = 1 multiply + 1 add

A forward pass through a network requires one multiplication per weight and one addition per bias. The total cost scales with parameter count.

109K

Parameters (MNIST)

~218K

Arithmetic ops / image

1.8T

Parameters (GPT-4)

Fully Connected NN

Use Cases for FCNNs

Image Classification

Flatten pixels into a vector and classify (before CNNs took over)

Natural Language Processing

Process word embeddings for sentiment analysis and text classification

Tabular / Structured Data

FCNNs remain the go-to for structured feature data (customer churn, fraud)

Reinforcement Learning

Policy and value networks in game-playing agents (DQN)

Fully Connected NN

Worked Example: Forward Pass (Layer 1)

FCNN

2 inputs → 2 hidden (ReLU) → 1 output (sigmoid). Input: $\mathbf{x} = [0.5,\; 0.8]$

Layer 1 weights & bias

$$W^{[1]} = \begin{bmatrix} 0.2 & 0.4 \\ 0.6 & 0.3 \end{bmatrix}, \quad b^{[1]} = \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}$$

Compute $z^{[1]} = W^{[1]}\mathbf{x} + b^{[1]}$

$z_1^{[1]} = 0.2(0.5) + 0.4(0.8) + 0.1 = \mathbf{0.52}$

$z_2^{[1]} = 0.6(0.5) + 0.3(0.8) + 0.2 = \mathbf{0.74}$

$a^{[1]} = \text{ReLU}(z^{[1]}) = [\max(0, 0.52),\; \max(0, 0.74)] = \mathbf{[0.52,\; 0.74]}$

Fully Connected NN

Worked Example: Forward Pass (Output)

Layer 2 (output) weights & bias

$$W^{[2]} = \begin{bmatrix} 0.5 & 0.7 \end{bmatrix}, \quad b^{[2]} = 0.1$$

Compute output

$z^{[2]} = 0.5(0.52) + 0.7(0.74) + 0.1 = 0.26 + 0.518 + 0.1 = \mathbf{0.878}$

$a^{[2]} = \sigma(0.878) = \frac{1}{1 + e^{-0.878}} = \mathbf{0.706}$

Fully Connected NN

Challenges in Training Deep Networks

Vanishing Gradients

Gradients shrink exponentially through many layers, making early layers barely learn.

Exploding Gradients

Gradients grow exponentially, causing unstable weight updates and NaN values.

Overfitting

Large networks can easily memorize training data. Need regularization + dropout.

Computational Cost

More parameters = more computation. GPUs are essential for training.

Fully Connected NN

The Deep Learning Recipe

5 Steps — Every Neural Network Ever

1

Choose Architecture

Layers, neurons, activations

2

Initialize Weights

Xavier or He init

3

Forward Pass

Compute prediction & loss

4

Backward Pass

Compute gradients

5

Update Weights

Repeat from step 3

Fully Connected NN

Activity: Count the Parameters

Exercise

Calculate the total parameters in this FCNN:

Input: 784 neurons (28×28 image, flattened)
Hidden 1: 128 neurons
Hidden 2: 64 neurons
Output: 10 neurons (digits 0–9)

Formula hint

Parameters per layer = $(\text{inputs} \times \text{outputs}) + \text{outputs}$. The first term is weights, the second is biases.

Click to reveal solution

Layer 1: $784 \times 128 + 128 = 100{,}480$
Layer 2: $128 \times 64 + 64 = 8{,}256$
Layer 3: $64 \times 10 + 10 = 650$
Total: 109,386 parameters — and this is considered a small network!

Summary

Key Takeaways

Classical Foundations

Linear Regression: Predict values with $\hat{y} = wx + b$; MSE measures fit
Gradient Descent: Derived MSE gradients, applied chain rule, optimized iteratively
Logistic Regression: Sigmoid maps $\mathbb{R} \to (0,1)$ for classification
Regularization: L1/L2/dropout prevent memorization

Neural Networks

Neuron: Weighted sum + activation; computes AND, OR, NOT
MLP: Hidden layers solve XOR; backprop trains via chain rule
FCNN: Deep networks with 100K+ parameters
Recipe: Init → forward → loss → backward → update → repeat

Looking Ahead

What's Next?

Convolutional Neural Networks

Specialized architecture for images — filters learn edges, textures, shapes

Recurrent Neural Networks

Process sequences — text, time series, speech with memory cells

Transformers & Attention

The architecture behind GPT, BERT, and modern AI breakthroughs

Hands-On Implementation

Build and train networks with PyTorch / TensorFlow

End of Lecture

Introduction to Neural Networks

Questions?

CMSC 194.2 • University of the Philippines Cebu

Iter	\(w\)	\(\frac{dJ}{dw} = 2(w-3)\)	Update	New \(w\)
0	0.000	\(2(0-3) = -6.0\)	\(0 - 0.2(-6.0)\)	1.200
1	1.200	\(2(1.2-3) = -3.6\)	\(1.2 - 0.2(-3.6)\)	1.920
2	1.920	\(2(1.92-3) = -2.16\)	\(1.92 - 0.2(-2.16)\)	2.352
3	2.352	\(2(2.352-3) = -1.296\)	\(2.352 - 0.2(-1.296)\)	2.611
4	2.611	\(2(2.611-3) = -0.778\)	\(2.611 - 0.2(-0.778)\)	2.767

\(z\)	\(e^{-z}\)	\(1 + e^{-z}\)	\(\sigma(z) = \frac{1}{1+e^{-z}}\)	Interpretation
\(-3\)	\(e^3 = 20.09\)	\(21.09\)	0.047	Very likely class 0
\(-1\)	\(e^1 = 2.718\)	\(3.718\)	0.269	Probably class 0
\(0\)	\(e^0 = 1\)	\(2\)	0.500	Coin flip!
\(1\)	\(e^{-1} = 0.368\)	\(1.368\)	0.731	Probably class 1
\(3\)	\(e^{-3} = 0.050\)	\(1.050\)	0.953	Very likely class 1

Hours \(x\)	\(z = 1.5x - 5.0\)	\(\sigma(z) = \frac{1}{1+e^{-z}}\)	Prediction
2	\(1.5(2)-5 = -2.0\)	\(\frac{1}{1+e^{2.0}} = 0.119\)	Fail ✗
3	\(1.5(3)-5 = -0.5\)	\(\frac{1}{1+e^{0.5}} = 0.378\)	Fail ✗
4	\(1.5(4)-5 = 1.0\)	\(\frac{1}{1+e^{-1.0}} = 0.731\)	Pass ✓
5	\(1.5(5)-5 = 2.5\)	\(\frac{1}{1+e^{-2.5}} = 0.924\)	Pass ✓

\(x_1\)	\(x_2\)	\(z = x_1 + x_2 - 1.5\)	\(f(z)\)	AND	Match?
0	0	\(0 + 0 - 1.5 = -1.5\)	0	0	✓
0	1	\(0 + 1 - 1.5 = -0.5\)	0	0	✓
1	0	\(1 + 0 - 1.5 = -0.5\)	0	0	✓
1	1	\(1 + 1 - 1.5 = 0.5\)	1	1	✓

\(x_1\)	\(x_2\)	\(z = x_1 + x_2 - 0.5\)	\(f(z)\)	OR	Match?
0	0	\(0 + 0 - 0.5 = -0.5\)	0	0	✓
0	1	\(0 + 1 - 0.5 = 0.5\)	1	1	✓
1	0	\(1 + 0 - 0.5 = 0.5\)	1	1	✓
1	1	\(1 + 1 - 0.5 = 1.5\)	1	1	✓

Area \(x\)	Computation	Predicted Price \(\hat{y}\)
1,200 sq ft	\(50 + 0.1(1200)\)	$170K
1,800 sq ft	\(50 + 0.1(1800)\)	$230K
2,200 sq ft	\(50 + 0.1(2200)\)	$270K
3,500 sq ft	\(50 + 0.1(3500)\)	$400K

Rule	Formula	Example
Power Rule	\(\frac{d}{dx}[x^n] = nx^{n-1}\)	\(\frac{d}{dx}[x^2] = 2x\)
Constant Rule	\(\frac{d}{dx}[cf(x)] = c \cdot f'(x)\)	\(\frac{d}{dx}[3x^2] = 6x\)
Chain Rule	\(\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)\)	\(\frac{d}{dx}[(2x+1)^2] = 2(2x+1) \cdot 2\)

Dendrites	→ Inputs \(x_1, x_2, \ldots\)
Synaptic strength	→ Weights \(w_1, w_2, \ldots\)
Soma (cell body)	→ Summation \(\Sigma\)
Axon hillock	→ Activation function \(f\)
Axon	→ Output \(\hat{y}\)

Introduction to Neural Networks

What We'll Cover

Linear Regression

Gradient Descent

Logistic Regression

Regularization

History of NNs

Simple Neuron

Multilayer Perceptron

Fully Connected NNs

Learning Objectives

Regression

Gradient Descent

Regularization

History

Neurons & Logic

Deep Networks

Linear Regression

What is Linear Regression?

Definition

Goal

Why Linear? A Real-Life Analogy

Your electricity bill

The connection

Many things are "roughly linear"

Use Cases for Linear Regression

House Price Prediction

Salary Estimation

Weather Forecasting

Sales Forecasting

Worked Example: Setup

Problem

Goal

Computing the Slope \(w_1\)

Step 1: Compute means

Step 2: Compute sums

Intercept & Final Model

Interpretation

Making Predictions

Key

Visualization

What Does "Error" Mean?

Residuals

Why Squared Errors?

Why not just sum the errors \(\sum(e_i)\)?

Sum of errors

Absolute errors

Squared errors

Why "differentiable" matters

Cost Function: Mean Squared Error

How do we measure "best fit"?

For our example

Next question

Gradient Descent

What is Gradient Descent?

Definition

Intuition

Where it's used

Why Do We Need Gradient Descent?

Can't we just use a formula?

Formula fails when...

GD works for...

The Calculus You Need: 3 Rules

What is a derivative?

That's it!

Intuition: Rolling Downhill

Imagine

Why the Negative Gradient?

If gradient is negative

If gradient is positive

Use Cases for Gradient Descent

Training Neural Networks

Linear/Logistic Regression

Image Recognition

Natural Language Processing

The Gradient Descent Algorithm

Key insight

Deriving the MSE Gradient — Setup

Our cost function (MSE)

Substitute the model