CMSC 194.2

Introduction to Neural Networks

From Linear Regression to Deep Networks

Noel Jeffrey Pinton

Department of Computer Science
University of the Philippines Cebu

Neural Networks

Overview

What We'll Cover

Linear Regression

Predicting continuous values

Gradient Descent

Optimizing parameters

Logistic Regression

Binary classification

Regularization

Preventing overfitting

History of NNs

80 years of progress

Simple Neuron

The perceptron

Multilayer Perceptron

Hidden layers

Fully Connected NNs

Deep networks

Objectives

Learning Objectives

Regression

Explain linear and logistic regression, compute predictions by hand, and understand residuals & cost functions

Gradient Descent

Derive the MSE gradient step-by-step using chain rule, and apply gradient descent to optimize parameters

Regularization

Describe overfitting vs. underfitting and explain how L1, L2, and dropout prevent memorization

History

Trace 80 years of neural network development from McCulloch-Pitts to ChatGPT

Neurons & Logic

Compute forward passes through perceptrons and build AND, OR, NOT gates from single neurons

Deep Networks

Explain how hidden layers solve XOR, work through backpropagation by hand, and count FCNN parameters

SECTION 1

Linear Regression

Linear Regression

What is Linear Regression?

Definition

A supervised learning method that models the linear relationship between a dependent variable \(y\) and one or more independent variables \(x\).

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_n x_n$$

Or in vector form: \(\hat{y} = \mathbf{w}^T \mathbf{x} + b\)

Linear Regression

Why Linear? A Real-Life Analogy

Your electricity bill

Monthly charge = base fee + rate per kWh × usage

\(\text{Bill} = 500 + 12 \times \text{kWh used}\)

The connection

This is exactly \(\hat{y} = w_0 + w_1 x\) where:

  • \(w_0 = 500\) (base fee / intercept)
  • \(w_1 = 12\) (rate / slope)
  • \(x\) = kWh used (input feature)
  • \(\hat{y}\) = bill amount (prediction)

Many things are "roughly linear"

  • Distance traveled = speed × time
  • Taxi fare = base + rate × km
  • Exam score ≈ hours studied × factor
  • Crop yield ≈ fertilizer amount × factor
Linear Regression

Use Cases for Linear Regression

House Price Prediction

Predict price from area, bedrooms, location

Salary Estimation

Predict salary from years of experience

Weather Forecasting

Predict temperature from historical data

Sales Forecasting

Predict revenue from ad spend

Linear Regression

Worked Example: Setup

Problem

Predict house price ($1000s) from area (sq ft)

Area \(x\) (sq ft)Price \(y\) ($1000s)
1000150
1500200
2000250
2500300
3000350
Linear Regression

Computing the Slope \(w_1\)

$$w_1 = \frac{\sum_{i=1}^{m}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{m}(x_i - \bar{x})^2}$$

Step 1: Compute means

\(\bar{x} = \frac{1000+1500+2000+2500+3000}{5} = 2000\)

\(\bar{y} = \frac{150+200+250+300+350}{5} = 250\)

Step 2: Compute sums

Numerator: \(250{,}000\)

Denominator: \(2{,}500{,}000\)

Linear Regression

Intercept & Final Model

$$w_0 = \bar{y} - w_1 \bar{x} = 250 - 0.1 \times 2000 = \mathbf{50}$$

Final Model: \(\hat{y} = 50 + 0.1x\)

Linear Regression

Making Predictions

Using our model \(\hat{y} = 50 + 0.1x\):

Area \(x\)ComputationPredicted Price \(\hat{y}\)
1,200 sq ft\(50 + 0.1(1200)\)$170K
1,800 sq ft\(50 + 0.1(1800)\)$230K
2,200 sq ft\(50 + 0.1(2200)\)$270K
3,500 sq ft\(50 + 0.1(3500)\)$400K
Linear Regression

Visualization

Scatter plot of house area vs price with the regression line y = 50 + 0.1x fitting all 5 data points
Linear Regression

What Does "Error" Mean?

Area (sq ft)Price ($K) error error error ŷ = 50 + 0.1x

Residuals

The vertical dashed lines are the residuals (errors): \(e_i = \hat{y}_i - y_i\). Each one measures how far off our prediction is. The "best" line minimizes these gaps overall.

Linear Regression

Why Squared Errors?

Why not just sum the errors \(\sum(e_i)\)?

Positive and negative errors cancel out! A line through the middle of a scattered cloud could have total error = 0 despite being terrible.

Sum of errors

\(\sum e_i\)

Cancels out — useless

Absolute errors

\(\sum |e_i|\)

Works, but not differentiable at 0

Squared errors

\(\sum e_i^2\)

Always positive, differentiable, penalizes big errors more

Linear Regression

Cost Function: Mean Squared Error

How do we measure "best fit"?

By minimizing the average squared distance between predictions and actual values.

$$J(w_0, w_1) = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}_i - y_i)^2$$

SECTION 2

Gradient Descent

Gradient Descent

What is Gradient Descent?

Definition

An iterative optimization algorithm that finds the minimum of a function by repeatedly taking steps proportional to the negative of the gradient.

$$w := w - \alpha \frac{\partial J}{\partial w}$$

Where \(\alpha\) is the learning rate (step size)

Gradient Descent

Why Do We Need Gradient Descent?

Can't we just use a formula?

For simple linear regression with 1 feature, yes — we used the closed-form formula to get \(w_1 = 0.1\). But what about...

Formula fails when...

  • You have 100+ features (matrix inversion is \(O(n^3)\))
  • The model is non-linear (neural networks)
  • Dataset is too large for memory

GD works for...

  • Any differentiable function
  • Any number of parameters
  • Any dataset size (mini-batches)
  • Linear models and neural networks
Gradient Descent

The Calculus You Need: 3 Rules

What is a derivative?

The derivative \(\frac{df}{dx}\) tells you the slope (rate of change) of \(f\) at any point. Positive slope = function is increasing. Negative slope = decreasing.

RuleFormulaExample
Power Rule\(\frac{d}{dx}[x^n] = nx^{n-1}\)\(\frac{d}{dx}[x^2] = 2x\)
Constant Rule\(\frac{d}{dx}[cf(x)] = c \cdot f'(x)\)\(\frac{d}{dx}[3x^2] = 6x\)
Chain Rule\(\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)\)\(\frac{d}{dx}[(2x+1)^2] = 2(2x+1) \cdot 2\)
Gradient Descent

Intuition: Rolling Downhill

Imagine

You're standing on a hillside in dense fog. You can only feel the slope under your feet.

Strategy: Always step in the steepest downhill direction. Eventually, you reach the valley (minimum).

Cost function parabola J(w) = (w-3) squared showing a ball rolling downhill toward the minimum
Gradient Descent

Why the Negative Gradient?

w (left) gradient < 0 -gradient > 0 step RIGHT ✓ w (right) gradient > 0 -gradient < 0 step LEFT ✓ minimum J(w)

If gradient is negative

Subtracting a negative number = adding → we move right (toward the minimum)

If gradient is positive

Subtracting a positive number = decreasing → we move left (toward the minimum)

Gradient Descent

Use Cases for Gradient Descent

Training Neural Networks

Optimize millions of weights via backpropagation

Linear/Logistic Regression

Find optimal parameters for prediction models

Image Recognition

Train CNNs for object detection and classification

Natural Language Processing

Train transformers for language understanding

Gradient Descent

The Gradient Descent Algorithm

Gradient Descent
1. Initialize weights w randomly
2. Repeat until convergence:
   a. Compute gradient: g = dJ/dw
   b. Update weights:  w = w - α × g
3. Return w
Gradient Descent

Deriving the MSE Gradient — Setup

Our cost function (MSE)

$$J(w_0, w_1) = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}_i - y_i)^2$$

Substitute the model

Since \(\hat{y}_i = w_0 + w_1 x_i\), we can write:

$$J(w_0, w_1) = \frac{1}{2m}\sum_{i=1}^{m}(w_0 + w_1 x_i - y_i)^2$$

We need two partial derivatives

\(\frac{\partial J}{\partial w_1}\) — how does the cost change when we adjust the slope?

\(\frac{\partial J}{\partial w_0}\) — how does the cost change when we adjust the intercept?

Gradient Descent

Deriving \(\frac{\partial J}{\partial w_1}\) — Step by Step

Step 1: Apply the derivative to the sum

$$\frac{\partial J}{\partial w_1} = \frac{1}{2m}\sum_{i=1}^{m} \frac{\partial}{\partial w_1}\left[(w_0 + w_1 x_i - y_i)^2\right]$$

Step 2: Chain rule — derivative of \((\text{something})^2\)

$$= \frac{1}{2m}\sum_{i=1}^{m} 2(w_0 + w_1 x_i - y_i) \cdot \frac{\partial}{\partial w_1}(w_0 + w_1 x_i - y_i)$$

The inner derivative w.r.t. \(w_1\): \(\frac{\partial}{\partial w_1}(w_0 + w_1 x_i - y_i) = x_i\)

Step 3: Simplify — the 2 and \(\frac{1}{2}\) cancel!

$$= \frac{1}{m}\sum_{i=1}^{m}(w_0 + w_1 x_i - y_i) \cdot x_i$$

Gradient Descent

Deriving \(\frac{\partial J}{\partial w_0}\) — Step by Step

Same process, but the inner derivative changes

$$\frac{\partial J}{\partial w_0} = \frac{1}{2m}\sum_{i=1}^{m} 2(w_0 + w_1 x_i - y_i) \cdot \frac{\partial}{\partial w_0}(w_0 + w_1 x_i - y_i)$$

Key difference

\(\frac{\partial}{\partial w_0}(w_0 + w_1 x_i - y_i) = \mathbf{1}\)

Compare: for \(w_1\) it was \(x_i\)

Result

$$\frac{\partial J}{\partial w_0} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)$$

No \(x_i\) — just the average error!

Gradient Descent

The Complete Update Rules

Gradient Descent for Linear Regression

$$w_1 := w_1 - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i) \cdot x_i$$

$$w_0 := w_0 - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)$$

What we derived

  • Start from the cost function \(J\)
  • Take partial derivatives using chain rule
  • Get the gradients for each parameter
  • Subtract them (scaled by \(\alpha\))

Two equations — that's it!

These two equations are all you need to train linear regression. Repeat until convergence.

Gradient Descent

GD for Linear Regression — Worked Example

Using our house price data (5 points)

Start: \(w_0 = 0,\; w_1 = 0,\; \alpha = 0.0000001\) (tiny LR because features are large)

Iteration 0: Compute predictions (\(\hat{y} = 0 + 0 \cdot x = 0\) for all)

Errors: \(\hat{y}_i - y_i = [0-150,\; 0-200,\; 0-250,\; 0-300,\; 0-350] = [-150, -200, -250, -300, -350]\)

\(\frac{\partial J}{\partial w_0}\)

\(= \frac{1}{5}(-150-200-250-300-350)\)

\(= \frac{-1250}{5} = \mathbf{-250}\)

\(\frac{\partial J}{\partial w_1}\)

\(= \frac{1}{5}[(-150)(1000) + (-200)(1500) + \ldots]\)

\(= \frac{-575{,}000}{5} \times 10^{-7} \approx \mathbf{-550{,}000}\)

Gradient Descent

GD for Linear Regression — Convergence

Iteration\(w_0\)\(w_1\)\(J\) (cost)
00.0000.00032,500
100.0030.04210,240
1000.280.0841,056
1,00012.40.09728.7
10,00045.80.1000.35
~50,00050.00.100≈ 0
Gradient Descent

Worked Example: Setup

Problem

Minimize \(J(w) = (w - 3)^2\)

Settings

  • Starting point: \(w_0 = 0\)
  • Learning rate: \(\alpha = 0.2\)
  • Gradient: \(\frac{dJ}{dw} = 2(w - 3)\)

Expected minimum

\(J(w)\) is minimized when \(w = 3\). Let's see if gradient descent finds it!

Gradient Descent

Step-by-Step Iterations

Iter\(w\)\(\frac{dJ}{dw} = 2(w-3)\)UpdateNew \(w\)
00.000\(2(0-3) = -6.0\)\(0 - 0.2(-6.0)\)1.200
11.200\(2(1.2-3) = -3.6\)\(1.2 - 0.2(-3.6)\)1.920
21.920\(2(1.92-3) = -2.16\)\(1.92 - 0.2(-2.16)\)2.352
32.352\(2(2.352-3) = -1.296\)\(2.352 - 0.2(-1.296)\)2.611
42.611\(2(2.611-3) = -0.778\)\(2.611 - 0.2(-0.778)\)2.767
Gradient Descent

Gradient Descent in Action

Gradient descent convergence showing 5 iterations on J(w) = (w-3) squared, converging from w=0 toward w=3
Gradient Descent

Effect of Learning Rate \(\alpha\)

Too Small (\(\alpha = 0.01\))

Learning rate too small: slow convergence

Slow convergence

Just Right (\(\alpha = 0.2\))

Learning rate just right: steady convergence

Optimal convergence

Too Large (\(\alpha = 0.9\))

Learning rate too large: overshooting and divergence

Overshoots — may diverge!

Gradient Descent

Common Pitfalls

Local Minima

Non-convex functions have multiple valleys. GD might get stuck in a shallow one instead of finding the deepest.

Saddle Points

Points where gradient = 0 but it's not a minimum. The surface curves up in one direction and down in another.

Feature Scaling

If features have very different ranges (e.g., age 0–100 vs salary 0–1M), GD zig-zags instead of going straight to the minimum.

When to Stop?

Common criteria: gradient < threshold, cost change < \(\epsilon\), or fixed number of iterations.

Gradient Descent

Gradient Descent Variants

VariantData per StepSpeedStability
Batch GDAll samplesSlowVery stable
Stochastic GD1 sampleFastNoisy
Mini-batch GD\(k\) samplesBalancedBalanced
SECTION 3

Logistic Regression

Logistic Regression

Why Not Linear Regression for Classification?

1.21.00.50.0-0.2 ŷ = 1.3 ??? ŷ = -0.2 ??? Predictions outside [0, 1]!

Problem 1

Linear regression can predict -0.2 or 1.3 — what does a negative probability mean?

Problem 2

A single outlier can shift the entire line, changing the decision boundary dramatically.

Logistic Regression

What is Logistic Regression?

Definition

A classification algorithm that models the probability of a binary outcome using the sigmoid (logistic) function.

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)$$

Logistic Regression

The Sigmoid Function

Sigmoid function S-curve with shaded regions: predict 0 below 0.5 threshold, predict 1 above
Logistic Regression

Sigmoid — Computing by Hand

\(z\)\(e^{-z}\)\(1 + e^{-z}\)\(\sigma(z) = \frac{1}{1+e^{-z}}\)Interpretation
\(-3\)\(e^3 = 20.09\)\(21.09\)0.047Very likely class 0
\(-1\)\(e^1 = 2.718\)\(3.718\)0.269Probably class 0
\(0\)\(e^0 = 1\)\(2\)0.500Coin flip!
\(1\)\(e^{-1} = 0.368\)\(1.368\)0.731Probably class 1
\(3\)\(e^{-3} = 0.050\)\(1.050\)0.953Very likely class 1
Logistic Regression

From Probability to Decision

1

Linear Combination

\(z = \mathbf{w}^T\mathbf{x} + b\)

Can be any real number

2

Apply Sigmoid

\(p = \sigma(z)\)

Squished to (0, 1)

3

Apply Threshold

\(p \geq 0.5 \rightarrow\) Class 1

\(p < 0.5 \rightarrow\) Class 0

00.51 Predict 0 (Fail)Predict 1 (Pass)
Logistic Regression

Use Cases for Logistic Regression

Email Spam Detection

Classify emails as spam or not spam

Medical Diagnosis

Predict disease presence from symptoms

Fraud Detection

Flag suspicious credit card transactions

Student Pass/Fail

Predict outcome from study hours

Logistic Regression

Worked Example: Setup

Problem

Predict pass/fail from study hours

Hours \(x\)1234567
ResultFailFailFailPassPassPassPass
Logistic Regression

Forward Pass Computation

Hours \(x\)\(z = 1.5x - 5.0\)\(\sigma(z) = \frac{1}{1+e^{-z}}\)Prediction
2\(1.5(2)-5 = -2.0\)\(\frac{1}{1+e^{2.0}} = 0.119\)Fail
3\(1.5(3)-5 = -0.5\)\(\frac{1}{1+e^{0.5}} = 0.378\)Fail
4\(1.5(4)-5 = 1.0\)\(\frac{1}{1+e^{-1.0}} = 0.731\)Pass
5\(1.5(5)-5 = 2.5\)\(\frac{1}{1+e^{-2.5}} = 0.924\)Pass
Logistic Regression

Loss: Binary Cross-Entropy

$$J = -\frac{1}{m}\sum_{i=1}^{m}\left[y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$$

Logistic Regression

Decision Boundary

Logistic regression decision boundary at 3.33 study hours, with fail points below and pass points above
Logistic Regression

Training Logistic Regression

Gradient of cross-entropy

$$\frac{\partial J}{\partial w_j} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)x_j^{(i)}$$

Surprise! This looks identical to the linear regression gradient — but \(\hat{y}\) uses the sigmoid.

Training recipe

  1. Compute predictions: \(\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b)\)
  2. Compute loss: \(J\) (cross-entropy)
  3. Compute gradients: \(\frac{\partial J}{\partial w}\)
  4. Update weights: \(w := w - \alpha \nabla J\)
  5. Repeat until convergence
Logistic Regression

Linear vs Logistic Regression

FeatureLinear RegressionLogistic Regression
OutputContinuous valueProbability (0–1)
ActivationNone (identity)Sigmoid
Loss FunctionMean Squared ErrorBinary Cross-Entropy
Use CaseRegression (predict amount)Classification (predict class)
DecisionNo thresholdThreshold at 0.5
SECTION 4

Regularization

Regularization

What is Regularization?

Definition

A technique that adds a penalty term to the loss function to prevent overfitting by discouraging overly complex models.

$$J_{\text{reg}} = J_{\text{original}} + \lambda \cdot R(\mathbf{w})$$

Where \(\lambda\) controls regularization strength

Think of it as...

Regularization is like a teacher telling a student: "I don't just want the right answer — I want a simple explanation." Simpler models generalize better.

Regularization

The Overfitting Analogy

The Memorizer

Memorizes every answer from past exams word-for-word

100%
Practice exams
45%
New exam

The Understander

Learns the underlying concepts and problem-solving strategies

85%
Practice exams
82%
New exam
Regularization

Underfitting vs. Just Right vs. Overfitting

Underfitting

Underfitting: straight line poorly fitting curved data

High Bias — too simple

Just Right

Good fit: smooth curve following the data trend

Good Balance

Overfitting

Overfitting: wiggly curve passing through every data point

High Variance — too complex

Regularization

Types of Regularization

L1 Regularization (Lasso)

$$J_{L1} = J + \lambda \sum_{i} |w_i|$$

Encourages sparsity — some weights become exactly 0, effectively performing feature selection.

L2 Regularization (Ridge)

$$J_{L2} = J + \lambda \sum_{i} w_i^2$$

Shrinks all weights toward 0 but never exactly 0. Smoother, more stable solutions.

Regularization

Beyond L1 & L2: More Techniques

Elastic Net (L1 + L2)

$$J_{EN} = J + \lambda_1 \sum|w_i| + \lambda_2 \sum w_i^2$$

Combines the best of both: feature selection and weight shrinkage. Used in scikit-learn's ElasticNet.

Dropout (Neural Networks)

Randomly "turn off" neurons during training (e.g., 50% chance). Forces the network to not rely on any single neuron.

Like studying with random pages removed — you learn the core concepts, not surface patterns.

Regularization

Use Cases for Regularization

Feature Selection (L1)

Automatically remove irrelevant features by zeroing their weights

Neural Network Training

Prevent deep networks from memorizing training data

Reducing Complexity

Keep models simple and interpretable

Improving Generalization

Models perform better on unseen test data

Regularization

Worked Example: L2 Regularization

Applying L2 penalty to our linear regression

Original model: \(w_0 = 50\) (bias), \(w_1 = 0.1\) (weight), and \(\lambda = 0.01\)

Without regularization

\(J = 0\) (perfect fit)

With L2 regularization

\(J_{\text{reg}} = 0 + 0.01 \times (0.1^2) = 0 + 0.01 \times 0.01 = \mathbf{0.0001}\)

Regularization

Effect of \(\lambda\)

Effect of regularization parameter lambda: lambda=0 overfits, lambda=0.1 is optimal, lambda=10 underfits
Regularization

Regularization Summary

Key Takeaways

  • Regularization controls model complexity via a penalty term
  • L1 (Lasso): Feature selection (sparse weights)
  • L2 (Ridge): Weight shrinkage (smooth models)
  • \(\lambda\) balances fit quality vs. complexity
  • Elastic Net: Combines L1 + L2 (best of both)

What's Next?

We now have the classical building blocks:

  • Regression (linear & logistic)
  • Optimization (gradient descent)
  • Regularization (preventing overfitting)

Let's look at the history that led from these ideas to neural networks...

SECTION 5

A Brief History of Neural Networks

From the 1940s to Today

History

The Early Years (1943–1969)

1943

McCulloch & Pitts — First mathematical model of an artificial neuron. Showed neurons could compute logical functions.

1958

Frank Rosenblatt's Perceptron — First trainable neural network, implemented in hardware. Could learn to classify simple patterns.

1969

Minsky & Papert — Published "Perceptrons", proving single-layer networks cannot solve XOR. Revealed fundamental limitations.

History

The Resurgence (1986–2006)

1986

Rumelhart, Hinton & Williams — Popularized backpropagation, enabling training of multi-layer networks. The key breakthrough.

1989

Yann LeCun — Applied CNNs to handwritten digit recognition (MNIST). First practical deep learning success.

1997

Hochreiter & Schmidhuber — Invented LSTM for sequential data, solving the vanishing gradient problem.

2006

Geoffrey Hinton — Deep Belief Networks. The term "Deep Learning" enters mainstream AI vocabulary.

History

The Deep Learning Era (2012–Present)

2012

AlexNet wins ImageNet by a huge margin. 8 layers, GPU training. The "Big Bang" of modern deep learning.

2014

GANs (Goodfellow) — Generative Adversarial Networks can create realistic images from noise.

2017

"Attention Is All You Need" — The Transformer architecture revolutionizes NLP and later all of AI.

2022–25

ChatGPT & LLMs — Large language models reach mainstream. AI becomes a daily tool for millions.

175B
Parameters in GPT-3
100M+
ChatGPT users in 2 months
History

Why Neural Networks Work Now

Big Data

Internet, smartphones, and sensors generate massive datasets. Deep networks need lots of data to learn effectively.

Compute Power

GPUs, TPUs, and cloud computing make training billion-parameter models feasible. 10,000x faster than 2000s hardware.

Better Algorithms

ReLU, batch normalization, dropout, residual connections, Adam optimizer — all invented in the 2010s.

History

From Classical ML to Neural Networks

LinearRegression LogisticRegression Perceptron MLP DeepNetworks + sigmoid+ threshold+ hidden layers+ more depth
History

Think-Pair-Share

Discussion Question

Why did neural networks fail in the 1970s but succeed in the 2010s?

What changed in terms of data, compute, and algorithms?

Hint: Data

How much data existed in the 1970s vs. the age of the internet?

Hint: Compute

What hardware breakthrough made matrix multiplication 100× faster?

Hint: Algorithms

What training method was missing before 1986?

Click to reveal answer

1970s: Limited data, no GPUs, only single-layer networks (couldn't solve XOR). 2010s: Internet-scale data, GPU parallelism (NVIDIA CUDA), and breakthroughs like backpropagation, ReLU, and batch normalization made deep networks trainable.

SECTION 6

The Simple Neuron

The Perceptron

The Simple Neuron

What is an Artificial Neuron?

Definition

A computational unit that takes weighted inputs, sums them, adds a bias, and applies an activation function.

$$z = \sum_{i=1}^{n} w_i x_i + b, \quad a = f(z)$$

x₁ x₂ x₃ w₁ w₂ w₃ Σ + b z f(z) ŷ InputsSumActivationOutput
The Simple Neuron

Biological vs. Artificial

Dendrites Soma AxonHillock Axon Outputsignal

Mapping

Dendrites→ Inputs \(x_1, x_2, \ldots\)
Synaptic strength→ Weights \(w_1, w_2, \ldots\)
Soma (cell body)→ Summation \(\Sigma\)
Axon hillock→ Activation function \(f\)
Axon→ Output \(\hat{y}\)

Important caveat

Real neurons are vastly more complex — they use timing, chemical signals, and recurrent connections. The artificial neuron is a very rough approximation.

The Simple Neuron

Common Activation Functions

Sigmoid

Sigmoid activation function curve from 0 to 1

\(f(z) = \frac{1}{1+e^{-z}}\)

Range: (0, 1)

ReLU

ReLU activation function: zero for negative, linear for positive

\(f(z) = \max(0, z)\)

Range: [0, ∞)

Tanh

Tanh activation function curve from -1 to 1

\(f(z) = \tanh(z)\)

Range: (-1, 1)

The Simple Neuron

What Can a Single Neuron Do?

Key Insight

A single neuron with sigmoid activation is logistic regression! Same weighted sum + same activation function.

Capabilities

  • Binary classification
  • Logic gates: AND, OR, NOT
  • Any linearly separable problem

Limitations

  • Cannot solve XOR
  • Cannot learn non-linear boundaries
  • Only one decision boundary (a line/plane)
The Simple Neuron

Worked Example: AND Gate

Neuron

\(w_1 = 1,\; w_2 = 1,\; b = -1.5\), step activation \(f(z) = \begin{cases}1 & z \geq 0\\0 & z < 0\end{cases}\)

\(x_1\)\(x_2\)\(z = x_1 + x_2 - 1.5\)\(f(z)\)ANDMatch?
00\(0 + 0 - 1.5 = -1.5\)00
01\(0 + 1 - 1.5 = -0.5\)00
10\(1 + 0 - 1.5 = -0.5\)00
11\(1 + 1 - 1.5 = 0.5\)11
The Simple Neuron

OR Gate — Change the Bias!

Neuron

\(w_1 = 1,\; w_2 = 1,\; b = -0.5\), same step activation

\(x_1\)\(x_2\)\(z = x_1 + x_2 - 0.5\)\(f(z)\)ORMatch?
00\(0 + 0 - 0.5 = -0.5\)00
01\(0 + 1 - 0.5 = 0.5\)11
10\(1 + 0 - 0.5 = 0.5\)11
11\(1 + 1 - 0.5 = 1.5\)11
The Simple Neuron

NOT Gate — Single Input

Neuron

\(w_1 = -1,\; b = 0.5\), step activation. Only one input.

\(x\)\(z = -x + 0.5\)\(f(z)\)NOT \(x\)Match?
0\(-0 + 0.5 = 0.5\)11
1\(-1 + 0.5 = -0.5\)00
AND
\(w=1, b=-1.5\)
OR
\(w=1, b=-0.5\)
NOT
\(w=-1, b=0.5\)
The Simple Neuron

The XOR Problem

Can a single neuron compute XOR?

\(x_1\)\(x_2\)XOR
000
011
101
110

No! XOR is not linearly separable. No single straight line can separate the 1s from the 0s.

XOR problem: 4 data points showing non-linear separability
1 1 w=1 w=1 Σ b = -1.5 0.5 step 1
The Simple Neuron

AND Gate Neuron — Traced

Step-by-step for input (1, 1)

  1. \(z = w_1 x_1 + w_2 x_2 + b\)
  2. \(z = 1(1) + 1(1) + (-1.5)\)
  3. \(z = 2 - 1.5 = 0.5\)
  4. \(f(0.5) = 1\) (since \(0.5 \geq 0\))

Output: 1 ✓ Correct!

The Simple Neuron

We Need More Power

The Limitation

A single neuron can only learn linearly separable patterns. It draws one straight line through feature space.

The Solution

Stack multiple neurons in layers. Each neuron learns a different feature. Together, they can solve XOR and any non-linear problem!

Enter: The Multilayer Perceptron (MLP)

SECTION 7

Multilayer Perceptron

Hidden Layers

Multilayer Perceptron

What is an MLP?

Definition

A feedforward neural network with one or more hidden layers between the input and output layers. Each layer is fully connected to the next.

Structure

Input Layer → Hidden Layer(s) → Output Layer

  • Each connection has a weight
  • Each neuron has a bias and activation function
  • Information flows forward only (no loops)
Multilayer Perceptron

What Hidden Layers Actually Learn

Layer 1: Edges

Detects simple patterns — lines, curves, light/dark transitions

Layer 2: Parts

Combines edges into shapes — eyes, noses, wheels, letters

Layer 3: Objects

Combines parts into concepts — faces, cars, words

Multilayer Perceptron

MLP Architecture

x₁ x₂ h₁ h₂ h₃ ŷ Input (2) Hidden (3) Output (1)
Multilayer Perceptron

Use Cases for MLP

XOR & Non-Linear Classification

Solve problems that single neurons cannot

Handwritten Digit Recognition

MNIST dataset — the "Hello World" of deep learning

Function Approximation

Universal approximation theorem: can approximate any continuous function

Tabular Data Prediction

Structured data with complex feature interactions

Multilayer Perceptron

Solving XOR with an MLP

Network

2 inputs, 2 hidden neurons (step activation), 1 output

Hidden neuron \(h_1\) (OR-like)

\(w_{11}=1,\; w_{12}=1,\; b_1=-0.5\)

\(h_1 = f(x_1 + x_2 - 0.5)\)

Hidden neuron \(h_2\) (AND)

\(w_{21}=1,\; w_{22}=1,\; b_2=-1.5\)

\(h_2 = f(x_1 + x_2 - 1.5)\)

Multilayer Perceptron

XOR — Full Computation

\(x_1\)\(x_2\)\(h_1\) = f(\(x_1+x_2-0.5\))\(h_2\) = f(\(x_1+x_2-1.5\))\(\hat{y}\) = f(\(h_1-2h_2-0.5\))XOR
00f(-0.5) = 0f(-1.5) = 0f(-0.5) = 00
01f(0.5) = 1f(-0.5) = 0f(0.5) = 11
10f(0.5) = 1f(-0.5) = 0f(0.5) = 11
11f(1.5) = 1f(0.5) = 1f(-1.5) = 00
Multilayer Perceptron

Backpropagation: The Key Idea

The Problem

We know the output is wrong, but how do we know which hidden weights to blame? The hidden neurons don't have direct targets — only the output does.

Factory Analogy

Imagine a factory assembly line with 3 stations. The final product is defective. You trace backward:

  • Station 3 (output): added the wrong label → fix station 3
  • Station 2 (hidden): provided a bent component → fix station 2
  • Station 1 (hidden): cut the material too short → fix station 1

Backpropagation does exactly this — it traces the error backward through each layer.

Multilayer Perceptron

Training MLPs: Backpropagation

Forward Pass

  1. Feed input through network
  2. Compute each layer's output
  3. Get final prediction \(\hat{y}\)
  4. Compute loss \(J\)

Backward Pass

  1. Compute output error
  2. Propagate error backward
  3. Compute gradients via chain rule
  4. Update all weights

Chain Rule

$$\frac{\partial J}{\partial w^{[l]}} = \frac{\partial J}{\partial a^{[L]}} \cdot \frac{\partial a^{[L]}}{\partial z^{[L]}} \cdot \ldots \cdot \frac{\partial z^{[l]}}{\partial w^{[l]}}$$

Backpropagation = the chain rule applied systematically across layers

Multilayer Perceptron

Backprop Example: Forward Pass

Tiny network: 1 input → 1 hidden (sigmoid) → 1 output (sigmoid)

Input: \(x = 1\), Target: \(y = 1\), Learning rate: \(\alpha = 0.5\)

Weights: \(w_1 = 0.5,\; b_1 = 0.2\) (hidden)  |  \(w_2 = 0.8,\; b_2 = 0.1\) (output)

Hidden layer

\(z_1 = w_1 x + b_1 = 0.5(1) + 0.2 = 0.7\)

\(h = \sigma(0.7) = \frac{1}{1+e^{-0.7}} = \mathbf{0.668}\)

Output layer

\(z_2 = w_2 h + b_2 = 0.8(0.668) + 0.1 = 0.634\)

\(\hat{y} = \sigma(0.634) = \frac{1}{1+e^{-0.634}} = \mathbf{0.653}\)

Multilayer Perceptron

Backprop Example: Backward Pass

Output layer gradient (\(\frac{\partial L}{\partial w_2}\))

\(\frac{\partial L}{\partial \hat{y}} = \hat{y} - y = 0.653 - 1 = -0.347\)

\(\frac{\partial \hat{y}}{\partial z_2} = \hat{y}(1-\hat{y}) = 0.653 \times 0.347 = 0.2266\)   (sigmoid derivative)

\(\frac{\partial z_2}{\partial w_2} = h = 0.668\)

Chain rule: \(\frac{\partial L}{\partial w_2} = (-0.347)(0.2266)(0.668) = \mathbf{-0.0525}\)

Hidden layer gradient (\(\frac{\partial L}{\partial w_1}\))

Continue the chain through \(w_2\): \(\frac{\partial z_2}{\partial h} = w_2 = 0.8\), \(\frac{\partial h}{\partial z_1} = h(1-h) = 0.668 \times 0.332 = 0.2218\), \(\frac{\partial z_1}{\partial w_1} = x = 1\)

Chain rule: \(\frac{\partial L}{\partial w_1} = (-0.347)(0.2266)(0.8)(0.2218)(1) = \mathbf{-0.01394}\)

Multilayer Perceptron

Backprop Example: Update & Verify

Update weights (\(\alpha = 0.5\))

\(w_2' = 0.8 - 0.5(-0.0525) = \mathbf{0.826}\)

\(w_1' = 0.5 - 0.5(-0.01394) = \mathbf{0.507}\)

Verify: new forward pass

\(z_1' = 0.507(1) + 0.2 = 0.707\)

\(h' = \sigma(0.707) = 0.670\)

\(z_2' = 0.826(0.670) + 0.1 = 0.653\)

\(\hat{y}' = \sigma(0.653) = \mathbf{0.658}\)

0.0603
Loss BEFORE
0.0585
Loss AFTER
XOR problem solved by MLP with two decision boundary lines
Multilayer Perceptron

How the MLP Solves XOR

What happened?

  • \(h_1\) draws one boundary line
  • \(h_2\) draws another boundary line
  • The output combines them: the region between the lines is class 1
  • Two linear boundaries → one non-linear decision region

The power of hidden layers

They transform the input space into a representation where the problem becomes linearly separable.

SECTION 8

Fully Connected Neural Networks

Going Deep

Fully Connected NN

What is a Fully Connected Neural Network?

Definition

A deep neural network where every neuron in one layer is connected to every neuron in the next layer. Also called a Dense Network or Deep Feedforward Network.

MLP vs FCNN

An MLP with 2+ hidden layers = a Fully Connected Neural Network. "Deep" means multiple hidden layers.

Key property

More layers = more abstraction. Early layers learn simple features, deeper layers learn complex combinations.

Fully Connected NN

FCNN Architecture

x₁ x₂ x₃ y₁ y₂ Input (3)Hidden 1 (4)Hidden 2 (4)Output (2)
Fully Connected NN

How Many Operations? Thinking About Scale

Every parameter = 1 multiply + 1 add

A forward pass through a network requires one multiplication per weight and one addition per bias. The total cost scales with parameter count.

109K
Parameters (MNIST)
~218K
Arithmetic ops / image
1.8T
Parameters (GPT-4)
Fully Connected NN

Use Cases for FCNNs

Image Classification

Flatten pixels into a vector and classify (before CNNs took over)

Natural Language Processing

Process word embeddings for sentiment analysis and text classification

Tabular / Structured Data

FCNNs remain the go-to for structured feature data (customer churn, fraud)

Reinforcement Learning

Policy and value networks in game-playing agents (DQN)

Fully Connected NN

Worked Example: Forward Pass (Layer 1)

FCNN

2 inputs → 2 hidden (ReLU) → 1 output (sigmoid). Input: \(\mathbf{x} = [0.5,\; 0.8]\)

Layer 1 weights & bias

$$W^{[1]} = \begin{bmatrix} 0.2 & 0.4 \\ 0.6 & 0.3 \end{bmatrix}, \quad b^{[1]} = \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix}$$

Compute \(z^{[1]} = W^{[1]}\mathbf{x} + b^{[1]}\)

\(z_1^{[1]} = 0.2(0.5) + 0.4(0.8) + 0.1 = \mathbf{0.52}\)

\(z_2^{[1]} = 0.6(0.5) + 0.3(0.8) + 0.2 = \mathbf{0.74}\)

\(a^{[1]} = \text{ReLU}(z^{[1]}) = [\max(0, 0.52),\; \max(0, 0.74)] = \mathbf{[0.52,\; 0.74]}\)

Fully Connected NN

Worked Example: Forward Pass (Output)

Layer 2 (output) weights & bias

$$W^{[2]} = \begin{bmatrix} 0.5 & 0.7 \end{bmatrix}, \quad b^{[2]} = 0.1$$

Compute output

\(z^{[2]} = 0.5(0.52) + 0.7(0.74) + 0.1 = 0.26 + 0.518 + 0.1 = \mathbf{0.878}\)

\(a^{[2]} = \sigma(0.878) = \frac{1}{1 + e^{-0.878}} = \mathbf{0.706}\)

Fully Connected NN

Challenges in Training Deep Networks

Vanishing Gradients

Gradients shrink exponentially through many layers, making early layers barely learn.

Exploding Gradients

Gradients grow exponentially, causing unstable weight updates and NaN values.

Overfitting

Large networks can easily memorize training data. Need regularization + dropout.

Computational Cost

More parameters = more computation. GPUs are essential for training.

Fully Connected NN

The Deep Learning Recipe

5 Steps — Every Neural Network Ever

1

Choose Architecture

Layers, neurons, activations

2

Initialize Weights

Xavier or He init

3

Forward Pass

Compute prediction & loss

4

Backward Pass

Compute gradients

5

Update Weights

Repeat from step 3

Fully Connected NN

Activity: Count the Parameters

Exercise

Calculate the total parameters in this FCNN:

  • Input: 784 neurons (28×28 image, flattened)
  • Hidden 1: 128 neurons
  • Hidden 2: 64 neurons
  • Output: 10 neurons (digits 0–9)

Formula hint

Parameters per layer = \((\text{inputs} \times \text{outputs}) + \text{outputs}\). The first term is weights, the second is biases.

Click to reveal solution

Layer 1: \(784 \times 128 + 128 = 100{,}480\)
Layer 2: \(128 \times 64 + 64 = 8{,}256\)
Layer 3: \(64 \times 10 + 10 = 650\)
Total: 109,386 parameters — and this is considered a small network!

Summary

Key Takeaways

Classical Foundations

  • Linear Regression: Predict values with \(\hat{y} = wx + b\); MSE measures fit
  • Gradient Descent: Derived MSE gradients, applied chain rule, optimized iteratively
  • Logistic Regression: Sigmoid maps \(\mathbb{R} \to (0,1)\) for classification
  • Regularization: L1/L2/dropout prevent memorization

Neural Networks

  • Neuron: Weighted sum + activation; computes AND, OR, NOT
  • MLP: Hidden layers solve XOR; backprop trains via chain rule
  • FCNN: Deep networks with 100K+ parameters
  • Recipe: Init → forward → loss → backward → update → repeat
Looking Ahead

What's Next?

Convolutional Neural Networks

Specialized architecture for images — filters learn edges, textures, shapes

Recurrent Neural Networks

Process sequences — text, time series, speech with memory cells

Transformers & Attention

The architecture behind GPT, BERT, and modern AI breakthroughs

Hands-On Implementation

Build and train networks with PyTorch / TensorFlow

End of Lecture

Introduction to Neural Networks

Questions?

CMSC 194.2 • University of the Philippines Cebu