Linear Regression

CMSC 173 - Module 02

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

What is Linear Regression?

Linear regression models the relationship between a dependent variable and one or more independent variables using a linear function.

Simple Linear Regression: $$y = \theta_0 + \theta_1 x + \epsilon$$

Components

$y$ — target/output
$x$ — feature/input
$\theta_0$ — intercept (bias)
$\theta_1$ — slope (weight)
$\epsilon$ — error term

Goal

Find $\theta_0, \theta_1$ that best fit the data
Minimize prediction errors
Enable predictions on new data

The Cost Function

Mean Squared Error (MSE): Measures how well our line fits the data.

J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left( y^{(i)} - (\theta_0 + \theta_1 x^{(i)}) \right)^2

Why This Formula?

Squared errors: Penalizes large errors more
Sum over all points: Considers entire dataset
Factor of 1/2: Simplifies derivative

Interpretation

$J = 0$ → Perfect fit
Large $J$ → Poor fit
Goal: Minimize $J$

Knowledge Check

Think About It

Why do we square the errors instead of just summing them?

Click the blurred area to reveal the answer

Gradient Descent Algorithm

Gradient descent is an iterative optimization algorithm that finds the minimum of a function by following the negative gradient.

Update Rules

$$\theta_0 := \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}$$

$$\theta_1 := \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}$$

Learning Rate ($\alpha$): Controls step size. Too large → overshoots. Too small → slow convergence.

Computing the Gradients

The partial derivatives tell us which direction to move each parameter.

Gradient Formulas

$$\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)$$

$$\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) x^{(i)}$$

Where $\hat{y}^{(i)} = \theta_0 + \theta_1 x^{(i)}$ is the prediction.

Gradient Descent in Action

Each iteration updates the regression line to better fit the data.

Algorithm StepsInitialize $\theta_0, \theta_1$ (often to 0)
Compute predictions: $\hat{y} = \theta_0 + \theta_1 x$
Calculate gradients
Update parameters
Repeat until convergence

Convergence: Stop when cost $J$ changes very little between iterations.

Feature Normalization

Problem: When features have very different scales, gradient descent can be slow or unstable.

Z-score Normalization: $$x_{norm} = \frac{x - \mu}{\sigma}$$

Without Normalization

Need very small $\alpha$
Slow convergence
Risk of divergence

With Normalization

Can use larger $\alpha$
Faster convergence
More stable training

Knowledge Check

Think About It

If house sizes range from 50-500 sqm, what would be a normalized value for a 200 sqm house?

Click the blurred area to reveal the answer

Closed-Form Solution (OLS)

Linear regression has an analytical solution — no iteration needed!

Least Squares Formulas

$$\theta_1 = \frac{\sum_{i=1}^m (x^{(i)} - \bar{x})(y^{(i)} - \bar{y})}{\sum_{i=1}^m (x^{(i)} - \bar{x})^2}$$

$$\theta_0 = \bar{y} - \theta_1 \bar{x}$$

Where $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$.

Matrix Form of Linear Regression

Design Matrix: Stack all training examples with a column of 1s for the intercept.

X = \begin{bmatrix} 1 & x^{(1)} \\ 1 & x^{(2)} \\ \vdots & \vdots \\ 1 & x^{(m)} \end{bmatrix}, \quad \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix}

Normal Equation

$$\theta = (X^T X)^{-1} X^T y$$

This directly computes the optimal parameters in one step!

Gradient Descent vs Closed-Form

Gradient Descent

Pros:
Works with large datasets
Scales to many features
Memory efficient
Cons:
Requires tuning $\alpha$
Many iterations needed

Normal Equation

Pros:
No hyperparameters
One-step solution
Exact answer
Cons:
Slow for large $n$ features
$(X^TX)^{-1}$ is $O(n^3)$

Rule of thumb: Use normal equation if $n < 10,000$ features, otherwise use gradient descent.

The Loss Surface

The cost function $J(\theta_0, \theta_1)$ forms a 3D surface — gradient descent finds the lowest point.

Key PropertiesConvex surface: Bowl-shaped, one global minimum
Gradient: Points uphill, we go opposite direction
Path: Series of steps from initial guess to minimum

For linear regression with MSE, the surface is always convex — gradient descent is guaranteed to find the global minimum.

Knowledge Check

Think About It

Why is convexity important for optimization?

Click the blurred area to reveal the answer

Multiple Linear Regression

Multiple features: Extend to $n$ input variables.

y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n

Vector Notation

$$h_\theta(x) = \theta^T x = \sum_{j=0}^{n} \theta_j x_j$$

where $x_0 = 1$ (bias term)

All the same techniques apply — gradient descent and normal equation work for any number of features.

Practical Considerations

Debugging Tips

Plot $J$ vs iterations — should decrease
Try different learning rates
Check for feature scaling issues
Look for outliers in data

Common Issues

Divergence: $\alpha$ too large
Slow convergence: $\alpha$ too small
Poor fit: Need more features or different model

Always visualize! Plot predictions vs actual values to assess fit quality.

Summary

Key TakeawaysLinear regression models relationships as $y = \theta_0 + \theta_1 x$
Cost function (MSE) measures fit quality
Gradient descent iteratively minimizes cost
Normal equation gives closed-form solution
Feature normalization improves convergence
Multiple regression extends to many features

Next: Regularization — handling overfitting in linear models.

End of Module 02

Linear Regression

Questions?