Back to Course
CMSC 173

Module 02: Linear Regression

1 / --

Linear Regression

CMSC 173 - Module 02

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

What is Linear Regression?

Linear regression models the relationship between a dependent variable and one or more independent variables using a linear function.
Simple Linear Regression: $$y = \theta_0 + \theta_1 x + \epsilon$$

Components

  • $y$ — target/output
  • $x$ — feature/input
  • $\theta_0$ — intercept (bias)
  • $\theta_1$ — slope (weight)
  • $\epsilon$ — error term

Goal

  • Find $\theta_0, \theta_1$ that best fit the data
  • Minimize prediction errors
  • Enable predictions on new data

The Cost Function

Mean Squared Error (MSE): Measures how well our line fits the data.
$$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left( y^{(i)} - (\theta_0 + \theta_1 x^{(i)}) \right)^2$$

Why This Formula?

  • Squared errors: Penalizes large errors more
  • Sum over all points: Considers entire dataset
  • Factor of 1/2: Simplifies derivative

Interpretation

  • $J = 0$ → Perfect fit
  • Large $J$ → Poor fit
  • Goal: Minimize $J$

Knowledge Check

Think About It

Why do we square the errors instead of just summing them?

Click the blurred area to reveal the answer

Gradient Descent Algorithm

Gradient descent is an iterative optimization algorithm that finds the minimum of a function by following the negative gradient.

Update Rules

$$\theta_0 := \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}$$

$$\theta_1 := \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}$$

Learning Rate ($\alpha$): Controls step size. Too large → overshoots. Too small → slow convergence.

Computing the Gradients

The partial derivatives tell us which direction to move each parameter.

Gradient Formulas

$$\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)$$

$$\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) x^{(i)}$$

Where $\hat{y}^{(i)} = \theta_0 + \theta_1 x^{(i)}$ is the prediction.

Gradient Descent in Action

Each iteration updates the regression line to better fit the data.

Algorithm Steps

  1. Initialize $\theta_0, \theta_1$ (often to 0)
  2. Compute predictions: $\hat{y} = \theta_0 + \theta_1 x$
  3. Calculate gradients
  4. Update parameters
  5. Repeat until convergence
Convergence: Stop when cost $J$ changes very little between iterations.

Feature Normalization

Problem: When features have very different scales, gradient descent can be slow or unstable.
Z-score Normalization: $$x_{norm} = \frac{x - \mu}{\sigma}$$

Without Normalization

  • Need very small $\alpha$
  • Slow convergence
  • Risk of divergence

With Normalization

  • Can use larger $\alpha$
  • Faster convergence
  • More stable training

Knowledge Check

Think About It

If house sizes range from 50-500 sqm, what would be a normalized value for a 200 sqm house?

Click the blurred area to reveal the answer

Closed-Form Solution (OLS)

Linear regression has an analytical solution — no iteration needed!

Least Squares Formulas

$$\theta_1 = \frac{\sum_{i=1}^m (x^{(i)} - \bar{x})(y^{(i)} - \bar{y})}{\sum_{i=1}^m (x^{(i)} - \bar{x})^2}$$

$$\theta_0 = \bar{y} - \theta_1 \bar{x}$$

Where $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$.

Matrix Form of Linear Regression

Design Matrix: Stack all training examples with a column of 1s for the intercept.
$$X = \begin{bmatrix} 1 & x^{(1)} \\ 1 & x^{(2)} \\ \vdots & \vdots \\ 1 & x^{(m)} \end{bmatrix}, \quad \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix}$$

Normal Equation

$$\theta = (X^T X)^{-1} X^T y$$

This directly computes the optimal parameters in one step!

Gradient Descent vs Closed-Form

Gradient Descent

  • Pros:
  • Works with large datasets
  • Scales to many features
  • Memory efficient
  • Cons:
  • Requires tuning $\alpha$
  • Many iterations needed

Normal Equation

  • Pros:
  • No hyperparameters
  • One-step solution
  • Exact answer
  • Cons:
  • Slow for large $n$ features
  • $(X^TX)^{-1}$ is $O(n^3)$
Rule of thumb: Use normal equation if $n < 10,000$ features, otherwise use gradient descent.

The Loss Surface

The cost function $J(\theta_0, \theta_1)$ forms a 3D surface — gradient descent finds the lowest point.

Key Properties

  • Convex surface: Bowl-shaped, one global minimum
  • Gradient: Points uphill, we go opposite direction
  • Path: Series of steps from initial guess to minimum
For linear regression with MSE, the surface is always convex — gradient descent is guaranteed to find the global minimum.

Knowledge Check

Think About It

Why is convexity important for optimization?

Click the blurred area to reveal the answer

Multiple Linear Regression

Multiple features: Extend to $n$ input variables.
$$y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n$$

Vector Notation

$$h_\theta(x) = \theta^T x = \sum_{j=0}^{n} \theta_j x_j$$

where $x_0 = 1$ (bias term)

All the same techniques apply — gradient descent and normal equation work for any number of features.

Practical Considerations

Debugging Tips

  • Plot $J$ vs iterations — should decrease
  • Try different learning rates
  • Check for feature scaling issues
  • Look for outliers in data

Common Issues

  • Divergence: $\alpha$ too large
  • Slow convergence: $\alpha$ too small
  • Poor fit: Need more features or different model
Always visualize! Plot predictions vs actual values to assess fit quality.

Summary

Key Takeaways

  • Linear regression models relationships as $y = \theta_0 + \theta_1 x$
  • Cost function (MSE) measures fit quality
  • Gradient descent iteratively minimizes cost
  • Normal equation gives closed-form solution
  • Feature normalization improves convergence
  • Multiple regression extends to many features
Next: Regularization — handling overfitting in linear models.

End of Module 02

Linear Regression

Questions?