Linear Regression
CMSC 173 - Module 02
Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu
What is Linear Regression?
Linear regression models the relationship between a dependent variable and one or more independent variables using a linear function.
Simple Linear Regression: $$y = \theta_0 + \theta_1 x + \epsilon$$
Components
- $y$ — target/output
- $x$ — feature/input
- $\theta_0$ — intercept (bias)
- $\theta_1$ — slope (weight)
- $\epsilon$ — error term
Goal
- Find $\theta_0, \theta_1$ that best fit the data
- Minimize prediction errors
- Enable predictions on new data
The Cost Function
Mean Squared Error (MSE): Measures how well our line fits the data.
$$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left( y^{(i)} - (\theta_0 + \theta_1 x^{(i)}) \right)^2$$
Why This Formula?
- Squared errors: Penalizes large errors more
- Sum over all points: Considers entire dataset
- Factor of 1/2: Simplifies derivative
Interpretation
- $J = 0$ → Perfect fit
- Large $J$ → Poor fit
- Goal: Minimize $J$
Knowledge Check
Think About It
Why do we square the errors instead of just summing them?
Squaring ensures all errors are positive (negative errors don't cancel positive ones) and penalizes larger errors more heavily, making the model focus on reducing big mistakes.
Click the blurred area to reveal the answer
Gradient Descent Algorithm
Gradient descent is an iterative optimization algorithm that finds the minimum of a function by following the negative gradient.
Update Rules
$$\theta_0 := \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}$$
$$\theta_1 := \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}$$
Learning Rate ($\alpha$): Controls step size. Too large → overshoots. Too small → slow convergence.
Computing the Gradients
The partial derivatives tell us which direction to move each parameter.
Gradient Formulas
$$\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)$$
$$\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) x^{(i)}$$
Where $\hat{y}^{(i)} = \theta_0 + \theta_1 x^{(i)}$ is the prediction.
Gradient Descent in Action
Each iteration updates the regression line to better fit the data.
Algorithm Steps
- Initialize $\theta_0, \theta_1$ (often to 0)
- Compute predictions: $\hat{y} = \theta_0 + \theta_1 x$
- Calculate gradients
- Update parameters
- Repeat until convergence
Convergence: Stop when cost $J$ changes very little between iterations.
Feature Normalization
Problem: When features have very different scales, gradient descent can be slow or unstable.
Z-score Normalization: $$x_{norm} = \frac{x - \mu}{\sigma}$$
Without Normalization
- Need very small $\alpha$
- Slow convergence
- Risk of divergence
With Normalization
- Can use larger $\alpha$
- Faster convergence
- More stable training
Knowledge Check
Think About It
If house sizes range from 50-500 sqm, what would be a normalized value for a 200 sqm house?
Using z-score: if $\mu = 275$ and $\sigma = 130$, then $x_{norm} = (200 - 275) / 130 \approx -0.58$. The negative value indicates it's below average size.
Click the blurred area to reveal the answer
Closed-Form Solution (OLS)
Linear regression has an analytical solution — no iteration needed!
Least Squares Formulas
$$\theta_1 = \frac{\sum_{i=1}^m (x^{(i)} - \bar{x})(y^{(i)} - \bar{y})}{\sum_{i=1}^m (x^{(i)} - \bar{x})^2}$$
$$\theta_0 = \bar{y} - \theta_1 \bar{x}$$
Where $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$.
Matrix Form of Linear Regression
Design Matrix: Stack all training examples with a column of 1s for the intercept.
$$X = \begin{bmatrix} 1 & x^{(1)} \\ 1 & x^{(2)} \\ \vdots & \vdots \\ 1 & x^{(m)} \end{bmatrix}, \quad \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix}$$
Normal Equation
$$\theta = (X^T X)^{-1} X^T y$$
This directly computes the optimal parameters in one step!
Gradient Descent vs Closed-Form
Gradient Descent
- Pros:
- Works with large datasets
- Scales to many features
- Memory efficient
- Cons:
- Requires tuning $\alpha$
- Many iterations needed
Normal Equation
- Pros:
- No hyperparameters
- One-step solution
- Exact answer
- Cons:
- Slow for large $n$ features
- $(X^TX)^{-1}$ is $O(n^3)$
Rule of thumb: Use normal equation if $n < 10,000$ features, otherwise use gradient descent.
The Loss Surface
The cost function $J(\theta_0, \theta_1)$ forms a 3D surface — gradient descent finds the lowest point.
Key Properties
- Convex surface: Bowl-shaped, one global minimum
- Gradient: Points uphill, we go opposite direction
- Path: Series of steps from initial guess to minimum
For linear regression with MSE, the surface is always convex — gradient descent is guaranteed to find the global minimum.
Knowledge Check
Think About It
Why is convexity important for optimization?
A convex function has only one minimum (the global minimum), so gradient descent will always converge to the best solution regardless of starting point. Non-convex functions may have local minima that trap the algorithm.
Click the blurred area to reveal the answer
Multiple Linear Regression
Multiple features: Extend to $n$ input variables.
$$y = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n$$
Vector Notation
$$h_\theta(x) = \theta^T x = \sum_{j=0}^{n} \theta_j x_j$$
where $x_0 = 1$ (bias term)
All the same techniques apply — gradient descent and normal equation work for any number of features.
Practical Considerations
Debugging Tips
- Plot $J$ vs iterations — should decrease
- Try different learning rates
- Check for feature scaling issues
- Look for outliers in data
Common Issues
- Divergence: $\alpha$ too large
- Slow convergence: $\alpha$ too small
- Poor fit: Need more features or different model
Always visualize! Plot predictions vs actual values to assess fit quality.
Summary
Key Takeaways
- Linear regression models relationships as $y = \theta_0 + \theta_1 x$
- Cost function (MSE) measures fit quality
- Gradient descent iteratively minimizes cost
- Normal equation gives closed-form solution
- Feature normalization improves convergence
- Multiple regression extends to many features
Next: Regularization — handling overfitting in linear models.
End of Module 02
Linear Regression
Questions?