Back to Course
CMSC 173

Module 01: Parameter Estimation

1 / --

Parameter Estimation

CMSC 173 - Module 01

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

What We'll Cover

Foundations

What is parameter estimation?

Method of Moments

Match sample to theory

Maximum Likelihood

Optimal estimation

Applications

Real-world ML examples

What is Parameter Estimation?

Normal Distribution with Estimated Parameters
Log-Likelihood Function
Definition

Inferring unknown distribution parameters from observed data samples.

The Problem:
  • Data: $\{x_1, x_2, \ldots, x_n\}$
  • Distribution: $f(x|\theta)$
  • Find: $\hat{\theta}$

Why It Matters in ML

ML Applications - Supervised Learning
ML Applications - Unsupervised Learning
  • Supervised: Model weights
  • Unsupervised: Cluster parameters
  • Time Series: ARIMA coefficients
  • Deep Learning: Network weights

Estimator Quality Criteria

Estimator Properties - Bias
Estimator Properties - Variance

Desirable Properties

  • Unbiased: $E[\hat{\theta}] = \theta$
  • Consistent: $\hat{\theta} \to \theta$
  • Efficient: Min variance

Bias-Variance Tradeoff

Bias-Variance Tradeoff - High Bias
Bias-Variance Tradeoff - High Variance

Mean Squared Error

$$MSE = Bias^2 + Variance$$
Key Insight

Sometimes a little bias can reduce overall error.

Key Notation

Normal Distribution with Parameters
Estimation Process

Variables

  • $X$: Random variable
  • $\theta$: True parameter
  • $\hat{\theta}$: Estimate

Functions

  • $f(x|\theta)$: PDF/PMF
  • $L(\theta|x)$: Likelihood

Understanding Moments

First Moment - Mean
Second Moment - Variance

$k$-th Moment

$$m_k = E[X^k]$$
  • $m_1 = \mu$ (mean)
  • $\mu_2 = \sigma^2$ (variance)

Method of Moments: Core Idea

Core Principle

Match sample moments to theoretical moments to estimate parameters.

Theory

$m_k(\theta)$

Sample

$\hat{m}_k = \frac{1}{n}\sum x_i^k$

MoM Algorithm

MoM - Sample Moments
MoM - Theoretical Moments
  1. Express moments: $m_k(\theta)$
  2. Calculate sample: $\hat{m}_k$
  3. Set equal: $m_k(\theta) = \hat{m}_k$
  4. Solve for $\hat{\theta}$

MoM: Normal Distribution

Goal

Estimate $\mu$ and $\sigma^2$ for $N(\mu, \sigma^2)$

Theory

$m_1 = \mu$

$m_2 = \mu^2 + \sigma^2$

MoM Estimates

$\hat{\mu} = \bar{x}$

$\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$

MoM: Poisson Distribution

Poisson Distribution
Poisson MoM Estimation
Poisson($\lambda$)

Theory: $E[X] = \lambda$

MoM Estimate

$$\hat{\lambda} = \bar{x}$$

MoM: Gamma Distribution

Gamma($\alpha, \beta$)

$E[X] = \alpha\beta$, $Var(X) = \alpha\beta^2$

MoM Estimates

$$\hat{\beta} = \frac{\hat{\sigma}^2}{\bar{x}}, \quad \hat{\alpha} = \frac{\bar{x}^2}{\hat{\sigma}^2}$$

MoM Properties

MoM Advantages
MoM Limitations
Pros

Simple, consistent, general

Cons

Not optimal, may give invalid estimates

Maximum Likelihood: Core Idea

Likelihood Function
Maximum Likelihood Point
Core Principle

Find parameters that make observed data most likely.

The Likelihood Function

Likelihood Function
Log-Likelihood Function

Likelihood

$$L(\theta) = \prod_{i=1}^n f(x_i | \theta)$$

Log-Likelihood

$$\ell(\theta) = \sum_{i=1}^n \log f(x_i | \theta)$$

Finding the MLE

Log-Likelihood Curve
MLE Optimization

MLE

$$\hat{\theta}_{MLE} = \arg\max_\theta \ell(\theta)$$

Method: Solve $\frac{d\ell}{d\theta} = 0$

MLE: Normal Distribution

Goal

Estimate $\mu$ and $\sigma^2$ for $N(\mu, \sigma^2)$

MLE Solutions

$$\hat{\mu}_{MLE} = \bar{x}$$ $$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^2$$

MLE: Poisson Distribution

Log-likelihood

$\ell(\lambda) = (\sum x_i)\log\lambda - n\lambda$

Score

$\frac{d\ell}{d\lambda} = \frac{\sum x_i}{\lambda} - n = 0$

MLE

$$\hat{\lambda} = \bar{x}$$
Note

Same as MoM for Poisson!

MLE: Exponential Distribution

Exponential Distribution
Exponential MLE
Exponential($\lambda$)

$f(x) = \lambda e^{-\lambda x}$

MLE

$$\hat{\lambda} = \frac{1}{\bar{x}}$$

MLE Properties

MLE Consistency
MLE Efficiency
  • Consistent: $\hat{\theta} \to \theta$
  • Asymptotically Normal
  • Efficient: Min variance
  • Invariant: $g(\hat{\theta})$ is MLE of $g(\theta)$

Fisher Information

Fisher Information

$$I(\theta) = -E\left[\frac{d^2\ell}{d\theta^2}\right]$$

Cramér-Rao Bound

$Var(\hat{\theta}) \geq \frac{1}{I(\theta)}$

Insight

Higher information = lower variance

Numerical MLE Methods

Numerical Optimization
Convergence
When Needed

No closed-form solution

  • Newton-Raphson
  • Gradient ascent
  • EM algorithm

MoM vs MLE: Comparison

MoM Estimation
MLE Estimation
MoMMLE
SpeedFastVaries
EfficiencyLowerOptimal
ComplexitySimpleComplex

Efficiency Comparison

MoM Variance
MLE Variance

Relative Efficiency

$$ARE = \frac{Var_{MLE}}{Var_{MoM}}$$
  • Normal $\mu$: ARE = 1
  • Normal $\sigma^2$: ARE = 0.5
  • Gamma: MLE wins

When to Use Which?

Use MoM

Quick estimates, starting values, simple distributions

Use MLE

Optimal estimates, inference, model comparison

Pro Tip

Use MoM estimates as starting values for MLE optimization.

Application: Linear Regression

Linear Regression Data
Linear Regression Fit
Model

$y = \beta_0 + \beta_1 x + \epsilon$

Estimates

$\hat{\beta}_1 = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}$

Application: Logistic Regression

Logistic Data
Logistic Curve
Model

$P(Y=1) = \frac{1}{1+e^{-(\beta_0+\beta_1 X)}}$

Note

No closed-form; requires numerical optimization.

Application: Gaussian Mixtures

GMM Data
GMM Clustering
GMM

$f(x) = \sum_k \pi_k N(x|\mu_k, \sigma_k^2)$

EM Algorithm

E: Compute assignments
M: Update params

Application: Time Series (ARIMA)

Time Series Data
ARIMA Forecast

MoM

Yule-Walker equations

MLE

Kalman filter + optimization

Robust Estimation

Standard Estimation with Outliers
Robust Estimation
Problem

MLE sensitive to outliers

  • M-estimators
  • Huber loss
  • Trimmed means

Bootstrap Estimation

Bootstrap Samples
Bootstrap Distribution
Principle

Resample from data to estimate uncertainty.

  1. Draw $B$ bootstrap samples
  2. Compute $\hat{\theta}_b^*$ for each
  3. Use distribution for inference

Model Selection Criteria

AIC Comparison
BIC Comparison

Information Criteria

$AIC = -2\ell + 2k$

$BIC = -2\ell + k\log n$

Rule

Lower is better

Diagnostic Tools

Residual Plot
Q-Q Plot
  • Residual plots
  • Q-Q plots
  • Goodness-of-fit tests
  • Cross-validation

Computational Tools

Python Tools
R Tools
  • Python: scipy, statsmodels
  • R: optim(), maxLik
  • Bayesian: Stan, PyMC
  • DL: PyTorch, TensorFlow

Common Pitfalls

Wrong Distribution

Use EDA and goodness-of-fit tests

Small Samples

Use bootstrap or Bayesian

Outliers

Use robust methods

Overfitting

Use AIC/BIC, cross-validation

Best Practices

Diagnostic Workflow
Validation Process
Checklist
  • Start with MoM estimates
  • Use MoM as MLE starting values
  • Validate assumptions
  • Report confidence intervals
  • Check diagnostics

Key Takeaways

MoM

Match moments. Simple, quick, good for starting values.

MLE

Maximize likelihood. Optimal, efficient, asymptotically normal.

Remember

Parameter estimation is fundamental to statistical modeling and ML!

Next Steps

Advanced Topics

GMM, Regularization, Bayesian MCMC

Practice Tools

scipy.optimize, statsmodels, PyMC

Questions?

Ready for Module 2: Linear Regression

End of Module 01

Parameter Estimation

Questions?