Back to Course
CMSC 173

Module 04: Exploratory Data Analysis (EDA)

1 / --

Exploratory Data Analysis (EDA)

CMSC 173 - Module 04

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

What is Exploratory Data Analysis?

Definition

EDA is the process of investigating datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods.
Primary Goals:
  • Understand data structure and quality
  • Discover patterns and relationships
  • Identify anomalies and outliers
  • Guide feature engineering decisions
  • Inform modeling strategy
Key Questions EDA Answers:
  • What does my data look like?
  • Is my data clean and complete?
  • What patterns exist?
  • Which features are important?
EDA Process Overview: \begin{tikzpicture}[node distance=0.8cm] \node[rectangle, draw, fill=datacolor!20, text width=3cm, text centered, font=\small] (data) {Raw Data}; \node[rectangle, draw, fill=trendcolor!20, text width=3cm, text centered, font=\small, below=0.4cm of data] (explore) {Data Exploration}; \node[rectangle, draw, fill=featurecolor!20, text width=3cm, text centered, font=\small, below=0.4cm of explore] (clean) {Data Cleaning}; \node[rectangle, draw, fill=outliercolor!20, text width=3cm, text centered, font=\small, below=0.4cm of clean] (model) {Model Ready Data}; \draw[process arrow] (data) -- (explore); \draw[process arrow] (explore) -- (clean); \draw[process arrow] (clean) -- (model); \draw[process arrow] (explore.east) .. controls +(right:0.8cm) and +(right:0.8cm) .. (data.east); \end{tikzpicture}

Key Insight

EDA is iterative! Insights from one analysis often lead to new questions and deeper investigations.

The EDA Workflow

\begin{tikzpicture}[node distance=1.5cm, scale=0.9, transform shape] % Main workflow nodes \node[eda process, fill=datacolor!20] (load) {Data Loading}; \node[eda process, fill=trendcolor!20, right=of load] (inspect) {Initial Inspection}; \node[eda process, fill=featurecolor!20, right=of inspect] (clean) {Data Cleaning}; \node[eda process, fill=outliercolor!20, below=of clean] (viz) {Visualization}; \node[eda process, fill=tealaccent!20, left=of viz] (stats) {Statistical Analysis}; \node[eda process, fill=maizedark!20, left=of stats] (insights) {Insights \& Patterns}; % Arrows \draw[process arrow] (load) -- (inspect); \draw[process arrow] (inspect) -- (clean); \draw[process arrow] (clean) -- (viz); \draw[process arrow] (viz) -- (stats); \draw[process arrow] (stats) -- (insights); % Feedback loops \draw[process arrow, dashed] (viz) to[bend left=20] (clean); \draw[process arrow, dashed] (stats) to[bend left=30] (inspect); \draw[process arrow, dashed] (insights) to[bend left=40] (load); \end{tikzpicture}

Data Loading

  • Import datasets
  • Check file formats
  • Handle encoding issues
\begin{methodblock}{Statistical Analysis}
  • Descriptive statistics
  • Correlation analysis
  • Distribution testing
\end{methodblock}
\begin{tipblock}{Key Outcome}
  • Clean, understood data
  • Feature insights
  • Modeling strategy
\end{tipblock}

Why EDA is Critical for Machine Learning

Without EDA

Common Pitfalls:
  • Garbage In, Garbage Out
  • Poor model performance
  • Biased predictions
  • Overfitting to noise
  • Missing important patterns
  • Wasted computational resources
Statistical Foundation: \\ For dataset $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$: $$\begin{aligned}\text{Data Quality} = f(\text{Completeness, Accuracy, Consistency}) \\ \text{Model Performance} \propto \text{Data Quality}\end{aligned}$$
\begin{exampleblock}{With Proper EDA} Benefits Achieved:
  • High-quality, clean data
  • Optimal feature selection
  • Appropriate model choice
  • Better generalization
  • Actionable insights
  • Efficient resource usage
\end{exampleblock} Impact Quantification: \\ Studies show that proper EDA can improve model performance by 15-30\% and reduce development time by 40-60\%. \begin{tikzpicture}[scale=0.7] \draw[fill=outliercolor!20] (0,0) rectangle (2,1) node[pos=.5] {\small No EDA}; \draw[fill=featurecolor!20] (0,1.2) rectangle (3,2.2) node[pos=.5] {\small With EDA}; \node[right] at (2.1,0.5) {\small 60\% Accuracy}; \node[right] at (3.1,1.7) {\small 85\% Accuracy}; \end{tikzpicture}

Understanding Your Dataset

[Figure: ../figures/01_data_types_overview.png]

First Steps

Always start with: \texttt{df.info()}, \texttt{df.describe()}, \texttt{df.shape}, and \texttt{df.head()}

Data Types Classification

Numerical Data

Continuous Variables:
  • Can take any value in a range
  • Examples: age, salary, temperature
  • Mathematical operations meaningful
Discrete Variables:
  • Countable, distinct values
  • Examples: number of children, cars owned
  • Often integers
\begin{methodblock}{Mathematical Representation} For numerical variable $X$: $$X \in \mathbb{R} \text{ (continuous)} \text{ or } X \in \mathbb{Z} \text{ (discrete)}$$ \end{methodblock}

Categorical Data

Nominal Variables:
  • No natural ordering
  • Examples: color, gender, city
  • Cannot perform arithmetic
Ordinal Variables:
  • Natural ordering exists
  • Examples: education level, rating
  • Ranking meaningful, differences may not be
\begin{methodblock}{Mathematical Representation} For categorical variable $C$: $$C \in \{\text{category}_1, \text{category}_2, …, \text{category}_k\}$$ \end{methodblock}
\begin{tipblock}{Practical Tip} Encoding Strategy: Numerical $\rightarrow$ Keep as-is; Nominal $\rightarrow$ One-hot encoding; Ordinal $\rightarrow$ Label encoding \end{tipblock}

Sample Dataset: Titanic Survival Analysis

Dataset Overview

Contains information on 891 passengers aboard the Titanic. Goal: Predict passenger survival based on their attributes.
\tiny \begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline PassengerId & Survived & Pclass & Sex & Age & SibSp & Parch & Fare & Embarked \\ \hline 1 & 0 & 3 & male & 22.0 & 1 & 0 & 7.25 & S \\ 2 & 1 & 1 & female & 38.0 & 1 & 0 & 71.28 & C \\ 3 & 1 & 3 & female & 26.0 & 0 & 0 & 7.92 & S \\ 4 & 1 & 1 & female & 35.0 & 1 & 0 & 53.10 & S \\ 5 & 0 & 3 & male & 35.0 & 0 & 0 & 8.05 & S \\ \hline \end{tabular}

Numerical Features

  • Age: Continuous (0-80)
  • Fare: Continuous (0-512)
  • SibSp: Discrete count
  • Parch: Discrete count
\begin{methodblock}{Categorical Features}
  • Sex: Nominal (M/F)
  • Embarked: Nominal (C/Q/S)
  • Pclass: Ordinal (1st, 2nd, 3rd)
\end{methodblock}
\begin{tipblock}{Target Variable}
  • Survived: Binary (0/1)
  • Classification Problem
  • 38.4\% survival rate
\end{tipblock}

Sample Dataset: Iris Species Classification

Dataset Overview

Classic dataset with 150 iris flowers from 3 species. Goal: Classify species based on flower measurements.
\tiny \begin{tabular}{|c|c|c|c|c|} \hline Sepal Length & Sepal Width & Petal Length & Petal Width & Species \\ \hline 5.1 & 3.5 & 1.4 & 0.2 & setosa \\ 4.9 & 3.0 & 1.4 & 0.2 & setosa \\ 6.2 & 2.9 & 4.3 & 1.3 & versicolor \\ 5.9 & 3.0 & 5.1 & 1.8 & virginica \\ 6.4 & 2.8 & 5.6 & 2.2 & virginica \\ \hline \end{tabular}

Numerical Features

  • Sepal Length: Continuous (4.3-7.9 cm)
  • Sepal Width: Continuous (2.0-4.4 cm)
  • Petal Length: Continuous (1.0-6.9 cm)
  • Petal Width: Continuous (0.1-2.5 cm)
\begin{tipblock}{Target Variable}
  • Species: 3-class categorical
  • Classes: setosa, versicolor, virginica
  • Balanced: 50 samples per class
  • Clean: No missing values
\end{tipblock}
\begin{methodblock}{EDA Advantages} Perfect for Learning: Small size, clean data, clear patterns, well-separated classes, interpretable features \end{methodblock}

Iris Dataset: Advanced Visualization Techniques

Correlogram Analysis

\begin{tikzpicture}[scale=0.8] % Draw correlation matrix \draw[fill=blue!30] (0,0) rectangle (1,1); \draw[fill=red!50] (1,0) rectangle (2,1); \draw[fill=red!70] (2,0) rectangle (3,1); \draw[fill=red!60] (3,0) rectangle (4,1); \draw[fill=red!50] (0,1) rectangle (1,2); \draw[fill=blue!30] (1,1) rectangle (2,2); \draw[fill=blue!20] (2,1) rectangle (3,2); \draw[fill=blue!10] (3,1) rectangle (4,2); \draw[fill=red!70] (0,2) rectangle (1,3); \draw[fill=blue!20] (1,2) rectangle (2,3); \draw[fill=blue!30] (2,2) rectangle (3,3); \draw[fill=red!90] (3,2) rectangle (4,3); \draw[fill=red!60] (0,3) rectangle (1,4); \draw[fill=blue!10] (1,3) rectangle (2,4); \draw[fill=red!90] (2,3) rectangle (3,4); \draw[fill=blue!30] (3,3) rectangle (4,4); % Labels \node at (0.5,-0.3) {\tiny Sep.L}; \node at (1.5,-0.3) {\tiny Sep.W}; \node at (2.5,-0.3) {\tiny Pet.L}; \node at (3.5,-0.3) {\tiny Pet.W}; \node at (-0.3,0.5) {\tiny Sep.L}; \node at (-0.3,1.5) {\tiny Sep.W}; \node at (-0.3,2.5) {\tiny Pet.L}; \node at (-0.3,3.5) {\tiny Pet.W}; % Correlation values \node at (2.5,0.5) {\tiny 0.96}; \node at (3.5,2.5) {\tiny 0.96}; \node at (1.5,0.5) {\tiny -0.12}; \end{tikzpicture} Strong correlation: Petal length ↔ Petal width (r=0.96)

Box Plot Insights

\begin{tikzpicture}[scale=0.7] % Three box plots for species \foreach \x/\species/\color in {1/setosa/green, 2.5/versicolor/blue, 4/virginica/red} { % Box plot structure \draw[thick, \color] (\x-0.3,0.5) rectangle (\x+0.3,1.5); \draw[thick, \color] (\x-0.3,1) -- (\x+0.3,1); \draw[thick, \color] (\x,0.2) -- (\x,0.5); \draw[thick, \color] (\x,1.5) -- (\x,1.8); % Species labels \node at (\x,-0.2) {\tiny \species}; % Add some outliers for virginica \ifnum\x=4 \fill[\color] (\x,2.2) circle (0.05); \fill[\color] (\x,0.1) circle (0.05); \fi } \node at (2.5,2.5) {\small Petal Length by Species}; \end{tikzpicture} Clear separation: Species distinguishable by petal features

Violin Plot Analysis

\begin{tikzpicture}[scale=0.9] % Draw three violin plots \foreach \x/\species/\color in {1.5/setosa/green!60, 3.5/versicolor/blue!60, 5.5/virginica/red!60} { % Violin shape (ellipse approximation) \fill[\color, opacity=0.7] (\x,0.5) ellipse (0.4 and 0.8); \draw[thick] (\x,0.5) ellipse (0.4 and 0.8); % Center line \draw[thick, black] (\x-0.1,0.5) -- (\x+0.1,0.5); % Species labels \node at (\x,-0.1) {\small \species}; } % Axis labels \node at (0.5,0.5) [rotate=90] {\small Sepal Width}; \node at (3.5,-0.5) {\small Species}; % Title \node at (3.5,1.8) {Distribution Shape \& Density by Species}; \end{tikzpicture} \begin{methodblock}{Violin Plot Advantages} Combines: Box plot summary statistics + density estimation + distribution shape visualization \end{methodblock}

Univariate Analysis - Numerical Variables

[Figure: ../figures/02_univariate_numerical.png]

Key Observations

Age: Right-skewed, missing values; Fare: Heavy right tail, potential outliers

Statistical Measures for Numerical Data

Central Tendency

For variable $X = \{x_1, x_2, …, x_n\}$: Mean (Arithmetic): $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$ Median: Middle value when sorted $$\text{Median} = \begin{cases} x_{(n+1)/2} & \text{if } n \text{ odd} \\ \frac{x_{n/2} + x_{n/2+1}}{2} & \text{if } n \text{ even} \end{cases}$$ Mode: Most frequent value

Dispersion Measures

Variance: $$\sigma^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$ Standard Deviation: $$\sigma = \sqrt{\sigma^2}$$ Interquartile Range: $$\text{IQR} = Q_3 - Q_1$$ Range: $$\text{Range} = x_{\max} - x_{\min}$$
\begin{tipblock}{Practical Guidelines} Skewed data: Use median \& IQR; Normal data: Use mean \& standard deviation \end{tipblock}

Univariate Analysis - Categorical Variables

[Figure: ../figures/03_univariate_categorical.png]

Key Insights

Gender imbalance: 65\% male passengers; Class distribution: 55\% third class; Embarkation: 72\% from Southampton

Statistical Measures for Categorical Data

Frequency Analysis

For categorical variable $C$ with categories $\{c_1, c_2, …, c_k\}$: Frequency: $$f_i = \text{count}(C = c_i)$$ Relative Frequency: $$p_i = \frac{f_i}{n} \text{ where } \sum_{i=1}^{k} p_i = 1$$ Mode: Category with highest frequency $$\text{Mode} = \arg\max_{c_i} f_i$$
\begin{methodblock}{Entropy Measure} Information content: $$H(C) = -\sum_{i=1}^{k} p_i \log_2(p_i)$$ Higher entropy = more uniform distribution \end{methodblock}

Visualization Guidelines

Bar Charts:
  • Best for comparing categories
  • Order by frequency for impact
  • Use consistent colors
Pie Charts:
  • Good for showing proportions
  • Limit to $\leq 5$ categories
  • Start largest slice at 12 o'clock
\begin{exampleblock}{Chi-Square Test} Test for uniform distribution: $$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}$$ where $O_i$ = observed, $E_i$ = expected \end{exampleblock}

Distribution Analysis \& Normality Testing

Common Distributions

Normal Distribution: $$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$ Log-Normal Distribution: $$f(x) = \frac{1}{x\sigma\sqrt{2\pi}} e^{-\frac{(\ln x-\mu)^2}{2\sigma^2}}$$ Skewness: $$\text{Skew} = \frac{E[(X-\mu)^3]}{\sigma^3}$$

Normality Tests

Shapiro-Wilk Test: $$W = \frac{(\sum_{i=1}^n a_i x_{(i)})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}$$ Kolmogorov-Smirnov Test: $$D = \sup_x |F_n(x) - F(x)|$$ Anderson-Darling Test: More sensitive to tail deviations

Decision Rule

If $p < 0.05$: Reject normality assumption; Consider transformations (log, square root, Box-Cox)

Correlation Analysis

[Figure: ../figures/04_correlation_analysis.png]

Correlation Insights

Strong correlations: Fare-Survival (0.26), Age-Survival (-0.07); Weak correlations: SibSp-Parch (0.41)

Correlation Coefficients \& Interpretation

Pearson Correlation

For linear relationships: $$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$ Range: $r \in [-1, 1]$
  • $r = 1$: Perfect positive correlation
  • $r = 0$: No linear correlation
  • $r = -1$: Perfect negative correlation
\begin{methodblock}{Significance Test} $$t = r\sqrt{\frac{n-2}{1-r^2}} \sim t_{n-2}$$ \end{methodblock}

Non-Linear Correlations

Spearman Rank Correlation: $$\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}$$ where $d_i$ = rank difference Kendall's Tau: $$\tau = \frac{n_c - n_d}{\frac{1}{2}n(n-1)}$$ where $n_c$ = concordant pairs, $n_d$ = discordant pairs
\begin{exampleblock}{Interpretation Guide}
  • $|r| < 0.3$: Weak relationship
  • $0.3 \leq |r| < 0.7$: Moderate relationship
  • $|r| \geq 0.7$: Strong relationship
\end{exampleblock}

Remember

Correlation $\neq$ Causation! Always investigate the underlying mechanisms.

Bivariate Analysis - Feature Relationships

[Figure: ../figures/10_bivariate_analysis.png]

Key Patterns

Gender effect: Women had 74\% survival rate vs men 19\%; Class effect: 1st class 63\% vs 3rd class 24\%

Cross-Tabulation \& Contingency Tables

Contingency Table

For categorical variables $A$ and $B$: \small \begin{tabular}{|c|c|c|c|} \hline & $B_1$ & $B_2$ & Total \\ \hline $A_1$ & $n_{11}$ & $n_{12}$ & $n_{1.}$ \\ $A_2$ & $n_{21}$ & $n_{22}$ & $n_{2.}$ \\ \hline Total & $n_{.1}$ & $n_{.2}$ & $n$ \\ \hline \end{tabular} Joint Probability: $$P(A_i, B_j) = \frac{n_{ij}}{n}$$ Marginal Probability: $$P(A_i) = \frac{n_{i.}}{n}, P(B_j) = \frac{n_{.j}}{n}$$

Independence Test

Chi-Square Test: $$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$ where $E_{ij} = \frac{n_{i.} \times n_{.j}}{n}$ Degrees of freedom: $$df = (r-1)(c-1)$$ Cramér's V (Effect Size): $$V = \sqrt{\frac{\chi^2}{n \times \min(r-1, c-1)}}$$
\begin{tipblock}{Interpretation} $V \in [0,1]$: 0 = no association, 1 = perfect association \end{tipblock}

Missing Data Analysis

[Figure: ../figures/05_missing_data_analysis.png]

Missing Data Impact

Age: 20\% missing values; Pattern: May not be random - could be related to passenger class or survival

Types of Missing Data

MCAR: Missing Completely at Random

Definition: Missing data is independent of both observed and unobserved data. Mathematical condition: $$P(\text{Missing} | X, Y) = P(\text{Missing})$$ Example: Survey responses lost due to mail delivery issues. Implication: Can use any imputation method without bias.
\begin{methodblock}{Test for MCAR} Little's MCAR Test: Tests null hypothesis that data is MCAR using EM algorithm. \end{methodblock}

MAR: Missing at Random

Definition: Missing data depends on observed data, but not on unobserved data. Mathematical condition: $$P(\text{Missing} | X, Y) = P(\text{Missing} | X)$$ Example: Older passengers more likely to have missing age data. MNAR: Missing Not at Random Missing data depends on unobserved data. Example: High-income individuals not reporting income.

Handling Strategy

MAR: Multiple imputation; MNAR: Domain expertise required

Imputation Strategies

Simple Imputation

Mean/Mode Imputation: $$x_{\text{missing}} = \bar{x} \text{ or Mode}(x)$$ Median Imputation: $$x_{\text{missing}} = \text{Median}(x)$$ Forward/Backward Fill: For time series data Constant Value: Domain-specific constant (e.g., 0, -1)

Limitations

Simple methods reduce variance and can introduce bias

Advanced Imputation

KNN Imputation: $$x_{\text{missing}} = \frac{1}{k}\sum_{i \in \text{k-nearest}} x_i$$ Multiple Imputation: Creates multiple complete datasets, analyzes each, pools results. Model-based: - Linear regression - Random Forest - Deep learning approaches
\begin{exampleblock}{Best Practice} Always analyze missing data pattern before choosing imputation method \end{exampleblock}

Outlier Detection Methods

[Figure: ../figures/06_outlier_detection.png]

Outlier Findings

Fare: 20 outliers detected using IQR method; Age: Few extreme values at high ages

Statistical Outlier Detection Methods

IQR Method

Interquartile Range: $$\text{IQR} = Q_3 - Q_1$$ Outlier bounds: $$\text{Lower bound} = Q_1 - 1.5 \times \text{IQR}$$ $$\text{Upper bound} = Q_3 + 1.5 \times \text{IQR}$$ Outlier condition: $$x < \text{Lower bound} \text{ or } x > \text{Upper bound}$$
\begin{methodblock}{Modified Z-Score} $$M_i = \frac{0.6745(x_i - \text{median})}{\text{MAD}}$$ where MAD = median absolute deviation \end{methodblock}

Z-Score Method

Standard Z-score: $$z_i = \frac{x_i - \bar{x}}{\sigma}$$ Outlier threshold: $$|z_i| > 2.5 \text{ or } |z_i| > 3$$ Limitation: Sensitive to outliers in mean and std calculation
\begin{exampleblock}{Isolation Forest} Anomaly Score: $$s(x,n) = 2^{-\frac{E(h(x))}{c(n)}}$$ where $E(h(x))$ = average path length, $c(n)$ = average path length of BST \end{exampleblock}
\begin{tipblock}{Decision Framework} Normal distribution: Z-score; Skewed distribution: IQR; Multivariate: Isolation Forest \end{tipblock}

Multivariate Outlier Detection

Mahalanobis Distance

For multivariate data $\mathbf{x} \in \mathbb{R}^p$: $$D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$ where: - $\boldsymbol{\mu}$ = sample mean vector - $\boldsymbol{\Sigma}$ = sample covariance matrix Outlier threshold: $$D_M(\mathbf{x}) > \sqrt{\chi^2_{p,\alpha}}$$
\begin{methodblock}{Cook's Distance} Measures influence of each observation on regression: $$D_i = \frac{\sum_{j=1}^n (\hat{y}_j - \hat{y}_{j(i)})^2}{p \times \text{MSE}}$$ \end{methodblock}

Local Outlier Factor (LOF)

Local Reachability Density: $$\text{lrd}_k(A) = \frac{1}{\frac{\sum_{B \in N_k(A)} \text{reach-dist}_k(A,B)}{|N_k(A)|}}$$ LOF Score: $$\text{LOF}_k(A) = \frac{\sum_{B \in N_k(A)} \frac{\text{lrd}_k(B)}{\text{lrd}_k(A)}}{|N_k(A)|}$$ Interpretation: - LOF $\approx 1$: Normal point - LOF $\gg 1$: Outlier

Key Insight

Multivariate outliers may not be outliers in any single dimension

Feature Engineering Examples

[Figure: ../figures/07_feature_engineering.png]

Engineering Insights

Age binning: Creates interpretable groups; Family size: Combines multiple features; Title extraction: Captures social status

Feature Creation Techniques

Binning \& Discretization

Equal-width binning: $$\text{bin width} = \frac{x_{\max} - x_{\min}}{k}$$ Equal-frequency binning: Each bin contains $\frac{n}{k}$ observations Quantile-based binning: Based on percentiles (quartiles, deciles) Domain-specific binning: Using expert knowledge (e.g., age groups)
\begin{methodblock}{Optimal Binning} Use information gain or chi-square test to determine optimal bin boundaries \end{methodblock}

Feature Combinations

Arithmetic Operations: - Addition: $x_1 + x_2$ (total family size) - Multiplication: $x_1 \times x_2$ (interaction terms) - Division: $x_1 / x_2$ (ratios, rates) Boolean Operations: - Logical AND: $x_1 \land x_2$ - Logical OR: $x_1 \lor x_2$ - Conditional: if $x_1 > \text{threshold}$ then 1 else 0 String Operations: - Length: $\text{len}(\text{string})$ - Contains: pattern matching - Extract: regular expressions
\begin{exampleblock}{Feature Engineering Guidelines} Domain Knowledge: Most important factor; Iterative Process: Create, test, refine; Validation: Always validate on holdout set \end{exampleblock}

Mathematical Transformations

Power Transformations

Log Transformation: $$y = \log(x + c)$$ Reduces right skewness, stabilizes variance Square Root: $$y = \sqrt{x}$$ Moderate variance stabilization Box-Cox Transformation: $$y = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \ln(x) & \text{if } \lambda = 0 \end{cases}$$ Optimal $\lambda$ found via maximum likelihood

Trigonometric Features

For cyclical data (time, angles): Sine/Cosine encoding: $$\sin\left(\frac{2\pi \times \text{value}}{\text{max\_value}}\right)$$ $$\cos\left(\frac{2\pi \times \text{value}}{\text{max\_value}}\right)$$ Example for hour of day: $$\text{hour\_sin} = \sin\left(\frac{2\pi \times \text{hour}}{24}\right)$$ $$\text{hour\_cos} = \cos\left(\frac{2\pi \times \text{hour}}{24}\right)$$
\begin{tipblock}{When to Transform} Transform when: skewed data, non-linear relationships, or specific model requirements \end{tipblock}

Feature Selection Methods

[Figure: ../figures/09_feature_selection.png]

Selection Results

Statistical: Gender and fare most important; Tree-based: Consistent with domain knowledge about survival factors

Statistical Feature Selection

Filter Methods

Correlation-based: Select features with high correlation to target, low correlation to each other F-test (ANOVA): $$F = \frac{\text{MSB}}{\text{MSW}} = \frac{\sum_{i=1}^k n_i(\bar{x}_i - \bar{x})^2/(k-1)}{\sum_{i=1}^k \sum_{j=1}^{n_i}(x_{ij} - \bar{x}_i)^2/(n-k)}$$ Chi-square test: $$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$ Mutual Information: $$I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$$

Wrapper Methods

Forward Selection: Start with empty set, add best features iteratively Backward Elimination: Start with all features, remove worst iteratively Recursive Feature Elimination: $$\text{rank}_i = f(\text{coef}_i, \text{importance}_i)$$ Genetic Algorithm: Evolutionary approach to feature subset selection
\begin{methodblock}{Embedded Methods} L1 Regularization (Lasso): $$\min_\beta \frac{1}{2n}||y - X\beta||^2_2 + \lambda||\beta||_1$$ \end{methodblock}

Tree-based Feature Importance

Random Forest Importance

Mean Decrease Impurity: $$\text{Importance}(x_j) = \frac{1}{T}\sum_{t=1}^T \sum_{v \in \text{splits}} p(v) \times \Delta I(v)$$ where: - $T$ = number of trees - $p(v)$ = proportion of samples reaching node $v$ - $\Delta I(v)$ = impurity decrease at node $v$ Mean Decrease Accuracy: Permutation-based importance measuring prediction accuracy drop when feature is shuffled

Gradient Boosting Importance

Gain-based Importance: $$\text{Importance}(x_j) = \sum_{t=1}^T \sum_{v \in \text{splits}_j} \text{gain}(v)$$ SHAP Values: Shapley Additive exPlanations provide unified measure: $$\phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!(|N|-|S|-1)!}{|N|!}[f(S \cup \{j\}) - f(S)]$$

Caution

Tree-based importance can be biased toward high-cardinality categorical features

Normalization Comparison

[Figure: ../figures/08_normalization_comparison.png]

Normalization Effects

StandardScaler: Zero mean, unit variance; MinMaxScaler: [0,1] range; RobustScaler: Median-based, outlier resistant

Scaling Methods Mathematical Formulations

Standard Scaling (Z-score)

$$x_{\text{scaled}} = \frac{x - \mu}{\sigma}$$ where $\mu = \frac{1}{n}\sum_{i=1}^n x_i$ and $\sigma = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)^2}$ Properties: - Mean = 0, Std = 1 - Preserves distribution shape - Sensitive to outliers

Min-Max Scaling

$$x_{\text{scaled}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$$ Properties: - Range: [0, 1] - Preserves relationships - Very sensitive to outliers

Robust Scaling

$$x_{\text{scaled}} = \frac{x - \text{Median}(x)}{\text{IQR}(x)}$$ where $\text{IQR} = Q_3 - Q_1$ Properties: - Median-centered - Uses interquartile range - Robust to outliers

Unit Vector Scaling

$$x_{\text{scaled}} = \frac{x}{||x||_2}$$ where $||x||_2 = \sqrt{\sum_{i=1}^n x_i^2}$ Use case: When magnitude matters more than individual values
\begin{tipblock}{Selection Guide} Normal data + no outliers: StandardScaler; Bounded range needed: MinMaxScaler; Outliers present: RobustScaler \end{tipblock}

When \& Why to Normalize

Algorithms Requiring Normalization

Distance-based: - k-NN, k-means clustering - SVM with RBF kernel - Neural networks Gradient-based: - Logistic regression - Linear regression with regularization - Deep learning Mathematical justification: Features with larger scales dominate distance calculations: $$d = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}$$

Algorithms Not Requiring Normalization

Tree-based methods: - Decision trees - Random Forest - Gradient boosting Reason: Trees use split points, not absolute values Rule-based: - Naive Bayes - Association rules Statistical: Feature scales don't affect splitting decisions or probability calculations

Critical Rule

Always fit scaler on training data only! Apply same transformation to validation/test sets to avoid data leakage.

Target Variable Analysis

[Figure: ../figures/11_target_analysis.png]

Target Insights

Class imbalance: 62\% non-survival; Gender-class interaction: First-class women had 97\% survival rate

Business Insights from EDA

[Figure: ../figures/12_business_insights.png]

Actionable Insights

Revenue impact: Higher-paying passengers had better survival rates; Port differences: Embarkation port correlates with survival

Advanced Visualization Techniques

Dimensionality Reduction

Principal Component Analysis: $$\mathbf{Y} = \mathbf{XW}$$ where $\mathbf{W}$ contains eigenvectors of covariance matrix t-SNE: $$p_{j|i} = \frac{\exp(-||\mathbf{x}_i - \mathbf{x}_j||^2/2\sigma_i^2)}{\sum_{k \neq i} \exp(-||\mathbf{x}_i - \mathbf{x}_k||^2/2\sigma_i^2)}$$ UMAP: Uniform Manifold Approximation - Preserves local and global structure - Faster than t-SNE - Better for clustering visualization

Interactive Visualizations

Plotly Benefits: - Zoom, pan, hover information - 3D scatter plots - Animated visualizations - Dashboard creation Parallel Coordinates: Visualize high-dimensional data relationships Sankey Diagrams: Show flow between categorical variables Radar Charts: Compare multiple features simultaneously
\begin{tipblock}{Best Practice} Progressive Disclosure: Start with simple plots, add complexity as needed for deeper insights \end{tipblock}

Time Series EDA Considerations

Time Series Components

Decomposition: $$y(t) = \text{Trend}(t) + \text{Seasonal}(t) + \text{Noise}(t)$$ Stationarity Testing: Augmented Dickey-Fuller test: $$\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \delta_1 \Delta y_{t-1} + ⋯ + \epsilon_t$$ Autocorrelation: $$\rho_k = \frac{\text{Cov}(y_t, y_{t-k})}{\text{Var}(y_t)}$$

Seasonal Analysis

Seasonal Decomposition: - STL (Seasonal and Trend decomposition using Loess) - X-12-ARIMA - Classical decomposition Periodogram: $$I(\omega) = \frac{1}{n}\left|\sum_{t=1}^n y_t e^{-i\omega t}\right|^2$$ Box-Cox for stabilization: Handle changing variance over time

Time Series EDA Goals

Identify trends, seasonality, outliers, structural breaks, and appropriate transformation needs

EDA to ML Pipeline Integration

[Figure: ../figures/13_ml_pipeline_demo.png]

Pipeline Success

EDA insights validated: Gender and class are top predictors; Model performance: 85\% accuracy achieved through proper preprocessing

From EDA to Model Development

EDA-Informed Decisions

Feature Engineering: - Age binning based on distribution analysis - Family size creation from SibSp + Parch - Title extraction from name patterns Preprocessing Choices: - Median imputation for age (right-skewed) - StandardScaler for fare (wide range) - One-hot encoding for categorical variables Model Selection: - Random Forest chosen for mixed data types - Handles non-linear relationships - Robust to outliers (detected in EDA)

Validation Strategy

Cross-Validation Design: Based on data size (891 samples) → 5-fold CV Stratification: Maintain class balance (38.4\% survival rate) Performance Metrics: - Accuracy: Overall performance - Precision/Recall: Handle class imbalance - F1-Score: Balanced measure - AUC-ROC: Threshold-independent Feature Importance Validation: EDA findings confirmed by model: 1. Sex (gender) - highest importance 2. Fare - economic status indicator 3. Age - demographic factor

EDA Best Practices \& Common Pitfalls

Common Pitfalls

Data Leakage: - Using future information - Target leakage in features - Scaling on entire dataset Confirmation Bias: - Looking only for expected patterns - Ignoring contradictory evidence - Over-interpreting correlations Statistical Errors: - Multiple testing without correction - Assuming causation from correlation - Ignoring sample size effects
\begin{exampleblock}{Best Practices} Systematic Approach: - Follow structured EDA workflow - Document all findings and decisions - Version control EDA notebooks Statistical Rigor: - Apply multiple testing corrections - Use appropriate statistical tests - Report confidence intervals Reproducibility: - Set random seeds - Save preprocessing parameters - Create reusable functions Communication: - Clear visualizations - Executive summaries - Actionable recommendations \end{exampleblock}

EDA Checklist \& Quality Assurance

Data Quality Checklist

  • [$\checkmark$] Completeness: Missing value analysis
  • [$\checkmark$] Accuracy: Outlier detection \& validation
  • [$\checkmark$] Consistency: Data type verification
  • [$\checkmark$] Uniqueness: Duplicate detection
  • [$\checkmark$] Validity: Range \& format checking
  • [$\checkmark$] Timeliness: Temporal analysis
\begin{methodblock}{Statistical Validation}
  • [$\checkmark$] Distribution testing
  • [$\checkmark$] Correlation significance tests
  • [$\checkmark$] Independence assumptions
  • [$\checkmark$] Sample size adequacy
\end{methodblock}
\begin{tipblock}{Visualization Checklist}
  • [$\checkmark$] Clarity: Clear labels \& legends
  • [$\checkmark$] Completeness: All data represented
  • [$\checkmark$] Accuracy: Correct scales \& axes
  • [$\checkmark$] Aesthetics: Professional appearance
  • [$\checkmark$] Accessibility: Color-blind friendly
  • [$\checkmark$] Context: Meaningful titles \& captions
\end{tipblock}

Documentation Standards

  • [$\checkmark$] Data source \& collection methods
  • [$\checkmark$] Preprocessing steps \& rationale
  • [$\checkmark$] Key findings \& insights
  • [$\checkmark$] Limitations \& assumptions
  • [$\checkmark$] Next steps \& recommendations

Summary: Key Takeaways

Core EDA Principles

1. Systematic Approach - Start with data overview - Progress from simple to complex - Document everything 2. Statistical Rigor - Use appropriate tests - Check assumptions - Report confidence intervals 3. Visual Communication - Clear, interpretable plots - Multiple visualization types - Story-driven presentation
\begin{exampleblock}{Practical Impact} Model Performance - 15-30\% improvement typical - Better feature selection - Reduced overfitting Business Value - Actionable insights - Risk identification - Decision support Efficiency Gains - 40-60\% time savings - Focused modeling efforts - Reduced iterations \end{exampleblock}
\begin{tikzpicture}[scale=0.8] \node[rectangle, draw, fill=datacolor!20, text width=2.5cm, text centered] (eda) {Quality EDA}; \node[rectangle, draw, fill=featurecolor!20, text width=2.5cm, text centered, right=1cm of eda] (model) {Better Models}; \node[rectangle, draw, fill=trendcolor!20, text width=2.5cm, text centered, right=1cm of model] (business) {Business Value}; \draw[process arrow, thick] (eda) -- (model); \draw[process arrow, thick] (model) -- (business); \end{tikzpicture}

Next Steps: Advanced EDA Topics

Advanced Techniques

Automated EDA: - pandas-profiling - sweetviz - autoviz Big Data EDA: - Sampling strategies - Distributed computing - Stream processing Domain-Specific EDA: - Text data analysis - Image data exploration - Time series deep-dive

Integration Topics

MLOps Integration: - Automated data quality checks - Feature store management - Drift detection Causal Inference: - Confounding variable identification - Causal graph construction - Treatment effect analysis Ethics \& Fairness: - Bias detection in data - Fairness metrics - Responsible AI practices
\begin{tipblock}{Learning Path} Practice: Apply EDA to diverse datasets; Study: Read domain literature; Share: Present findings to stakeholders \end{tipblock}

Resources \& Further Reading

Essential Books

"Exploratory Data Analysis" - John Tukey \\[2pt] The foundational text for EDA principles "Python for Data Analysis" - Wes McKinney \\[2pt] Practical pandas-based EDA "The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman \\[2pt] Statistical foundations "Fundamentals of Data Visualization" - Claus Wilke \\[2pt] Visualization best practices

Online Resources

Python Libraries: - pandas, seaborn, matplotlib - plotly, bokeh (interactive) - scipy, statsmodels (statistics) R Libraries: - ggplot2, dplyr - corrplot, VIM - DataExplorer, dlookr Courses: - Coursera: EDA with Python - edX: Data Science MicroMasters - Kaggle Learn: Data Visualization
Questions \& Discussion \\[0.2cm] "The greatest value of a picture is when it forces us to notice what we never expected to see." - John Tukey

End of Module 04

Exploratory Data Analysis (EDA)

Questions?