Exploratory Data Analysis (EDA)

CMSC 173 - Module 04

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

What is Exploratory Data Analysis?

DefinitionEDA is the process of investigating datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

Primary Goals:

Understand data structure and quality
Discover patterns and relationships
Identify anomalies and outliers
Guide feature engineering decisions
Inform modeling strategy

Key Questions EDA Answers:

What does my data look like?
Is my data clean and complete?
What patterns exist?
Which features are important?

EDA Process Overview: \begin{tikzpicture}[node distance=0.8cm] \node[rectangle, draw, fill=datacolor!20, text width=3cm, text centered, font=\small] (data) {Raw Data}; \node[rectangle, draw, fill=trendcolor!20, text width=3cm, text centered, font=\small, below=0.4cm of data] (explore) {Data Exploration}; \node[rectangle, draw, fill=featurecolor!20, text width=3cm, text centered, font=\small, below=0.4cm of explore] (clean) {Data Cleaning}; \node[rectangle, draw, fill=outliercolor!20, text width=3cm, text centered, font=\small, below=0.4cm of clean] (model) {Model Ready Data}; \draw[process arrow] (data) -- (explore); \draw[process arrow] (explore) -- (clean); \draw[process arrow] (clean) -- (model); \draw[process arrow] (explore.east) .. controls +(right:0.8cm) and +(right:0.8cm) .. (data.east); \end{tikzpicture}

Key Insight

EDA is iterative! Insights from one analysis often lead to new questions and deeper investigations.

The EDA Workflow

\begin{tikzpicture}[node distance=1.5cm, scale=0.9, transform shape] % Main workflow nodes \node[eda process, fill=datacolor!20] (load) {Data Loading}; \node[eda process, fill=trendcolor!20, right=of load] (inspect) {Initial Inspection}; \node[eda process, fill=featurecolor!20, right=of inspect] (clean) {Data Cleaning}; \node[eda process, fill=outliercolor!20, below=of clean] (viz) {Visualization}; \node[eda process, fill=tealaccent!20, left=of viz] (stats) {Statistical Analysis}; \node[eda process, fill=maizedark!20, left=of stats] (insights) {Insights \& Patterns}; % Arrows \draw[process arrow] (load) -- (inspect); \draw[process arrow] (inspect) -- (clean); \draw[process arrow] (clean) -- (viz); \draw[process arrow] (viz) -- (stats); \draw[process arrow] (stats) -- (insights); % Feedback loops \draw[process arrow, dashed] (viz) to[bend left=20] (clean); \draw[process arrow, dashed] (stats) to[bend left=30] (inspect); \draw[process arrow, dashed] (insights) to[bend left=40] (load); \end{tikzpicture}

Data Loading

Import datasets
Check file formats
Handle encoding issues

\begin{methodblock}{Statistical Analysis}

Descriptive statistics
Correlation analysis
Distribution testing

\end{methodblock}

\begin{tipblock}{Key Outcome}

Clean, understood data
Feature insights
Modeling strategy

\end{tipblock}

Why EDA is Critical for Machine Learning

Without EDA

Common Pitfalls:

Garbage In, Garbage Out
Poor model performance
Biased predictions
Overfitting to noise
Missing important patterns
Wasted computational resources

Statistical Foundation: \\ For dataset $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$: $$\begin{aligned}\text{Data Quality} = f(\text{Completeness, Accuracy, Consistency}) \\ \text{Model Performance} \propto \text{Data Quality}\end{aligned}$$

\begin{exampleblock}{With Proper EDA} Benefits Achieved:

High-quality, clean data
Optimal feature selection
Appropriate model choice
Better generalization
Actionable insights
Efficient resource usage

\end{exampleblock} Impact Quantification: \\ Studies show that proper EDA can improve model performance by 15-30\% and reduce development time by 40-60\%. \begin{tikzpicture}[scale=0.7] \draw[fill=outliercolor!20] (0,0) rectangle (2,1) node[pos=.5] {\small No EDA}; \draw[fill=featurecolor!20] (0,1.2) rectangle (3,2.2) node[pos=.5] {\small With EDA}; \node[right] at (2.1,0.5) {\small 60\% Accuracy}; \node[right] at (3.1,1.7) {\small 85\% Accuracy}; \end{tikzpicture}

Understanding Your Dataset

[Figure: ../figures/01_data_types_overview.png]

First Steps

Always start with: \texttt{df.info()}, \texttt{df.describe()}, \texttt{df.shape}, and \texttt{df.head()}

Data Types Classification

Numerical DataContinuous Variables:
Can take any value in a range
Examples: age, salary, temperature
Mathematical operations meaningful


Discrete Variables:
Countable, distinct values
Examples: number of children, cars owned
Often integers

\begin{methodblock}{Mathematical Representation} For numerical variable $X$: $$X \in \mathbb{R} \text{ (continuous)} \text{ or } X \in \mathbb{Z} \text{ (discrete)}$$ \end{methodblock}

Categorical DataNominal Variables:
No natural ordering
Examples: color, gender, city
Cannot perform arithmetic


Ordinal Variables:
Natural ordering exists
Examples: education level, rating
Ranking meaningful, differences may not be

\begin{methodblock}{Mathematical Representation} For categorical variable $C$: $$C \in \{\text{category}_1, \text{category}_2, …, \text{category}_k\}$$ \end{methodblock}

\begin{tipblock}{Practical Tip} Encoding Strategy: Numerical $\rightarrow$ Keep as-is; Nominal $\rightarrow$ One-hot encoding; Ordinal $\rightarrow$ Label encoding \end{tipblock}

Sample Dataset: Titanic Survival Analysis

Dataset OverviewContains information on 891 passengers aboard the Titanic. Goal: Predict passenger survival based on their attributes.

\tiny \begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline PassengerId & Survived & Pclass & Sex & Age & SibSp & Parch & Fare & Embarked \\ \hline 1 & 0 & 3 & male & 22.0 & 1 & 0 & 7.25 & S \\ 2 & 1 & 1 & female & 38.0 & 1 & 0 & 71.28 & C \\ 3 & 1 & 3 & female & 26.0 & 0 & 0 & 7.92 & S \\ 4 & 1 & 1 & female & 35.0 & 1 & 0 & 53.10 & S \\ 5 & 0 & 3 & male & 35.0 & 0 & 0 & 8.05 & S \\ \hline \end{tabular}

Numerical Features

Age: Continuous (0-80)
Fare: Continuous (0-512)
SibSp: Discrete count
Parch: Discrete count

\begin{methodblock}{Categorical Features}

Sex: Nominal (M/F)
Embarked: Nominal (C/Q/S)
Pclass: Ordinal (1st, 2nd, 3rd)

\end{methodblock}

\begin{tipblock}{Target Variable}

Survived: Binary (0/1)
Classification Problem
38.4\% survival rate

\end{tipblock}

Sample Dataset: Iris Species Classification

Dataset OverviewClassic dataset with 150 iris flowers from 3 species. Goal: Classify species based on flower measurements.

\tiny \begin{tabular}{|c|c|c|c|c|} \hline Sepal Length & Sepal Width & Petal Length & Petal Width & Species \\ \hline 5.1 & 3.5 & 1.4 & 0.2 & setosa \\ 4.9 & 3.0 & 1.4 & 0.2 & setosa \\ 6.2 & 2.9 & 4.3 & 1.3 & versicolor \\ 5.9 & 3.0 & 5.1 & 1.8 & virginica \\ 6.4 & 2.8 & 5.6 & 2.2 & virginica \\ \hline \end{tabular}

Numerical Features

Sepal Length: Continuous (4.3-7.9 cm)
Sepal Width: Continuous (2.0-4.4 cm)
Petal Length: Continuous (1.0-6.9 cm)
Petal Width: Continuous (0.1-2.5 cm)

\begin{tipblock}{Target Variable}

Species: 3-class categorical
Classes: setosa, versicolor, virginica
Balanced: 50 samples per class
Clean: No missing values

\end{tipblock}

\begin{methodblock}{EDA Advantages} Perfect for Learning: Small size, clean data, clear patterns, well-separated classes, interpretable features \end{methodblock}

Iris Dataset: Advanced Visualization Techniques

Correlogram Analysis

\begin{tikzpicture}[scale=0.8]
% Draw correlation matrix
\draw[fill=blue!30] (0,0) rectangle (1,1);
\draw[fill=red!50] (1,0) rectangle (2,1);
\draw[fill=red!70] (2,0) rectangle (3,1);
\draw[fill=red!60] (3,0) rectangle (4,1);

\draw[fill=red!50] (0,1) rectangle (1,2);
\draw[fill=blue!30] (1,1) rectangle (2,2);
\draw[fill=blue!20] (2,1) rectangle (3,2);
\draw[fill=blue!10] (3,1) rectangle (4,2);

\draw[fill=red!70] (0,2) rectangle (1,3);
\draw[fill=blue!20] (1,2) rectangle (2,3);
\draw[fill=blue!30] (2,2) rectangle (3,3);
\draw[fill=red!90] (3,2) rectangle (4,3);

\draw[fill=red!60] (0,3) rectangle (1,4);
\draw[fill=blue!10] (1,3) rectangle (2,4);
\draw[fill=red!90] (2,3) rectangle (3,4);
\draw[fill=blue!30] (3,3) rectangle (4,4);

% Labels
\node at (0.5,-0.3) {\tiny Sep.L};
\node at (1.5,-0.3) {\tiny Sep.W};
\node at (2.5,-0.3) {\tiny Pet.L};
\node at (3.5,-0.3) {\tiny Pet.W};

\node at (-0.3,0.5) {\tiny Sep.L};
\node at (-0.3,1.5) {\tiny Sep.W};
\node at (-0.3,2.5) {\tiny Pet.L};
\node at (-0.3,3.5) {\tiny Pet.W};

% Correlation values
\node at (2.5,0.5) {\tiny 0.96};
\node at (3.5,2.5) {\tiny 0.96};
\node at (1.5,0.5) {\tiny -0.12};
\end{tikzpicture}


Strong correlation: Petal length ↔ Petal width (r=0.96)

Box Plot Insights

\begin{tikzpicture}[scale=0.7] % Three box plots for species \foreach \x/\species/\color in {1/setosa/green, 2.5/versicolor/blue, 4/virginica/red} { % Box plot structure \draw[thick, \color] (\x-0.3,0.5) rectangle (\x+0.3,1.5); \draw[thick, \color] (\x-0.3,1) -- (\x+0.3,1); \draw[thick, \color] (\x,0.2) -- (\x,0.5); \draw[thick, \color] (\x,1.5) -- (\x,1.8); % Species labels \node at (\x,-0.2) {\tiny \species}; % Add some outliers for virginica \ifnum\x=4 \fill[\color] (\x,2.2) circle (0.05); \fill[\color] (\x,0.1) circle (0.05); \fi } \node at (2.5,2.5) {\small Petal Length by Species}; \end{tikzpicture} Clear separation: Species distinguishable by petal features

Violin Plot Analysis
\begin{tikzpicture}[scale=0.9]
% Draw three violin plots
\foreach \x/\species/\color in {1.5/setosa/green!60, 3.5/versicolor/blue!60, 5.5/virginica/red!60} {
    % Violin shape (ellipse approximation)
    \fill[\color, opacity=0.7] (\x,0.5) ellipse (0.4 and 0.8);
    \draw[thick] (\x,0.5) ellipse (0.4 and 0.8);

    % Center line
    \draw[thick, black] (\x-0.1,0.5) -- (\x+0.1,0.5);

    % Species labels
    \node at (\x,-0.1) {\small \species};
}

% Axis labels
\node at (0.5,0.5) [rotate=90] {\small Sepal Width};
\node at (3.5,-0.5) {\small Species};

% Title
\node at (3.5,1.8) {Distribution Shape \& Density by Species};
\end{tikzpicture}


\begin{methodblock}{Violin Plot Advantages}
Combines: Box plot summary statistics + density estimation + distribution shape visualization
\end{methodblock}

Univariate Analysis - Numerical Variables

[Figure: ../figures/02_univariate_numerical.png]

Key Observations

Age: Right-skewed, missing values; Fare: Heavy right tail, potential outliers

Statistical Measures for Numerical Data

Central TendencyFor variable $X = \{x_1, x_2, …, x_n\}$:

Mean (Arithmetic):
$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$

Median: Middle value when sorted
$$\text{Median} = \begin{cases}
x_{(n+1)/2} & \text{if } n \text{ odd} \\
\frac{x_{n/2} + x_{n/2+1}}{2} & \text{if } n \text{ even}
\end{cases}$$

Mode: Most frequent value

Dispersion MeasuresVariance:
$$\sigma^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

Standard Deviation:
$$\sigma = \sqrt{\sigma^2}$$

Interquartile Range:
$$\text{IQR} = Q_3 - Q_1$$

Range:
$$\text{Range} = x_{\max} - x_{\min}$$

\begin{tipblock}{Practical Guidelines} Skewed data: Use median \& IQR; Normal data: Use mean \& standard deviation \end{tipblock}

Univariate Analysis - Categorical Variables

[Figure: ../figures/03_univariate_categorical.png]

Key Insights

Gender imbalance: 65\% male passengers; Class distribution: 55\% third class; Embarkation: 72\% from Southampton

Statistical Measures for Categorical Data

Frequency AnalysisFor categorical variable $C$ with categories $\{c_1, c_2, …, c_k\}$:

Frequency:
$$f_i = \text{count}(C = c_i)$$

Relative Frequency:
$$p_i = \frac{f_i}{n} \text{ where } \sum_{i=1}^{k} p_i = 1$$

Mode: Category with highest frequency
$$\text{Mode} = \arg\max_{c_i} f_i$$

\begin{methodblock}{Entropy Measure} Information content: $$H(C) = -\sum_{i=1}^{k} p_i \log_2(p_i)$$ Higher entropy = more uniform distribution \end{methodblock}

Visualization GuidelinesBar Charts:
Best for comparing categories
Order by frequency for impact
Use consistent colors


Pie Charts:
Good for showing proportions
Limit to $\leq 5$ categories
Start largest slice at 12 o'clock

\begin{exampleblock}{Chi-Square Test} Test for uniform distribution: $$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}$$ where $O_i$ = observed, $E_i$ = expected \end{exampleblock}

Distribution Analysis \& Normality Testing

Common DistributionsNormal Distribution:
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

Log-Normal Distribution:
$$f(x) = \frac{1}{x\sigma\sqrt{2\pi}} e^{-\frac{(\ln x-\mu)^2}{2\sigma^2}}$$

Skewness:
$$\text{Skew} = \frac{E[(X-\mu)^3]}{\sigma^3}$$

Normality TestsShapiro-Wilk Test:
$$W = \frac{(\sum_{i=1}^n a_i x_{(i)})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}$$

Kolmogorov-Smirnov Test:
$$D = \sup_x |F_n(x) - F(x)|$$

Anderson-Darling Test:
More sensitive to tail deviations

Decision Rule

If $p < 0.05$: Reject normality assumption; Consider transformations (log, square root, Box-Cox)

Correlation Analysis

[Figure: ../figures/04_correlation_analysis.png]

Correlation Insights

Strong correlations: Fare-Survival (0.26), Age-Survival (-0.07); Weak correlations: SibSp-Parch (0.41)

Correlation Coefficients \& Interpretation

Pearson CorrelationFor linear relationships:
$$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$

Range: $r \in [-1, 1]$
$r = 1$: Perfect positive correlation
$r = 0$: No linear correlation
$r = -1$: Perfect negative correlation

\begin{methodblock}{Significance Test} $$t = r\sqrt{\frac{n-2}{1-r^2}} \sim t_{n-2}$$ \end{methodblock}

Non-Linear CorrelationsSpearman Rank Correlation:
$$\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}$$
where $d_i$ = rank difference

Kendall's Tau:
$$\tau = \frac{n_c - n_d}{\frac{1}{2}n(n-1)}$$
where $n_c$ = concordant pairs, $n_d$ = discordant pairs

\begin{exampleblock}{Interpretation Guide}

$|r| < 0.3$: Weak relationship
$0.3 \leq |r| < 0.7$: Moderate relationship
$|r| \geq 0.7$: Strong relationship

\end{exampleblock}

Remember

Correlation $\neq$ Causation! Always investigate the underlying mechanisms.

Bivariate Analysis - Feature Relationships

[Figure: ../figures/10_bivariate_analysis.png]

Key Patterns

Gender effect: Women had 74\% survival rate vs men 19\%; Class effect: 1st class 63\% vs 3rd class 24\%

Cross-Tabulation \& Contingency Tables

Contingency TableFor categorical variables $A$ and $B$:


\small
\begin{tabular}{|c|c|c|c|}
\hline
 & $B_1$ & $B_2$ & Total \\
\hline
$A_1$ & $n_{11}$ & $n_{12}$ & $n_{1.}$ \\
$A_2$ & $n_{21}$ & $n_{22}$ & $n_{2.}$ \\
\hline
Total & $n_{.1}$ & $n_{.2}$ & $n$ \\
\hline
\end{tabular}


Joint Probability:
$$P(A_i, B_j) = \frac{n_{ij}}{n}$$

Marginal Probability:
$$P(A_i) = \frac{n_{i.}}{n},   P(B_j) = \frac{n_{.j}}{n}$$

Independence TestChi-Square Test:
$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

where $E_{ij} = \frac{n_{i.} \times n_{.j}}{n}$

Degrees of freedom:
$$df = (r-1)(c-1)$$

Cramér's V (Effect Size):
$$V = \sqrt{\frac{\chi^2}{n \times \min(r-1, c-1)}}$$

\begin{tipblock}{Interpretation} $V \in [0,1]$: 0 = no association, 1 = perfect association \end{tipblock}

Missing Data Analysis

[Figure: ../figures/05_missing_data_analysis.png]

Missing Data Impact

Age: 20\% missing values; Pattern: May not be random - could be related to passenger class or survival

Types of Missing Data

MCAR: Missing Completely at RandomDefinition: Missing data is independent of both observed and unobserved data.

Mathematical condition:
$$P(\text{Missing} | X, Y) = P(\text{Missing})$$

Example: Survey responses lost due to mail delivery issues.

Implication: Can use any imputation method without bias.

\begin{methodblock}{Test for MCAR} Little's MCAR Test: Tests null hypothesis that data is MCAR using EM algorithm. \end{methodblock}

MAR: Missing at RandomDefinition: Missing data depends on observed data, but not on unobserved data.

Mathematical condition:
$$P(\text{Missing} | X, Y) = P(\text{Missing} | X)$$

Example: Older passengers more likely to have missing age data.

MNAR: Missing Not at Random
Missing data depends on unobserved data.

Example: High-income individuals not reporting income.

Handling Strategy

MAR: Multiple imputation; MNAR: Domain expertise required

Imputation Strategies

Simple ImputationMean/Mode Imputation:
$$x_{\text{missing}} = \bar{x} \text{ or Mode}(x)$$

Median Imputation:
$$x_{\text{missing}} = \text{Median}(x)$$

Forward/Backward Fill:
For time series data

Constant Value:
Domain-specific constant (e.g., 0, -1)

Limitations

Simple methods reduce variance and can introduce bias

Advanced ImputationKNN Imputation:
$$x_{\text{missing}} = \frac{1}{k}\sum_{i \in \text{k-nearest}} x_i$$

Multiple Imputation:
Creates multiple complete datasets, analyzes each, pools results.

Model-based:
- Linear regression
- Random Forest
- Deep learning approaches

\begin{exampleblock}{Best Practice} Always analyze missing data pattern before choosing imputation method \end{exampleblock}

Outlier Detection Methods

[Figure: ../figures/06_outlier_detection.png]

Outlier Findings

Fare: 20 outliers detected using IQR method; Age: Few extreme values at high ages

Statistical Outlier Detection Methods

IQR MethodInterquartile Range:
$$\text{IQR} = Q_3 - Q_1$$

Outlier bounds:
$$\text{Lower bound} = Q_1 - 1.5 \times \text{IQR}$$
$$\text{Upper bound} = Q_3 + 1.5 \times \text{IQR}$$

Outlier condition:
$$x < \text{Lower bound} \text{ or } x > \text{Upper bound}$$

\begin{methodblock}{Modified Z-Score} $$M_i = \frac{0.6745(x_i - \text{median})}{\text{MAD}}$$ where MAD = median absolute deviation \end{methodblock}

Z-Score MethodStandard Z-score:
$$z_i = \frac{x_i - \bar{x}}{\sigma}$$

Outlier threshold:
$$|z_i| > 2.5 \text{ or } |z_i| > 3$$

Limitation: Sensitive to outliers in mean and std calculation

\begin{exampleblock}{Isolation Forest} Anomaly Score: $$s(x,n) = 2^{-\frac{E(h(x))}{c(n)}}$$ where $E(h(x))$ = average path length, $c(n)$ = average path length of BST \end{exampleblock}

\begin{tipblock}{Decision Framework} Normal distribution: Z-score; Skewed distribution: IQR; Multivariate: Isolation Forest \end{tipblock}

Multivariate Outlier Detection

Mahalanobis DistanceFor multivariate data $\mathbf{x} \in \mathbb{R}^p$:

$$D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

where:
- $\boldsymbol{\mu}$ = sample mean vector
- $\boldsymbol{\Sigma}$ = sample covariance matrix

Outlier threshold:
$$D_M(\mathbf{x}) > \sqrt{\chi^2_{p,\alpha}}$$

\begin{methodblock}{Cook's Distance} Measures influence of each observation on regression: $$D_i = \frac{\sum_{j=1}^n (\hat{y}_j - \hat{y}_{j(i)})^2}{p \times \text{MSE}}$$ \end{methodblock}

Local Outlier Factor (LOF)Local Reachability Density:
$$\text{lrd}_k(A) = \frac{1}{\frac{\sum_{B \in N_k(A)} \text{reach-dist}_k(A,B)}{|N_k(A)|}}$$

LOF Score:
$$\text{LOF}_k(A) = \frac{\sum_{B \in N_k(A)} \frac{\text{lrd}_k(B)}{\text{lrd}_k(A)}}{|N_k(A)|}$$

Interpretation:
- LOF $\approx 1$: Normal point
- LOF $\gg 1$: Outlier

Key Insight

Multivariate outliers may not be outliers in any single dimension

Feature Engineering Examples

[Figure: ../figures/07_feature_engineering.png]

Engineering Insights

Age binning: Creates interpretable groups; Family size: Combines multiple features; Title extraction: Captures social status

Feature Creation Techniques

Binning \& DiscretizationEqual-width binning:
$$\text{bin width} = \frac{x_{\max} - x_{\min}}{k}$$

Equal-frequency binning:
Each bin contains $\frac{n}{k}$ observations

Quantile-based binning:
Based on percentiles (quartiles, deciles)

Domain-specific binning:
Using expert knowledge (e.g., age groups)

\begin{methodblock}{Optimal Binning} Use information gain or chi-square test to determine optimal bin boundaries \end{methodblock}

Feature CombinationsArithmetic Operations:
- Addition: $x_1 + x_2$ (total family size)
- Multiplication: $x_1 \times x_2$ (interaction terms)
- Division: $x_1 / x_2$ (ratios, rates)

Boolean Operations:
- Logical AND: $x_1 \land x_2$
- Logical OR: $x_1 \lor x_2$
- Conditional: if $x_1 > \text{threshold}$ then 1 else 0

String Operations:
- Length: $\text{len}(\text{string})$
- Contains: pattern matching
- Extract: regular expressions

\begin{exampleblock}{Feature Engineering Guidelines} Domain Knowledge: Most important factor; Iterative Process: Create, test, refine; Validation: Always validate on holdout set \end{exampleblock}

Mathematical Transformations

Power TransformationsLog Transformation:
$$y = \log(x + c)$$
Reduces right skewness, stabilizes variance

Square Root:
$$y = \sqrt{x}$$
Moderate variance stabilization

Box-Cox Transformation:
$$y = \begin{cases}
\frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\
\ln(x) & \text{if } \lambda = 0
\end{cases}$$

Optimal $\lambda$ found via maximum likelihood

Trigonometric FeaturesFor cyclical data (time, angles):

Sine/Cosine encoding:
$$\sin\left(\frac{2\pi \times \text{value}}{\text{max\_value}}\right)$$
$$\cos\left(\frac{2\pi \times \text{value}}{\text{max\_value}}\right)$$

Example for hour of day:
$$\text{hour\_sin} = \sin\left(\frac{2\pi \times \text{hour}}{24}\right)$$
$$\text{hour\_cos} = \cos\left(\frac{2\pi \times \text{hour}}{24}\right)$$

\begin{tipblock}{When to Transform} Transform when: skewed data, non-linear relationships, or specific model requirements \end{tipblock}

Feature Selection Methods

[Figure: ../figures/09_feature_selection.png]

Selection Results

Statistical: Gender and fare most important; Tree-based: Consistent with domain knowledge about survival factors

Statistical Feature Selection

Filter MethodsCorrelation-based:
Select features with high correlation to target, low correlation to each other

F-test (ANOVA):
$$F = \frac{\text{MSB}}{\text{MSW}} = \frac{\sum_{i=1}^k n_i(\bar{x}_i - \bar{x})^2/(k-1)}{\sum_{i=1}^k \sum_{j=1}^{n_i}(x_{ij} - \bar{x}_i)^2/(n-k)}$$

Chi-square test:
$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

Mutual Information:
$$I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$$

Wrapper MethodsForward Selection:
Start with empty set, add best features iteratively

Backward Elimination:
Start with all features, remove worst iteratively

Recursive Feature Elimination:
$$\text{rank}_i = f(\text{coef}_i, \text{importance}_i)$$

Genetic Algorithm:
Evolutionary approach to feature subset selection

\begin{methodblock}{Embedded Methods} L1 Regularization (Lasso): $$\min_\beta \frac{1}{2n}||y - X\beta||^2_2 + \lambda||\beta||_1$$ \end{methodblock}

Tree-based Feature Importance

Random Forest ImportanceMean Decrease Impurity:
$$\text{Importance}(x_j) = \frac{1}{T}\sum_{t=1}^T \sum_{v \in \text{splits}} p(v) \times \Delta I(v)$$

where:
- $T$ = number of trees
- $p(v)$ = proportion of samples reaching node $v$
- $\Delta I(v)$ = impurity decrease at node $v$

Mean Decrease Accuracy:
Permutation-based importance measuring prediction accuracy drop when feature is shuffled

Gradient Boosting ImportanceGain-based Importance:
$$\text{Importance}(x_j) = \sum_{t=1}^T \sum_{v \in \text{splits}_j} \text{gain}(v)$$

SHAP Values:
Shapley Additive exPlanations provide unified measure:
$$\phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!(|N|-|S|-1)!}{|N|!}[f(S \cup \{j\}) - f(S)]$$

Caution

Tree-based importance can be biased toward high-cardinality categorical features

Normalization Comparison

[Figure: ../figures/08_normalization_comparison.png]

Normalization Effects

StandardScaler: Zero mean, unit variance; MinMaxScaler: [0,1] range; RobustScaler: Median-based, outlier resistant

Scaling Methods Mathematical Formulations

Standard Scaling (Z-score)$$x_{\text{scaled}} = \frac{x - \mu}{\sigma}$$

where $\mu = \frac{1}{n}\sum_{i=1}^n x_i$ and $\sigma = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)^2}$

Properties:
- Mean = 0, Std = 1
- Preserves distribution shape
- Sensitive to outliers

Min-Max Scaling$$x_{\text{scaled}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$$

Properties:
- Range: [0, 1]
- Preserves relationships
- Very sensitive to outliers

Robust Scaling$$x_{\text{scaled}} = \frac{x - \text{Median}(x)}{\text{IQR}(x)}$$

where $\text{IQR} = Q_3 - Q_1$

Properties:
- Median-centered
- Uses interquartile range
- Robust to outliers

Unit Vector Scaling$$x_{\text{scaled}} = \frac{x}{||x||_2}$$

where $||x||_2 = \sqrt{\sum_{i=1}^n x_i^2}$

Use case: When magnitude matters more than individual values

\begin{tipblock}{Selection Guide} Normal data + no outliers: StandardScaler; Bounded range needed: MinMaxScaler; Outliers present: RobustScaler \end{tipblock}

When \& Why to Normalize

Algorithms Requiring NormalizationDistance-based:
- k-NN, k-means clustering
- SVM with RBF kernel
- Neural networks

Gradient-based:
- Logistic regression
- Linear regression with regularization
- Deep learning

Mathematical justification:
Features with larger scales dominate distance calculations:
$$d = \sqrt{\sum_{i=1}^p (x_i - y_i)^2}$$

Algorithms Not Requiring NormalizationTree-based methods:
- Decision trees
- Random Forest
- Gradient boosting

Reason: Trees use split points, not absolute values

Rule-based:
- Naive Bayes
- Association rules

Statistical: Feature scales don't affect splitting decisions or probability calculations

Critical Rule

Always fit scaler on training data only! Apply same transformation to validation/test sets to avoid data leakage.

Target Variable Analysis

[Figure: ../figures/11_target_analysis.png]

Target Insights

Class imbalance: 62\% non-survival; Gender-class interaction: First-class women had 97\% survival rate

Business Insights from EDA

[Figure: ../figures/12_business_insights.png]

Actionable Insights

Revenue impact: Higher-paying passengers had better survival rates; Port differences: Embarkation port correlates with survival

Advanced Visualization Techniques

Dimensionality ReductionPrincipal Component Analysis:
$$\mathbf{Y} = \mathbf{XW}$$
where $\mathbf{W}$ contains eigenvectors of covariance matrix

t-SNE:
$$p_{j|i} = \frac{\exp(-||\mathbf{x}_i - \mathbf{x}_j||^2/2\sigma_i^2)}{\sum_{k \neq i} \exp(-||\mathbf{x}_i - \mathbf{x}_k||^2/2\sigma_i^2)}$$

UMAP: Uniform Manifold Approximation
- Preserves local and global structure
- Faster than t-SNE
- Better for clustering visualization

Interactive VisualizationsPlotly Benefits:
- Zoom, pan, hover information
- 3D scatter plots
- Animated visualizations
- Dashboard creation

Parallel Coordinates:
Visualize high-dimensional data relationships

Sankey Diagrams:
Show flow between categorical variables

Radar Charts:
Compare multiple features simultaneously

\begin{tipblock}{Best Practice} Progressive Disclosure: Start with simple plots, add complexity as needed for deeper insights \end{tipblock}

Time Series EDA Considerations

Time Series ComponentsDecomposition:
$$y(t) = \text{Trend}(t) + \text{Seasonal}(t) + \text{Noise}(t)$$

Stationarity Testing:
Augmented Dickey-Fuller test:
$$\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \delta_1 \Delta y_{t-1} + ⋯ + \epsilon_t$$

Autocorrelation:
$$\rho_k = \frac{\text{Cov}(y_t, y_{t-k})}{\text{Var}(y_t)}$$

Seasonal AnalysisSeasonal Decomposition:
- STL (Seasonal and Trend decomposition using Loess)
- X-12-ARIMA
- Classical decomposition

Periodogram:
$$I(\omega) = \frac{1}{n}\left|\sum_{t=1}^n y_t e^{-i\omega t}\right|^2$$

Box-Cox for stabilization:
Handle changing variance over time

Time Series EDA Goals

Identify trends, seasonality, outliers, structural breaks, and appropriate transformation needs

EDA to ML Pipeline Integration

[Figure: ../figures/13_ml_pipeline_demo.png]

Pipeline Success

EDA insights validated: Gender and class are top predictors; Model performance: 85\% accuracy achieved through proper preprocessing

From EDA to Model Development

EDA-Informed DecisionsFeature Engineering:
- Age binning based on distribution analysis
- Family size creation from SibSp + Parch
- Title extraction from name patterns

Preprocessing Choices:
- Median imputation for age (right-skewed)
- StandardScaler for fare (wide range)
- One-hot encoding for categorical variables

Model Selection:
- Random Forest chosen for mixed data types
- Handles non-linear relationships
- Robust to outliers (detected in EDA)

Validation StrategyCross-Validation Design:
Based on data size (891 samples) → 5-fold CV

Stratification:
Maintain class balance (38.4\% survival rate)

Performance Metrics:
- Accuracy: Overall performance
- Precision/Recall: Handle class imbalance
- F1-Score: Balanced measure
- AUC-ROC: Threshold-independent

Feature Importance Validation:
EDA findings confirmed by model:
1. Sex (gender) - highest importance
2. Fare - economic status indicator
3. Age - demographic factor

EDA Best Practices \& Common Pitfalls

Common Pitfalls

Data Leakage: - Using future information - Target leakage in features - Scaling on entire dataset Confirmation Bias: - Looking only for expected patterns - Ignoring contradictory evidence - Over-interpreting correlations Statistical Errors: - Multiple testing without correction - Assuming causation from correlation - Ignoring sample size effects

\begin{exampleblock}{Best Practices} Systematic Approach: - Follow structured EDA workflow - Document all findings and decisions - Version control EDA notebooks Statistical Rigor: - Apply multiple testing corrections - Use appropriate statistical tests - Report confidence intervals Reproducibility: - Set random seeds - Save preprocessing parameters - Create reusable functions Communication: - Clear visualizations - Executive summaries - Actionable recommendations \end{exampleblock}

EDA Checklist \& Quality Assurance

Data Quality Checklist

[$\checkmark$] Completeness: Missing value analysis
[$\checkmark$] Accuracy: Outlier detection \& validation
[$\checkmark$] Consistency: Data type verification
[$\checkmark$] Uniqueness: Duplicate detection
[$\checkmark$] Validity: Range \& format checking
[$\checkmark$] Timeliness: Temporal analysis

\begin{methodblock}{Statistical Validation}

[$\checkmark$] Distribution testing
[$\checkmark$] Correlation significance tests
[$\checkmark$] Independence assumptions
[$\checkmark$] Sample size adequacy

\end{methodblock}

\begin{tipblock}{Visualization Checklist}

[$\checkmark$] Clarity: Clear labels \& legends
[$\checkmark$] Completeness: All data represented
[$\checkmark$] Accuracy: Correct scales \& axes
[$\checkmark$] Aesthetics: Professional appearance
[$\checkmark$] Accessibility: Color-blind friendly
[$\checkmark$] Context: Meaningful titles \& captions

\end{tipblock}

Documentation Standards[$\checkmark$] Data source \& collection methods
[$\checkmark$] Preprocessing steps \& rationale
[$\checkmark$] Key findings \& insights
[$\checkmark$] Limitations \& assumptions
[$\checkmark$] Next steps \& recommendations

Summary: Key Takeaways

Core EDA Principles

1. Systematic Approach - Start with data overview - Progress from simple to complex - Document everything 2. Statistical Rigor - Use appropriate tests - Check assumptions - Report confidence intervals 3. Visual Communication - Clear, interpretable plots - Multiple visualization types - Story-driven presentation

\begin{exampleblock}{Practical Impact} Model Performance - 15-30\% improvement typical - Better feature selection - Reduced overfitting Business Value - Actionable insights - Risk identification - Decision support Efficiency Gains - 40-60\% time savings - Focused modeling efforts - Reduced iterations \end{exampleblock}

\begin{tikzpicture}[scale=0.8] \node[rectangle, draw, fill=datacolor!20, text width=2.5cm, text centered] (eda) {Quality EDA}; \node[rectangle, draw, fill=featurecolor!20, text width=2.5cm, text centered, right=1cm of eda] (model) {Better Models}; \node[rectangle, draw, fill=trendcolor!20, text width=2.5cm, text centered, right=1cm of model] (business) {Business Value}; \draw[process arrow, thick] (eda) -- (model); \draw[process arrow, thick] (model) -- (business); \end{tikzpicture}

Next Steps: Advanced EDA Topics

Advanced TechniquesAutomated EDA:
- pandas-profiling
- sweetviz
- autoviz

Big Data EDA:
- Sampling strategies
- Distributed computing
- Stream processing

Domain-Specific EDA:
- Text data analysis
- Image data exploration
- Time series deep-dive

Integration TopicsMLOps Integration:
- Automated data quality checks
- Feature store management
- Drift detection

Causal Inference:
- Confounding variable identification
- Causal graph construction
- Treatment effect analysis

Ethics \& Fairness:
- Bias detection in data
- Fairness metrics
- Responsible AI practices

\begin{tipblock}{Learning Path} Practice: Apply EDA to diverse datasets; Study: Read domain literature; Share: Present findings to stakeholders \end{tipblock}

Resources \& Further Reading

Essential Books"Exploratory Data Analysis" - John Tukey
\\[2pt]
The foundational text for EDA principles

"Python for Data Analysis" - Wes McKinney
\\[2pt]
Practical pandas-based EDA

"The Elements of Statistical Learning" - Hastie, Tibshirani, Friedman
\\[2pt]
Statistical foundations

"Fundamentals of Data Visualization" - Claus Wilke
\\[2pt]
Visualization best practices

Online ResourcesPython Libraries:
- pandas, seaborn, matplotlib
- plotly, bokeh (interactive)
- scipy, statsmodels (statistics)

R Libraries:
- ggplot2, dplyr
- corrplot, VIM
- DataExplorer, dlookr

Courses:
- Coursera: EDA with Python
- edX: Data Science MicroMasters
- Kaggle Learn: Data Visualization

Questions \& Discussion \\[0.2cm] "The greatest value of a picture is when it forces us to notice what we never expected to see." - John Tukey

End of Module 04

Exploratory Data Analysis (EDA)

Questions?