Descriptive Statistics and Summary Visualizations
Exploratory Data Analysis, or EDA, begins with understanding the basic structure, distribution, and behaviour of data. Descriptive statistics and summary visualizations help us quickly identify patterns, unusual values, missing data, skewness, class imbalance, and possible modelling challenges.
Before building a predictive model, we must first ask: What does the data look like? How are the variables distributed? Are there outliers? Are categories balanced? Are relationships visible? Descriptive statistics and visualizations answer these questions.
What is Descriptive Statistics?
Descriptive statistics summarize the main characteristics of a dataset using numbers. Instead of looking at thousands or millions of rows one by one, we use statistics such as mean, median, standard deviation, minimum, maximum, and percentiles to understand the data quickly.
In predictive analytics, descriptive statistics help us understand data quality, variable behaviour, feature distribution, and possible preprocessing needs.
Core Idea: Descriptive statistics do not build the model directly, but they help us understand the data before modelling decisions are made.
Why Summary Statistics Matter for Predictive Modelling
Main Categories of Descriptive Statistics
Descriptive statistics can be grouped into three major categories: measures of central tendency, measures of spread, and measures of shape or position.
| Category | Statistics | What It Tells Us | Modelling Relevance |
|---|---|---|---|
| Center Central Tendency |
Mean, Median, Mode | Typical or representative value of a variable. | Useful for imputation, comparison, and understanding average behaviour. |
| Spread Variability |
Range, Variance, Standard Deviation, IQR | How much values differ from each other. | Helps detect outliers and decide scaling needs. |
| Shape Distribution Shape |
Skewness, Kurtosis, Percentiles | Whether data is symmetric, skewed, heavy-tailed, or extreme. | Guides transformations and robust modelling decisions. |
Measures of Central Tendency
Measures of central tendency describe the centre or typical value of a variable. The three most common measures are mean, median, and mode.
| Measure | Meaning | Example | When to Use |
|---|---|---|---|
| Mean | Arithmetic average of all values. | Average monthly sales. | Best when data is fairly symmetric and has no extreme outliers. |
| Median | Middle value when data is sorted. | Median customer income. | Best when data is skewed or contains outliers. |
| Mode | Most frequently occurring value or category. | Most common payment method. | Useful for categorical variables and mode imputation. |
Important: In skewed data, the mean can be misleading. For example, a few very high-income customers can make average income look much higher than the income of a typical customer. In such cases, median is often more reliable.
Measures of Spread
Measures of spread describe how scattered or concentrated the values are. A model behaves differently when a feature has very small variation compared to when it has extremely wide variation.
| Measure | Meaning | Modelling Importance |
|---|---|---|
| Range | Difference between maximum and minimum value. | Quickly reveals extreme value spread. |
| Variance | Average squared deviation from the mean. | Shows variability, but is harder to interpret because units are squared. |
| Standard Deviation | Typical distance of values from the mean. | Useful for understanding scale and detecting unusual values. |
| Interquartile Range | Difference between Q3 and Q1. | Robust measure of spread, useful for outlier detection. |
Percentiles and Quartiles
Percentiles show the relative position of values in a dataset. For example, the 90th percentile means 90% of values are below that point and 10% are above it.
Quartiles divide the data into four parts. Q1 is the 25th percentile, Q2 is the median or 50th percentile, and Q3 is the 75th percentile.
| Statistic | Meaning | Example Interpretation |
|---|---|---|
| Q1 | 25th percentile. | 25% of customers spend less than this value. |
| Q2 | 50th percentile or median. | Half the customers spend below this value and half above it. |
| Q3 | 75th percentile. | 75% of customers spend below this value. |
| IQR | Q3 minus Q1. | Spread of the middle 50% of observations. |
Summary Visualizations
Summary visualizations convert data into charts that are easier to understand than raw numbers. Visuals reveal distribution shape, unusual values, category imbalance, relationships, and trends quickly.
Common Summary Visualizations
Choosing the Right Visualization
Different charts answer different questions. Choosing the right visualization depends on the type of variable and the analysis objective.
| Visualization | Best For | What It Reveals | Example Use |
|---|---|---|---|
| Chart Histogram |
Single numerical variable. | Distribution shape, skewness, gaps, and peaks. | Distribution of customer age or income. |
| Chart Box Plot |
Numerical variable and outlier detection. | Median, quartiles, spread, and extreme values. | Detecting high-value outliers in transaction amounts. |
| Chart Bar Chart |
Categorical variables. | Frequency of each category. | Number of customers by city or plan type. |
| Chart Pie Chart |
Simple category proportions. | Share of each category. | Payment method share, but only when categories are few. |
| Chart Scatter Plot |
Two numerical variables. | Relationship, clusters, and unusual points. | House area vs. house price. |
| Chart Line Chart |
Time-based data. | Trend, seasonality, spikes, and drops. | Monthly sales trend over time. |
Descriptive Statistics Workflow in EDA
EDA Summary Workflow
EDA for Numerical Variables
Numerical variables should be analysed using measures such as mean, median, minimum, maximum, standard deviation, percentiles, skewness, and outliers.
- Mean
- Median
- Difference between mean and median
- Typical customer or transaction value
- Minimum and maximum
- Standard deviation
- Interquartile range
- Extreme values
- Skewness
- Long tails
- Multiple peaks
- Need for transformation
- Histogram
- Box plot
- Density plot
- Scatter plot
EDA for Categorical Variables
Categorical variables should be summarized using frequency counts, percentage shares, number of unique categories, rare categories, and category imbalance.
| Check | Meaning | Why It Matters |
|---|---|---|
| Frequency Count | Number of observations in each category. | Shows dominant and rare categories. |
| Percentage Share | Proportion of each category. | Helps identify imbalance. |
| Unique Count | Number of different categories. | Important for encoding strategy. |
| Rare Categories | Categories with very few records. | May need grouping into “Other”. |
Example: Descriptive Statistics for Customer Data
Business Problem
A retail company wants to build a model to predict whether a customer will make a repeat purchase. Before modelling, the analyst performs descriptive statistics and summary visualization.
| Variable | EDA Finding | Modelling Decision |
|---|---|---|
| Customer Age | Mean age is 34, median age is 31, and distribution is slightly right-skewed. | Check age groups and consider binning if behaviour differs by age segment. |
| Monthly Spend | Highly skewed with a few very high-spending customers. | Use log transformation or cap extreme values if needed. |
| City | Large number of cities with many rare categories. | Group rare cities into “Other” before encoding. |
| Payment Method | UPI and card payments dominate the dataset. | Use one-hot encoding and check relationship with repeat purchase. |
| Repeat Purchase Target | Only 18% customers made a repeat purchase. | Use stratified splitting and classification metrics beyond accuracy. |
How Visualizations Support Modelling Decisions
Visualizations help us make better modelling choices by showing patterns that summary statistics alone may hide. For example, two variables may have the same mean but very different distributions.
| Visualization Finding | Possible Modelling Action |
|---|---|
| Histogram shows strong right skew | Apply log transformation or use robust models. |
| Box plot shows extreme outliers | Investigate outliers and decide whether to cap, remove, or keep. |
| Bar chart shows rare categories | Group rare categories before encoding. |
| Line chart shows seasonality | Create month, season, holiday, or lag features. |
| Scatter plot shows non-linear relationship | Consider feature transformation, polynomial features, or tree-based models. |
Common Mistakes in Descriptive EDA
| Mistake | Why It Is Harmful | Better Approach |
|---|---|---|
| Looking only at averages | Mean can hide skewness, outliers, and unequal distribution. | Check median, percentiles, histograms, and box plots. |
| Ignoring categorical imbalance | Rare categories may cause unstable model behaviour. | Check category counts and group rare levels when needed. |
| Not checking target distribution | Class imbalance or skewed target affects model strategy. | Always analyse the target variable separately. |
| Using wrong chart type | A poor chart can hide the real pattern. | Use histograms for numerical data and bar charts for categories. |
| Skipping business interpretation | Statistics without context may lead to wrong preprocessing decisions. | Interpret every finding in relation to the business problem. |
Best Practices for Summary EDA
Descriptive Statistics and Visualization Checklist
- Start with data types: Separate numerical, categorical, date/time, and text variables.
- Summarize numerical variables: Check mean, median, standard deviation, min, max, and percentiles.
- Summarize categorical variables: Check counts, percentages, unique values, and rare categories.
- Visualize distributions: Use histograms, box plots, and bar charts.
- Check the target variable: Understand class balance or target skewness.
- Look for outliers: Extreme values may need investigation or treatment.
- Look for skewness: Skewed variables may need transformation.
- Connect findings to modelling: Every EDA insight should guide preprocessing, feature engineering, or evaluation choices.
Why This Step Matters Before Predictive Modelling
Descriptive statistics and summary visualizations are the foundation of informed modelling. Without EDA, model building becomes guesswork. We may choose the wrong algorithm, ignore outliers, mishandle skewed features, overlook rare categories, or use inappropriate evaluation metrics.
Good EDA helps convert raw data into modelling insight. It allows analysts to understand what the data is saying before asking a machine learning algorithm to learn from it.
Practical Insight: A strong predictive model usually begins with strong data understanding. Descriptive statistics and summary visualizations are the first tools for building that understanding.
Key Takeaways
- Descriptive statistics summarize the main characteristics of data using numbers.
- Mean, median, and mode describe central tendency.
- Range, variance, standard deviation, and IQR describe spread.
- Percentiles and quartiles help understand value positions and outliers.
- Histograms, box plots, bar charts, scatter plots, and line charts are common summary visualizations.
- Numerical and categorical variables require different EDA techniques.
- EDA findings guide missing value treatment, outlier handling, feature engineering, transformations, and model selection.
- Good predictive modelling starts with careful data exploration.