Descriptive Statistics and Summary Visualizations

Exploratory Data Analysis, or EDA, begins with understanding the basic structure, distribution, and behaviour of data. Descriptive statistics and summary visualizations help us quickly identify patterns, unusual values, missing data, skewness, class imbalance, and possible modelling challenges.

Before building a predictive model, we must first ask: What does the data look like? How are the variables distributed? Are there outliers? Are categories balanced? Are relationships visible? Descriptive statistics and visualizations answer these questions.

What is Descriptive Statistics?

Descriptive statistics summarize the main characteristics of a dataset using numbers. Instead of looking at thousands or millions of rows one by one, we use statistics such as mean, median, standard deviation, minimum, maximum, and percentiles to understand the data quickly.

In predictive analytics, descriptive statistics help us understand data quality, variable behaviour, feature distribution, and possible preprocessing needs.

Core Idea: Descriptive statistics do not build the model directly, but they help us understand the data before modelling decisions are made.

Why Summary Statistics Matter for Predictive Modelling

📊

Understand Distribution

Statistics show whether values are concentrated, spread out, skewed, or extreme.

⚠️

Detect Data Problems

Minimum, maximum, missing counts, and unusual ranges help identify errors and outliers.

⚙️

Guide Preprocessing

Scaling, transformation, imputation, and outlier treatment choices depend on statistical summaries.

🎯

Improve Modelling Decisions

EDA helps select suitable algorithms, metrics, feature engineering techniques, and validation strategies.

Main Categories of Descriptive Statistics

Descriptive statistics can be grouped into three major categories: measures of central tendency, measures of spread, and measures of shape or position.

Category	Statistics	What It Tells Us	Modelling Relevance
Center Central Tendency	Mean, Median, Mode	Typical or representative value of a variable.	Useful for imputation, comparison, and understanding average behaviour.
Spread Variability	Range, Variance, Standard Deviation, IQR	How much values differ from each other.	Helps detect outliers and decide scaling needs.
Shape Distribution Shape	Skewness, Kurtosis, Percentiles	Whether data is symmetric, skewed, heavy-tailed, or extreme.	Guides transformations and robust modelling decisions.

Measures of Central Tendency

Measures of central tendency describe the centre or typical value of a variable. The three most common measures are mean, median, and mode.

Measure	Meaning	Example	When to Use
Mean	Arithmetic average of all values.	Average monthly sales.	Best when data is fairly symmetric and has no extreme outliers.
Median	Middle value when data is sorted.	Median customer income.	Best when data is skewed or contains outliers.
Mode	Most frequently occurring value or category.	Most common payment method.	Useful for categorical variables and mode imputation.

Important: In skewed data, the mean can be misleading. For example, a few very high-income customers can make average income look much higher than the income of a typical customer. In such cases, median is often more reliable.

Measures of Spread

Measures of spread describe how scattered or concentrated the values are. A model behaves differently when a feature has very small variation compared to when it has extremely wide variation.

Measure	Meaning	Modelling Importance
Range	Difference between maximum and minimum value.	Quickly reveals extreme value spread.
Variance	Average squared deviation from the mean.	Shows variability, but is harder to interpret because units are squared.
Standard Deviation	Typical distance of values from the mean.	Useful for understanding scale and detecting unusual values.
Interquartile Range	Difference between Q3 and Q1.	Robust measure of spread, useful for outlier detection.

Percentiles and Quartiles

Percentiles show the relative position of values in a dataset. For example, the 90th percentile means 90% of values are below that point and 10% are above it.

Quartiles divide the data into four parts. Q1 is the 25th percentile, Q2 is the median or 50th percentile, and Q3 is the 75th percentile.

Statistic	Meaning	Example Interpretation
Q1	25th percentile.	25% of customers spend less than this value.
Q2	50th percentile or median.	Half the customers spend below this value and half above it.
Q3	75th percentile.	75% of customers spend below this value.
IQR	Q3 minus Q1.	Spread of the middle 50% of observations.

Summary Visualizations

Summary visualizations convert data into charts that are easier to understand than raw numbers. Visuals reveal distribution shape, unusual values, category imbalance, relationships, and trends quickly.

Common Summary Visualizations

Histogram

Box Plot

Scatter Plot

Choosing the Right Visualization

Different charts answer different questions. Choosing the right visualization depends on the type of variable and the analysis objective.

Visualization	Best For	What It Reveals	Example Use
Chart Histogram	Single numerical variable.	Distribution shape, skewness, gaps, and peaks.	Distribution of customer age or income.
Chart Box Plot	Numerical variable and outlier detection.	Median, quartiles, spread, and extreme values.	Detecting high-value outliers in transaction amounts.
Chart Bar Chart	Categorical variables.	Frequency of each category.	Number of customers by city or plan type.
Chart Pie Chart	Simple category proportions.	Share of each category.	Payment method share, but only when categories are few.
Chart Scatter Plot	Two numerical variables.	Relationship, clusters, and unusual points.	House area vs. house price.
Chart Line Chart	Time-based data.	Trend, seasonality, spikes, and drops.	Monthly sales trend over time.

Descriptive Statistics Workflow in EDA

EDA Summary Workflow

Identify Data Types

→

Calculate Summary Statistics

→

Visualize Distributions

→

Detect Data Issues

→

Plan Preprocessing

EDA for Numerical Variables

Numerical variables should be analysed using measures such as mean, median, minimum, maximum, standard deviation, percentiles, skewness, and outliers.

Check Center

Mean
Median
Difference between mean and median
Typical customer or transaction value

Check Spread

Minimum and maximum
Standard deviation
Interquartile range
Extreme values

Check Shape

Skewness
Long tails
Multiple peaks
Need for transformation

Use Visuals

Histogram
Box plot
Density plot
Scatter plot

EDA for Categorical Variables

Categorical variables should be summarized using frequency counts, percentage shares, number of unique categories, rare categories, and category imbalance.

Check	Meaning	Why It Matters
Frequency Count	Number of observations in each category.	Shows dominant and rare categories.
Percentage Share	Proportion of each category.	Helps identify imbalance.
Unique Count	Number of different categories.	Important for encoding strategy.
Rare Categories	Categories with very few records.	May need grouping into “Other”.

Example: Descriptive Statistics for Customer Data

Business Problem

A retail company wants to build a model to predict whether a customer will make a repeat purchase. Before modelling, the analyst performs descriptive statistics and summary visualization.

Variable	EDA Finding	Modelling Decision
Customer Age	Mean age is 34, median age is 31, and distribution is slightly right-skewed.	Check age groups and consider binning if behaviour differs by age segment.
Monthly Spend	Highly skewed with a few very high-spending customers.	Use log transformation or cap extreme values if needed.
City	Large number of cities with many rare categories.	Group rare cities into “Other” before encoding.
Payment Method	UPI and card payments dominate the dataset.	Use one-hot encoding and check relationship with repeat purchase.
Repeat Purchase Target	Only 18% customers made a repeat purchase.	Use stratified splitting and classification metrics beyond accuracy.

How Visualizations Support Modelling Decisions

Visualizations help us make better modelling choices by showing patterns that summary statistics alone may hide. For example, two variables may have the same mean but very different distributions.

Visualization Finding	Possible Modelling Action
Histogram shows strong right skew	Apply log transformation or use robust models.
Box plot shows extreme outliers	Investigate outliers and decide whether to cap, remove, or keep.
Bar chart shows rare categories	Group rare categories before encoding.
Line chart shows seasonality	Create month, season, holiday, or lag features.
Scatter plot shows non-linear relationship	Consider feature transformation, polynomial features, or tree-based models.

Common Mistakes in Descriptive EDA

Mistake	Why It Is Harmful	Better Approach
Looking only at averages	Mean can hide skewness, outliers, and unequal distribution.	Check median, percentiles, histograms, and box plots.
Ignoring categorical imbalance	Rare categories may cause unstable model behaviour.	Check category counts and group rare levels when needed.
Not checking target distribution	Class imbalance or skewed target affects model strategy.	Always analyse the target variable separately.
Using wrong chart type	A poor chart can hide the real pattern.	Use histograms for numerical data and bar charts for categories.
Skipping business interpretation	Statistics without context may lead to wrong preprocessing decisions.	Interpret every finding in relation to the business problem.

Best Practices for Summary EDA

Descriptive Statistics and Visualization Checklist

Start with data types: Separate numerical, categorical, date/time, and text variables.
Summarize numerical variables: Check mean, median, standard deviation, min, max, and percentiles.
Summarize categorical variables: Check counts, percentages, unique values, and rare categories.
Visualize distributions: Use histograms, box plots, and bar charts.
Check the target variable: Understand class balance or target skewness.
Look for outliers: Extreme values may need investigation or treatment.
Look for skewness: Skewed variables may need transformation.
Connect findings to modelling: Every EDA insight should guide preprocessing, feature engineering, or evaluation choices.

Why This Step Matters Before Predictive Modelling

Descriptive statistics and summary visualizations are the foundation of informed modelling. Without EDA, model building becomes guesswork. We may choose the wrong algorithm, ignore outliers, mishandle skewed features, overlook rare categories, or use inappropriate evaluation metrics.

Good EDA helps convert raw data into modelling insight. It allows analysts to understand what the data is saying before asking a machine learning algorithm to learn from it.

Practical Insight: A strong predictive model usually begins with strong data understanding. Descriptive statistics and summary visualizations are the first tools for building that understanding.

Key Takeaways

Descriptive statistics summarize the main characteristics of data using numbers.
Mean, median, and mode describe central tendency.
Range, variance, standard deviation, and IQR describe spread.
Percentiles and quartiles help understand value positions and outliers.
Histograms, box plots, bar charts, scatter plots, and line charts are common summary visualizations.
Numerical and categorical variables require different EDA techniques.
EDA findings guide missing value treatment, outlier handling, feature engineering, transformations, and model selection.
Good predictive modelling starts with careful data exploration.

3.1 Descriptive statistics and summary visualizations