Descriptive Statistics and Summary Visualizations

Exploratory Data Analysis, or EDA, begins with understanding the basic structure, distribution, and behaviour of data. Descriptive statistics and summary visualizations help us quickly identify patterns, unusual values, missing data, skewness, class imbalance, and possible modelling challenges.

Before building a predictive model, we must first ask: What does the data look like? How are the variables distributed? Are there outliers? Are categories balanced? Are relationships visible? Descriptive statistics and visualizations answer these questions.

What is Descriptive Statistics?

Descriptive statistics summarize the main characteristics of a dataset using numbers. Instead of looking at thousands or millions of rows one by one, we use statistics such as mean, median, standard deviation, minimum, maximum, and percentiles to understand the data quickly.

In predictive analytics, descriptive statistics help us understand data quality, variable behaviour, feature distribution, and possible preprocessing needs.

Core Idea: Descriptive statistics do not build the model directly, but they help us understand the data before modelling decisions are made.

Why Summary Statistics Matter for Predictive Modelling

📊
Understand Distribution
Statistics show whether values are concentrated, spread out, skewed, or extreme.
⚠️
Detect Data Problems
Minimum, maximum, missing counts, and unusual ranges help identify errors and outliers.
⚙️
Guide Preprocessing
Scaling, transformation, imputation, and outlier treatment choices depend on statistical summaries.
🎯
Improve Modelling Decisions
EDA helps select suitable algorithms, metrics, feature engineering techniques, and validation strategies.

Main Categories of Descriptive Statistics

Descriptive statistics can be grouped into three major categories: measures of central tendency, measures of spread, and measures of shape or position.

Category Statistics What It Tells Us Modelling Relevance
Center
Central Tendency
Mean, Median, Mode Typical or representative value of a variable. Useful for imputation, comparison, and understanding average behaviour.
Spread
Variability
Range, Variance, Standard Deviation, IQR How much values differ from each other. Helps detect outliers and decide scaling needs.
Shape
Distribution Shape
Skewness, Kurtosis, Percentiles Whether data is symmetric, skewed, heavy-tailed, or extreme. Guides transformations and robust modelling decisions.

Measures of Central Tendency

Measures of central tendency describe the centre or typical value of a variable. The three most common measures are mean, median, and mode.

Measure Meaning Example When to Use
Mean Arithmetic average of all values. Average monthly sales. Best when data is fairly symmetric and has no extreme outliers.
Median Middle value when data is sorted. Median customer income. Best when data is skewed or contains outliers.
Mode Most frequently occurring value or category. Most common payment method. Useful for categorical variables and mode imputation.

Important: In skewed data, the mean can be misleading. For example, a few very high-income customers can make average income look much higher than the income of a typical customer. In such cases, median is often more reliable.

Measures of Spread

Measures of spread describe how scattered or concentrated the values are. A model behaves differently when a feature has very small variation compared to when it has extremely wide variation.

Measure Meaning Modelling Importance
Range Difference between maximum and minimum value. Quickly reveals extreme value spread.
Variance Average squared deviation from the mean. Shows variability, but is harder to interpret because units are squared.
Standard Deviation Typical distance of values from the mean. Useful for understanding scale and detecting unusual values.
Interquartile Range Difference between Q3 and Q1. Robust measure of spread, useful for outlier detection.

Percentiles and Quartiles

Percentiles show the relative position of values in a dataset. For example, the 90th percentile means 90% of values are below that point and 10% are above it.

Quartiles divide the data into four parts. Q1 is the 25th percentile, Q2 is the median or 50th percentile, and Q3 is the 75th percentile.

Statistic Meaning Example Interpretation
Q1 25th percentile. 25% of customers spend less than this value.
Q2 50th percentile or median. Half the customers spend below this value and half above it.
Q3 75th percentile. 75% of customers spend below this value.
IQR Q3 minus Q1. Spread of the middle 50% of observations.

Summary Visualizations

Summary visualizations convert data into charts that are easier to understand than raw numbers. Visuals reveal distribution shape, unusual values, category imbalance, relationships, and trends quickly.

Common Summary Visualizations

Histogram
Box Plot
Scatter Plot

Choosing the Right Visualization

Different charts answer different questions. Choosing the right visualization depends on the type of variable and the analysis objective.

Visualization Best For What It Reveals Example Use
Chart
Histogram
Single numerical variable. Distribution shape, skewness, gaps, and peaks. Distribution of customer age or income.
Chart
Box Plot
Numerical variable and outlier detection. Median, quartiles, spread, and extreme values. Detecting high-value outliers in transaction amounts.
Chart
Bar Chart
Categorical variables. Frequency of each category. Number of customers by city or plan type.
Chart
Pie Chart
Simple category proportions. Share of each category. Payment method share, but only when categories are few.
Chart
Scatter Plot
Two numerical variables. Relationship, clusters, and unusual points. House area vs. house price.
Chart
Line Chart
Time-based data. Trend, seasonality, spikes, and drops. Monthly sales trend over time.

Descriptive Statistics Workflow in EDA

EDA Summary Workflow

Identify Data Types
Calculate Summary Statistics
Visualize Distributions
Detect Data Issues
Plan Preprocessing

EDA for Numerical Variables

Numerical variables should be analysed using measures such as mean, median, minimum, maximum, standard deviation, percentiles, skewness, and outliers.

Check Center
  • Mean
  • Median
  • Difference between mean and median
  • Typical customer or transaction value
Check Spread
  • Minimum and maximum
  • Standard deviation
  • Interquartile range
  • Extreme values
Check Shape
  • Skewness
  • Long tails
  • Multiple peaks
  • Need for transformation
Use Visuals
  • Histogram
  • Box plot
  • Density plot
  • Scatter plot

EDA for Categorical Variables

Categorical variables should be summarized using frequency counts, percentage shares, number of unique categories, rare categories, and category imbalance.

Check Meaning Why It Matters
Frequency Count Number of observations in each category. Shows dominant and rare categories.
Percentage Share Proportion of each category. Helps identify imbalance.
Unique Count Number of different categories. Important for encoding strategy.
Rare Categories Categories with very few records. May need grouping into “Other”.

Example: Descriptive Statistics for Customer Data

Business Problem

A retail company wants to build a model to predict whether a customer will make a repeat purchase. Before modelling, the analyst performs descriptive statistics and summary visualization.

Variable EDA Finding Modelling Decision
Customer Age Mean age is 34, median age is 31, and distribution is slightly right-skewed. Check age groups and consider binning if behaviour differs by age segment.
Monthly Spend Highly skewed with a few very high-spending customers. Use log transformation or cap extreme values if needed.
City Large number of cities with many rare categories. Group rare cities into “Other” before encoding.
Payment Method UPI and card payments dominate the dataset. Use one-hot encoding and check relationship with repeat purchase.
Repeat Purchase Target Only 18% customers made a repeat purchase. Use stratified splitting and classification metrics beyond accuracy.

How Visualizations Support Modelling Decisions

Visualizations help us make better modelling choices by showing patterns that summary statistics alone may hide. For example, two variables may have the same mean but very different distributions.

Visualization Finding Possible Modelling Action
Histogram shows strong right skew Apply log transformation or use robust models.
Box plot shows extreme outliers Investigate outliers and decide whether to cap, remove, or keep.
Bar chart shows rare categories Group rare categories before encoding.
Line chart shows seasonality Create month, season, holiday, or lag features.
Scatter plot shows non-linear relationship Consider feature transformation, polynomial features, or tree-based models.

Common Mistakes in Descriptive EDA

Mistake Why It Is Harmful Better Approach
Looking only at averages Mean can hide skewness, outliers, and unequal distribution. Check median, percentiles, histograms, and box plots.
Ignoring categorical imbalance Rare categories may cause unstable model behaviour. Check category counts and group rare levels when needed.
Not checking target distribution Class imbalance or skewed target affects model strategy. Always analyse the target variable separately.
Using wrong chart type A poor chart can hide the real pattern. Use histograms for numerical data and bar charts for categories.
Skipping business interpretation Statistics without context may lead to wrong preprocessing decisions. Interpret every finding in relation to the business problem.

Best Practices for Summary EDA

Descriptive Statistics and Visualization Checklist

  • Start with data types: Separate numerical, categorical, date/time, and text variables.
  • Summarize numerical variables: Check mean, median, standard deviation, min, max, and percentiles.
  • Summarize categorical variables: Check counts, percentages, unique values, and rare categories.
  • Visualize distributions: Use histograms, box plots, and bar charts.
  • Check the target variable: Understand class balance or target skewness.
  • Look for outliers: Extreme values may need investigation or treatment.
  • Look for skewness: Skewed variables may need transformation.
  • Connect findings to modelling: Every EDA insight should guide preprocessing, feature engineering, or evaluation choices.

Why This Step Matters Before Predictive Modelling

Descriptive statistics and summary visualizations are the foundation of informed modelling. Without EDA, model building becomes guesswork. We may choose the wrong algorithm, ignore outliers, mishandle skewed features, overlook rare categories, or use inappropriate evaluation metrics.

Good EDA helps convert raw data into modelling insight. It allows analysts to understand what the data is saying before asking a machine learning algorithm to learn from it.

Practical Insight: A strong predictive model usually begins with strong data understanding. Descriptive statistics and summary visualizations are the first tools for building that understanding.

Key Takeaways

  • Descriptive statistics summarize the main characteristics of data using numbers.
  • Mean, median, and mode describe central tendency.
  • Range, variance, standard deviation, and IQR describe spread.
  • Percentiles and quartiles help understand value positions and outliers.
  • Histograms, box plots, bar charts, scatter plots, and line charts are common summary visualizations.
  • Numerical and categorical variables require different EDA techniques.
  • EDA findings guide missing value treatment, outlier handling, feature engineering, transformations, and model selection.
  • Good predictive modelling starts with careful data exploration.