Univariate and Bivariate Analysis

Exploratory Data Analysis becomes more powerful when we study variables at two levels: individually and in relation to each other. Univariate analysis helps us understand one variable at a time, while bivariate analysis helps us understand how two variables move together.

These two techniques help identify distributions, outliers, category imbalance, relationships, trends, and predictors that may be useful for machine learning models.

What is Univariate Analysis?

Univariate analysis means analysing one variable at a time. The goal is to understand the basic behaviour of a single feature or target variable without considering its relationship with other variables.

For example, if we analyse only customer age, monthly income, transaction amount, product category, or churn status individually, we are performing univariate analysis.

Core Idea: Univariate analysis answers the question: “What does this one variable look like?”

What is Bivariate Analysis?

Bivariate analysis means analysing the relationship between two variables. The goal is to understand whether one variable changes when another variable changes.

For example, if we analyse the relationship between house size and house price, income and loan default, advertising spend and sales, or customer complaints and churn, we are performing bivariate analysis.

Core Idea: Bivariate analysis answers the question: “How does one variable behave in relation to another variable?”

Univariate vs Bivariate Analysis

Aspect Univariate Analysis Bivariate Analysis
Number of Variables One variable at a time. Two variables together.
Main Question What does this variable look like? How are these two variables related?
Common Outputs Distribution, frequency, spread, outliers. Relationship, trend, comparison, association.
Common Visuals Histogram, box plot, bar chart. Scatter plot, grouped bar chart, box plot by category, cross-tabulation.
Modelling Use Helps detect data quality issues and preprocessing needs. Helps identify useful predictors and relationships with the target.

Simple Difference Between Univariate and Bivariate Analysis

Univariate: One Variable Distribution
Bivariate: Relationship Between Two Variables

Why These Analyses Matter for Predictive Modelling

🔍
Understand Variables
Univariate analysis shows distribution, range, missing values, and unusual values for each variable.
🔗
Find Relationships
Bivariate analysis reveals whether features are related to each other or to the target variable.
⚙️
Guide Feature Engineering
Patterns discovered during analysis can inspire transformations, bins, ratios, and interaction features.
🎯
Improve Model Strategy
These analyses help choose algorithms, preprocessing steps, metrics, and validation methods.

Univariate Analysis for Numerical Variables

For numerical variables, univariate analysis focuses on central tendency, spread, distribution shape, skewness, and outliers.

Check What to Look For Recommended Visual Modelling Action
Univariate
Central Value
Mean, median, and whether they are far apart. Summary table. Decide whether mean or median better represents typical behaviour.
Univariate
Spread
Minimum, maximum, range, standard deviation, IQR. Box plot. Check if scaling or outlier treatment is required.
Univariate
Shape
Symmetry, skewness, long tails, multiple peaks. Histogram. Apply transformation if distribution is highly skewed.
Univariate
Outliers
Extreme values that are unusually high or low. Box plot or percentile table. Investigate whether to remove, cap, transform, or keep.

Univariate Analysis for Categorical Variables

For categorical variables, univariate analysis focuses on frequency counts, category percentages, dominant categories, rare categories, and class imbalance.

Check What to Look For Recommended Visual Modelling Action
Categorical
Frequency Count
Number of records in each category. Bar chart. Understand category distribution.
Categorical
Dominant Category
One category appearing much more than others. Bar chart or frequency table. Check whether feature has low information value.
Categorical
Rare Categories
Categories with very few observations. Frequency table. Group rare categories into “Other” before encoding.
Categorical
Target Balance
Whether target classes are balanced or imbalanced. Bar chart. Use stratified splitting and appropriate metrics.

Bivariate Analysis: Choosing the Right Technique

The best bivariate analysis method depends on the data types of the two variables being compared. Numerical-to-numerical relationships are different from categorical-to-numerical or categorical-to-categorical relationships.

Bivariate Analysis Matrix

Numerical vs Numerical

Use scatter plots, correlation, trend lines, and pair plots.

Example: House area vs. house price.

Categorical vs Numerical

Use grouped summaries, box plots, violin plots, and bar charts of averages.

Example: Product category vs. sales amount.

Categorical vs Categorical

Use cross-tabulation, stacked bar charts, and proportion tables.

Example: Plan type vs. churn status.

Time vs Numerical

Use line charts, rolling averages, seasonal plots, and trend analysis.

Example: Month vs. sales revenue.

Numerical vs Numerical Analysis

When both variables are numerical, bivariate analysis helps identify whether they move together, move opposite to each other, or show no clear relationship.

Analysis Method What It Shows Example Modelling Insight
Numerical
Scatter Plot
Pattern between two continuous variables. Advertising spend vs. sales. Shows linear, non-linear, clustered, or outlier patterns.
Numerical
Correlation
Strength and direction of linear relationship. Income vs. credit limit. Helps detect useful predictors and multicollinearity.
Numerical
Trend Line
Average direction of relationship. Property size vs. price. Shows whether relationship may be linear or non-linear.

Categorical vs Numerical Analysis

When one variable is categorical and the other is numerical, we compare the distribution or average value of the numerical variable across categories.

For example, we may compare average monthly spending across customer segments or compare house prices across different locations.

Analysis Method What It Shows Example Modelling Insight
Bivariate
Grouped Mean / Median
Average numerical value by category. Average spend by customer segment. Shows which categories have higher or lower outcomes.
Bivariate
Box Plot by Category
Distribution of numerical variable across groups. Salary distribution by department. Shows spread, outliers, and group differences.
Bivariate
Bar Chart of Averages
Comparison of average values between categories. Average order value by region. Helps identify categories with predictive value.

Categorical vs Categorical Analysis

When both variables are categorical, bivariate analysis focuses on frequency combinations and proportions. This is especially useful in classification problems.

For example, in customer churn prediction, we may analyse whether churn rate differs across plan types, regions, payment methods, or complaint categories.

Analysis Method What It Shows Example Modelling Insight
Categorical
Cross-Tabulation
Counts for combinations of two categories. Plan type vs. churn status. Shows whether categories are associated with the target.
Categorical
Proportion Table
Percentage distribution within categories. Churn rate by region. Helps compare groups fairly even when group sizes differ.
Categorical
Stacked Bar Chart
Visual comparison of category proportions. Payment method vs. repeat purchase. Highlights group-level differences in outcome behaviour.

Target-Based Bivariate Analysis

In predictive modelling, one of the most important uses of bivariate analysis is studying each feature against the target variable. This helps identify which features may be useful predictors.

Feature-to-Target Analysis Workflow

Select Feature
Compare with Target
Identify Pattern
Check Business Logic
Use in Modelling

Example: Customer Churn Analysis

Business Problem

A telecom company wants to predict customer churn. Before building the model, analysts perform univariate and bivariate analysis to understand customer behaviour.

Analysis Type Variable or Relationship Finding Modelling Decision
Univariate Monthly charges Distribution is right-skewed with a few high-value customers. Check outliers and consider transformation if needed.
Univariate Contract type Most customers are on monthly contracts. One-hot encode contract type and check target relationship.
Bivariate Contract type vs. churn Monthly contract customers have much higher churn rate. Contract type is likely an important predictor.
Bivariate Tenure vs. churn New customers churn more frequently than long-term customers. Create tenure groups or use tenure as a strong feature.
Bivariate Support tickets vs. churn Customers with repeated complaints show higher churn risk. Create complaint frequency feature.

Example: House Price Analysis

Regression Problem

A real estate company wants to predict house prices. Univariate analysis helps understand individual variables, while bivariate analysis helps understand what drives price.

  • Univariate: Analyse distribution of price, area, rooms, property age, and location.
  • Bivariate: Analyse area vs. price, location vs. price, rooms vs. price, and property age vs. price.
  • Insight: If area has a strong positive relationship with price, it becomes an important model feature.
  • Insight: If price differs strongly by location, location encoding becomes important.

Common Patterns Found During Bivariate Analysis

Pattern Meaning Possible Modelling Action
Positive Relationship As one variable increases, the other also increases. Use as predictor; consider linear relationship.
Negative Relationship As one variable increases, the other decreases. Use as predictor; check business interpretation.
No Clear Relationship Variables do not show visible association. Feature may have weak individual predictive power.
Non-Linear Relationship Relationship changes direction or shape. Use transformations, bins, or tree-based models.
Group Difference Numerical outcome differs across categories. Encode category carefully and consider interaction features.
Outlier Relationship Some points behave very differently from the pattern. Investigate outliers and decide whether to treat them.

Common Mistakes to Avoid

Mistake Why It Is Harmful Better Approach
Skipping univariate analysis Data quality issues, skewness, and outliers may remain hidden. Analyse every important variable individually first.
Using only correlation Correlation captures mainly linear relationships and may miss non-linear patterns. Use scatter plots and grouped summaries along with correlation.
Ignoring categorical relationships Important category-level patterns may be missed. Use cross-tabs, stacked bars, and group-wise target rates.
Confusing association with causation A relationship between two variables does not prove one causes the other. Interpret relationships carefully and validate with business logic.
Not analysing features against target Important predictive signals may remain undiscovered. Perform feature-to-target analysis for every meaningful feature.

Best Practices for Univariate and Bivariate Analysis

Analysis Checklist

  • Start with univariate analysis: Understand each variable before studying relationships.
  • Separate numerical and categorical variables: Use different summaries and visuals for each type.
  • Analyse the target variable carefully: Check class imbalance, skewness, and unusual values.
  • Use bivariate analysis with the target: Identify features that may have predictive value.
  • Use the right chart: Histograms for numerical distributions, bar charts for categories, scatter plots for numerical relationships.
  • Compare groups carefully: Use proportions, medians, and distributions, not only raw counts.
  • Look for non-linear relationships: Not all predictive patterns are straight-line relationships.
  • Connect findings to feature engineering: Convert EDA insights into useful model inputs.
  • Validate with business logic: Statistical patterns should make practical sense.

How This Analysis Improves Predictive Models

Univariate analysis improves modelling by revealing data quality issues, outliers, skewness, missingness, and imbalance. Bivariate analysis improves modelling by revealing relationships, target patterns, useful features, and possible transformations.

Together, these methods help analysts move from raw data to modelling strategy. They guide decisions about encoding, scaling, transformations, feature selection, feature engineering, and model evaluation.

Practical Rule: Do not start modelling before asking two questions: “What does each variable look like?” and “How does each important variable relate to the target?”

Key Takeaways

  • Univariate analysis studies one variable at a time.
  • Bivariate analysis studies the relationship between two variables.
  • Univariate analysis helps detect distributions, outliers, rare categories, and imbalance.
  • Bivariate analysis helps identify relationships and useful predictors.
  • Numerical, categorical, and time-based variables require different analysis techniques.
  • Feature-to-target analysis is especially important for predictive modelling.
  • EDA findings should guide preprocessing, feature engineering, model selection, and evaluation strategy.
  • Strong predictive modelling begins with careful univariate and bivariate analysis.