Correlation Analysis and Heatmaps
Correlation analysis helps us understand how numerical variables move together. In predictive modelling, it is commonly used to identify relationships between features, detect redundant variables, understand target associations, and diagnose multicollinearity.
Heatmaps make correlation analysis easier by converting a correlation matrix into a visual format. Instead of reading many numbers, we can quickly identify strong positive relationships, strong negative relationships, and weak relationships using color intensity.
What is Correlation?
Correlation is a statistical measure that describes the strength and direction of the relationship between two numerical variables. It tells us whether two variables tend to increase together, move in opposite directions, or have little visible linear relationship.
Correlation values usually range from -1 to +1. A value close to +1 indicates a strong positive relationship, a value close to -1 indicates a strong negative relationship, and a value close to 0 indicates weak or no linear relationship.
Core Idea: Correlation answers the question: “When one numerical variable changes, does another numerical variable tend to change with it?”
Understanding Correlation Values
| Correlation Value | Relationship Type | Meaning | Example |
|---|---|---|---|
| +1.0 | Perfect Positive | Both variables increase together perfectly. | Celsius and Fahrenheit temperature scales. |
| +0.7 to +0.9 | Strong Positive | Higher values of one variable are strongly associated with higher values of another. | House area and house price. |
| +0.3 to +0.6 | Moderate Positive | Variables move together, but not perfectly. | Advertising spend and sales revenue. |
| Around 0 | Weak / No Linear | No clear linear relationship is visible. | Customer ID and purchase amount. |
| -0.3 to -0.6 | Moderate Negative | As one variable increases, the other tends to decrease. | Product price and demand in some markets. |
| -0.7 to -0.9 | Strong Negative | Higher values of one variable are strongly associated with lower values of another. | Loan repayment discipline and default risk. |
| -1.0 | Perfect Negative | One variable increases exactly as the other decreases. | Rare in business datasets but possible in mathematical transformations. |
Visual Understanding of Correlation
Three Common Correlation Patterns
Why Correlation Matters in Predictive Modelling
Types of Correlation
Different correlation methods are used for different types of relationships. The most common methods in predictive analytics are Pearson correlation and Spearman correlation.
| Correlation Type | What It Measures | Best Used When | Example |
|---|---|---|---|
| Method Pearson Correlation |
Linear relationship between two numerical variables. | Data is continuous and relationship is approximately linear. | House area vs. house price. |
| Method Spearman Correlation |
Monotonic relationship using ranks. | Relationship is not perfectly linear or data has outliers/skewness. | Customer satisfaction rank vs. renewal likelihood. |
| Method Kendall Correlation |
Ordinal association based on rank agreement. | Small datasets or ordinal ranking problems. | Ranked preference scores vs. product ratings. |
Pearson Correlation
Pearson correlation measures the strength and direction of a linear relationship between two numerical variables. It is the most commonly used correlation method during EDA.
However, Pearson correlation can miss non-linear relationships. It can also be affected by outliers because extreme values may pull the relationship upward or downward.
Important: Pearson correlation is useful for linear relationships, but a low Pearson correlation does not always mean there is no relationship. The relationship may be non-linear.
Spearman Correlation
Spearman correlation measures whether two variables move together in a consistent ranked order. It is useful when the relationship is monotonic but not necessarily linear.
For example, as customer satisfaction increases, renewal likelihood may also increase, but not in a perfectly straight-line pattern. Spearman correlation can capture this type of ranked relationship better than Pearson correlation.
Correlation Matrix
A correlation matrix is a table that shows correlation values between multiple numerical variables. Each cell shows the correlation between two variables.
In a correlation matrix, the diagonal values are always 1 because every variable is perfectly correlated with itself. The matrix is also symmetric, meaning the correlation between A and B is the same as the correlation between B and A.
Example Correlation Heatmap
How to Read a Correlation Heatmap
A heatmap uses color intensity to represent correlation values. Strong positive relationships, strong negative relationships, and weak relationships become visually easy to identify.
| Heatmap Observation | Meaning | Possible Modelling Action |
|---|---|---|
| Dark positive cells | Two variables move strongly in the same direction. | Check whether both features are needed or redundant. |
| Dark negative cells | One variable increases while the other decreases. | Interpret business logic and consider feature importance. |
| Near-zero cells | Weak linear relationship. | Feature may still be useful if relationship is non-linear or interacts with other features. |
| Feature highly correlated with target | Feature may be predictive. | Investigate further and validate during modelling. |
| Features highly correlated with each other | Possible multicollinearity or duplicated information. | Remove one feature, combine features, or use regularization if needed. |
Correlation Analysis Workflow
Practical Correlation Analysis Pipeline
Correlation with the Target Variable
In predictive modelling, one of the most useful checks is correlation between numerical features and the target variable. This helps identify features that may have predictive value.
For regression problems, the target is numerical, so direct correlation with features is straightforward. For binary classification problems, a 0/1 encoded target can sometimes be used to understand direction, but interpretation should be cautious.
Example: House Price Prediction
Suppose a real estate company wants to predict house price. Correlation analysis may show:
| Feature | Correlation with Price | Interpretation |
|---|---|---|
| House Area | +0.82 | Larger houses tend to have higher prices. |
| Number of Rooms | +0.68 | More rooms are generally associated with higher prices. |
| Property Age | -0.45 | Older properties may have lower prices, depending on location and condition. |
| Distance from City Centre | -0.58 | Properties farther from the city centre may have lower prices. |
These correlations help identify important predictors, but final feature selection should still be validated using model performance.
Multicollinearity
Multicollinearity occurs when two or more input features are highly correlated with each other. This can make it difficult to interpret the separate effect of each feature, especially in linear regression and logistic regression.
For example, house area and number of rooms may be highly correlated. If both are used in a linear model, the model may struggle to estimate their individual effects clearly.
- Two or more features have very high correlation.
- Model coefficients become unstable.
- Feature importance becomes difficult to interpret.
- Small data changes cause large coefficient changes.
- Remove one of the highly correlated features.
- Combine features into a new ratio or index.
- Use regularization such as Ridge or Lasso.
- Use tree-based models if interpretability allows.
Correlation Does Not Mean Causation
Correlation only shows association. It does not prove that one variable causes another. Two variables may move together because of a third hidden variable, coincidence, seasonality, or business process effects.
Important: If ice cream sales and cold drink sales are correlated, it does not mean ice cream sales cause cold drink sales. A third factor, such as hot weather, may influence both.
Limitations of Correlation Analysis
| Limitation | Why It Matters | Better Practice |
|---|---|---|
| Captures mostly linear relationships | A non-linear relationship may show low correlation. | Use scatter plots and non-linear models when needed. |
| Sensitive to outliers | Extreme values can inflate or reduce correlation. | Inspect outliers before trusting correlation values. |
| Does not prove causation | Association may not represent cause and effect. | Use business logic, experiments, or causal analysis when needed. |
| Mostly for numerical variables | Categorical relationships need different methods. | Use cross-tabs, target rates, chi-square tests, or group summaries for categories. |
| Feature interactions may be missed | A feature may be useful only when combined with another feature. | Explore interactions and use model-based feature importance. |
How Correlation Supports Feature Selection
Correlation analysis can help remove redundant variables and identify promising predictors, but it should not be the only feature selection method. Some features with low correlation may still be useful in non-linear models or through interactions with other features.
| Correlation Finding | Possible Action |
|---|---|
| Feature has high correlation with target | Consider it as a potentially important predictor. |
| Two features are highly correlated with each other | Consider removing one, combining them, or using regularization. |
| Feature has near-zero correlation with target | Do not remove blindly; check non-linear patterns and model-based importance. |
| Correlation changes across segments | Create segment-specific features or interaction terms. |
Example: Correlation Analysis in Customer Churn
Business Problem
A telecom company wants to predict customer churn. The target variable is encoded as 1 for churned and 0 for not churned. Analysts examine correlations between numerical features and churn.
| Feature | Correlation with Churn | Business Interpretation | Possible Action |
|---|---|---|---|
| Customer Tenure | -0.70 | Longer-tenure customers are less likely to churn. | Create tenure groups and use tenure as key feature. |
| Monthly Charges | +0.38 | Higher charges may be associated with higher churn risk. | Check pricing sensitivity by customer segment. |
| Support Tickets | +0.55 | More complaints may indicate dissatisfaction. | Create complaint frequency and unresolved-ticket features. |
| Total Spend | -0.48 | Customers with higher lifetime value may be more loyal. | Use lifetime value as retention-focused feature. |
These correlations provide useful clues, but the final churn model should still be evaluated using validation and test data.
Best Practices for Correlation Analysis
Correlation and Heatmap Checklist
- Use numerical variables first: Correlation is mainly designed for numerical relationships.
- Check scatter plots: Do not rely only on correlation numbers.
- Choose the right method: Use Pearson for linear relationships and Spearman for ranked or monotonic relationships.
- Inspect outliers: Extreme values can distort correlation values.
- Look at feature-target correlations: They may reveal useful predictors.
- Look at feature-feature correlations: They may reveal redundancy or multicollinearity.
- Use heatmaps for many variables: They help quickly identify strong relationship patterns.
- Do not confuse correlation with causation: Always interpret relationships with business logic.
- Validate with modelling: Correlation is an EDA tool, not final proof of feature usefulness.
Common Mistakes to Avoid
| Mistake | Why It Is Harmful | Better Approach |
|---|---|---|
| Assuming high correlation means causation | Can lead to wrong business decisions. | Use business reasoning, experiments, or causal methods when needed. |
| Removing all low-correlation features | Some features may be useful in non-linear models or interactions. | Use model-based feature importance and validation performance. |
| Ignoring multicollinearity | Linear model coefficients may become unstable. | Remove redundant features or use regularization. |
| Using Pearson correlation blindly | May miss monotonic or non-linear relationships. | Use Spearman correlation and scatter plots when appropriate. |
| Not checking outliers | Outliers can create misleading correlation values. | Inspect distributions and compare correlations before and after treatment. |
Why Heatmaps are Useful in EDA
Heatmaps make large correlation matrices easier to understand. When a dataset has many numerical variables, reading pairwise correlations from a table can be difficult. A heatmap highlights strong relationships visually using color.
This helps analysts quickly identify which variables may be connected to the target, which predictors may be redundant, and where deeper investigation is needed.
Practical Insight: A correlation heatmap is not the final answer. It is a map that tells you where to investigate further before modelling.
Key Takeaways
- Correlation measures the strength and direction of relationship between numerical variables.
- Correlation values range from -1 to +1.
- Positive correlation means variables move in the same direction.
- Negative correlation means variables move in opposite directions.
- Pearson correlation measures linear relationships.
- Spearman correlation measures ranked or monotonic relationships.
- Heatmaps visually display correlation matrices using color intensity.
- Correlation analysis helps identify useful predictors, redundant features, and multicollinearity.
- Correlation does not prove causation and should always be interpreted with business logic.