Handling Missing Values: Imputation and Deletion
Missing values are one of the most common problems in real-world datasets. Before building a predictive model, we must identify missing data, understand why it is missing, and decide whether to remove it or fill it using suitable techniques.
The way we handle missing values directly affects model accuracy, bias, reliability, and business interpretation. A careless missing value treatment can silently damage the entire predictive modelling process.
What are Missing Values?
Missing values are blank, unavailable, unknown, or invalid entries in a dataset. They occur when information is not recorded, not collected, lost during data transfer, or intentionally skipped.
In datasets, missing values may appear as empty cells, NULL, NaN, NA, None, zero used incorrectly, or special placeholders such as “Unknown” or “Not Available”.
Core Idea: Missing values are not just technical errors. They may contain important information about customer behaviour, business processes, system failures, or data collection gaps.
Why Missing Values Matter in Predictive Modelling
Many machine learning algorithms cannot directly work with missing values. Even when an algorithm can handle them, missing data can affect patterns, distributions, relationships, and model performance.
Common Reasons for Missing Values
Before deciding how to handle missing values, we should understand why they are missing. The reason often determines the best treatment method.
| Reason | Example | Possible Meaning |
|---|---|---|
| Not Collected | A customer did not provide income information. | The field may be optional or sensitive. |
| System Error | Sensor failed to record temperature. | Technical issue in data capture. |
| Not Applicable | Loan amount is missing for customers who never applied for a loan. | The value is logically not relevant. |
| Manual Entry Error | Salesperson forgot to enter customer location. | Human error in data recording. |
| Data Integration Issue | Customer ID exists in one system but not another. | Mismatch during data merging. |
| Intentional Non-Response | Survey respondent skips salary question. | Missingness may be related to privacy or behaviour. |
Types of Missing Data
Statistically, missing data is often classified into three types. Understanding these types helps us choose between deletion, simple imputation, or more advanced methods.
| Missing Data Type | Meaning | Example | Handling Difficulty |
|---|---|---|---|
| MCAR Missing Completely at Random |
Missingness has no relationship with any observed or unobserved variable. | A few records are lost due to random system failure. | Comparatively easier to handle. |
| MAR Missing at Random |
Missingness is related to other observed variables. | Income is more often missing for younger customers, and age is available. | Can often be handled with informed imputation. |
| MNAR Missing Not at Random |
Missingness is related to the missing value itself. | High-income customers avoid disclosing income. | Most difficult and may introduce bias. |
Important: If missing values are not random, deleting or blindly filling them can create biased models. Always investigate the pattern of missingness before applying a treatment.
Step-by-Step Missing Value Treatment Workflow
Missing Value Handling Pipeline
Deletion Methods
Deletion means removing missing values from the dataset. This can be done by removing rows or columns. It is simple, but it can lead to information loss if used carelessly.
1. Row Deletion
In row deletion, entire records are removed if they contain missing values. This works only when the number of missing records is small and the missingness is mostly random.
2. Column Deletion
In column deletion, an entire feature is removed if it contains too many missing values or has very limited predictive value.
| Deletion Method | When to Use | Advantages | Risks |
|---|---|---|---|
| Deletion Row Deletion |
Few rows have missing values and data loss is minimal. | Simple and quick. | Can reduce sample size and introduce bias. |
| Deletion Column Deletion |
A feature has very high missingness and low business value. | Removes unreliable variables. | May remove a potentially important predictor. |
Practical Rule: Deletion is safe only when the missing data is small, random, and not strongly related to the target variable.
Imputation Methods
Imputation means replacing missing values with estimated values. Instead of removing data, imputation tries to preserve records by filling gaps intelligently.
Common Imputation Techniques
| Imputation Method | Best For | Example | Limitation |
|---|---|---|---|
| Imputation Mean Imputation |
Numerical data with roughly symmetric distribution. | Replace missing age with average age. | Sensitive to outliers and reduces variance. |
| Imputation Median Imputation |
Numerical data with skewness or outliers. | Replace missing income with median income. | May still oversimplify the true pattern. |
| Imputation Mode Imputation |
Categorical variables. | Replace missing city with most common city. | Can overrepresent the most frequent category. |
| Time-Based Forward Fill |
Time series or ordered records. | Use previous day’s stock price for a missing day. | Can be misleading if values change quickly. |
| Time-Based Backward Fill |
Time series where future nearby values are acceptable for analysis. | Use next recorded sensor reading to fill a missing reading. | Can introduce future information if used incorrectly in modelling. |
| Advanced KNN Imputation |
Datasets where similar records can help estimate missing values. | Fill missing income based on similar customers. | Computationally heavier and affected by feature scaling. |
| Advanced Regression Imputation |
When missing variable can be predicted from other variables. | Predict missing house size using price, location, and rooms. | Can create overly confident estimates. |
When to Use Deletion vs Imputation
The right method depends on the amount of missing data, reason for missingness, business importance of the variable, and model sensitivity.
- Missing values are very few.
- Missingness appears random.
- Removing records does not reduce dataset size significantly.
- The column has extremely high missingness and low value.
- Rows contain useful information in other columns.
- The feature is important for prediction.
- Missingness is moderate and manageable.
- You want to preserve dataset size.
- Missing values depend on other variables.
- Simple mean or mode imputation may distort patterns.
- The dataset has enough related features.
- Model accuracy is highly sensitive to missing treatment.
- The fact that a value is missing may itself carry meaning.
- Customers intentionally skip sensitive questions.
- Missingness is related to behaviour or risk.
- You want the model to learn from missingness patterns.
Missing Indicator Variables
Sometimes, the absence of a value is informative. In such cases, we can create a missing indicator variable that marks whether the original value was missing.
For example, if customers who do not disclose income are more likely to default on loans, then “income missing” becomes a useful predictive signal.
Example: Instead of only filling missing income values, create an additional column called income_missing with values 1 for missing and 0 for available. This allows the model to learn whether missingness itself is important.
Example: Handling Missing Values in Loan Data
Business Problem
A bank wants to predict whether a loan applicant may default. The dataset contains missing values in income, employment type, credit score, and loan purpose.
| Variable | Missing Problem | Recommended Treatment | Reason |
|---|---|---|---|
| Income | Some applicants did not disclose income. | Median imputation + missing indicator. | Income is often skewed and missingness may carry risk information. |
| Employment Type | Few missing category values. | Mode imputation or “Unknown” category. | Categorical variable where missingness may represent unreported employment status. |
| Credit Score | Important numerical variable with moderate missingness. | KNN or regression-based imputation. | Credit score is highly predictive, so simple deletion may lose valuable records. |
| Loan Purpose | Very few missing values. | Mode imputation. | Simple treatment is acceptable when missingness is low. |
This example shows that different variables may require different missing value treatments within the same dataset.
Best Practices for Handling Missing Values
Missing Value Treatment Checklist
- Measure missingness: Calculate missing percentage for every column.
- Understand the reason: Investigate why the value is missing.
- Check the target relationship: See whether missingness is related to the outcome.
- Avoid blind deletion: Do not remove rows or columns without checking impact.
- Use median for skewed numerical data: It is more robust to outliers than mean.
- Use mode or “Unknown” for categorical data: Choose based on business meaning.
- Add missing indicators when useful: Missingness may carry predictive value.
- Validate model performance: Compare different treatments using validation data.
- Avoid data leakage: Fit imputation rules only on training data, then apply them to validation and test data.
Common Mistakes to Avoid
| Mistake | Why It Is Harmful | Better Approach |
|---|---|---|
| Deleting all rows with any missing value | Can remove too much useful data and reduce model reliability. | Check missing percentage and use imputation where appropriate. |
| Using mean for all numerical columns | Mean is distorted by outliers and may not represent skewed data. | Use median for skewed distributions and compare results. |
| Imputing before train-test split incorrectly | Can leak information from test data into training. | Fit imputation on training data only. |
| Ignoring missingness patterns | Missing values may reveal important behaviour or risk. | Create missing indicators when absence itself is meaningful. |
| Using one method for all variables | Different variables have different meanings and distributions. | Select treatment based on data type, business logic, and model impact. |
How Missing Value Handling Affects Models
Missing value treatment changes the data distribution and may affect model predictions. For example, mean imputation can reduce variance, while row deletion can change the sample population. Advanced imputation can preserve patterns better but may add complexity.
Therefore, missing value handling should not be treated as a mechanical step. It should be evaluated as part of the modelling process using validation performance and business logic.
Practical Insight: The best missing value strategy is not always the most complex one. The best strategy is the one that preserves useful information, reduces bias, and improves model performance on unseen data.
Key Takeaways
- Missing values are common in real-world predictive analytics datasets.
- Missing data can reduce accuracy, introduce bias, and cause algorithm errors.
- Before treatment, always understand the reason and pattern of missingness.
- Deletion is simple but can cause information loss when used carelessly.
- Imputation preserves data by filling missing values using estimated values.
- Mean, median, mode, forward fill, backward fill, KNN, and regression imputation are common methods.
- Missing indicator variables can help models learn from the fact that a value is missing.
- Imputation should be fitted only on training data to avoid data leakage.