Handling Missing Values: Imputation and Deletion

Missing values are one of the most common problems in real-world datasets. Before building a predictive model, we must identify missing data, understand why it is missing, and decide whether to remove it or fill it using suitable techniques.

The way we handle missing values directly affects model accuracy, bias, reliability, and business interpretation. A careless missing value treatment can silently damage the entire predictive modelling process.

What are Missing Values?

Missing values are blank, unavailable, unknown, or invalid entries in a dataset. They occur when information is not recorded, not collected, lost during data transfer, or intentionally skipped.

In datasets, missing values may appear as empty cells, NULL, NaN, NA, None, zero used incorrectly, or special placeholders such as “Unknown” or “Not Available”.

Core Idea: Missing values are not just technical errors. They may contain important information about customer behaviour, business processes, system failures, or data collection gaps.

Why Missing Values Matter in Predictive Modelling

Many machine learning algorithms cannot directly work with missing values. Even when an algorithm can handle them, missing data can affect patterns, distributions, relationships, and model performance.

📉

Reduced Accuracy

Important information may be lost when missing values are ignored or handled incorrectly.

⚖️

Biased Results

If missingness is not random, the model may learn distorted patterns from incomplete data.

🧠

Wrong Relationships

Poor imputation may create artificial relationships that do not exist in reality.

🚫

Algorithm Errors

Some algorithms fail completely when missing values are present in the input data.

Common Reasons for Missing Values

Before deciding how to handle missing values, we should understand why they are missing. The reason often determines the best treatment method.

Reason	Example	Possible Meaning
Not Collected	A customer did not provide income information.	The field may be optional or sensitive.
System Error	Sensor failed to record temperature.	Technical issue in data capture.
Not Applicable	Loan amount is missing for customers who never applied for a loan.	The value is logically not relevant.
Manual Entry Error	Salesperson forgot to enter customer location.	Human error in data recording.
Data Integration Issue	Customer ID exists in one system but not another.	Mismatch during data merging.
Intentional Non-Response	Survey respondent skips salary question.	Missingness may be related to privacy or behaviour.

Types of Missing Data

Statistically, missing data is often classified into three types. Understanding these types helps us choose between deletion, simple imputation, or more advanced methods.

Missing Data Type	Meaning	Example	Handling Difficulty
MCAR Missing Completely at Random	Missingness has no relationship with any observed or unobserved variable.	A few records are lost due to random system failure.	Comparatively easier to handle.
MAR Missing at Random	Missingness is related to other observed variables.	Income is more often missing for younger customers, and age is available.	Can often be handled with informed imputation.
MNAR Missing Not at Random	Missingness is related to the missing value itself.	High-income customers avoid disclosing income.	Most difficult and may introduce bias.

Important: If missing values are not random, deleting or blindly filling them can create biased models. Always investigate the pattern of missingness before applying a treatment.

Step-by-Step Missing Value Treatment Workflow

Missing Value Handling Pipeline

Detect Missing Values

→

Measure Missing Percentage

→

Understand Missing Pattern

→

Choose Treatment

→

Validate Model Impact

Deletion Methods

Deletion means removing missing values from the dataset. This can be done by removing rows or columns. It is simple, but it can lead to information loss if used carelessly.

1. Row Deletion

In row deletion, entire records are removed if they contain missing values. This works only when the number of missing records is small and the missingness is mostly random.

2. Column Deletion

In column deletion, an entire feature is removed if it contains too many missing values or has very limited predictive value.

Deletion Method	When to Use	Advantages	Risks
Deletion Row Deletion	Few rows have missing values and data loss is minimal.	Simple and quick.	Can reduce sample size and introduce bias.
Deletion Column Deletion	A feature has very high missingness and low business value.	Removes unreliable variables.	May remove a potentially important predictor.

Practical Rule: Deletion is safe only when the missing data is small, random, and not strongly related to the target variable.

Imputation Methods

Imputation means replacing missing values with estimated values. Instead of removing data, imputation tries to preserve records by filling gaps intelligently.

➗

Mean Imputation

Replaces missing numerical values with the average value of the column.

📍

Median Imputation

Replaces missing numerical values with the middle value. Useful when outliers are present.

🏷️

Mode Imputation

Replaces missing categorical values with the most frequent category.

⏱️

Forward / Backward Fill

Uses nearby values to fill missing observations in time-based data.

🧠

Model-Based Imputation

Uses other variables to predict and fill missing values more intelligently.

Common Imputation Techniques

Imputation Method	Best For	Example	Limitation
Imputation Mean Imputation	Numerical data with roughly symmetric distribution.	Replace missing age with average age.	Sensitive to outliers and reduces variance.
Imputation Median Imputation	Numerical data with skewness or outliers.	Replace missing income with median income.	May still oversimplify the true pattern.
Imputation Mode Imputation	Categorical variables.	Replace missing city with most common city.	Can overrepresent the most frequent category.
Time-Based Forward Fill	Time series or ordered records.	Use previous day’s stock price for a missing day.	Can be misleading if values change quickly.
Time-Based Backward Fill	Time series where future nearby values are acceptable for analysis.	Use next recorded sensor reading to fill a missing reading.	Can introduce future information if used incorrectly in modelling.
Advanced KNN Imputation	Datasets where similar records can help estimate missing values.	Fill missing income based on similar customers.	Computationally heavier and affected by feature scaling.
Advanced Regression Imputation	When missing variable can be predicted from other variables.	Predict missing house size using price, location, and rooms.	Can create overly confident estimates.

When to Use Deletion vs Imputation

The right method depends on the amount of missing data, reason for missingness, business importance of the variable, and model sensitivity.

Use Deletion When

Missing values are very few.
Missingness appears random.
Removing records does not reduce dataset size significantly.
The column has extremely high missingness and low value.

Use Imputation When

Rows contain useful information in other columns.
The feature is important for prediction.
Missingness is moderate and manageable.
You want to preserve dataset size.

Use Advanced Imputation When

Missing values depend on other variables.
Simple mean or mode imputation may distort patterns.
The dataset has enough related features.
Model accuracy is highly sensitive to missing treatment.

Use Missing Indicator When

The fact that a value is missing may itself carry meaning.
Customers intentionally skip sensitive questions.
Missingness is related to behaviour or risk.
You want the model to learn from missingness patterns.

Missing Indicator Variables

Sometimes, the absence of a value is informative. In such cases, we can create a missing indicator variable that marks whether the original value was missing.

For example, if customers who do not disclose income are more likely to default on loans, then “income missing” becomes a useful predictive signal.

Example: Instead of only filling missing income values, create an additional column called income_missing with values 1 for missing and 0 for available. This allows the model to learn whether missingness itself is important.

Example: Handling Missing Values in Loan Data

Business Problem

A bank wants to predict whether a loan applicant may default. The dataset contains missing values in income, employment type, credit score, and loan purpose.

Variable	Missing Problem	Recommended Treatment	Reason
Income	Some applicants did not disclose income.	Median imputation + missing indicator.	Income is often skewed and missingness may carry risk information.
Employment Type	Few missing category values.	Mode imputation or “Unknown” category.	Categorical variable where missingness may represent unreported employment status.
Credit Score	Important numerical variable with moderate missingness.	KNN or regression-based imputation.	Credit score is highly predictive, so simple deletion may lose valuable records.
Loan Purpose	Very few missing values.	Mode imputation.	Simple treatment is acceptable when missingness is low.

This example shows that different variables may require different missing value treatments within the same dataset.

Best Practices for Handling Missing Values

Missing Value Treatment Checklist

Measure missingness: Calculate missing percentage for every column.
Understand the reason: Investigate why the value is missing.
Check the target relationship: See whether missingness is related to the outcome.
Avoid blind deletion: Do not remove rows or columns without checking impact.
Use median for skewed numerical data: It is more robust to outliers than mean.
Use mode or “Unknown” for categorical data: Choose based on business meaning.
Add missing indicators when useful: Missingness may carry predictive value.
Validate model performance: Compare different treatments using validation data.
Avoid data leakage: Fit imputation rules only on training data, then apply them to validation and test data.

Common Mistakes to Avoid

Mistake	Why It Is Harmful	Better Approach
Deleting all rows with any missing value	Can remove too much useful data and reduce model reliability.	Check missing percentage and use imputation where appropriate.
Using mean for all numerical columns	Mean is distorted by outliers and may not represent skewed data.	Use median for skewed distributions and compare results.
Imputing before train-test split incorrectly	Can leak information from test data into training.	Fit imputation on training data only.
Ignoring missingness patterns	Missing values may reveal important behaviour or risk.	Create missing indicators when absence itself is meaningful.
Using one method for all variables	Different variables have different meanings and distributions.	Select treatment based on data type, business logic, and model impact.

How Missing Value Handling Affects Models

Missing value treatment changes the data distribution and may affect model predictions. For example, mean imputation can reduce variance, while row deletion can change the sample population. Advanced imputation can preserve patterns better but may add complexity.

Therefore, missing value handling should not be treated as a mechanical step. It should be evaluated as part of the modelling process using validation performance and business logic.

Practical Insight: The best missing value strategy is not always the most complex one. The best strategy is the one that preserves useful information, reduces bias, and improves model performance on unseen data.

Key Takeaways

Missing values are common in real-world predictive analytics datasets.
Missing data can reduce accuracy, introduce bias, and cause algorithm errors.
Before treatment, always understand the reason and pattern of missingness.
Deletion is simple but can cause information loss when used carelessly.
Imputation preserves data by filling missing values using estimated values.
Mean, median, mode, forward fill, backward fill, KNN, and regression imputation are common methods.
Missing indicator variables can help models learn from the fact that a value is missing.
Imputation should be fitted only on training data to avoid data leakage.

2.2 Handling missing values (imputation, deletion)