Linear Regression and Its Assumptions

Linear regression is one of the most important and widely used algorithms in predictive modelling. It is used to predict a continuous numerical outcome by learning a straight-line relationship between input features and the target variable.

Because linear regression is simple, interpretable, and mathematically clear, it is often used as a baseline model for regression problems and as a foundation for understanding more advanced machine learning techniques.

What is Linear Regression?

Linear regression is a supervised learning algorithm used for predicting numerical values. It assumes that the target variable can be explained as a linear combination of one or more input variables.

For example, a real estate company may use linear regression to predict house price using house area, number of rooms, location score, property age, and distance from city centre.

Core Idea: Linear regression tries to fit the best possible straight line, or linear equation, that explains the relationship between input features and a numerical target variable.

Simple Linear Regression

Simple linear regression uses one input variable to predict one continuous target variable. It tries to fit a straight line through the data points.

Y = β₀ + β₁X + ε

Y = predicted target, β₀ = intercept, β₁ = slope coefficient, X = input feature, ε = error term.

Visual Idea of Linear Regression

Data Points

Best-Fit Line

Residual Errors

Multiple Linear Regression

Multiple linear regression uses more than one input variable to predict the target. This is more common in real-world predictive modelling because outcomes are usually influenced by many factors.

Y = β₀ + β₁X₁ + β₂X₂ + β₃X₃ + … + βₙXₙ + ε

Each coefficient shows the expected change in Y when that feature changes by one unit, holding other variables constant.

For example, house price may be predicted using area, location score, number of bedrooms, property age, floor number, and distance from metro station.

Key Terms in Linear Regression

Term	Meaning	Example Interpretation
Target Variable	The numerical outcome we want to predict.	House price, sales revenue, delivery time, customer spend.
Feature / Predictor	The input variable used to predict the target.	House area, number of rooms, customer income.
Intercept	The predicted value of Y when all input variables are zero.	Baseline prediction before adding feature effects.
Coefficient	The effect of one feature on the target, holding other features constant.	If area coefficient is 4,000, each extra sq. ft. adds ₹4,000 to predicted price.
Residual	Difference between actual value and predicted value.	If actual price is ₹80 lakh and predicted price is ₹75 lakh, residual is ₹5 lakh.
Error Term	Unexplained variation not captured by the model.	Market sentiment, negotiation effect, or unrecorded property quality.

How Linear Regression Learns

Linear regression finds the line or equation that minimizes the difference between actual values and predicted values. These differences are called residuals.

The most common method is Ordinary Least Squares, which chooses coefficients that minimize the sum of squared residuals.

Linear Regression Training Process

Start with Data

→

Estimate Coefficients

→

Make Predictions

→

Calculate Residuals

→

Minimize Error

Why Linear Regression is Useful

🔍

Highly Interpretable

Coefficients explain how each feature affects the predicted outcome.

⚡

Fast to Train

Linear regression is computationally efficient and works well as a baseline model.

📊

Good for Regression Problems

It is useful when the target is continuous, such as price, revenue, cost, or demand.

🧭

Strong Baseline

It provides a simple benchmark before trying more complex models.

Linear Regression Assumptions

Linear regression works best when certain assumptions are reasonably satisfied. These assumptions help ensure that coefficients, predictions, and statistical interpretations are reliable.

Assumption	Meaning	How to Check	What to Do If Violated
Assumption Linearity	Relationship between features and target should be approximately linear.	Scatter plots, residual plots.	Use transformations, polynomial features, or non-linear models.
Assumption Independence of Errors	Residuals should not be correlated with each other.	Check time order, residual autocorrelation, Durbin-Watson test.	Use time-series methods or add lag features.
Assumption Homoscedasticity	Residual variance should be roughly constant across prediction levels.	Residuals vs fitted values plot.	Transform target, use weighted regression, or robust errors.
Assumption Normality of Residuals	Residuals should be approximately normally distributed for inference.	Histogram or Q-Q plot of residuals.	Transform variables, check outliers, or use robust methods.
Assumption No Strong Multicollinearity	Input features should not be highly correlated with each other.	Correlation matrix, VIF.	Remove redundant features, combine variables, or use regularization.
Assumption No Extreme Influential Outliers	A few extreme points should not dominate the fitted line.	Box plots, residual plots, leverage, Cook’s distance.	Investigate, cap, transform, remove if erroneous, or use robust regression.

Assumption 1: Linearity

Linear regression assumes that the relationship between each predictor and the target is approximately linear. This means a straight-line pattern should reasonably describe the relationship.

If the relationship is curved, linear regression may underfit the data and produce biased predictions.

Good Sign

Scatter plot shows a roughly straight-line pattern.
Residuals are randomly scattered around zero.
Feature effect is stable across the range.

Warning Sign

Scatter plot shows a curve or U-shape.
Residual plot shows a clear pattern.
Predictions are poor at low or high values.

Assumption 2: Independence of Errors

The residuals should be independent of each other. This is especially important when data is collected over time or when observations are grouped by customer, store, region, or machine.

For example, monthly sales errors may be correlated across time because sales in one month are related to sales in previous months. In such cases, simple linear regression may not be enough.

Assumption 3: Homoscedasticity

Homoscedasticity means that the spread of residuals should be approximately constant across all levels of predicted values. In simple words, the model should not make very small errors for low values and very large errors for high values.

If residual spread increases or decreases systematically, the problem is called heteroscedasticity.

Example: In house price prediction, errors may be much larger for luxury houses than for affordable houses. This creates unequal error variance and may require transformation or a different modelling approach.

Assumption 4: Normality of Residuals

Linear regression assumes residuals are approximately normally distributed when we want to make statistical inferences such as confidence intervals and hypothesis tests.

For prediction alone, slight non-normality is often less serious than strong non-linearity, leakage, outliers, or heteroscedasticity. However, extremely non-normal residuals may indicate missing patterns or poor model fit.

Assumption 5: No Strong Multicollinearity

Multicollinearity occurs when input features are highly correlated with each other. It can make coefficients unstable and difficult to interpret.

For example, house area and number of rooms may be highly correlated. If both are included, the model may struggle to assign separate effects clearly.

Practical Insight: Multicollinearity may not always destroy prediction accuracy, but it can seriously reduce coefficient interpretability.

Assumption 6: No Extreme Influential Outliers

Outliers can strongly influence a linear regression line because the model minimizes squared errors. A few extreme observations can pull the line toward themselves and distort the coefficients.

Outliers should be investigated before treatment. They may be data errors, rare valid events, or important business signals.

Common Diagnostics for Linear Regression

Diagnostic Tool	Used To Check	What to Look For
Diagnostic Scatter Plot	Linearity between feature and target.	Roughly straight relationship.
Diagnostic Residual Plot	Linearity and homoscedasticity.	Random scatter around zero with no clear pattern.
Diagnostic Histogram of Residuals	Residual normality.	Approximately bell-shaped distribution.
Diagnostic Q-Q Plot	Residual normality.	Points approximately following a straight diagonal line.
Diagnostic Correlation Matrix	Multicollinearity.	Very high correlations between predictors.
Diagnostic VIF	Multicollinearity severity.	High VIF values may indicate redundant predictors.

Model Evaluation Metrics for Linear Regression

Linear regression is evaluated using regression metrics that compare actual values with predicted values.

Metric	Meaning	Interpretation
MAE Mean Absolute Error	Average absolute difference between actual and predicted values.	Easy to understand in original units.
MSE Mean Squared Error	Average squared prediction error.	Penalizes large errors more strongly.
RMSE Root Mean Squared Error	Square root of MSE.	Error in original units, sensitive to large errors.
R² Coefficient of Determination	Proportion of variation in target explained by the model.	Higher is generally better, but must be interpreted carefully.
Adjusted R²	R² adjusted for number of predictors.	Useful when comparing models with different numbers of features.

Example: House Price Prediction

Business Problem

A real estate company wants to predict house prices using area, number of bedrooms, property age, distance from city centre, and location score.

Feature	Possible Coefficient	Business Interpretation
Area	+4,000	Each additional sq. ft. is associated with ₹4,000 higher predicted price, holding other variables constant.
Property Age	-75,000	Each additional year of age is associated with ₹75,000 lower predicted price, holding other variables constant.
Location Score	+2,50,000	Each one-point increase in location score is associated with ₹2.5 lakh higher predicted price.
Distance from City Centre	-1,20,000	Each additional kilometre from the city centre is associated with lower predicted price.

These coefficients are useful because they make the model explainable to business users, not just predictive.

When Linear Regression Works Well

Good Use Cases

Target variable is continuous.
Relationships are approximately linear.
Interpretability is important.
Dataset is clean and well-prepared.
Business wants coefficient-level explanation.

Common Applications

House price prediction.
Sales forecasting baseline.
Demand estimation.
Marketing spend impact analysis.
Cost and revenue prediction.

When Linear Regression May Not Work Well

Weak Situations

Relationship is strongly non-linear.
Data has many extreme outliers.
Important feature interactions are missing.
Residuals show strong patterns.
Target variable is categorical.

Possible Alternatives

Polynomial regression.
Decision trees.
Random forest regression.
Gradient boosting regression.
Regularized regression such as Ridge or Lasso.

Common Mistakes in Linear Regression

Mistake	Why It Is Harmful	Better Approach
Using linear regression for categorical target	Linear regression predicts continuous values, not classes.	Use logistic regression or classification models for categorical targets.
Ignoring non-linearity	Model may underfit and produce biased predictions.	Use transformations, interaction terms, polynomial features, or non-linear models.
Ignoring outliers	Extreme points can heavily influence coefficients.	Investigate outliers and use robust methods if needed.
Interpreting correlation as causation	Coefficient relationships do not automatically prove cause and effect.	Use business logic, experiments, or causal methods for causal claims.
Not checking multicollinearity	Coefficients may become unstable and misleading.	Use correlation checks, VIF, feature selection, or regularization.

Best Practices for Linear Regression

Linear Regression Checklist

Use it for continuous targets: Linear regression is designed for numerical prediction.
Start with EDA: Check distributions, outliers, missing values, and relationships.
Check linearity: Use scatter plots and residual plots.
Inspect residuals: Residuals should not show strong patterns.
Check multicollinearity: Use correlation matrix or VIF.
Handle outliers carefully: Investigate before removing or capping.
Use train-validation-test split: Evaluate generalization, not memorization.
Interpret coefficients carefully: Coefficients show association, not automatic causation.
Compare with other models: Use linear regression as a baseline before trying complex models.

Why Linear Regression Remains Important

Even though many advanced machine learning models exist, linear regression remains important because it is simple, fast, transparent, and easy to explain. It helps analysts understand the basic relationship between variables and provides a strong foundation for predictive modelling.

In many business situations, interpretability is as important as accuracy. Linear regression is especially useful when stakeholders want to know not only what the prediction is, but also why the model made that prediction.

Practical Insight: Linear regression is often the first model to build in a regression problem. Even if a more advanced model performs better later, linear regression provides an interpretable benchmark.

Key Takeaways

Linear regression predicts continuous numerical outcomes.
Simple linear regression uses one predictor; multiple linear regression uses many predictors.
Coefficients show the expected change in the target for a one-unit change in a feature.
Residuals are the differences between actual and predicted values.
Major assumptions include linearity, independence of errors, homoscedasticity, normal residuals, no strong multicollinearity, and no extreme influential outliers.
Residual plots, scatter plots, Q-Q plots, correlation matrices, and VIF help diagnose problems.
Linear regression is interpretable, fast, and useful as a baseline model.
It should be used carefully when relationships are non-linear, outliers are extreme, or assumptions are strongly violated.

5.1 Linear regression and its assumptions