Linear Regression and Its Assumptions

Linear regression is one of the most important and widely used algorithms in predictive modelling. It is used to predict a continuous numerical outcome by learning a straight-line relationship between input features and the target variable.

Because linear regression is simple, interpretable, and mathematically clear, it is often used as a baseline model for regression problems and as a foundation for understanding more advanced machine learning techniques.

What is Linear Regression?

Linear regression is a supervised learning algorithm used for predicting numerical values. It assumes that the target variable can be explained as a linear combination of one or more input variables.

For example, a real estate company may use linear regression to predict house price using house area, number of rooms, location score, property age, and distance from city centre.

Core Idea: Linear regression tries to fit the best possible straight line, or linear equation, that explains the relationship between input features and a numerical target variable.

Simple Linear Regression

Simple linear regression uses one input variable to predict one continuous target variable. It tries to fit a straight line through the data points.

Y = β₀ + β₁X + ε
Y = predicted target, β₀ = intercept, β₁ = slope coefficient, X = input feature, ε = error term.

Visual Idea of Linear Regression

Data Points
Best-Fit Line
Residual Errors

Multiple Linear Regression

Multiple linear regression uses more than one input variable to predict the target. This is more common in real-world predictive modelling because outcomes are usually influenced by many factors.

Y = β₀ + β₁X₁ + β₂X₂ + β₃X₃ + … + βₙXₙ + ε
Each coefficient shows the expected change in Y when that feature changes by one unit, holding other variables constant.

For example, house price may be predicted using area, location score, number of bedrooms, property age, floor number, and distance from metro station.

Key Terms in Linear Regression

Term Meaning Example Interpretation
Target Variable The numerical outcome we want to predict. House price, sales revenue, delivery time, customer spend.
Feature / Predictor The input variable used to predict the target. House area, number of rooms, customer income.
Intercept The predicted value of Y when all input variables are zero. Baseline prediction before adding feature effects.
Coefficient The effect of one feature on the target, holding other features constant. If area coefficient is 4,000, each extra sq. ft. adds ₹4,000 to predicted price.
Residual Difference between actual value and predicted value. If actual price is ₹80 lakh and predicted price is ₹75 lakh, residual is ₹5 lakh.
Error Term Unexplained variation not captured by the model. Market sentiment, negotiation effect, or unrecorded property quality.

How Linear Regression Learns

Linear regression finds the line or equation that minimizes the difference between actual values and predicted values. These differences are called residuals.

The most common method is Ordinary Least Squares, which chooses coefficients that minimize the sum of squared residuals.

Linear Regression Training Process

Start with Data
Estimate Coefficients
Make Predictions
Calculate Residuals
Minimize Error

Why Linear Regression is Useful

🔍
Highly Interpretable
Coefficients explain how each feature affects the predicted outcome.
Fast to Train
Linear regression is computationally efficient and works well as a baseline model.
📊
Good for Regression Problems
It is useful when the target is continuous, such as price, revenue, cost, or demand.
🧭
Strong Baseline
It provides a simple benchmark before trying more complex models.

Linear Regression Assumptions

Linear regression works best when certain assumptions are reasonably satisfied. These assumptions help ensure that coefficients, predictions, and statistical interpretations are reliable.

Assumption Meaning How to Check What to Do If Violated
Assumption
Linearity
Relationship between features and target should be approximately linear. Scatter plots, residual plots. Use transformations, polynomial features, or non-linear models.
Assumption
Independence of Errors
Residuals should not be correlated with each other. Check time order, residual autocorrelation, Durbin-Watson test. Use time-series methods or add lag features.
Assumption
Homoscedasticity
Residual variance should be roughly constant across prediction levels. Residuals vs fitted values plot. Transform target, use weighted regression, or robust errors.
Assumption
Normality of Residuals
Residuals should be approximately normally distributed for inference. Histogram or Q-Q plot of residuals. Transform variables, check outliers, or use robust methods.
Assumption
No Strong Multicollinearity
Input features should not be highly correlated with each other. Correlation matrix, VIF. Remove redundant features, combine variables, or use regularization.
Assumption
No Extreme Influential Outliers
A few extreme points should not dominate the fitted line. Box plots, residual plots, leverage, Cook’s distance. Investigate, cap, transform, remove if erroneous, or use robust regression.

Assumption 1: Linearity

Linear regression assumes that the relationship between each predictor and the target is approximately linear. This means a straight-line pattern should reasonably describe the relationship.

If the relationship is curved, linear regression may underfit the data and produce biased predictions.

Good Sign
  • Scatter plot shows a roughly straight-line pattern.
  • Residuals are randomly scattered around zero.
  • Feature effect is stable across the range.
Warning Sign
  • Scatter plot shows a curve or U-shape.
  • Residual plot shows a clear pattern.
  • Predictions are poor at low or high values.

Assumption 2: Independence of Errors

The residuals should be independent of each other. This is especially important when data is collected over time or when observations are grouped by customer, store, region, or machine.

For example, monthly sales errors may be correlated across time because sales in one month are related to sales in previous months. In such cases, simple linear regression may not be enough.

Assumption 3: Homoscedasticity

Homoscedasticity means that the spread of residuals should be approximately constant across all levels of predicted values. In simple words, the model should not make very small errors for low values and very large errors for high values.

If residual spread increases or decreases systematically, the problem is called heteroscedasticity.

Example: In house price prediction, errors may be much larger for luxury houses than for affordable houses. This creates unequal error variance and may require transformation or a different modelling approach.

Assumption 4: Normality of Residuals

Linear regression assumes residuals are approximately normally distributed when we want to make statistical inferences such as confidence intervals and hypothesis tests.

For prediction alone, slight non-normality is often less serious than strong non-linearity, leakage, outliers, or heteroscedasticity. However, extremely non-normal residuals may indicate missing patterns or poor model fit.

Assumption 5: No Strong Multicollinearity

Multicollinearity occurs when input features are highly correlated with each other. It can make coefficients unstable and difficult to interpret.

For example, house area and number of rooms may be highly correlated. If both are included, the model may struggle to assign separate effects clearly.

Practical Insight: Multicollinearity may not always destroy prediction accuracy, but it can seriously reduce coefficient interpretability.

Assumption 6: No Extreme Influential Outliers

Outliers can strongly influence a linear regression line because the model minimizes squared errors. A few extreme observations can pull the line toward themselves and distort the coefficients.

Outliers should be investigated before treatment. They may be data errors, rare valid events, or important business signals.

Common Diagnostics for Linear Regression

Diagnostic Tool Used To Check What to Look For
Diagnostic
Scatter Plot
Linearity between feature and target. Roughly straight relationship.
Diagnostic
Residual Plot
Linearity and homoscedasticity. Random scatter around zero with no clear pattern.
Diagnostic
Histogram of Residuals
Residual normality. Approximately bell-shaped distribution.
Diagnostic
Q-Q Plot
Residual normality. Points approximately following a straight diagonal line.
Diagnostic
Correlation Matrix
Multicollinearity. Very high correlations between predictors.
Diagnostic
VIF
Multicollinearity severity. High VIF values may indicate redundant predictors.

Model Evaluation Metrics for Linear Regression

Linear regression is evaluated using regression metrics that compare actual values with predicted values.

Metric Meaning Interpretation
MAE
Mean Absolute Error
Average absolute difference between actual and predicted values. Easy to understand in original units.
MSE
Mean Squared Error
Average squared prediction error. Penalizes large errors more strongly.
RMSE
Root Mean Squared Error
Square root of MSE. Error in original units, sensitive to large errors.

Coefficient of Determination
Proportion of variation in target explained by the model. Higher is generally better, but must be interpreted carefully.
Adjusted R² R² adjusted for number of predictors. Useful when comparing models with different numbers of features.

Example: House Price Prediction

Business Problem

A real estate company wants to predict house prices using area, number of bedrooms, property age, distance from city centre, and location score.

Feature Possible Coefficient Business Interpretation
Area +4,000 Each additional sq. ft. is associated with ₹4,000 higher predicted price, holding other variables constant.
Property Age -75,000 Each additional year of age is associated with ₹75,000 lower predicted price, holding other variables constant.
Location Score +2,50,000 Each one-point increase in location score is associated with ₹2.5 lakh higher predicted price.
Distance from City Centre -1,20,000 Each additional kilometre from the city centre is associated with lower predicted price.

These coefficients are useful because they make the model explainable to business users, not just predictive.

When Linear Regression Works Well

Good Use Cases
  • Target variable is continuous.
  • Relationships are approximately linear.
  • Interpretability is important.
  • Dataset is clean and well-prepared.
  • Business wants coefficient-level explanation.
Common Applications
  • House price prediction.
  • Sales forecasting baseline.
  • Demand estimation.
  • Marketing spend impact analysis.
  • Cost and revenue prediction.

When Linear Regression May Not Work Well

Weak Situations
  • Relationship is strongly non-linear.
  • Data has many extreme outliers.
  • Important feature interactions are missing.
  • Residuals show strong patterns.
  • Target variable is categorical.
Possible Alternatives
  • Polynomial regression.
  • Decision trees.
  • Random forest regression.
  • Gradient boosting regression.
  • Regularized regression such as Ridge or Lasso.

Common Mistakes in Linear Regression

Mistake Why It Is Harmful Better Approach
Using linear regression for categorical target Linear regression predicts continuous values, not classes. Use logistic regression or classification models for categorical targets.
Ignoring non-linearity Model may underfit and produce biased predictions. Use transformations, interaction terms, polynomial features, or non-linear models.
Ignoring outliers Extreme points can heavily influence coefficients. Investigate outliers and use robust methods if needed.
Interpreting correlation as causation Coefficient relationships do not automatically prove cause and effect. Use business logic, experiments, or causal methods for causal claims.
Not checking multicollinearity Coefficients may become unstable and misleading. Use correlation checks, VIF, feature selection, or regularization.

Best Practices for Linear Regression

Linear Regression Checklist

  • Use it for continuous targets: Linear regression is designed for numerical prediction.
  • Start with EDA: Check distributions, outliers, missing values, and relationships.
  • Check linearity: Use scatter plots and residual plots.
  • Inspect residuals: Residuals should not show strong patterns.
  • Check multicollinearity: Use correlation matrix or VIF.
  • Handle outliers carefully: Investigate before removing or capping.
  • Use train-validation-test split: Evaluate generalization, not memorization.
  • Interpret coefficients carefully: Coefficients show association, not automatic causation.
  • Compare with other models: Use linear regression as a baseline before trying complex models.

Why Linear Regression Remains Important

Even though many advanced machine learning models exist, linear regression remains important because it is simple, fast, transparent, and easy to explain. It helps analysts understand the basic relationship between variables and provides a strong foundation for predictive modelling.

In many business situations, interpretability is as important as accuracy. Linear regression is especially useful when stakeholders want to know not only what the prediction is, but also why the model made that prediction.

Practical Insight: Linear regression is often the first model to build in a regression problem. Even if a more advanced model performs better later, linear regression provides an interpretable benchmark.

Key Takeaways

  • Linear regression predicts continuous numerical outcomes.
  • Simple linear regression uses one predictor; multiple linear regression uses many predictors.
  • Coefficients show the expected change in the target for a one-unit change in a feature.
  • Residuals are the differences between actual and predicted values.
  • Major assumptions include linearity, independence of errors, homoscedasticity, normal residuals, no strong multicollinearity, and no extreme influential outliers.
  • Residual plots, scatter plots, Q-Q plots, correlation matrices, and VIF help diagnose problems.
  • Linear regression is interpretable, fast, and useful as a baseline model.
  • It should be used carefully when relationships are non-linear, outliers are extreme, or assumptions are strongly violated.