The Linear Regression Model

Part III — Statistical Inference & Model Validation

Paulo Fagandini

Lisbon Accounting and Business School — Polytechnic University of Lisbon

Statistical Inference in Regression

Lecture Overview

Lecture 10 — Statistical Inference & Model Validation

  • Distributions of OLS estimators under normality
  • \(t\)-tests for individual coefficients
  • Confidence intervals for \(\beta_j\)
  • The \(F\)-test: overall model significance
  • The ANOVA table for regression
  • Point prediction and prediction intervals
  • Validating the classical assumptions:
    • Linearity, normality, homoscedasticity, independence, multicollinearity

Note

Reference: Newbold Ch. 11.6–11.8, Ch. 12.4–12.6.

Sampling Distributions of OLS Estimators

From Estimates to Inference

We have \(\hat{\beta}_j\) — point estimates. But how uncertain are they?

Under the classical assumptions A1–A6:

\[\hat{\beta}_j \sim N\!\left(\beta_j,\; \sigma^2_{\hat{\beta}_j}\right)\]

Since \(\sigma^2\) is unknown, we replace it with \(s^2 = \text{SSR}/(n-k-1)\):

\[T_j = \frac{\hat{\beta}_j - \beta_j}{s_{\hat{\beta}_j}} \sim t_{n-k-1}\]

where \(s_{\hat{\beta}_j}\) is the standard error of \(\hat{\beta}_j\) (reported by R in the Std. Error column).

Note

In SLR (\(k=1\)): \(s_{\hat{\beta}_1} = \dfrac{s}{\sqrt{S_{XX}}}\), where \(s = \sqrt{\text{SSR}/(n-2)}\).

\(t\)-Tests for Individual Coefficients

The Individual Significance Test

Question: Is the variable \(X_j\) statistically significant — i.e. does it contribute to explaining \(Y\)?

Hypotheses:

\[H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0\]

Test statistic:

\[t_j = \frac{\hat{\beta}_j}{s_{\hat{\beta}_j}} \;\sim\; t_{n-k-1} \quad \text{under } H_0\]

Decision rule at significance level \(\alpha\):

Approach Rule
Critical value Reject \(H_0\) if \(|t_j| > t_{n-k-1,\,\alpha/2}\)
\(p\)-value Reject \(H_0\) if \(p\text{-value} < \alpha\)

Reading the R Output

summary(mlr_model)

Call:
lm(formula = profit2 ~ revenue2 + employees)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.8947  -9.0708  -0.1219   6.0029  25.3186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.46125    9.51476   2.256   0.0324 *  
revenue2     0.12902    0.02048   6.299 9.66e-07 ***
employees    0.61262    0.11789   5.196 1.80e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.43 on 27 degrees of freedom
Multiple R-squared:  0.6543,    Adjusted R-squared:  0.6287 
F-statistic: 25.55 on 2 and 27 DF,  p-value: 5.923e-07

Interpreting the \(t\)-test Output

From the Coefficients table in summary():

Column Meaning
Estimate \(\hat{\beta}_j\) — the OLS point estimate
Std. Error \(s_{\hat{\beta}_j}\) — standard error of the estimate
t value \(t_j = \hat{\beta}_j / s_{\hat{\beta}_j}\) — the test statistic
Pr(>|t|) Two-sided \(p\)-value: probability of observing \(|t_j|\) this large if \(H_0\) were true
Significance codes *** \(p<0.001\), ** \(p<0.01\), * \(p<0.05\), . \(p<0.1\)

Important

If Pr(>|t|) \(< \alpha\) (e.g. 0.05): reject \(H_0\) — the variable is statistically significant at level \(\alpha\).

If not: we fail to reject \(H_0\) — the variable may not be contributing to the model after controlling for the others (principle of parsimony: consider removing it).

One-Sided Tests

Sometimes theory suggests a directional hypothesis:

\[H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j > 0 \quad \text{(or } H_1: \beta_j < 0\text{)}\]

The test statistic is the same \(t_j = \hat{\beta}_j / s_{\hat{\beta}_j}\).

For \(H_1: \beta_j > 0\): reject \(H_0\) if \(t_j > t_{n-k-1,\,\alpha}\)

For \(H_1: \beta_j < 0\): reject \(H_0\) if \(t_j < -t_{n-k-1,\,\alpha}\)

Note

The \(p\)-value from R is always two-sided. For a one-sided test, divide it by 2 — but only when the sign of \(\hat{\beta}_j\) is consistent with \(H_1\).

Confidence Intervals for \(\beta_j\)

Construction

A \((1-\alpha) \times 100\%\) confidence interval for \(\beta_j\):

\[\hat{\beta}_j \;\pm\; t_{n-k-1,\,\alpha/2} \;\cdot\; s_{\hat{\beta}_j}\]

Important

Interpretation: If we repeated the sampling procedure many times, \((1-\alpha)\times 100\%\) of the intervals constructed this way would contain the true \(\beta_j\).

The interval gives a range of plausible values for \(\beta_j\) given the data.

Link to hypothesis testing: if \(0\) is not in the \(95\%\) confidence interval for \(\beta_j\), then \(H_0: \beta_j = 0\) is rejected at \(\alpha = 5\%\).

Confidence Intervals in R

confint(mlr_model, level = 0.95)
                 2.5 %     97.5 %
(Intercept) 1.93857748 40.9839143
revenue2    0.08699515  0.1710521
employees   0.37072535  0.8545246

Interpretation:

  • The 95% CI for \(\beta_{\text{revenue}}\) is approximately [0.087, 0.171].
  • Since 0 is not in this interval, we reject \(H_0: \beta_{\text{revenue}} = 0\) at 5% — revenue has a statistically significant positive effect on profit.

The \(F\)-Test: Overall Significance

Motivation

The \(t\)-tests assess individual regressors. But sometimes we want to ask:

Does the model as a whole explain a significant amount of variation in \(Y\)?

This is the overall significance test:

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0\] \[H_1: \text{at least one } \beta_j \neq 0\]

Important

Under \(H_0\), none of the regressors help explain \(Y\) — the model is no better than simply predicting \(\bar{Y}\) for everyone.

The \(F\)-Statistic

\[F = \frac{\text{SSE}/k}{\text{SSR}/(n-k-1)} = \frac{\text{Explained MS}}{\text{Residual MS}} \;\sim\; F_{k,\, n-k-1} \quad \text{under } H_0\]

The ANOVA table for regression:

Source SS df MS \(F\)
Regression (Explained) SSE \(k\) SSE/\(k\) MSE/MSR
Residual SSR \(n-k-1\) SSR/\((n-k-1)\)
Total SST \(n-1\)

Decision: Reject \(H_0\) if \(F > F_{k,\,n-k-1,\,\alpha}\) or if \(p\text{-value} < \alpha\).

\(F\)-Test in R

The \(F\)-statistic and its \(p\)-value appear at the bottom of summary():

summary(mlr_model)

Call:
lm(formula = profit2 ~ revenue2 + employees)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.8947  -9.0708  -0.1219   6.0029  25.3186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.46125    9.51476   2.256   0.0324 *  
revenue2     0.12902    0.02048   6.299 9.66e-07 ***
employees    0.61262    0.11789   5.196 1.80e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.43 on 27 degrees of freedom
Multiple R-squared:  0.6543,    Adjusted R-squared:  0.6287 
F-statistic: 25.55 on 2 and 27 DF,  p-value: 5.923e-07

The line F-statistic: ... on k and n-k-1 DF, p-value: ... gives the overall test.

Relationship Between \(F\) and \(R^2\)

In MLR, the \(F\)-statistic can be written as:

\[F = \frac{R^2/k}{(1-R^2)/(n-k-1)}\]

This makes clear that:

  • A high \(R^2\) tends to produce a large \(F\) (and small \(p\)-value)
  • But a significant \(F\) does not mean all individual coefficients are significant
  • And a significant \(t_j\) does not imply overall model significance

Note

In SLR (\(k = 1\)): \(F = t^2\) — the \(F\)-test is equivalent to the two-sided \(t\)-test on the single slope.

Prediction

Point Prediction

Given new values \(X_1 = x_1^*, \ldots, X_k = x_k^*\), the point prediction is:

\[\hat{Y}^* = \hat{\beta}_0 + \hat{\beta}_1 x_1^* + \cdots + \hat{\beta}_k x_k^*\]

# Predict profit for revenue = 300, employees = 40
new_obs <- data.frame(revenue2 = 300, employees = 40)
predict(mlr_model, newdata = new_obs)
       1 
84.67332 

Important

Extrapolation warning: predictions are only reliable within the range of the observed data. Predicting far outside the sample range (e.g. revenue = 5000 when the data only covers 100–500) is unreliable.

Prediction Intervals

A prediction interval accounts for two sources of uncertainty:

  1. Uncertainty about the true mean \(E[Y^*]\) (same as a confidence interval for the mean)
  2. Uncertainty about the individual outcome \(Y^*\) around that mean

\[\hat{Y}^* \;\pm\; t_{n-k-1,\,\alpha/2} \;\cdot\; s \cdot \sqrt{1 + \mathbf{x}^{*\prime}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}^*}\]

In R:

predict(mlr_model, newdata = new_obs,
        interval = "prediction", level = 0.95)
       fit      lwr      upr
1 84.67332 58.73202 110.6146

The prediction interval is always wider than the confidence interval for the mean — it must additionally account for individual variation.

Validating the Classical Assumptions

Why Validation Matters

Our inference (t-tests, F-test, confidence intervals) is only valid if the classical assumptions hold.

If they fail:

  • A2 violated (non-zero mean error, omitted variables): OLS is biased
  • A3 violated (heteroscedasticity): OLS standard errors are wrong → invalid \(t\) and \(F\) tests
  • A4 violated (autocorrelation): same problem as heteroscedasticity
  • A6 violated (non-normality): \(t\) and \(F\) approximations may fail in small samples
  • Multicollinearity (not a classical assumption violation): inflated standard errors, unstable estimates

The tool for checking assumptions: residual analysis.

The Four Key Diagnostic Plots

par(mfrow = c(1, 4), mar = c(4, 4, 3, 1))
plot(mlr_model)

Residuals vs. Fitted: Checking Linearity and Homoscedasticity

What to look for:

  • Random scatter around zero → linearity OK
  • Constant vertical spread → homoscedasticity OK
  • Funnel or fan shape → heteroscedasticity (problem)
  • Curved pattern → non-linearity (problem)

Normality of Residuals: Q–Q Plot

What to look for:

  • Points near the diagonal → normality approximately satisfied
  • S-shaped curve → heavy tails (non-normal)
  • Systematic departure → consider transformations of \(Y\)

Detecting Heteroscedasticity

Problem signs:

  • Rising line → variance increases with \(\hat{Y}\) (funnel-right)
  • Falling line → variance decreases with \(\hat{Y}\) (funnel-left)

Formal test: White’s test (tests whether residual variance depends on any regressor or its square).

Practical fix: Use robust (heteroscedasticity-consistent) standard errors or transform \(Y\) (e.g. \(\log Y\)).

The Classical Assumptions: Summary Table

Assumption How to check Symptom if violated Consequence
A1 Linearity Scatter plot, residuals vs. fitted Curved pattern Biased estimates
A2 Zero mean Residuals vs. fitted Systematic non-zero residuals Biased \(\hat{\beta}_0\)
A3 Homoscedasticity Scale-location plot, White’s test Funnel shape Invalid SEs & tests
A4 No autocorrelation Durbin-Watson statistic, residuals vs. order Pattern in residuals Invalid SEs & tests
A6 Normality Q–Q plot, Shapiro-Wilk test S-curve in Q–Q Invalid small-sample tests
Multicollinearity: VIF High pairwise \(r\), VIF \(> 5\) Inflated SEs, unstable \(\hat{\beta}_j\)

Autocorrelation: The Durbin–Watson Test

Relevant mainly for time series data. Tests \(H_0\): no autocorrelation vs. \(H_1\): positive autocorrelation.

\[DW = \frac{\sum_{i=2}^{n}(e_i - e_{i-1})^2}{\sum_{i=1}^{n} e_i^2}\]

Rule of thumb: \(DW \approx 2\) → no autocorrelation; \(DW < 1.5\) → concern.

# install.packages("lmtest") if needed
library(lmtest)
dwtest(mlr_model)

    Durbin-Watson test

data:  mlr_model
DW = 1.6312, p-value = 0.1308
alternative hypothesis: true autocorrelation is greater than 0

Multicollinearity and VIF

Multicollinearity: two or more regressors are highly correlated with each other.

Consequence: OLS is still unbiased and BLUE, but standard errors are inflated → \(t\)-statistics are small → we may fail to detect significant effects.

The Variance Inflation Factor (VIF) for regressor \(X_j\):

\[\text{VIF}_j = \frac{1}{1 - R_j^2}\]

where \(R_j^2\) is the \(R^2\) from regressing \(X_j\) on all other regressors.

Note

Rule of thumb: VIF \(< 5\) is acceptable; VIF \(> 10\) is a serious problem.

VIF in R

# install.packages("car") if needed
library(car)
vif(mlr_model)
 revenue2 employees 
 1.110802  1.110802 

Both VIF values are close to 1 → no multicollinearity problem in our example. This is expected since revenue2 and employees were generated independently.

Putting It All Together: Model Validation Workflow

  1. Fit the model with lm() and check summary() for coefficient signs, magnitudes and significance.
  2. Plot residuals vs. fitted — check for linearity and homoscedasticity.
  3. Inspect the Q–Q plot — check for normality of residuals.
  4. Run Durbin-Watson (if data may have time ordering) — check for autocorrelation.
  5. Compute VIF (if \(k \geq 2\)) — check for multicollinearity.
  6. If violations are found: consider transformations (log \(Y\), log \(X\)), robust standard errors, or model re-specification (add/remove variables).

Important

Validation is not a single pass — it is an iterative dialogue between the model and the data.

Complete R Workflow

End-to-End: From Data to Validated Model

# 1. Fit the model
model <- lm(Y ~ X1 + X2, data = mydata)

# 2. Inspect results
summary(model)
confint(model, level = 0.95)

# 3. Residual diagnostics
par(mfrow = c(2, 2))
plot(model)

# 4. Autocorrelation (if needed)
library(lmtest)
dwtest(model)

# 5. Multicollinearity (if k >= 2)
library(car)
vif(model)

# 6. Predict for new data
predict(model,
        newdata = data.frame(X1 = x1_new, X2 = x2_new),
        interval = "prediction",
        level = 0.95)

Reading a Regression Table

A typical regression output table (as seen in R or SPSS):

Estimate Std. Error \(t\) value \(\Pr(>|t|)\)
(Intercept) \(\hat{\beta}_0\) \(s_{\hat{\beta}_0}\) \(t_0\) \(p_0\)
\(X_1\) \(\hat{\beta}_1\) \(s_{\hat{\beta}_1}\) \(t_1\) \(p_1\)
\(X_2\) \(\hat{\beta}_2\) \(s_{\hat{\beta}_2}\) \(t_2\) \(p_2\)
Metric What it tells you
Residual std. error (\(s\)) Typical size of residuals; in the same units as \(Y\)
\(R^2\) Proportion of variance in \(Y\) explained
Adjusted \(R^2\) \(R^2\) penalised for number of regressors
\(F\)-statistic Overall significance of the model

Course Summary: Linear Regression

The Linear Regression Journey (3 Lectures)

Lecture Topic Key Concepts
8 Introduction & OLS Model specification, Gauss–Markov assumptions, OLS formulas, coefficient interpretation
9 Fit & MLR \(R^2\), \(\bar{R}^2\), Gauss–Markov theorem, Omitted Variable Bias, MLR ceteris paribus interpretation
10 Inference & Validation \(t\)-tests, \(F\)-test, CIs, prediction intervals, residual diagnostics, VIF

The core message:

OLS gives us the best linear unbiased estimates — but only if the classical assumptions hold. Always validate your model before drawing conclusions.

Key Formulas at a Glance

Formula
OLS slope \(\hat{\beta}_1 = S_{XY}/S_{XX}\)
OLS intercept \(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\)
Error variance \(s^2 = \text{SSR}/(n-k-1)\)
\(R^2\) \(1 - \text{SSR}/\text{SST}\)
Adjusted \(R^2\) \(1 - (1-R^2)(n-1)/(n-k-1)\)
\(t\)-statistic \(t_j = \hat{\beta}_j / s_{\hat{\beta}_j}\)
CI for \(\beta_j\) \(\hat{\beta}_j \pm t_{n-k-1,\alpha/2} \cdot s_{\hat{\beta}_j}\)
\(F\)-statistic \((\text{SSE}/k)\,/\,(\text{SSR}/(n-k-1))\)