The Linear Regression Model

Part III — Statistical Inference & Model Validation

Paulo Fagandini

Lisbon Accounting and Business School — Polytechnic University of Lisbon

Statistical Inference in Regression

Lecture Overview

Lecture 10 — Statistical Inference & Model Validation

Distributions of OLS estimators under normality
\(t\)-tests for individual coefficients
Confidence intervals for \(\beta_j\)
The \(F\)-test: overall model significance
The ANOVA table for regression
Point prediction and prediction intervals
Validating the classical assumptions:
- Linearity, normality, homoscedasticity, independence, multicollinearity

Note

Reference: Newbold Ch. 11.6–11.8, Ch. 12.4–12.6.

Sampling Distributions of OLS Estimators

From Estimates to Inference

We have \(\hat{\beta}_j\) — point estimates. But how uncertain are they?

Under the classical assumptions A1–A6:

\[\hat{\beta}_j \sim N\!\left(\beta_j,\; \sigma^2_{\hat{\beta}_j}\right)\]

Since \(\sigma^2\) is unknown, we replace it with \(s^2 = \text{SSR}/(n-k-1)\):

\[T_j = \frac{\hat{\beta}_j - \beta_j}{s_{\hat{\beta}_j}} \sim t_{n-k-1}\]

where \(s_{\hat{\beta}_j}\) is the standard error of \(\hat{\beta}_j\) (reported by R in the Std. Error column).

Note

In SLR (\(k=1\)): \(s_{\hat{\beta}_1} = \dfrac{s}{\sqrt{S_{XX}}}\), where \(s = \sqrt{\text{SSR}/(n-2)}\).

\(t\)-Tests for Individual Coefficients

The Individual Significance Test

Question: Is the variable \(X_j\) statistically significant — i.e. does it contribute to explaining \(Y\)?

Hypotheses:

\[H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j \neq 0\]

Test statistic:

\[t_j = \frac{\hat{\beta}_j}{s_{\hat{\beta}_j}} \;\sim\; t_{n-k-1} \quad \text{under } H_0\]

Decision rule at significance level \(\alpha\):

Approach	Rule
Critical value	Reject \(H_0\) if \(\|t_j\| > t_{n-k-1,\,\alpha/2}\)
\(p\)-value	Reject \(H_0\) if \(p\text{-value} < \alpha\)

Reading the R Output

summary(mlr_model)


Call:
lm(formula = profit2 ~ revenue2 + employees)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.8947  -9.0708  -0.1219   6.0029  25.3186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.46125    9.51476   2.256   0.0324 *  
revenue2     0.12902    0.02048   6.299 9.66e-07 ***
employees    0.61262    0.11789   5.196 1.80e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.43 on 27 degrees of freedom
Multiple R-squared:  0.6543,    Adjusted R-squared:  0.6287 
F-statistic: 25.55 on 2 and 27 DF,  p-value: 5.923e-07

Interpreting the \(t\)-test Output

From the Coefficients table in summary():

Column	Meaning
`Estimate`	\(\hat{\beta}_j\) — the OLS point estimate
`Std. Error`	\(s_{\hat{\beta}_j}\) — standard error of the estimate
`t value`	\(t_j = \hat{\beta}_j / s_{\hat{\beta}_j}\) — the test statistic
`Pr(>\|t\|)`	Two-sided \(p\)-value: probability of observing \(\|t_j\|\) this large if \(H_0\) were true
Significance codes	`*` \(p<0.001\), `` \(p<0.01\), `*` \(p<0.05\), `.` \(p<0.1\)

Important

If Pr(>|t|) \(< \alpha\) (e.g. 0.05): reject \(H_0\) — the variable is statistically significant at level \(\alpha\).

If not: we fail to reject \(H_0\) — the variable may not be contributing to the model after controlling for the others (principle of parsimony: consider removing it).

One-Sided Tests

Sometimes theory suggests a directional hypothesis:

\[H_0: \beta_j = 0 \quad \text{vs} \quad H_1: \beta_j > 0 \quad \text{(or } H_1: \beta_j < 0\text{)}\]

The test statistic is the same \(t_j = \hat{\beta}_j / s_{\hat{\beta}_j}\).

For \(H_1: \beta_j > 0\): reject \(H_0\) if \(t_j > t_{n-k-1,\,\alpha}\)

For \(H_1: \beta_j < 0\): reject \(H_0\) if \(t_j < -t_{n-k-1,\,\alpha}\)

Note

The \(p\)-value from R is always two-sided. For a one-sided test, divide it by 2 — but only when the sign of \(\hat{\beta}_j\) is consistent with \(H_1\).

Confidence Intervals for \(\beta_j\)

Construction

A \((1-\alpha) \times 100\%\) confidence interval for \(\beta_j\):

\[\hat{\beta}_j \;\pm\; t_{n-k-1,\,\alpha/2} \;\cdot\; s_{\hat{\beta}_j}\]

Important

Interpretation: If we repeated the sampling procedure many times, \((1-\alpha)\times 100\%\) of the intervals constructed this way would contain the true \(\beta_j\).

The interval gives a range of plausible values for \(\beta_j\) given the data.

Link to hypothesis testing: if \(0\) is not in the \(95\%\) confidence interval for \(\beta_j\), then \(H_0: \beta_j = 0\) is rejected at \(\alpha = 5\%\).

Confidence Intervals in R

confint(mlr_model, level = 0.95)

                 2.5 %     97.5 %
(Intercept) 1.93857748 40.9839143
revenue2    0.08699515  0.1710521
employees   0.37072535  0.8545246

Interpretation:

The 95% CI for \(\beta_{\text{revenue}}\) is approximately [0.087, 0.171].
Since 0 is not in this interval, we reject \(H_0: \beta_{\text{revenue}} = 0\) at 5% — revenue has a statistically significant positive effect on profit.

The \(F\)-Test: Overall Significance

Motivation

The \(t\)-tests assess individual regressors. But sometimes we want to ask:

Does the model as a whole explain a significant amount of variation in \(Y\)?

This is the overall significance test:

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0\] \[H_1: \text{at least one } \beta_j \neq 0\]

Important

Under \(H_0\), none of the regressors help explain \(Y\) — the model is no better than simply predicting \(\bar{Y}\) for everyone.

The \(F\)-Statistic

\[F = \frac{\text{SSE}/k}{\text{SSR}/(n-k-1)} = \frac{\text{Explained MS}}{\text{Residual MS}} \;\sim\; F_{k,\, n-k-1} \quad \text{under } H_0\]

The ANOVA table for regression:

Source	SS	df	MS	\(F\)
Regression (Explained)	SSE	\(k\)	SSE/\(k\)	MSE/MSR
Residual	SSR	\(n-k-1\)	SSR/\((n-k-1)\)
Total	SST	\(n-1\)

Decision: Reject \(H_0\) if \(F > F_{k,\,n-k-1,\,\alpha}\) or if \(p\text{-value} < \alpha\).

\(F\)-Test in R

The \(F\)-statistic and its \(p\)-value appear at the bottom of summary():

summary(mlr_model)


Call:
lm(formula = profit2 ~ revenue2 + employees)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.8947  -9.0708  -0.1219   6.0029  25.3186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.46125    9.51476   2.256   0.0324 *  
revenue2     0.12902    0.02048   6.299 9.66e-07 ***
employees    0.61262    0.11789   5.196 1.80e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.43 on 27 degrees of freedom
Multiple R-squared:  0.6543,    Adjusted R-squared:  0.6287 
F-statistic: 25.55 on 2 and 27 DF,  p-value: 5.923e-07

The line F-statistic: ... on k and n-k-1 DF, p-value: ... gives the overall test.

Relationship Between \(F\) and \(R^2\)

In MLR, the \(F\)-statistic can be written as:

\[F = \frac{R^2/k}{(1-R^2)/(n-k-1)}\]

This makes clear that:

A high \(R^2\) tends to produce a large \(F\) (and small \(p\)-value)
But a significant \(F\) does not mean all individual coefficients are significant
And a significant \(t_j\) does not imply overall model significance

Note

In SLR (\(k = 1\)): \(F = t^2\) — the \(F\)-test is equivalent to the two-sided \(t\)-test on the single slope.

Prediction

Point Prediction

Given new values \(X_1 = x_1^*, \ldots, X_k = x_k^*\), the point prediction is:

\[\hat{Y}^* = \hat{\beta}_0 + \hat{\beta}_1 x_1^* + \cdots + \hat{\beta}_k x_k^*\]

# Predict profit for revenue = 300, employees = 40
new_obs <- data.frame(revenue2 = 300, employees = 40)
predict(mlr_model, newdata = new_obs)

       1 
84.67332

Important

Extrapolation warning: predictions are only reliable within the range of the observed data. Predicting far outside the sample range (e.g. revenue = 5000 when the data only covers 100–500) is unreliable.

Prediction Intervals

A prediction interval accounts for two sources of uncertainty:

Uncertainty about the true mean \(E[Y^*]\) (same as a confidence interval for the mean)
Uncertainty about the individual outcome \(Y^*\) around that mean

\[\hat{Y}^* \;\pm\; t_{n-k-1,\,\alpha/2} \;\cdot\; s \cdot \sqrt{1 + \mathbf{x}^{*\prime}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}^*}\]

In R:

predict(mlr_model, newdata = new_obs,
        interval = "prediction", level = 0.95)

       fit      lwr      upr
1 84.67332 58.73202 110.6146

The prediction interval is always wider than the confidence interval for the mean — it must additionally account for individual variation.

Validating the Classical Assumptions

Why Validation Matters

Our inference (t-tests, F-test, confidence intervals) is only valid if the classical assumptions hold.

If they fail:

A2 violated (non-zero mean error, omitted variables): OLS is biased
A3 violated (heteroscedasticity): OLS standard errors are wrong → invalid \(t\) and \(F\) tests
A4 violated (autocorrelation): same problem as heteroscedasticity
A6 violated (non-normality): \(t\) and \(F\) approximations may fail in small samples
Multicollinearity (not a classical assumption violation): inflated standard errors, unstable estimates

The tool for checking assumptions: residual analysis.

The Four Key Diagnostic Plots

par(mfrow = c(1, 4), mar = c(4, 4, 3, 1))
plot(mlr_model)

Residuals vs. Fitted: Checking Linearity and Homoscedasticity

What to look for:

Random scatter around zero → linearity OK
Constant vertical spread → homoscedasticity OK
Funnel or fan shape → heteroscedasticity (problem)
Curved pattern → non-linearity (problem)

Normality of Residuals: Q–Q Plot

What to look for:

Points near the diagonal → normality approximately satisfied
S-shaped curve → heavy tails (non-normal)
Systematic departure → consider transformations of \(Y\)

Detecting Heteroscedasticity

Problem signs:

Rising line → variance increases with \(\hat{Y}\) (funnel-right)
Falling line → variance decreases with \(\hat{Y}\) (funnel-left)

Formal test: White’s test (tests whether residual variance depends on any regressor or its square).

Practical fix: Use robust (heteroscedasticity-consistent) standard errors or transform \(Y\) (e.g. \(\log Y\)).

The Classical Assumptions: Summary Table

Assumption	How to check	Symptom if violated	Consequence
A1 Linearity	Scatter plot, residuals vs. fitted	Curved pattern	Biased estimates
A2 Zero mean	Residuals vs. fitted	Systematic non-zero residuals	Biased \(\hat{\beta}_0\)
A3 Homoscedasticity	Scale-location plot, White’s test	Funnel shape	Invalid SEs & tests
A4 No autocorrelation	Durbin-Watson statistic, residuals vs. order	Pattern in residuals	Invalid SEs & tests
A6 Normality	Q–Q plot, Shapiro-Wilk test	S-curve in Q–Q	Invalid small-sample tests
—	Multicollinearity: VIF	High pairwise \(r\), VIF \(> 5\)	Inflated SEs, unstable \(\hat{\beta}_j\)

Autocorrelation: The Durbin–Watson Test

Relevant mainly for time series data. Tests \(H_0\): no autocorrelation vs. \(H_1\): positive autocorrelation.

\[DW = \frac{\sum_{i=2}^{n}(e_i - e_{i-1})^2}{\sum_{i=1}^{n} e_i^2}\]

Rule of thumb: \(DW \approx 2\) → no autocorrelation; \(DW < 1.5\) → concern.

# install.packages("lmtest") if needed
library(lmtest)
dwtest(mlr_model)


    Durbin-Watson test

data:  mlr_model
DW = 1.6312, p-value = 0.1308
alternative hypothesis: true autocorrelation is greater than 0

Multicollinearity and VIF

Multicollinearity: two or more regressors are highly correlated with each other.

Consequence: OLS is still unbiased and BLUE, but standard errors are inflated → \(t\)-statistics are small → we may fail to detect significant effects.

The Variance Inflation Factor (VIF) for regressor \(X_j\):

\[\text{VIF}_j = \frac{1}{1 - R_j^2}\]

where \(R_j^2\) is the \(R^2\) from regressing \(X_j\) on all other regressors.

Note

Rule of thumb: VIF \(< 5\) is acceptable; VIF \(> 10\) is a serious problem.

VIF in R

# install.packages("car") if needed
library(car)
vif(mlr_model)

 revenue2 employees 
 1.110802  1.110802

Both VIF values are close to 1 → no multicollinearity problem in our example. This is expected since revenue2 and employees were generated independently.

Putting It All Together: Model Validation Workflow

Fit the model with lm() and check summary() for coefficient signs, magnitudes and significance.
Plot residuals vs. fitted — check for linearity and homoscedasticity.
Inspect the Q–Q plot — check for normality of residuals.
Run Durbin-Watson (if data may have time ordering) — check for autocorrelation.
Compute VIF (if \(k \geq 2\)) — check for multicollinearity.
If violations are found: consider transformations (log \(Y\), log \(X\)), robust standard errors, or model re-specification (add/remove variables).

Important

Validation is not a single pass — it is an iterative dialogue between the model and the data.

Complete R Workflow

End-to-End: From Data to Validated Model

# 1. Fit the model
model <- lm(Y ~ X1 + X2, data = mydata)

# 2. Inspect results
summary(model)
confint(model, level = 0.95)

# 3. Residual diagnostics
par(mfrow = c(2, 2))
plot(model)

# 4. Autocorrelation (if needed)
library(lmtest)
dwtest(model)

# 5. Multicollinearity (if k >= 2)
library(car)
vif(model)

# 6. Predict for new data
predict(model,
        newdata = data.frame(X1 = x1_new, X2 = x2_new),
        interval = "prediction",
        level = 0.95)

Reading a Regression Table

A typical regression output table (as seen in R or SPSS):

	Estimate	Std. Error	\(t\) value	\(\Pr(>\|t\|)\)
(Intercept)	\(\hat{\beta}_0\)	\(s_{\hat{\beta}_0}\)	\(t_0\)	\(p_0\)
\(X_1\)	\(\hat{\beta}_1\)	\(s_{\hat{\beta}_1}\)	\(t_1\)	\(p_1\)
\(X_2\)	\(\hat{\beta}_2\)	\(s_{\hat{\beta}_2}\)	\(t_2\)	\(p_2\)

Metric	What it tells you
Residual std. error (\(s\))	Typical size of residuals; in the same units as \(Y\)
\(R^2\)	Proportion of variance in \(Y\) explained
Adjusted \(R^2\)	\(R^2\) penalised for number of regressors
\(F\)-statistic	Overall significance of the model

Course Summary: Linear Regression

The Linear Regression Journey (3 Lectures)

Lecture	Topic	Key Concepts
8	Introduction & OLS	Model specification, Gauss–Markov assumptions, OLS formulas, coefficient interpretation
9	Fit & MLR	\(R^2\), \(\bar{R}^2\), Gauss–Markov theorem, Omitted Variable Bias, MLR ceteris paribus interpretation
10	Inference & Validation	\(t\)-tests, \(F\)-test, CIs, prediction intervals, residual diagnostics, VIF

The core message:

OLS gives us the best linear unbiased estimates — but only if the classical assumptions hold. Always validate your model before drawing conclusions.

Key Formulas at a Glance

	Formula
OLS slope	\(\hat{\beta}_1 = S_{XY}/S_{XX}\)
OLS intercept	\(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\)
Error variance	\(s^2 = \text{SSR}/(n-k-1)\)
\(R^2\)	\(1 - \text{SSR}/\text{SST}\)
Adjusted \(R^2\)	\(1 - (1-R^2)(n-1)/(n-k-1)\)
\(t\)-statistic	\(t_j = \hat{\beta}_j / s_{\hat{\beta}_j}\)
CI for \(\beta_j\)	\(\hat{\beta}_j \pm t_{n-k-1,\alpha/2} \cdot s_{\hat{\beta}_j}\)
\(F\)-statistic	\((\text{SSE}/k)\,/\,(\text{SSR}/(n-k-1))\)