Goodness of Fit & Multiple Regression: Application in R

Hands-on practice with pre-loaded R data

Paulo Fagandini

Lisbon Accounting and Business School

Try it online

No installation needed

Scan the QR code or open the link below to try Jupyter in your browser.

https://jupyter.org

Goodness of Fit & Multiple Regression

Learning goals

Compute and interpret \(R^2\) from a fitted model.
Compare \(R^2\) and adjusted \(\bar{R}^2\) when adding regressors.
Fit a multiple linear regression model in R.
Interpret partial slopes (ceteris paribus).
See omitted variable bias in action by comparing a short and a long regression.
Use anova() to inspect the variance decomposition (SST, SSE, SSR).

Dataset

Pre-loaded R data

We will use airquality, available in base R.

It contains daily air quality measurements in New York, May–September 1973.

data(airquality)

?airquality   # run this in your session to read the documentation

Variables

Variable	Description	Unit
`Ozone`	Ozone concentration	ppb
`Solar.R`	Solar radiation	lang
`Wind`	Wind speed	mph
`Temp`	Maximum daily temperature	°F
`Month`	Month (5–9)	—
`Day`	Day of month	—

Our research question:

What predicts ozone concentration? We start with temperature alone, then add more variables.

Inspecting the data

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Summary statistics

summary(airquality)

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NAs    :37       NAs    :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0

A note on missing values

airquality has missing values (NA) in Ozone and Solar.R.

lm() handles them automatically by dropping incomplete rows — but it is good practice to check:

# How many complete cases do we have?
sum(complete.cases(airquality))

[1] 111

# Work with complete cases only
aq <- airquality[complete.cases(airquality), ]
nrow(aq)

[1] 111

Exercise 1

Explore the data

Create a scatter plot of Ozone (Y) against Temp (X).
Describe the direction and strength of the relationship.
Does a linear model seem reasonable?

Hint

ggplot(aq, aes(x = Temp, y = Ozone)) +
  geom_point() +
  theme_minimal()

Or simply:

plot(aq$Temp, aq$Ozone)

Plot

Code

ggplot(aq, aes(x = Temp, y = Ozone)) +
  geom_point(color = iscal_burgundy, size = 2.5, alpha = 0.7) +
  theme_minimal(base_size = 14) +
  theme(panel.grid.minor = element_blank()) +
  labs(
    title = "Ozone concentration vs. Temperature",
    x     = "Temperature (°F)",
    y     = "Ozone (ppb)"
  )

Exercise 2

Fit a simple linear regression

Fit the model:

\[\text{Ozone}_i = \beta_0 + \beta_1\,\text{Temp}_i + \varepsilon_i\]

Tasks

Estimate the model with lm().
Display the full summary().
Extract the \(R^2\).

Hint

slr <- lm(Ozone ~ Temp, data = aq)
summary(slr)
summary(slr)$r.squared

Exercise 3

Interpret the output

Using the output from Exercise 2, answer:

What does the slope \(\hat{\beta}_1\) mean in this context?
What proportion of the variation in ozone is explained by temperature alone?
Is the slope statistically significant? How do you know?

Hint

coef(slr)
summary(slr)$r.squared
summary(slr)$coefficients   # shows t-statistics and p-values

Exercise 4

The variance decomposition

R can report SST, SSE, and SSR directly.

Tasks

Use anova(slr) and examine the output.
Identify which row corresponds to SSE and which to SSR.
Verify manually that \(R^2 = \text{SSE}/\text{SST}\).

Hint

anova(slr)

# SSE and SSR are in the "Sum Sq" column
a   <- anova(slr)
SSE <- a["Temp",    "Sum Sq"]
SSR <- a["Residuals", "Sum Sq"]
SST <- SSE + SSR

SSE / SST   # should equal summary(slr)$r.squared

Exercise 5

Residual diagnostics

Plot residuals against fitted values — does the spread look constant?
Create a Q-Q plot of the residuals — do they look approximately normal?
What concerns, if any, do you notice?

Hint

par(mfrow = c(2, 2))
plot(slr)
par(mfrow = c(1, 1))

Or manually:

plot(fitted(slr), resid(slr))
abline(h = 0, lty = 2)

qqnorm(resid(slr))
qqline(resid(slr))

Exercise 6

Add a second regressor

Now fit the multiple regression model including Wind:

\[\text{Ozone}_i = \beta_0 + \beta_1\,\text{Temp}_i + \beta_2\,\text{Wind}_i + \varepsilon_i\]

Tasks

Fit the model with lm().
Compare \(R^2\) and \(\bar{R}^2\) with the SLR from Exercise 2.
How did the coefficient on Temp change? Why?

Hint

mlr1 <- lm(Ozone ~ Temp + Wind, data = aq)
summary(mlr1)

# Compare R² and adjusted R²
summary(slr)$r.squared;      summary(slr)$adj.r.squared
summary(mlr1)$r.squared;     summary(mlr1)$adj.r.squared

Exercise 7

Add a third regressor

Now include Solar.R as well:

\[\text{Ozone}_i = \beta_0 + \beta_1\,\text{Temp}_i + \beta_2\,\text{Wind}_i + \beta_3\,\text{Solar.R}_i + \varepsilon_i\]

Tasks

Fit and summarise the model.
Build a table comparing \(R^2\) and \(\bar{R}^2\) across all three models.
Which model would you prefer, and why?

Hint

mlr2 <- lm(Ozone ~ Temp + Wind + Solar.R, data = aq)
summary(mlr2)

# Comparison table
data.frame(
  Model    = c("SLR (Temp)", "MLR (Temp+Wind)", "MLR (Temp+Wind+Solar.R)"),
  R2       = c(summary(slr)$r.squared,
               summary(mlr1)$r.squared,
               summary(mlr2)$r.squared),
  Adj_R2   = c(summary(slr)$adj.r.squared,
               summary(mlr1)$adj.r.squared,
               summary(mlr2)$adj.r.squared)
)

Exercise 8

Omitted variable bias in action

In Exercise 6 you noticed the coefficient on Temp changed when Wind was added.

Tasks

Extract \(\hat{\beta}_{\text{Temp}}\) from the SLR (Exercise 2) and from the MLR with Wind (Exercise 6).
Are they different? In which direction?
Check whether Temp and Wind are correlated — does the sign of the correlation explain the direction of the bias?

Hint

coef(slr)["Temp"]
coef(mlr1)["Temp"]

cor(aq$Temp, aq$Wind)

Think: what is the sign of \(\beta_{\text{Wind}}\) in the true model, and what is the sign of \(\text{Corr}(\text{Temp}, \text{Wind})\)? Use the OVB table from the lecture.

Exercise 9

Prediction with MLR

Using your full model from Exercise 7, predict ozone concentration for two new days:

Day	Temp	Wind	Solar.R
A	80	10	200
B	95	5	250

Tasks

Obtain point predictions.
Obtain 95% prediction intervals.
Which day is predicted to have higher ozone? Does it make sense?

Hint

new_days <- data.frame(Temp = c(80, 95), Wind = c(10, 5), Solar.R = c(200, 250))
predict(mlr2, new_days, interval = "prediction", level = 0.95)

Cheat sheet

Core commands

# Load and clean data
data(airquality)
aq <- airquality[complete.cases(airquality), ]

# Simple linear regression
slr  <- lm(Ozone ~ Temp, data = aq)
summary(slr)

# Multiple linear regression
mlr1 <- lm(Ozone ~ Temp + Wind, data = aq)
mlr2 <- lm(Ozone ~ Temp + Wind + Solar.R, data = aq)
summary(mlr2)

# Goodness of fit
summary(slr)$r.squared
summary(slr)$adj.r.squared

# Variance decomposition
anova(slr)

# Coefficients
coef(mlr2)

# Predictions
new_days <- data.frame(Temp = c(80, 95), Wind = c(10, 5), Solar.R = c(200, 250))
predict(mlr2, new_days, interval = "prediction")

Useful checks

# Residual plots
par(mfrow = c(2, 2))
plot(mlr2)
par(mfrow = c(1, 1))

# Manual residual vs fitted
plot(fitted(mlr2), resid(mlr2))
abline(h = 0, lty = 2)

# Normality check
qqnorm(resid(mlr2))
qqline(resid(mlr2))

# Correlation between regressors (multicollinearity check)
cor(aq[, c("Temp", "Wind", "Solar.R")])

# Compare R² across models
data.frame(
  Model  = c("SLR", "MLR (2 vars)", "MLR (3 vars)"),
  R2     = c(summary(slr)$r.squared,
             summary(mlr1)$r.squared,
             summary(mlr2)$r.squared),
  Adj_R2 = c(summary(slr)$adj.r.squared,
             summary(mlr1)$adj.r.squared,
             summary(mlr2)$adj.r.squared)
)

Homework (not graded)

The full airquality dataset also has Month as a variable.

Add Month to the MLR model (treat it as numeric for now) and check whether \(\bar{R}^2\) improves.
Look at the scatter plot of Ozone vs Temp coloured by Month. Do you notice any pattern?
We will discuss in the next lecture whether controlling for Month makes conceptual sense, and what to do when a regressor is categorical rather than continuous.

ggplot(airquality[complete.cases(airquality),], aes(x = Temp, y = Ozone, color = factor(Month))) +
  geom_point(size = 2.5) +
  theme_minimal() +
  labs(color = "Month")