Hands-on practice with pre-loaded R data
Lisbon Accounting and Business School
Scan the QR code or open the link below to try Jupyter in your browser.
anova() to inspect the variance decomposition (SST, SSE, SSR).We will use airquality, available in base R.
It contains daily air quality measurements in New York, May–September 1973.
| Variable | Description | Unit |
|---|---|---|
Ozone |
Ozone concentration | ppb |
Solar.R |
Solar radiation | lang |
Wind |
Wind speed | mph |
Temp |
Maximum daily temperature | °F |
Month |
Month (5–9) | — |
Day |
Day of month | — |
Our research question:
What predicts ozone concentration? We start with temperature alone, then add more variables.
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NAs :37 NAs :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
airquality has missing values (NA) in Ozone and Solar.R.
lm() handles them automatically by dropping incomplete rows — but it is good practice to check:
Ozone (Y) against Temp (X).Fit the model:
\[\text{Ozone}_i = \beta_0 + \beta_1\,\text{Temp}_i + \varepsilon_i\]
lm().summary().Using the output from Exercise 2, answer:
R can report SST, SSE, and SSR directly.
anova(slr) and examine the output.Now fit the multiple regression model including Wind:
\[\text{Ozone}_i = \beta_0 + \beta_1\,\text{Temp}_i + \beta_2\,\text{Wind}_i + \varepsilon_i\]
lm().Temp change? Why?Now include Solar.R as well:
\[\text{Ozone}_i = \beta_0 + \beta_1\,\text{Temp}_i + \beta_2\,\text{Wind}_i + \beta_3\,\text{Solar.R}_i + \varepsilon_i\]
mlr2 <- lm(Ozone ~ Temp + Wind + Solar.R, data = aq)
summary(mlr2)
# Comparison table
data.frame(
Model = c("SLR (Temp)", "MLR (Temp+Wind)", "MLR (Temp+Wind+Solar.R)"),
R2 = c(summary(slr)$r.squared,
summary(mlr1)$r.squared,
summary(mlr2)$r.squared),
Adj_R2 = c(summary(slr)$adj.r.squared,
summary(mlr1)$adj.r.squared,
summary(mlr2)$adj.r.squared)
)In Exercise 6 you noticed the coefficient on Temp changed when Wind was added.
Wind (Exercise 6).Temp and Wind are correlated — does the sign of the correlation explain the direction of the bias?Using your full model from Exercise 7, predict ozone concentration for two new days:
| Day | Temp | Wind | Solar.R |
|---|---|---|---|
| A | 80 | 10 | 200 |
| B | 95 | 5 | 250 |
# Load and clean data
data(airquality)
aq <- airquality[complete.cases(airquality), ]
# Simple linear regression
slr <- lm(Ozone ~ Temp, data = aq)
summary(slr)
# Multiple linear regression
mlr1 <- lm(Ozone ~ Temp + Wind, data = aq)
mlr2 <- lm(Ozone ~ Temp + Wind + Solar.R, data = aq)
summary(mlr2)
# Goodness of fit
summary(slr)$r.squared
summary(slr)$adj.r.squared
# Variance decomposition
anova(slr)
# Coefficients
coef(mlr2)
# Predictions
new_days <- data.frame(Temp = c(80, 95), Wind = c(10, 5), Solar.R = c(200, 250))
predict(mlr2, new_days, interval = "prediction")# Residual plots
par(mfrow = c(2, 2))
plot(mlr2)
par(mfrow = c(1, 1))
# Manual residual vs fitted
plot(fitted(mlr2), resid(mlr2))
abline(h = 0, lty = 2)
# Normality check
qqnorm(resid(mlr2))
qqline(resid(mlr2))
# Correlation between regressors (multicollinearity check)
cor(aq[, c("Temp", "Wind", "Solar.R")])
# Compare R² across models
data.frame(
Model = c("SLR", "MLR (2 vars)", "MLR (3 vars)"),
R2 = c(summary(slr)$r.squared,
summary(mlr1)$r.squared,
summary(mlr2)$r.squared),
Adj_R2 = c(summary(slr)$adj.r.squared,
summary(mlr1)$adj.r.squared,
summary(mlr2)$adj.r.squared)
)The full airquality dataset also has Month as a variable.
Month to the MLR model (treat it as numeric for now) and check whether \(\bar{R}^2\) improves.Ozone vs Temp coloured by Month. Do you notice any pattern?Month makes conceptual sense, and what to do when a regressor is categorical rather than continuous.Statistics II — Goodness of Fit & MLR