Statistics II – Linear Regression: Exercises
This is a set of exercises prepared by the teaching faculty for Statistics II, covering the Linear Regression Model (simple and multiple linear regression, OLS estimation, goodness of fit, and inference), from the Lisbon Accounting and Business School (ISCAL–IPL).
The workbook consists of multiple-choice questions — for which you must select the correct option and justify your choice — and True (T) or False (F) questions — where you must indicate the logical value of each statement and justify your answer adequately.
Exercises adapted from: Custódio, S.G., Morgado, A.J., Ferreira, T. & Delgado, S. (2023). Análise de Regressão Linear: Exercícios de Aplicação Geral e Económica. Sílabas & Desafios. All data in this workbook are hypothetical.
Exercises
Exercise 1
Consider the following information from a Simple Linear Regression study:
\[\hat{y} = 4.3 - 1.25x \qquad R^2 = 0.97\]
Which of the following statements is true?
Exercise 2
In a study examining whether grades in Mathematics (\(x\)) and Statistics (\(y\)) are linearly related, the grades of 10 randomly selected students were recorded. The data yielded the following fitted model and correlation coefficient:
\[\hat{y} = 3.9 + 0.72x \qquad R = 0.94\]
2.a. The linear correlation is strong and positive.
2.b. For each additional grade point obtained in Statistics, there is an average increase of 0.72 grade points in Mathematics.
2.c. For each additional grade point obtained in Mathematics, there is an average increase of 0.72 grade points in Statistics.
2.d. In the absence of a Mathematics grade (i.e., \(x = 0\)), the estimated average Statistics grade is 3.9.
2.e. In the absence of a Statistics grade (i.e., \(y = 0\)), the estimated average Mathematics grade is 3.9.
2.f. João and Maria obtained grades of 12 and 4, respectively, in Mathematics. Having missed the Statistics assessment, their estimated Statistics grades are 12.54 and 6.78, respectively.
2.g. Any prediction made in 2.f is considered a good estimate.
Exercise 3
Two scatter diagrams, (A) and (B), are presented. Identify which diagram best corresponds to the statement:
“The sample correlation coefficient is approximately \(-0.75\).”
Exercise 4
Consider the following fitted regression line, where \(y\) represents box-office revenue and \(x\) represents the production budget of a film (both in millions of euros):
\[\hat{y} = 2 + 1.5x \qquad R = 0.50\]
4.a. The intercept of the model is meaningful in the context of this problem.
4.b. \(\hat{\beta}_1\) is the estimated regression coefficient that supports the following statement: “For each additional million euros spent on a film’s production, its revenue increases on average by 1.5 million euros.”
4.c. The expected revenue of a film with a production budget of €5 million is €9.5 million, and this is a credible prediction.
Exercise 5
The Statistics test grade can be estimated from the effective grade obtained, using OLS estimation. The following fitted model was obtained:
\[\widehat{\text{Test}} = -1.068 + 1.101 \cdot \text{Grade} \qquad R = 0.9013\]
5.a. Approximately 90.13% of the total variability in the Statistics test grade is explained by the regression model.
5.b. The expected Statistics test grade of a student with an effective grade of 13.2 is approximately 13.5.
Exercise 6
Assume a linear relationship between \(x\) and \(y\). A sample linear correlation coefficient of \(R = -0.30\) was obtained. Which of the following statements is true?
Exercise 7
A low value of the coefficient of determination, \(R^2\), means that:
Exercise 8
Data on household income (in units of 100 €) and years of schooling were recorded for 25 households. The following results were obtained:
\[\hat{y} = -6.065 + 1.369x \qquad R^2 = 0.594\]
8.i. The linear correlation coefficient is:
8.ii. On average, an additional year of schooling is associated with an increase of €1.369 in household income.
8.iii. A reliable estimate of household income when the years of schooling equals 25 can be determined, and that estimate is €2,817.
Exercise 9
A Multiple Linear Regression Model (MLR) was estimated to analyse the relationship between student grades (in points, the dependent variable) and the average number of daily hours of study (hours) and daily complete meals (meals). A random sample of 45 students was used. The results are summarised below.
Regression Statistics
| Statistic | Value |
|---|---|
| Multiple R | 0.7859 |
| R Square | 0.6177 |
| Adjusted R Square | 0.5995 |
| Standard Error | 2.6578 |
| Observations | 45 |
ANOVA
| Source | df | SS | MS | F | Sig. F |
|---|---|---|---|---|---|
| Regression | 2 | 479.316 | 239.658 | 33.927 | 1.70×10⁻⁹ |
| Residual | 42 | 296.684 | 7.064 | ||
| Total | 44 | 776.000 |
Coefficients
| Variable | Coefficient | Std. Error | t Stat | p-value |
|---|---|---|---|---|
| Intercept | 3.5733 | 1.0656 | 3.353 | 0.0017 |
| meals | 1.7632 | 0.2952 | 5.973 | 4.34×10⁻⁷ |
| hours | 2.5489 | 0.7279 | 3.501 | 0.0011 |
9.i. In this study, the sample type is classified as:
9.ii. From the OLS estimation, it can be concluded that the expected change in a student’s grade when they eat one additional complete meal per day is 1.76 points (holding all else constant).
9.iii. The change referred to in 9.ii is a:
9.iv. For any conventional significance level \(\alpha\), the variable meals significantly affects the dependent variable.
9.v. 78.6% of the variation in student grades is jointly explained by the two regressors.
Exercise 10
A Multiple Linear Regression Model was estimated to analyse the relationship between consumption (\(Y\)), income (\(X_1\)), and interest rate (\(X_2\)). The following output was obtained.
Model Summary
| R | R Square | Adjusted R Square | Std. Error |
|---|---|---|---|
| 0.984 | 0.969 | 0.961 | 2.101 |
ANOVA
| Source | SS | df | MS | F | Sig. |
|---|---|---|---|---|---|
| Regression | 1092.879 | 2 | 546.439 | 123.829 | .000 |
| Residual | 35.303 | 8 | 4.413 | ||
| Total | 1128.182 | 10 |
Coefficients
| Variable | B | Std. Error | Beta (Std.) | t | Sig. |
|---|---|---|---|---|---|
| (Constant) | 10.604 | 56.115 | — | 0.189 | .855 |
| Income (\(X_1\)) | 0.371 | 0.133 | 0.652 | 2.797 | .023 |
| Interest rate (\(X_2\)) | −164.996 | 112.724 | −0.341 | −1.464 | .181 |
10.i. The estimated model is: \(\hat{Y}_i = 10.604 + 0.371\,X_{1i} - 164.996\,X_{2i}\).
10.ii. Using an appropriate indicator, we can conclude that the quality of the estimated model is very good, since only 3.1% of the total variability in \(Y\) is not jointly explained by the regressors.
10.iii. Using an appropriate hypothesis test, we can conclude that the estimated regression model is globally significant.
Exercise 11
A Multiple Linear Regression Model was estimated to analyse the monthly internet expenditure (\(\widehat{\text{Gastos\_int}}\), in €) of 149 randomly selected individuals, using the following specification:
\[\text{Gastos\_int}_i = \beta_0 + \beta_1\,\text{rend}_i + \beta_2\,\text{contas}_i + \beta_3\,\text{age}_i + \varepsilon_i\]
where:
- \(\text{Gastos\_int}\): monthly internet expenditure (€);
- \(\text{rend}\): individual’s monthly income (€);
- \(\text{contas}\): number of active internet accounts;
- \(\text{age}\): individual’s age (years).
The OLS estimation produced the following output:
ANOVA
| Source | df | SS | MS | F | Sig. F |
|---|---|---|---|---|---|
| Regression | 3 | 4,026,353.027 | 1,342,118.0 | 15,825,970.7 | 0 |
| Residual | 146 | 12.381 | 0.085 | ||
| Total | 149 | 4,026,365.408 |
Coefficients
| Variable | Coefficient | Std. Error | t Stat | p-value |
|---|---|---|---|---|
| Intercept | −44.4746 | 0.0709 | −626.852 | 2.62×10⁻²⁵² |
| Income | 0.1600 | 0.0000235 | 6823.294 | ≈0 |
| Accounts | 0.5538 | 0.00835 | 66.292 | 7.15×10⁻¹¹¹ |
| Age | 0.0254 | 0.000859 | 29.595 | 1.46×10⁻⁶³ |
11.i. Interpreting the coefficient associated with income: if monthly income increases by €100, the estimated average internet expenditure increases by approximately €16, holding all else constant.
11.ii. For any conventional significance level, the regressor age is statistically significant.
11.iii. The \(p\)-value associated with the global significance test (F-test) allows us to conclude that:
11.iv. The point estimate of expected monthly internet expenditure for an individual earning €1,500, with 3 active accounts, aged 45 years, is:
Exercise 12
For respondents from the Lisbon region, a multiple linear regression model was estimated to predict gross income (in €) as a function of age, highest educational level attained, and the income level considered fair for the work performed (in €). The following output was obtained (partial):
| R | R Square | Adjusted R Square | |
|---|---|---|---|
| Model | 0.964 | 0.929 | 0.925 |
Standardized Coefficients (Beta)
| Variable | B | Beta (Std.) | Sig. |
|---|---|---|---|
| (Constant) | −7.872 | — | — |
| Educational level | 5.532 | 0.104 | 0.579 |
| Fair income level | 0.688 | 0.952 | 0.000 |
| Age | 0.391 | 0.029 | 0.912 |
12.a. Write down the estimated OLS model.
\[\widehat{\text{GrossIncome}}_{\text{Lisbon}} = -7.872 + 5.532\,\text{EduLevel} + 0.688\,\text{FairIncome} + 0.391\,\text{Age}\]
12.b. Since the percentage of total data variability explained by the fitted model is approximately 96.4%, it can be concluded that the degree of linear association between the variables is high.
12.c. The independent variable with the greatest contribution to explaining the dependent variable is fair income level, as it presents the highest absolute value of the standardized beta coefficient.
12.d. Although the appropriate test confirms that the model is globally significant, two of the three regressors are individually non-significant.