Basic Concepts, Test Procedure, Errors, and p-value
Lisbon Accounting and Business School – Polytechnic University of Lisbon
2025-04-14
So far in this course, we have covered:
Now we ask a different question:
Given a claim about a population parameter, does the sample data support or contradict that claim?
In practice, decisions must be made under uncertainty:
Hypothesis testing provides a formal, rigorous framework for answering these questions using sample data.
The Key Shift
In estimation, we ask: “What is the value of the parameter?”
In hypothesis testing, we ask: “Is the parameter equal to (or greater than, or less than) a specific value?”
We already have the building blocks:
Every hypothesis test involves the same fundamental elements:
Null Hypothesis
The null hypothesis, denoted \(H_0\), represents the status quo — the current belief, the manufacturer’s specification, or the default assumption.
It always contains an equality sign (\(=\), \(\leq\), or \(\geq\)).
Examples:
Important: We never “accept” \(H_0\). We either reject it or fail to reject it.
Alternative Hypothesis
The alternative hypothesis, denoted \(H_1\) (or \(H_a\)), represents what we are trying to find evidence for. It is the “research hypothesis.”
It contains a strict inequality (\(\neq\), \(>\), or \(<\)).
The form of \(H_1\) determines the type of test:
| \(H_0\) | \(H_1\) | Type of test |
|---|---|---|
| \(\mu = \mu_0\) | \(\mu \neq \mu_0\) | Two-tailed (bilateral) |
| \(\mu \leq \mu_0\) | \(\mu > \mu_0\) | Right-tailed (unilateral right) |
| \(\mu \geq \mu_0\) | \(\mu < \mu_0\) | Left-tailed (unilateral left) |
A common source of confusion: which claim goes in \(H_0\) and which in \(H_1\)?
Rule of thumb:
Think of it as a trial: \(H_0\) is “innocent” (the default), and \(H_1\) is “guilty.” The burden of proof lies on \(H_1\).
Scenario: A food company states that each package contains, on average, 500 g of product. A consumer protection agency suspects the company is underfilling.
What are \(H_0\) and \(H_1\)?
\[H_0: \mu \geq 500 \quad \text{vs.} \quad H_1: \mu < 500\]
This is a left-tailed test.
Scenario: An engineer suspects that a machine is overfilling packages beyond the nominal 100 g.
\[H_0: \mu \leq 100 \quad \text{vs.} \quad H_1: \mu > 100\]
This is a right-tailed test.
Scenario: A quality control inspector wants to check whether the mean weight differs from the specification of 100 g (could be above or below).
\[H_0: \mu = 100 \quad \text{vs.} \quad H_1: \mu \neq 100\]
This is a two-tailed (bilateral) test.
Test Statistic
The test statistic is a random variable, computed from the sample, whose distribution is known under \(H_0\).
It measures how far the sample result is from what \(H_0\) predicts.
You already know these from interval estimation:
| Parameter | Conditions | Test Statistic | Distribution under \(H_0\) |
|---|---|---|---|
| \(\mu\) | \(\sigma\) known | \(Z_0 = \frac{\bar{X}-\mu_0}{\sigma/\sqrt{n}}\) | \(N(0,1)\) |
| \(\mu\) | \(\sigma\) unknown | \(T_0 = \frac{\bar{X}-\mu_0}{S'/\sqrt{n}}\) | \(t_{(n-1)}\) |
| \(\sigma^2\) | Normal pop. | \(Q_0 = \frac{(n-1)S'^2}{\sigma_0^2}\) | \(\chi^2_{(n-1)}\) |
Significance Level
The significance level \(\alpha\) is the maximum probability of rejecting \(H_0\) when \(H_0\) is actually true.
\[\alpha = P(\text{reject } H_0 \mid H_0 \text{ is true})\]
Common values: \(\alpha = 0.01\), \(\alpha = 0.05\), \(\alpha = 0.10\).
\(\alpha\) is chosen before collecting data and performing the test.
Critical Region
The critical region (or rejection region) is the set of values of the test statistic for which we reject \(H_0\).
Its boundary is determined by the significance level \(\alpha\) and the type of test.
The shaded areas represent the rejection region. If the observed test statistic falls in the shaded area, we reject \(H_0\).
Step-by-step procedure for a hypothesis test
When we make a decision based on sample data, we can make two kinds of errors:
| \(H_0\) is true | \(H_0\) is false | |
|---|---|---|
| Do not reject \(H_0\) | Correct decision | Type II Error (\(\beta\)) |
| Reject \(H_0\) | Type I Error (\(\alpha\)) | Correct decision |
Type I Error
A Type I error occurs when we reject \(H_0\) even though \(H_0\) is true.
\[\alpha = P(\text{Type I Error}) = P(\text{reject } H_0 \mid H_0 \text{ is true})\]
This is exactly the significance level \(\alpha\).
Analogy: Convicting an innocent person in a trial.
Type II Error
A Type II error occurs when we fail to reject \(H_0\) even though \(H_0\) is false (i.e., \(H_1\) is true).
\[\beta = P(\text{Type II Error}) = P(\text{do not reject } H_0 \mid H_1 \text{ is true})\]
Analogy: Acquitting a guilty person in a trial.
\(\beta\) depends on:
Power
The power of a test is the probability of correctly rejecting \(H_0\) when \(H_1\) is true:
\[\text{Power} = 1 - \beta = P(\text{reject } H_0 \mid H_1 \text{ is true})\]
A good test has high power (close to 1).
In practice:
For a fixed sample size \(n\):
The only way to reduce both \(\alpha\) and \(\beta\) simultaneously is to increase the sample size \(n\).
Think of a hypothesis test as a fire alarm:
A bottling company fills bottles with a nominal volume of \(\mu_0 = 500\) mL. The filling process is known to have a standard deviation of \(\sigma = 8\) mL. A quality inspector takes a random sample of \(n = 36\) bottles and obtains a sample mean of \(\bar{x} = 497\) mL.
At the 5% significance level, is there evidence that the machine is underfilling?
Step 1: State the hypotheses.
The inspector suspects underfilling, i.e., the mean is less than 500.
\[H_0: \mu \geq 500 \quad \text{vs.} \quad H_1: \mu < 500\]
This is a left-tailed test.
Step 2: Significance level.
\[\alpha = 0.05\]
Step 3: Test statistic.
The population standard deviation \(\sigma = 8\) is known, so:
\[Z_0 = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} \underset{\text{under } H_0}{\sim} N(0, 1)\]
Step 4: Critical region.
For a left-tailed test at \(\alpha = 0.05\):
\[\text{Rejection region: } \left]-\infty\, ;\, -z_{\alpha}\right] = \left]-\infty\, ;\, -1.645\right]\]
We reject \(H_0\) if \(z_{obs} \leq -1.645\).
Step 5: Compute the observed value.
\[z_{obs} = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{497 - 500}{8 / \sqrt{36}} = \frac{-3}{8/6} = \frac{-3}{1.333} \approx -2.25\]
Step 6: Decision.
\[z_{obs} = -2.25 \leq -1.645\]
The observed value falls in the rejection region.
\(\Rightarrow\) We reject \(H_0\) at the 5% significance level.
Step 7: Conclusion.
At the 5% significance level, there is sufficient statistical evidence to conclude that the machine is underfilling the bottles (i.e., the mean volume is less than 500 mL).
For each scenario, write \(H_0\) and \(H_1\), and state the type of test:
Note: In (c), the researcher does not have a directional suspicion — it could go either way — hence the two-tailed test.
What we covered so far
Before the break, we used the critical region approach:
But this only tells us “reject” or “do not reject” for a specific \(\alpha\).
What if someone asks: “Would you still reject \(H_0\) at \(\alpha = 0.01\)?”
The \(p\)-value answers this question for all possible significance levels at once.
\(p\)-value
The \(p\)-value is the probability, computed under \(H_0\), of observing a test statistic value as extreme as, or more extreme than, the value actually observed.
It is the smallest significance level at which we would reject \(H_0\).
Informally: the \(p\)-value measures how surprised we should be by the sample data, if \(H_0\) were true.
The formula depends on the type of test:
| Type of test | \(p\)-value |
|---|---|
| Left-tailed (\(H_1: \theta < \theta_0\)) | \(p = P(T \leq t_{obs} \mid H_0)\) |
| Right-tailed (\(H_1: \theta > \theta_0\)) | \(p = P(T \geq t_{obs} \mid H_0)\) |
| Two-tailed (\(H_1: \theta \neq \theta_0\)) | \(p = 2 \times P(T \geq |t_{obs}| \mid H_0)\) |
Here \(T\) denotes the test statistic (could be \(Z\), \(T\), \(Q\), etc.) and \(t_{obs}\) is its observed value.
\(p\)-value decision rule
Given a significance level \(\alpha\):
This is equivalent to the critical region approach, but more informative:
The shaded blue areas represent the \(p\)-value: the probability of obtaining a result as extreme as \(t_{obs}\), under \(H_0\).
Recall: \(H_0: \mu \geq 500\) vs. \(H_1: \mu < 500\), \(z_{obs} = -2.25\), left-tailed test.
\[p\text{-value} = P(Z \leq z_{obs} \mid H_0) = P(Z \leq -2.25)\]
\[= \Phi(-2.25) = 1 - \Phi(2.25) = 1 - 0.9878 = 0.0122\]
Interpretation: If \(H_0\) were true (\(\mu = 500\)), the probability of observing a sample mean as low as (or lower than) 497 is only 1.22%.
Decision (at \(\alpha = 0.05\)): Since \(p = 0.0122 < 0.05 = \alpha\), we reject \(H_0\).
Decision (at \(\alpha = 0.01\)): Since \(p = 0.0122 > 0.01 = \alpha\), we do not reject \(H_0\).
A random sample of \(n = 20\) observations is drawn from a Normal population. The sample mean is \(\bar{x} = 101\) and the corrected sample standard deviation is \(s' = 3\).
Test \(H_0: \mu \leq 100\) vs. \(H_1: \mu > 100\).
Test statistic (\(\sigma\) unknown, Normal population):
\[T_0 = \frac{\bar{X} - \mu_0}{S'/\sqrt{n}} \underset{\text{under } H_0}{\sim} t_{(n-1)} = t_{(19)}\]
Observed value:
\[t_{obs} = \frac{101 - 100}{3/\sqrt{20}} = \frac{1}{3/4.472} = \frac{1}{0.6708} \approx 1.49\]
\(p\)-value (right-tailed test):
\[p = P(T > 1.49 \mid H_0)\]
Using the \(t\)-Student table (Table 7) with 19 degrees of freedom:
We look for the row \(\nu = 19\). We find that \(t_{0.10,(19)} = 1.328\) and \(t_{0.05,(19)} = 1.729\).
Since \(1.328 < 1.49 < 1.729\), we have:
\[0.05 < p < 0.10\]
Decision (at \(\alpha = 0.05\)): Since \(p > 0.05\), we do not reject \(H_0\).
Conclusion: At the 5% level, there is not sufficient evidence to conclude that \(\mu > 100\).
When using the \(t\)-Student or \(\chi^2\) tables, we typically cannot compute the exact \(p\)-value. Instead, we bracket it:
Strategy: Find the two table entries that “sandwich” \(t_{obs}\) (or \(q_{obs}\)).
Example: \(t_{obs} = 2.861\), \(\nu = 19\) (right-tailed).
From Table 7: \(t_{0.005,(19)} = 2.861\).
So \(p = 0.005\) (exactly, in this case).
Tip: If \(t_{obs}\) matches a table entry exactly, the \(p\)-value is the corresponding tail probability. Otherwise, state the interval.
A manufacturer specifies that the mean diameter of a component is \(\mu_0 = 25\) mm. The population variance is known: \(\sigma^2 = 4\). A random sample of \(n = 49\) components has \(\bar{x} = 25.6\) mm.
Test \(H_0: \mu = 25\) vs. \(H_1: \mu \neq 25\). Compute the \(p\)-value.
Test statistic (\(\sigma\) known):
\[Z_0 = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \sim N(0,1)\]
Observed value:
\[z_{obs} = \frac{25.6 - 25}{2/\sqrt{49}} = \frac{0.6}{2/7} = \frac{0.6}{0.2857} \approx 2.10\]
\(p\)-value (two-tailed):
\[p = 2 \times P(Z \geq |z_{obs}|) = 2 \times P(Z \geq 2.10)\]
\[= 2 \times [1 - \Phi(2.10)] = 2 \times (1 - 0.9821) = 2 \times 0.0179 = 0.0358\]
Decision: \(p = 0.0358 < 0.05\), so we reject \(H_0\) at \(\alpha = 0.05\). There is evidence that \(\mu \neq 25\) mm.
Both approaches always give the same conclusion for a given \(\alpha\):
| Approach | Reject \(H_0\) if… |
|---|---|
| Critical region | \(t_{obs}\) falls in the rejection region |
| \(p\)-value | \(p\text{-value} \leq \alpha\) |
In practice, many practitioners prefer the \(p\)-value because:
The \(p\)-value is NOT:
The \(p\)-value IS:
The probability of observing data as extreme as (or more extreme than) what was observed, assuming \(H_0\) is true.
While the decision rule (\(p \leq \alpha \Rightarrow\) reject) is clear-cut, some authors provide informal guidelines:
| \(p\)-value | Evidence against \(H_0\) |
|---|---|
| \(p > 0.10\) | No evidence |
| \(0.05 < p \leq 0.10\) | Weak (marginal) evidence |
| \(0.01 < p \leq 0.05\) | Moderate evidence |
| \(0.001 < p \leq 0.01\) | Strong evidence |
| \(p \leq 0.001\) | Very strong evidence |
Source: adapted from Newbold, Carlson & Thorne.
These are informal guidelines, not strict rules. The choice of \(\alpha\) remains a decision of the researcher.
There is a direct relationship between a two-tailed hypothesis test and a confidence interval:
CI–Test Equivalence
At significance level \(\alpha\), we reject \(H_0: \mu = \mu_0\) (two-tailed) if and only if \(\mu_0\) does not belong to the \((1-\alpha)\times 100\%\) confidence interval for \(\mu\).
This makes intuitive sense: if the hypothesized value \(\mu_0\) is outside the plausible range given by the CI, then the data contradicts \(H_0\).
Recall Example 5: \(\mu_0 = 25\), \(\sigma = 2\), \(n = 49\), \(\bar{x} = 25.6\).
The 95% confidence interval for \(\mu\) is:
\[\bar{x} \pm z_{0.025} \cdot \frac{\sigma}{\sqrt{n}} = 25.6 \pm 1.96 \times \frac{2}{7} = 25.6 \pm 0.56\]
\[\text{CI} = [25.04\, ;\, 26.16]\]
Since \(\mu_0 = 25 \notin [25.04\, ;\, 26.16]\), we reject \(H_0: \mu = 25\) at \(\alpha = 0.05\).
This is consistent with our earlier finding (\(p = 0.0358 < 0.05\)).
A manufacturer claims that light bulbs last, on average, at least 1000 hours. A random sample of 25 bulbs from a Normal population gives \(\bar{x} = 980\) hours and \(s' = 40\) hours.
At \(\alpha = 0.05\), is there evidence against the manufacturer’s claim?
Step 1: \(H_0: \mu \geq 1000\) vs. \(H_1: \mu < 1000\) (left-tailed)
Step 2: \(\alpha = 0.05\)
Step 3: \(\sigma\) unknown, Normal population, so:
\[T_0 = \frac{\bar{X} - \mu_0}{S'/\sqrt{n}} \underset{\text{under } H_0}{\sim} t_{(24)}\]
Step 4: Left-tailed, \(\alpha = 0.05\), \(\nu = 24\)
\[\text{Rejection region: } \left]-\infty\,;\, -t_{0.05,(24)}\right] = \left]-\infty\,;\, -1.711\right]\]
(From Table 7: \(t_{0.05,(24)} = 1.711\))
Step 5:
\[t_{obs} = \frac{980 - 1000}{40/\sqrt{25}} = \frac{-20}{8} = -2.50\]
Step 6: \(t_{obs} = -2.50 < -1.711\) → falls in the rejection region.
Decision: Reject \(H_0\).
\[p = P(T \leq -2.50) = P(T \geq 2.50)\]
From Table 7, \(\nu = 24\): \(t_{0.01,(24)} = 2.492\) and \(t_{0.005,(24)} = 2.797\).
Since \(2.492 < 2.50 < 2.797\), we have \(0.005 < p < 0.01\).
Conclusion: At 5%, there is strong evidence that the mean lifetime is less than 1000 hours (\(p < 0.01\)).
A food processing company fills cereal boxes labeled as 500 g. The variance of the filling process is known to be \(\sigma^2 = 25 \text{ g}^2\). A random sample of \(n = 64\) boxes yields \(\bar{x} = 498\) g.
Test whether the mean weight differs from 500 g at \(\alpha = 0.05\).
Compute the \(p\)-value.
a) \(H_0: \mu = 500\) vs. \(H_1: \mu \neq 500\) (two-tailed)
\(\sigma = 5\) is known, \(n = 64\):
\[Z_0 = \frac{\bar{X} - 500}{\sigma/\sqrt{n}} \sim N(0,1) \qquad z_{obs} = \frac{498 - 500}{5/\sqrt{64}} = \frac{-2}{5/8} = \frac{-2}{0.625} = -3.20\]
Rejection region (two-tailed, \(\alpha = 0.05\)):
\[\left]-\infty;\, -z_{0.025}\right] \cup \left[z_{0.025};\, +\infty\right[ = \left]-\infty;\, -1.96\right] \cup \left[1.96;\, +\infty\right[\]
Since \(z_{obs} = -3.20 < -1.96\), we reject \(H_0\).
b) \(p\)-value (two-tailed):
\[p = 2 \times P(Z \geq |{-3.20}|) = 2 \times P(Z \geq 3.20)\]
\[= 2 \times [1 - \Phi(3.20)] = 2 \times (1 - 0.9993) = 2 \times 0.0007 = 0.0014\]
Very strong evidence against \(H_0\): the \(p\)-value is much smaller than any conventional \(\alpha\).
From a random sample of \(n = 20\) observations drawn from a Normal population, one obtains \(\bar{x} = 101\) and \(s' = 3\).
Consider the test: \(H_0: \mu \leq 100\) vs. \(H_1: \mu > 100\).
For \(\alpha = 0.05\), determine the rejection region.
If, for another sample of the same size, \(t_{obs} = 2.861\), what is the \(p\)-value?
a) Right-tailed test, \(T_0 \sim t_{(19)}\):
\[\text{Rejection region: } \left[t_{0.05,(19)}\,;\, +\infty\right[ = \left[1.729\,;\, +\infty\right[\]
(Table 7: \(t_{0.05,(19)} = 1.729\))
b) For \(t_{obs} = 2.861\), right-tailed:
\[p = P(T > 2.861)\]
From Table 7, \(\nu = 19\): \(t_{0.005,(19)} = 2.861\).
So \(p = 0.005\).
There is evidence to reject \(H_0\) for all usual significance levels (\(0.005 < \alpha\), for any conventional \(\alpha\)).
Key takeaways — Lecture 6
We will apply the testing framework to specific cases:
Make sure to review the relevant sections in Newbold (Chapter 9).
These slides are a free adaptation of the course material for Estatística II by Prof. Teresa Ferreira and Prof. Sandra Custódio from the Lisbon Accounting and Business School — Polytechnic University of Lisbon.
Primary reference: Newbold, P., Carlson, W. & Thorne, B. — Statistics for Business and Economics, Global Edition.
Statistics II — Parametric Hypothesis Tests