In this unit, we investigate how statistics from samples vary, how we can model this variation, and how we can use sampling distributions to estimate unknown population parameters. This unit builds on the concept of randomness and prepares us for statistical inference.
A sampling distribution is the distribution of a statistic (like a sample mean or sample proportion) from all possible samples of the same size taken from the same population.
For example, if you repeatedly take samples of size 30 from a population and calculate the mean each time, the distribution of those 30-sample means is the sampling distribution of the sample mean.
Each dot in a sampling distribution graph represents a different sample statistic.
Sampling distributions help us understand what values of the statistic are likely or unlikely to occur by chance.
Key takeaway: A statistic is a random variable with its own distribution—the sampling distribution—centered around the true parameter (if unbiased) and with predictable spread (if random).
Even when we sample from the same population, different samples produce different statistics. This is called sampling variability.
Different samples yield different values for statistics like sample means (\( \bar{x} \)) or sample proportions (\( \hat{p} \)).
This variation is predictable when we know the sampling distribution.
Key Idea: The statistics we use to estimate population parameters will vary from sample to sample. Understanding this variability helps us judge how trustworthy our estimates are.
Tip: Larger sample sizes generally lead to less variability in the statistic.
A point estimate is a single value (statistic) that we use to estimate a population parameter.
Unbiased Estimator: A statistic is unbiased if the mean of its sampling distribution equals the true parameter.
Biased Estimator: A statistic is biased if its sampling distribution is centered at a value different from the parameter.
Common unbiased estimators:
\( \bar{x} \) for population mean \( \mu \)
\( \hat{p} \) for population proportion \( p \)
Bias vs. Variability:
Bias refers to accuracy (centered or not).
Variability refers to consistency (spread of the statistic).
Example: If you're using a faulty measuring device, your measurements may all be off in the same direction- that’s bias. If your device is accurate but inconsistent, that’s high variability.
Before using any formulas or normal approximations, you must check the following conditions:
Random: The data must come from a random sample or randomized experiment.
10% Condition: The sample size \( n \) must be no more than 10% of the population.
Large Counts: \( np \ge 10 \) and \( n(1 - p) \ge 10 \)
Random: Data must come from a random process.
10% Condition: \( n \le 0.10 \times N \)
Normal/Large Sample: If the population is normal, any \( n \) is okay. If not, use \( n \ge 30 \) (Central Limit Theorem).
The sampling distribution of a sample proportion \( \hat{p} \) describes how \( \hat{p} \) varies in repeated random sampling.
When conditions are met, the distribution of \( \hat{p} \) is approximately normal:
Shape: Approximately normal if \( np \ge 10 \) and \( n(1 - p) \ge 10 \)
Center: Mean \( \mu_{\hat{p}} = p \)
Spread: Standard deviation \( \sigma_{\hat{p}} = \sqrt{ \dfrac{p(1 - p)}{n} } \), only valid if sampling is random and \( n \le 10\%\text{ of the population} \)
Example: If 60% of a large population support a policy, and we take an SRS of 100 people, then:
\( \mu_{\hat{p}} = 0.60 \)
\( \sigma_{\hat{p}} = \sqrt{ \dfrac{0.6 \times 0.4}{100} } = 0.049 \)
The sampling distribution of the sample mean \( \bar{x} \) describes how \( \bar{x} \) varies in repeated samples.
Center: \( \mu_{\bar{x}} = \mu \)
Spread: \( \sigma_{\bar{x}} = \dfrac{\sigma}{\sqrt{n}} \), valid if \( n \le 10\%\text{ of the population} \)
Shape:
If population is normal, \( \bar{x} \) is normal for any \( n \)
If not normal, \( \bar{x} \) is approximately normal if \( n \ge 30 \) (Central Limit Theorem)
Example: If a population has mean \( \mu = 100 \) and standard deviation \( \sigma = 20 \), and we take samples of size 25:
\( \mu_{\bar{x}} = 100 \)
\( \sigma_{\bar{x}} = \dfrac{20}{\sqrt{25}} = 4 \)
The Central Limit Theorem (CLT) is one of the most important ideas in statistics.
CLT Statement: When sampling from any population with mean \( \mu \) and standard deviation \( \sigma \), the sampling distribution of \( \bar{x} \) approaches a normal distribution as the sample size \( n \) increases. This is true regardless of the shape of the population distribution, provided \( n \ge 30 \).
Implications:
Allows us to use normal probability models for inference even when the population is skewed.
Explains why averages are more normal than the population they're drawn from.
Tip: Always check the conditions before applying CLT: Random, 10%, and Normal/Large sample.
Simulate Sampling Distribution: Use randNorm(mean, sd, n)
in your TI calculator to simulate samples from a normal distribution.
Find z-scores: Use normalcdf
to find probabilities from a normal distribution, and invNorm
for percentiles.
Standardizing: To calculate a z-score for a statistic:
Standard deviation of sample proportion:
\( \sigma_{\hat{p}} = \sqrt{ \dfrac{p(1 - p)}{n} } \)
Standard deviation of sample mean:
\( \sigma_{\bar{x}} = \dfrac{\sigma}{\sqrt{n}} \)
Z-score:
\( z = \dfrac{\text{sample statistic} - \text{parameter}}{\text{standard deviation of statistic}} \)
Problem: A population has mean \( \mu = 90 \) and standard deviation \( \sigma = 12 \). What is the probability that the sample mean of a random sample of size 36 is greater than 92?
Solution:
\( \mu_{\bar{x}} = 90 \)
\( \sigma_{\bar{x}} = \dfrac{12}{\sqrt{36}} = 2 \)
\( z = \dfrac{92 - 90}{2} = 1 \)
\( P(\bar{x} > 92) = P(z > 1) = 0.1587 \)