Unit 5: Sampling Distributions

In this unit, we investigate how statistics from samples vary, how we can model this variation, and how we can use sampling distributions to estimate unknown population parameters. This unit builds on the concept of randomness and prepares us for statistical inference.

What Is a Sampling Distribution?

A sampling distribution is the distribution of a statistic (like a sample mean or sample proportion) from all possible samples of the same size taken from the same population.

For example, if you repeatedly take samples of size 30 from a population and calculate the mean each time, the distribution of those 30-sample means is the sampling distribution of the sample mean.

Each dot in a sampling distribution graph represents a different sample statistic.
Sampling distributions help us understand what values of the statistic are likely or unlikely to occur by chance.

Key takeaway: A statistic is a random variable with its own distribution—the sampling distribution—centered around the true parameter (if unbiased) and with predictable spread (if random).

Variation in Statistics for Samples from the Same Population

Even when we sample from the same population, different samples produce different statistics. This is called sampling variability.

Different samples yield different values for statistics like sample means (\( \bar{x} \)) or sample proportions (\( \hat{p} \)).

This variation is predictable when we know the sampling distribution.

Key Idea: The statistics we use to estimate population parameters will vary from sample to sample. Understanding this variability helps us judge how trustworthy our estimates are.

Tip: Larger sample sizes generally lead to less variability in the statistic.

Biased and Unbiased Point Estimates

A point estimate is a single value (statistic) that we use to estimate a population parameter.

Unbiased Estimator: A statistic is unbiased if the mean of its sampling distribution equals the true parameter.

Biased Estimator: A statistic is biased if its sampling distribution is centered at a value different from the parameter.

Common unbiased estimators:

\( \bar{x} \) for population mean \( \mu \)
\( \hat{p} \) for population proportion \( p \)

Bias vs. Variability:

Bias refers to accuracy (centered or not).
Variability refers to consistency (spread of the statistic).

Example: If you're using a faulty measuring device, your measurements may all be off in the same direction- that’s bias. If your device is accurate but inconsistent, that’s high variability.

Conditions Checklist for Sampling Distributions

Before using any formulas or normal approximations, you must check the following conditions:

Sample Proportions (\( \hat{p} \))

Random: The data must come from a random sample or randomized experiment.
10% Condition: The sample size \( n \) must be no more than 10% of the population.
Large Counts: \( np \ge 10 \) and \( n(1 - p) \ge 10 \)

Sample Means (\( \bar{x} \))

Random: Data must come from a random process.
10% Condition: \( n \le 0.10 \times N \)
Normal/Large Sample: If the population is normal, any \( n \) is okay. If not, use \( n \ge 30 \) (Central Limit Theorem).

Sampling Distributions for Sample Proportions

The sampling distribution of a sample proportion \( \hat{p} \) describes how \( \hat{p} \) varies in repeated random sampling.

When conditions are met, the distribution of \( \hat{p} \) is approximately normal:

Shape: Approximately normal if \( np \ge 10 \) and \( n(1 - p) \ge 10 \)
Center: Mean \( \mu_{\hat{p}} = p \)
Spread: Standard deviation \( \sigma_{\hat{p}} = \sqrt{ \dfrac{p(1 - p)}{n} } \), only valid if sampling is random and \( n \le 10\%\text{ of the population} \)

Example: If 60% of a large population support a policy, and we take an SRS of 100 people, then:

\( \mu_{\hat{p}} = 0.60 \)
\( \sigma_{\hat{p}} = \sqrt{ \dfrac{0.6 \times 0.4}{100} } = 0.049 \)

Sampling Distributions for Sample Means

The sampling distribution of the sample mean \( \bar{x} \) describes how \( \bar{x} \) varies in repeated samples.

Center: \( \mu_{\bar{x}} = \mu \)
Spread: \( \sigma_{\bar{x}} = \dfrac{\sigma}{\sqrt{n}} \), valid if \( n \le 10\%\text{ of the population} \)
Shape:
If population is normal, \( \bar{x} \) is normal for any \( n \)
If not normal, \( \bar{x} \) is approximately normal if \( n \ge 30 \) (Central Limit Theorem)

Example: If a population has mean \( \mu = 100 \) and standard deviation \( \sigma = 20 \), and we take samples of size 25:

\( \mu_{\bar{x}} = 100 \)
\( \sigma_{\bar{x}} = \dfrac{20}{\sqrt{25}} = 4 \)

The Central Limit Theorem (CLT)

The Central Limit Theorem (CLT) is one of the most important ideas in statistics.

CLT Statement: When sampling from any population with mean \( \mu \) and standard deviation \( \sigma \), the sampling distribution of \( \bar{x} \) approaches a normal distribution as the sample size \( n \) increases. This is true regardless of the shape of the population distribution, provided \( n \ge 30 \).

Implications:

Allows us to use normal probability models for inference even when the population is skewed.
Explains why averages are more normal than the population they're drawn from.

Tip: Always check the conditions before applying CLT: Random, 10%, and Normal/Large sample.

Calculator Hacks and Tips

Simulate Sampling Distribution: Use randNorm(mean, sd, n) in your TI calculator to simulate samples from a normal distribution.

Find z-scores: Use normalcdf to find probabilities from a normal distribution, and invNorm for percentiles.
Standardizing: To calculate a z-score for a statistic:

\( z = \dfrac{\text{statistic} - \text{mean}}{\text{standard deviation}} \)

Summary: Key Formulas

Standard deviation of sample proportion: \( \sigma_{\hat{p}} = \sqrt{ \dfrac{p(1 - p)}{n} } \)

Standard deviation of sample mean: \( \sigma_{\bar{x}} = \dfrac{\sigma}{\sqrt{n}} \)

Z-score: \( z = \dfrac{\text{sample statistic} - \text{parameter}}{\text{standard deviation of statistic}} \)

Example Problem

Problem: A population has mean \( \mu = 90 \) and standard deviation \( \sigma = 12 \). What is the probability that the sample mean of a random sample of size 36 is greater than 92?

Solution:

\( \mu_{\bar{x}} = 90 \)
\( \sigma_{\bar{x}} = \dfrac{12}{\sqrt{36}} = 2 \)
\( z = \dfrac{92 - 90}{2} = 1 \)
\( P(\bar{x} > 92) = P(z > 1) = 0.1587 \)

Key Takeaways

Statistics vary, but their variability follows predictable patterns.

Sampling distributions let us use probability to assess accuracy of statistics.

The Central Limit Theorem is essential for inference. It is what makes most of statistics work.

Always check conditions before applying normal models.