Unit 1: Exploring One-Variable Data

In this unit, we examine methods for displaying and summarizing one-variable data. Common representations include dot plots, histograms, and box plots. These visual tools help us understand the distribution of data in terms of center, shape, and spread.

Individuals and Variables

Individuals are the objects described by a set of data (e.g., people, animals, cars, schools).

Variable is any characteristic that can vary from one individual to another (e.g., height, hair color, age).

Categorical vs. Quantitative Variables

Categorical Variable: Places individuals into groups or categories (e.g., eye color, car brand).

Quantitative Variable: Takes numerical values for which arithmetic operations make sense (e.g., age, weight).

Note: If you can average the data, it's probably quantitative.

Distribution of a Variable

The distribution of a variable tells us what values the variable takes and how often it takes them.

Example: 100 students were asked about their favorite subject:
Math: 35, Science: 25, English: 20, History: 20.

Describing the Overall Pattern of a Distribution – Use SOCS

S – Shape: Describe the form of the distribution (e.g., symmetric, skewed left/right, bimodal).

O – Outliers: Identify any values that stand out from the pattern.

C – Center: Estimate the "typical" value (use mean or median).

S – Spread: Describe the range or variability (e.g., minimum and maximum values).

Note: Always describe the distribution in context of the question.

Example:

The tips distribution is skewed to the right and unimodal (S- Shape). There is a gap from $15 - $20 and a possible outlier from $20 - $20.25 (O- Outliers). The distribution of tips is centered at the median which is between $2.50 and $5.00 (C- Center). There is a large spread and high variability with a range of $22.50 (S- Spread).

Skewed Distributions

Skewed distributions occur when the data is not symmetrical. Instead of having a bell-shaped curve like a normal distribution, the data stretches more on one side.

Right-skewed (positively skewed): The tail is longer on the right side. Most values are clustered on the left, but some high outliers stretch the data to the right.

Left-skewed (negatively skewed): The tail is longer on the left side. Most values are higher, but a few low values pull the distribution to the left.

Outliers

An outlier is a value that falls far from the rest of the data. Use the 1.5 × IQR Rule to identify them:

Q1 − 1.5 × IQR
OR
Q3 + 1.5 × IQR

Example:

Low outlier cutoff: \( Q_1 - 1.5 \times \text{IQR} = 28 - 1.5(12) = 10 \) ⇒ 2 is an outlier

High outlier cutoff: \( Q_3 + 1.5 \times \text{IQR} = 40 + 1.5(12) = 58 \) ⇒ no high outlier

Outliers fall outside the expected range and can distort summary statistics.

The Mean (\( \bar{x} \))

To find the mean of a set of observations, add all the values and divide by the number of observations.

If the \( n \) observations are \( x_1, x_2, \ldots, x_n \), the mean is:

\[ \bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} \quad \text{or simply} \quad \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

The Median (M)

The median M is the midpoint of the distribution.

Arrange all observations in order from smallest to largest.

If \( n \) is odd, the median is the middle value: \( \frac{n+1}{2} \)

If \( n \) is even, the median is the average of the two middle values: \( \frac{n}{2} \) and \( \frac{n}{2} + 1 \)

The Five-Number Summary

The five-number summary consists of:

Minimum – Q1 – Median – Q3 – Maximum

It describes the spread and center of a data set in a simple way, moving from smallest to largest values.

The Quartiles (Q1 and Q3)

Arrange the data in increasing order and find the median.

Q1: Median of the lower half (values below the overall median).

Q3: Median of the upper half (values above the overall median).

Example: 2, 14, 28, 29, 30, 32, 33, 34, 40, 42, 52

Five-number summary: Min = 2, Q1 = 28, Med = 32, Q3 = 40, Max = 52

The Interquartile Range (IQR)

The IQR is the distance between the first and third quartiles:

IQR = Q3 − Q1

Example: \( \text{IQR} = 40 - 28 = 12 \)

Boxplot

A boxplot is a graph of the five-number summary, with outliers plotted individually.

  • A central box spans the quartiles.
  • A line inside the box marks the median.
  • Observations more than 1.5 × IQR outside the central box are plotted individually as outliers.
  • Lines (whiskers) extend from the box to the smallest and largest observations, not including outliers.

Example:

The Standard Deviation (S or Sx)

The standard deviation of a set of observations is the average of the squares of the deviations from their mean. The formula for the standard deviation of \( n \) observations \( x_1, x_2, \ldots, x_n \) is:

\[ s = \sqrt{ \frac{ \sum (x_i - \bar{x})^2 }{ n - 1 } } \]

Calculation of the Standard Deviation

Consider the data below which has a mean of 4.8:

So the standard deviation is: \[ s = \sqrt{ \frac{22.8}{5 - 1} } = \sqrt{ \frac{22.8}{4} } = \sqrt{5.7} \approx 2.387 \]

Types of Graphs

For Categorical Variables

  • Bar Graph: Displays frequencies or proportions in vertical or horizontal bars.
  • Pie Chart: Shows proportions as sectors of a circle (only when data forms a whole).
  • Segmented Bar Chart: Compares proportions within categories side by side.

For Quantitative Variables

  • Dotplot: Dots stacked above values. Great for small data sets.
  • Stemplot: Preserves data values and sorts them visually.
  • Histogram: Groups data into intervals (bins). Great for large data sets.
  • Boxplot: Displays the five-number summary and outliers.

Types of Distributions

  • Symmetric: Left and right sides are mirror images.
  • Skewed Right (positive): Right tail is longer.
  • Skewed Left (negative): Left tail is longer.
  • Bimodal: Two peaks.
  • Uniform: All bars are roughly the same height.
  • Bell-Shaped: Symmetric and mound-shaped (normal).

Always describe using SOCS: Shape, Outliers, Center, Spread.

Z-Score (Standardized Score)

A z-score measures how many standard deviations a data point is from the mean.

Use this formula:

\[ z = \frac{x - \bar{x}}{s} \]

  • If \( z > 0 \), the value is above the mean.
  • If \( z < 0 \), the value is below the mean.
  • If \( |z| \geq 2 \), the value may be considered unusual.
Example: A student scores 85 on a test with mean 75 and standard deviation 5.
\[ z = \frac{85 - 75}{5} = \frac{10}{5} = 2 \]
→ This score is 2 standard deviations above the mean.

The Empirical Rule (68–95–99.7 Rule)

In a normal distribution (symmetric and bell-shaped):

  • About 68% of values fall within 1 standard deviation of the mean
  • About 95% fall within 2 standard deviations
  • About 99.7% fall within 3 standard deviations

Choosing the Best Summary Statistic

  • If data is symmetric and has no outliers: Use the mean and standard deviation.
  • If data is skewed or has outliers: Use the median and IQR.
💡 Tip: Always match the summary statistic to the shape of the distribution!