In this unit, we examine methods for displaying and summarizing one-variable data. Common representations include dot plots, histograms, and box plots. These visual tools help us understand the distribution of data in terms of center, shape, and spread.
Individuals are the objects described by a set of data (e.g., people, animals, cars, schools).
Variable is any characteristic that can vary from one individual to another (e.g., height, hair color, age).
Categorical Variable: Places individuals into groups or categories (e.g., eye color, car brand).
Quantitative Variable: Takes numerical values for which arithmetic operations make sense (e.g., age, weight).
The distribution of a variable tells us what values the variable takes and how often it takes them.
S – Shape: Describe the form of the distribution (e.g., symmetric, skewed left/right, bimodal).
O – Outliers: Identify any values that stand out from the pattern.
C – Center: Estimate the "typical" value (use mean or median).
S – Spread: Describe the range or variability (e.g., minimum and maximum values).
Skewed distributions occur when the data is not symmetrical. Instead of having a bell-shaped curve like a normal distribution, the data stretches more on one side.
Right-skewed (positively skewed): The tail is longer on the right side. Most values are clustered on the left, but some high outliers stretch the data to the right.
Left-skewed (negatively skewed): The tail is longer on the left side. Most values are higher, but a few low values pull the distribution to the left.
An outlier is a value that falls far from the rest of the data. Use the 1.5 × IQR Rule to identify them:
Q1 − 1.5 × IQR
OR
Q3 + 1.5 × IQR
Example:
Low outlier cutoff: \( Q_1 - 1.5 \times \text{IQR} = 28 - 1.5(12) = 10 \) ⇒ 2 is an outlier
High outlier cutoff: \( Q_3 + 1.5 \times \text{IQR} = 40 + 1.5(12) = 58 \) ⇒ no high outlier
Outliers fall outside the expected range and can distort summary statistics.
To find the mean of a set of observations, add all the values and divide by the number of observations.
If the \( n \) observations are \( x_1, x_2, \ldots, x_n \), the mean is:
\[ \bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} \quad \text{or simply} \quad \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]
The median M is the midpoint of the distribution.
Arrange all observations in order from smallest to largest.
If \( n \) is odd, the median is the middle value: \( \frac{n+1}{2} \)
If \( n \) is even, the median is the average of the two middle values: \( \frac{n}{2} \) and \( \frac{n}{2} + 1 \)
The five-number summary consists of:
Minimum – Q1 – Median – Q3 – Maximum
It describes the spread and center of a data set in a simple way, moving from smallest to largest values.
Arrange the data in increasing order and find the median.
Q1: Median of the lower half (values below the overall median).
Q3: Median of the upper half (values above the overall median).
Example: 2, 14, 28, 29, 30, 32, 33, 34, 40, 42, 52
Five-number summary: Min = 2, Q1 = 28, Med = 32, Q3 = 40, Max = 52
The IQR is the distance between the first and third quartiles:
IQR = Q3 − Q1
Example: \( \text{IQR} = 40 - 28 = 12 \)
A boxplot is a graph of the five-number summary, with outliers plotted individually.
Example:
The standard deviation of a set of observations is the average of the squares of the deviations from their mean. The formula for the standard deviation of \( n \) observations \( x_1, x_2, \ldots, x_n \) is:
\[ s = \sqrt{ \frac{ \sum (x_i - \bar{x})^2 }{ n - 1 } } \]
Consider the data below which has a mean of 4.8:
So the standard deviation is: \[ s = \sqrt{ \frac{22.8}{5 - 1} } = \sqrt{ \frac{22.8}{4} } = \sqrt{5.7} \approx 2.387 \]
Always describe using SOCS: Shape, Outliers, Center, Spread.
A z-score measures how many standard deviations a data point is from the mean.
Use this formula:
\[ z = \frac{x - \bar{x}}{s} \]
In a normal distribution (symmetric and bell-shaped):