Unit 2: Exploring Two-Variable Data
In this unit, we will build on what we've learned by representing two-variable data, comparing distributions, describing relationships between variables, and using models to make predictions.
Explanatory vs. Response Variables
- Explanatory variable (x): Predicts or influences the response.
- Response variable (y): Measures the outcome of interest.
Example: Studying hours (x) vs. exam score (y)
→ Hours studied is explanatory; exam score is response.
Scatterplots
Used to display the relationship between two quantitative variables.
Each point represents one individual
Put explanatory variable on x-axis, response on y-axis
Describe overall pattern with FODS:
- Form: linear or nonlinear
- Outliers: any points that fall outside the trend
- Direction: positive or negative
- Strength: strong, moderate, weak
Example:

This scatterplot shows a strong, negative, linear association between age of drivers and number of accidents. There don't appear to be any outliers in the data.
Correlation (r)
Measures the strength and direction of a linear relationship between two variables.
- \( -1 \leq r \leq 1 \)
- \( r > 0 \): positive association
- \( r < 0 \): negative association
- \( r = 0 \): no linear association
- \( r \) is unitless and not affected by changes in center or scale
💡 Tip: Correlation does not imply causation!
Least-Squares Regression Line (LSRL)
The line that minimizes the sum of squared residuals.
Equation form: \( \hat{y} = a + bx \)
- \( \hat{y} \): Predicted value
- \( a \): y-intercept → Predicted \( y \) when \( x = 0 \)
- \( b \): Slope → Change in \( y \) for a 1-unit increase in \( x \)
Example: \( \hat{y} = 45 + 5x \)
Each additional hour studied predicts 5 more points. When \( x = 0 \), \( \hat{y} = 45 \).
Interpreting Computer Output
Example
- Constant: y-intercept \( a \)
- x coefficient: slope \( b \)
- SE Coef: standard error of the coefficient
- t, p: used for significance testing (more on this in later units)
Calculating the Regression Line by Calculator
- Enter data into Lists (STAT → Edit → L1, L2).
- STAT → CALC → LinReg(ax+b)
- Store RegEQ to Y₁: VARS → Y-VARS → Function → Y₁
- View scatterplot: 2nd → Y= → Turn on plot; ZoomStat
Interpreting Slope and Y-Intercept
- Slope: For each additional [unit of x], the predicted [y] increases/decreases by [b].
- Y-intercept: When [x = 0], the predicted [y] is [a].
Example: \( \hat{y} = 45 + 5x \)
Slope: Each additional hour studied predicts 5 more points on the test.
Intercept: A student who studies 0 hours is predicted to score 45.
Residuals
Residual = Actual y − Predicted y
\[
\text{Residual} = y - \hat{y}
\]
- Positive residual: prediction was too low
- Negative residual: prediction was too high
Residual Plots
A scatterplot of residuals vs. x (or predicted y)
- If the plot shows no pattern (random scatter), the linear model is appropriate.
- If you see curves, funnels, or structure → linear model is not appropriate.
Understanding \( \hat{y} \), \( a \), \( b \), \( r^2 \), and \( s \)
- \( \hat{y} \): The predicted value of y for a given x.
- \( a \): y-intercept → Starting value when \( x = 0 \)
- \( b \): Slope → For each 1-unit increase in \( x \), \( y \) changes by \( b \)
- \( r^2 \): Coefficient of determination → Proportion of variability in \( y \) explained by the LSRL
- \( s \): Standard deviation of residuals → Average prediction error in units of y
Example: If \( r^2 = 0.82 \), then 82% of variation in test scores is explained by study hours. If \( s = 2.5 \), predictions are typically off by 2.5 points.
Outliers vs. Leverage vs. Influential Points
- Outliers: Points with large residuals (far from regression line vertically)
- Leverage points: Points far from mean of \( x \), may not have large residuals
- Influential points: Points that, if removed, significantly change the LSRL
Coefficient of Determination (\( r^2 \))
\( r^2 \) tells us the proportion of variability in y explained by the linear relationship with x.
Example: \( r^2 = 0.85 \) → 85% of the variation in y is explained by the model.
Cautions in Regression
- Correlation ≠ Causation: Just because two variables are related does not mean one causes the other.
- Extrapolation: Predicting beyond the domain of the data is risky and unreliable.
- Outliers and influential points: Always check plots before trusting models.
Calculator Tips and Hacks
- Use LinReg(ax+b) to get equation, r, r² (enable diagnostics: 2nd → 0 → "DIAGNOSTIC ON")
- Store RegEQ directly into Y₁ to graph the line
- View predicted values: plug x into Y₁ using the table or by VARS → Y₁
- Use RESID list: 2nd → STAT → RESID to graph or analyze residuals