Unit 2: Exploring Two-Variable Data

In this unit, we will build on what we've learned by representing two-variable data, comparing distributions, describing relationships between variables, and using models to make predictions.

Explanatory vs. Response Variables

Explanatory variable (x): Predicts or influences the response.
Response variable (y): Measures the outcome of interest.

Example: Studying hours (x) vs. exam score (y)
→ Hours studied is explanatory; exam score is response.

Scatterplots

Used to display the relationship between two quantitative variables.

Each point represents one individual

Put explanatory variable on x-axis, response on y-axis

Describe overall pattern with FODS:

Form: linear or nonlinear
Outliers: any points that fall outside the trend
Direction: positive or negative
Strength: strong, moderate, weak

Example:

This scatterplot shows a strong, negative, linear association between age of drivers and number of accidents. There don't appear to be any outliers in the data.

Correlation (r)

Measures the strength and direction of a linear relationship between two variables.

\( -1 \leq r \leq 1 \)
\( r > 0 \): positive association
\( r < 0 \): negative association
\( r = 0 \): no linear association
\( r \) is unitless and not affected by changes in center or scale

💡 Tip: Correlation does not imply causation!

Least-Squares Regression Line (LSRL)

The line that minimizes the sum of squared residuals.

Equation form: \( \hat{y} = a + bx \)

\( \hat{y} \): Predicted value
\( a \): y-intercept → Predicted \( y \) when \( x = 0 \)
\( b \): Slope → Change in \( y \) for a 1-unit increase in \( x \)

Example: \( \hat{y} = 45 + 5x \)
Each additional hour studied predicts 5 more points. When \( x = 0 \), \( \hat{y} = 45 \).

Interpreting Computer Output

Example

Constant: y-intercept \( a \)
x coefficient: slope \( b \)
SE Coef: standard error of the coefficient
t, p: used for significance testing (more on this in later units)

Calculating the Regression Line by Calculator

Enter data into Lists (STAT → Edit → L1, L2).
STAT → CALC → LinReg(ax+b)
Store RegEQ to Y₁: VARS → Y-VARS → Function → Y₁
View scatterplot: 2nd → Y= → Turn on plot; ZoomStat

Interpreting Slope and Y-Intercept

Slope: For each additional [unit of x], the predicted [y] increases/decreases by [b].
Y-intercept: When [x = 0], the predicted [y] is [a].

Example: \( \hat{y} = 45 + 5x \)
Slope: Each additional hour studied predicts 5 more points on the test.
Intercept: A student who studies 0 hours is predicted to score 45.

Residuals

Residual = Actual y − Predicted y

\[ \text{Residual} = y - \hat{y} \]

Positive residual: prediction was too low
Negative residual: prediction was too high

Residual Plots

A scatterplot of residuals vs. x (or predicted y)

If the plot shows no pattern (random scatter), the linear model is appropriate.
If you see curves, funnels, or structure → linear model is not appropriate.

Understanding \( \hat{y} \), \( a \), \( b \), \( r^2 \), and \( s \)

\( \hat{y} \): The predicted value of y for a given x.
\( a \): y-intercept → Starting value when \( x = 0 \)
\( b \): Slope → For each 1-unit increase in \( x \), \( y \) changes by \( b \)
\( r^2 \): Coefficient of determination → Proportion of variability in \( y \) explained by the LSRL
\( s \): Standard deviation of residuals → Average prediction error in units of y

Example: If \( r^2 = 0.82 \), then 82% of variation in test scores is explained by study hours. If \( s = 2.5 \), predictions are typically off by 2.5 points.

Outliers vs. Leverage vs. Influential Points

Outliers: Points with large residuals (far from regression line vertically)
Leverage points: Points far from mean of \( x \), may not have large residuals
Influential points: Points that, if removed, significantly change the LSRL

Coefficient of Determination (\( r^2 \))

\( r^2 \) tells us the proportion of variability in y explained by the linear relationship with x.

Example: \( r^2 = 0.85 \) → 85% of the variation in y is explained by the model.

Cautions in Regression

Correlation ≠ Causation: Just because two variables are related does not mean one causes the other.
Extrapolation: Predicting beyond the domain of the data is risky and unreliable.
Outliers and influential points: Always check plots before trusting models.

Calculator Tips and Hacks

Use LinReg(ax+b) to get equation, r, r² (enable diagnostics: 2nd → 0 → "DIAGNOSTIC ON")
Store RegEQ directly into Y₁ to graph the line
View predicted values: plug x into Y₁ using the table or by VARS → Y₁
Use RESID list: 2nd → STAT → RESID to graph or analyze residuals