Statistics: regression and correlation – Week 4 focus
Download the Lessonotes Mobile South Africa app for faster lesson access on Android and iPhone.
Subject: Mathematics
Class: Grade 12
Term: 3rd Term
Week: 4
Theme: General lesson support
This page supports the lesson note with a companion video and a short classroom-ready summary.
For class groups and homework, share this lesson page so learners also get the summary, objectives, and full lesson context.
This week, we delve into the fascinating world of regression and correlation, critical tools in statistics used to understand and model relationships between two variables. Understanding these concepts is vital because it allows us to make predictions based on observed data. In South Africa, we can use regression and correlation to analyze everything from the relationship between household income and access to education, to the link between fertilizer use and crop yields. These analytical skills are crucial for informed decision-making in various fields.
Correlation: Correlation measures the strength and direction of a linear relationship between two variables. The correlation coefficient, denoted by r, is a number between -1 and +1. r = +1: Perfect positive correlation. As one variable increases, the other increases proportionally. r = -1: Perfect negative correlation. As one variable increases, the other decreases proportionally. r = 0: No linear correlation. The variables are not linearly related. 0 < r < 1: Positive correlation. -1 < r < 0: Negative correlation. The closer r is to +1 or -1, the stronger the linear relationship. Calculating the Correlation Coefficient (r): The formula for the Pearson product-moment correlation coefficient is: r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² Σ(yᵢ - ȳ)²] Where: xᵢ and yᵢ are the individual data points. x̄ and ȳ are the means of the x and y values, respectively. Σ represents the sum.
Example 1: A study investigates the relationship between the number of hours students spend studying per week (x) and their final exam scores (y).
The data is as follows: | Student | Hours Studied (x) | Exam Score (y) | |---|---|---| | A | 5 | 60 | | B | 10 | 75 | | C | 15 | 85 | | D | 20 | 90 | | E | 25 | 95 | Calculate the means: x̄ = (5+10+15+20+25)/5 = 15; ȳ = (60+75+85+90+95)/5 = 81 Calculate (xᵢ - x̄)(yᵢ - ȳ), (xᵢ - x̄)², and (yᵢ - ȳ)² for each student: | Student | xᵢ - x̄ | yᵢ - ȳ | (xᵢ - x̄)(yᵢ - ȳ) | (xᵢ - x̄)² | (yᵢ - ȳ)² | |---|---|---|---|---|---| | A | -10 | -21 | 210 | 100 | 441 | | B | -5 | -6 | 30 | 25 | 36 | | C | 0 | 4 | 0 | 0 | 16 | | D | 5 | 9 | 45 | 25 | 81 | | E | 10 | 14 | 140 | 100 | 196 | Sum the columns: Σ(xᵢ - x̄)(yᵢ - ȳ) = 425; Σ(xᵢ - x̄)² = 250; Σ(yᵢ - ȳ)² = 770 Calculate r: r = 425 / √(250 * 770) = 425 / √192500 ≈ 425 / 438.75 ≈ 0.97 Interpretation: The correlation coefficient is approximately 0.97, indicating a strong positive correlation between hours studied and exam scores.
Regression: Regression analysis involves finding the equation of a line that best fits the data points in a scatter plot. This line is called the least-squares regression line.
The equation of the line is: y = a + bx Where: y is the dependent variable (the variable being predicted). x is the independent variable (the variable used to make the prediction). a is the y-intercept (the value of y when x = 0). b is the slope (the change in y for every one unit change in x). Calculating the Equation of the Regression Line: b = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / Σ(xᵢ - x̄)² = Cov(x,y) / Var(x) a = ȳ - b x̄ Where: Cov(x,y) represents the covariance of x and y Var(x) represents the variance of x Example 2: Using the data from Example 1: We already calculated: Σ[(xᵢ - x̄)(yᵢ - ȳ)] = 425 and Σ(xᵢ - x̄)² = 250 Calculate b: b = 425 / 250 = 1.7 Calculate a: a = 81 - (1.7 * 15) = 81 - 25.5 = 55.5 Therefore, the equation of the regression line is: y = 55.5 + 1.7x Interpretation: Slope (b = 1.7): For every additional hour of studying, the exam score is predicted to increase by 1.7 points. Y-intercept (a = 55.5): A student who studies 0 hours is predicted to score 55.5 on the exam. (This might not be realistic in practice but is the mathematical interpretation.) Making Predictions (Interpolation and Extrapolation): We can use the regression equation to predict the value of y for a given value of x.
Interpolation: Making predictions within the range of the original x values.
Extrapolation: Making predictions outside the range of the original x values. Extrapolation should be done with caution, as the relationship may not hold true outside the observed range.
Example 3: Using the regression equation from Example 2 (y = 55.5 + 1.7x): Predict the exam score for a student who studies 12 hours (interpolation): y = 55.5 + (1.7 12) = 55.5 + 20.4 = 75.9 Predict the exam score for a student who studies 40 hours (extrapolation): y = 55.5 + (1.7 40) = 55.5 + 68 = 123.5 While mathematically correct, the prediction of 123.5 is unrealistic as the exam score cannot exceed
1
0
0. This highlights the dangers of extrapolation. Correlation vs.
Causation: It is crucial to understand that correlation does not imply causation. Just because two variables are correlated does not mean that one variable causes the other. There may be other factors involved, or the relationship may be coincidental.
Example: There might be a correlation between ice cream sales and crime rates.
However, it's unlikely that eating ice cream causes crime. A more likely explanation is that both ice cream sales and crime rates increase during warmer months. Guided Practice (With Solutions)
Question 1: The following data represents the number of unemployed individuals (in thousands) and the number of job vacancies (in thousands) in South Africa over a period of six months. | Month | Unemployed (x) | Vacancies (y) | |---|---|---| | 1 | 2500 | 50 | | 2 | 2450 | 55 | | 3 | 2400 | 60 | | 4 | 2350 | 65 | | 5 | 2300 | 70 | | 6 | 2250 | 75 | Calculate the correlation coefficient (r) and interpret its meaning.