Statistics: regression and correlation – Week 5 focus
Download the Lessonotes Mobile South Africa app for faster lesson access on Android and iPhone.
Subject: Mathematics
Class: Grade 12
Term: 3rd Term
Week: 5
Theme: General lesson support
This page supports the lesson note with a companion video and a short classroom-ready summary.
For class groups and homework, share this lesson page so learners also get the summary, objectives, and full lesson context.
This week, we delve into the fascinating world of regression and correlation. Understanding the relationship between two sets of data is crucial in many fields, from economics and environmental science to sports analysis and marketing. In South Africa, understanding trends in areas like unemployment, agricultural output, crime statistics, or academic performance requires a solid grasp of regression and correlation. Being able to analyse data, identify relationships, and make predictions is an invaluable skill for future decision-makers and problem-solvers. This knowledge will allow you to critically evaluate information presented in the media and make informed choices.
Correlation: Correlation measures the strength and direction of a linear relationship between two variables. The correlation coefficient, denoted by r, ranges from -1 to +1. r = +1: Perfect positive correlation (as one variable increases, the other increases proportionally). r = -1: Perfect negative correlation (as one variable increases, the other decreases proportionally). r = 0: No linear correlation. 0 < r < 1: Positive correlation. The closer r is to 1, the stronger the positive relationship. -1 < r < 0: Negative correlation. The closer r is to -1, the stronger the negative relationship.
Important Formula: The formula for calculating the correlation coefficient r is: r = ∑((xᵢ - x̄)(yᵢ - ȳ)) / √(∑(xᵢ - x̄)² * ∑(yᵢ - ȳ)²)
Where: xᵢ and yᵢ are individual data points. x̄ is the mean of the x values. ȳ is the mean of the y values. ∑ denotes summation.
Example 1: A study was conducted to investigate the relationship between the amount of rainfall (in mm) and the yield of maize (in tons per hectare) in a specific region of KwaZulu-Natal.
The data is shown below: | Rainfall (mm) | Maize Yield (tons/hectare) | |--------------|----------------------------| | 100 | 2 | | 150 | 3 | | 200 | 4 | | 250 | 5 | | 300 | 6 | Calculate the correlation coefficient.
Solution: First, calculate the means: x̄ = (100 + 150 + 200 + 250 + 300) / 5 = 200 ȳ = (2 + 3 + 4 + 5 + 6) / 5 = 4 Now, calculate the necessary sums: | Rainfall (xᵢ) | Maize Yield (yᵢ) | xᵢ - x̄ | yᵢ - ȳ | (xᵢ - x̄)(yᵢ - ȳ) | (xᵢ - x̄)² | (yᵢ - ȳ)² | |--------------|-------------------|---------|---------|-------------------|-----------|-----------| | 100 | 2 | -100 | -2 | 200 | 10000 | 4 | | 150 | 3 | -50 | -1 | 50 | 2500 | 1 | | 200 | 4 | 0 | 0 | 0 | 0 | 0 | | 250 | 5 | 50 | 1 | 50 | 2500 | 1 | | 300 | 6 | 100 | 2 | 200 | 10000 | 4 | | | | | | ∑ = 500 | ∑ = 25000 | ∑ = 10 | r = 500 / √(25000 * 10) = 500 / √250000 = 500 / 500 = 1 Therefore, r = 1, indicating a perfect positive correlation between rainfall and maize yield.
Regression: Regression analysis aims to find the equation of a line that best fits the data, allowing us to predict the value of one variable (the dependent variable, y) based on the value of another variable (the independent variable, x). This line is called the least squares regression line. Equation of the Least Squares Regression Line: y = a + bx Where: y is the predicted value of the dependent variable. x is the value of the independent variable. a is the y-intercept (the value of y when x = 0). b is the slope (the change in y for every unit change in x).
Formulas for calculating a and b: b = ∑((xᵢ - x̄)(yᵢ - ȳ)) / ∑(xᵢ - x̄)² (
Note: this is also equal to r * (s_y / s_x) where s_y and s_x are the standard deviations of y and x, respectively) a = ȳ - b * x̄ Example 2: Using the same data from Example 1, determine the equation of the least squares regression line.
Solution: We already calculated the necessary sums in Example 1: ∑((xᵢ - x̄)(yᵢ - ȳ)) = 500 ∑(xᵢ - x̄)² = 25000 x̄ = 200 ȳ = 4 b = 500 / 25000 = 0.02 a = 4 - 0.02 * 200 = 4 - 4 = 0 Therefore, the equation of the least squares regression line is: y = 0 + 0.02x or simply y = 0.02x Interpretation: The slope (b = 0.02) indicates that for every 1 mm increase in rainfall, the maize yield is predicted to increase by 0.02 tons per hectare. The y-intercept (a = 0) suggests that with no rainfall, the predicted maize yield is 0 tons per hectare, which is a logical assumption in this context.
Example 3: A researcher wants to investigate the relationship between study hours per week (x) and the final exam score (y) for a group of Grade 12 learners in Gauteng.
The following data was collected: | Study Hours (x) | Exam Score (y) | |-----------------|------------------| | 5 | 60 | | 10 | 75 | | 15 | 82 | | 20 | 90 | | 25 | 95 | Calculate the correlation coefficient. Determine the equation of the least squares regression line. Predict the exam score for a student who studies 18 hours per week.
Solution: Correlation Coefficient (r): x̄ = (5 + 10 + 15 + 20 + 25) / 5 = 15 ȳ = (60 + 75 + 82 + 90 + 95) / 5 = 80.4 | Study Hours (xᵢ) | Exam Score (yᵢ) | xᵢ - x̄ | yᵢ - ȳ | (xᵢ - x̄)(yᵢ - ȳ) | (xᵢ - x̄)² | (yᵢ - ȳ)² | |-----------------|------------------|---------|---------|-------------------|-----------|-----------| | 5 | 60 | -10 | -20.4 | 204 | 100 | 416.16 | | 10 | 75 | -5 | -5.4 | 27 | 25 | 29.16 | | 15 | 82 | 0 | 1.6 | 0 | 0 | 2.56 | | 20 | 90 | 5 | 9.6 | 48 | 25 | 92.16 | | 25 | 95 | 10 | 14.6 | 146 | 100 | 213.16 | | | | | | ∑ = 425 | ∑ = 350 | ∑ = 753.2 | r = 425 / √(350 * 753.2) = 425 / √263620 = 425 / 513.44 = 0.828 Therefore, r = 0.828, indicating a strong positive correlation between study hours and exam scores.
Least Squares Regression Line: b = 425 / 350 = 1.214 a = 80.4 - 1.214 * 15 = 80.4 - 18.21 = 62.19 Therefore, the equation of the least squares regression line is: y = 62.19 + 1.214x Prediction: For a student who studies 18 hours per week (x = 18): y = 62.19 + 1.214 * 18 = 62.19 + 21.