Frames:

1. Introduction

The Revised National Curriculum Statement calls for the application of mathematics in real-world contexts, and the use of mathematical modelling to analyse and describe our world. For example, the Data Handling and Probability Learning Outcome requires learners:

  • to collect and use data to establish statistical and probability models to solve related problems.
  • to represent bivariate data as a scatter plot and suggest intuitively whether a linear, quadratic or exponential function would best fit the data.

However, this is not so simple, because when we use real-life contexts, the data often are inaccurate, “messy” or “noisy”. This may be due to either measurement errors in scientific experiments, or to statistical data for which the model is not exact (e.g. the relationship between people’s length and weight, or the relationship between a person’s years of education and income.)

Let’s illustrate. Here is a table and scatterplot of an abstract relationship between two variables x and y.

 x
 y
0
5
3
6,5
6
8
9
9,5
12
11
15
12,5
18
14
30
?

Can you find the algebraic relationship between x and y from the table? Can you predict y(30)?

In the applet, click the sliders, or type values, to change the parameters a and b so that the line y = ax + b goes through all the points (to reset, click “init”).

This process of fitting a graph on given data is called curve fitting or regression.

You should find that the function y = 0,5x + 5 exactly “fits” all the data pairs, that the line passes exactly through all six points, and that y(30) can be confidently predicted as 0,5x30 + 5 = 20.

Now look at a real-world context: The data and scatterplot below were obtained in a science experiment measuring the length of a spring with different masses hanging on it (Hooke’s Law):

Mass (x)
Length (y)
0
5,05
3
6,72
6
8,40
9
9,15
12
10,50
15
12,85
18
13,65
30
?

Can you find the algebraic relationship between x and y from the table?
Can you predict y(30)?

Can you fit the line y = ax + b through all six points?

Here it is impossible to fit a line exactly through all the points. Our problem is to find the best approximate model for the data – we call this the line of best fit or the regression line.

You would agree that it is impossible to decide visually if one line is a better fit than another! In this unit we investigate criteria and methods to find the line of best fit – we will investigate it numerically, algebraically and graphically, and use the computation power of technology tools like Excel spreadsheets to help us. We will reflect on the “goodness of fit” (the strength of the relationship, or correlation), and apply our knowledge in a wide range of applications.

We will then later return to this problem and show that the line of best fit is y = 0,4781x + 5,1714.
From this model, y(30) can be reasonably predicted as 0, 4781x30 + 5,1714 » 19,5.

Outcomes

After working through this unit you should be able to:

  • Use scatterplots to visualise the relationship between two variables.
  • Explain and apply the least square errors method numerically and algebraically to find the curve of best fit.
  • Use technology tools like Excel Trendline to generate regression models and data.
  • Analyse regression data to choose the most appropriate approximate model for a situation.
  • Interpret the correlation coefficient for a dataset.
  • Use linearisation to model a real-life situation.
  • Use approximate models to predict unknown values (extrapolate and interpolate).