Skip to main content

Linear Regression

A scatter plot is a graph that represents a relationship between two variables in a dataset by plotting all the points onto the graph. If the relationship is from a linear model, or a model that is nearly linear, then we can use our knowledge of linear functions to draw conclusions.

Scatter Plot
Fig. 1 - Scatter Plot
note

Not all scatter plots indicate a linear relationship.

Line of Best Fit

The line of best fit is a linear function that best fits our data. This is a linear function we can then use to make predictions about the data. We can always "eyeball" a line that seems to fit or we can use the formula for least squares regression to obtain a line of best fit.

Given a set of (x,y)(x, y) points where xx is the set of inputs, yy is the set of outputs, and NN is the number of points, we can calculate the slope using...

m=N(xy)xyNx2(x)2\begin{align*} m &= \frac{N\sum{(xy)} \:-\: \sum{x}\sum{y}}{N\sum{x^2} \:-\: (\sum{x})^2} \end{align*}

Using the slope, which is the value of mm, we can calculate the yy-intercept using...

b=ymxN\begin{align*} b &= \frac{\sum{y} \:-\: m\sum{x}}{N} \end{align*}

Finally, using the values of mm and bb we have acquired, we can create the line of best fit using y=mx+by = mx + b.

Line of Best Fit
Fig. 2 - Line of Best Fit

Interpolation vs Extrapolation

Now that we have a linear function we can use to make predictions, we need to understand its limitations. When we predict a value using a function, we use two processes. The process known as interpolation is when we predict a value inside the domain and range of the data. On the other hand, the process known as extrapolation is used when we predict a value outside the domain and range of the data. This distinction is important when making predictions because of model breakdown which is a certain point where our model no longer works. Outside the domain and range of the data, we do not know how the data will change and we need to be aware of that fact. At the end of the day, we are just making predictions.

Interpolation vs Extrapolation
Fig. 3 - Interpolation vs Extrapolation

Correlation Coefficient

Not all lines of best fit are created equal. Some data exhibit stronger linear trends than others and so their linear models are more accurate because the data is less scattered. We need a way to quantify how strong the linear trends are and we can measure this using the correlation coefficient.

The correlation coefficient is a value, rr, between 1-1 and 11 which suggests the strength and direction of a linear relationship. If r>0r > 0, then we have a positive (increasing) relationship and if r<0r < 0, then we have a negative (decreasing) relationship. Also note, if rr is closer to 00 then the data is more scattered and if rr is closer to 11 or 1-1 then the data is less scattered. The less scattered the data is, the stronger the linear relationship is.

Correlation Coefficient
Fig. 4 - Correlation Coefficient

Given a set of (x,y)(x, y) points, where xix_i and yiy_i are the data points, xˉ\bar{x} is the mean of the set of xx, and yˉ\bar{y} is the mean of the set of yy, we can calculate the correlation coefficient using...

r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2\begin{align*} r &= \frac{\sum{(x_i \:-\: \bar{x})(y_i \:-\: \bar{y})}}{\sqrt{\sum{(x_i \:-\: \bar{x})^2}\sum{(y_i \:-\: \bar{y})^2}}} \end{align*}