The "Simplest" Regression

2022年1月28日
讀畢需時 5 分鐘

已更新：2024年1月22日

Simple Linear Regression, composed of only one response variable and one predictor variable, is the simplest form of regression. Building a simple linear regression model is essentially fitting a straight line to a scatterplot to forecast values of Y (response variable) given X (predictor variable).

The fitted line generalizes the relationship between the X variable and Y variable, so we can use this line as a tool of forecasting. Given a value of X, the line corresponds to a value of Y.

Below, we can see how simple linear regression performs prediction. Given X1, the line will forecast the value Y1. Given X2, the line will forecast the value Y2

Equation of a Line

You might recall from your linear algebra class in high school that a line can be represented mathematically in the following form:

The m is the slope and b is the y-intercept.

In linear regression, because we're also fitting a straight line, the line is denoted as:

We can see that the linear regression formula is basically the same thing as the line equation, but with different notations. b is represented as Beta0 and m is represented as Beta1. The predicted Y by the line has a cap on its head (we call this Y cap).

Error Term

If we use a line to make predictions, we will inevitably make errors in our predictions. Below, we have the red line as our linear model to predict the Y value. We can observe a discrepancy between the predicted Y value and the actual (observed) Y value. The discrepancy is the error of prediction, also denoted by e. The error term is also called residual.

An error exists for every single observation in the dataset. All of the errors, as a whole, give us an idea of how fit the red line is to the data. And this is exactly how we determine the best fitting line. The best-fitting line should minimize the sum of squared error (SSE). Below, we can see that the sum of squared error is the total sum of the area of the boxes. The line that minimizes SSE is also called the least-squares regression line (because it minimizes the area of the squares).

So how exactly do we find the parameters of this least-squares regression line (what is the slope and y-intercept)? Thankfully, some mathematicians did the dirty work for us and now we can find the slope and y-intercept using the formulas below:

For b1, we can even calculate it using the coefficient of correlation. Both formulas are viable.

Predictions

Now that we know how to calculate the parameters of the best fitting line, we can deploy the model for prediction. Consider the following example of age vs the number of cats owned:

The graph states for us the equation of the best fitting line:

We can use this line to predict the number of cats owned using age as the input. For instance, how many cats do you expect a cat owner to have owned by age 60? To find the answer, we plug the X value (age) into the equation and find the resulting predicted Y.

We expect a cat owner to own 4.0674 cats by the age of 60. Easy!

Interpretations

I believe you might have already noticed by now the definitions of the y-intercept (b0) and the slope (b1). But just in case, let’s clarify the meaning of these two parameters and how we can interpret them.

Interpreting b0:

Recall the least-squares regression line from the previous example.

If we plug in 0 as the X variable, we can see that the predicted Y value is the same as b0.

This means we can interpret b0 (0.2934) as "the number of cats you expect to own at 0 years old". But we can see that it doesn’t make any sense to be 0 years old. In fact, on most occasions, it wouldn’t make sense for the X variable to be 0, so this interpretation shouldn’t be taken too seriously.

Interpreting b1:

The interpretation of b1 should be rather straightforward. We can see that for every one-unit increase in X, the predicted Y value will increase by the value of b1. So in this example, our interpretation of b1 would be "For every 1-year increase in age, the expected number of cats on average increase by 0.0629".

Coefficient of Determination (R-square)

Okay, now let’s come back to SSE for a moment. Remember that SSE tells us how much error was made by the regression line. The problem with SSE is that it is dependent on the scale of the variable. Therefore, we can’t use SSE to directly compare regression lines that fit different variables to see which one fits better.

So statisticians came up with a better measure known as the coefficient of determination (R-square). Let’s visualize R-square using an example.

Above, we have a set of data points and two lines. The dotted line is our least-squares regression line and the solid line is a horizontal line placed at the mean value of the Y variable. We also see three terms here: SSE, SST and SSR. Let’s define SSE and SST first.

We already know what SSE is. The SSE (sum of square error) is determined by the difference between the least-squares regression line and the actual value. The SST (total sum of square), on the other hand, is determined by the difference between the horizontal line (Y mean) and the actual value. So, what does SST actually mean?

Imagine that we don’t fit any line to the data points. The worst baseline model we can use to predict the Y value would be to use a horizontal line at the Y mean. This means, regardless of the input X variable, we will always the predict Y mean. The SST is essentially the sum of squared error for this horizontal line.

Now let’s say we do fit a least-squares regression line to the data points. The regression line will surely do a much better job in prediction than using just a horizontal line at Y mean. This means the SSE will also be smaller than the SST. Our goal is to see how much SSE is smaller than the SST (which is the amount of improvement in using the regression line and predictor variable X than just the sample mean to predict Y). This measure is the coefficient of determination (R-square)

R-square is unit-free with values between 0 and 1, inclusively. The higher the R-square, the better the fit (stronger linear association between X and Y).

We can interpret R-square in two ways: (say R-square is 0.80)

80% of the sample variability in Y is explained by its linear dependency on X
By taking the linear dependence on X into account, the SSE is reduced by 80%

In the case of simple linear regression (only one predictor X variable), we can even derive R-square from the coefficient of correlation.

Say that R-square is again 0.8 and that the slope b1 is positive.

Say that R-square is again 0.8 and that the slope b1 is negative, then you should add a negative sign.

SSR (Sum of Square Regression)

Awesome! Now we know the difference between SSE and SST. What about SSR? If we look back at the picture, we know that SSR (Sum of Square Regression) is the sum of square difference between the predicted Y value (by the regression line) and the Y mean. The bigger the SSR, the greater the difference between the regression line and the horizontal line, which means a better fit.

The relationship between the three sums of square can be summarized as:

Thus, R-square can also be formulated as:

Here, we see again that the higher the SSR, the higher the R-square, indicating a better fit.

Conclusion

Yay! Congrats on reaching the end! This story introduced the foundation of regression, simple linear regression. We talked about how we can find the best fitting line (least-squares regression line) and its interpretations. We also went over the different sum of squares and the coefficient of determination to quantify the fitness of a line. Take some time to digest the contents of this story. In the next one, we will talk about inferences about the slope.