Investigating the Relationship between Two Numerical Variables.

Benjamin Chen
2022年1月20日
讀畢需時 4 分鐘

已更新：2024年1月22日

Recall that regression is the study of the relationship between variables. So before we advance to build regression models to perform prediction, we must master several concepts related to the association between variables. We'll focus primarily on the association between two numerical variables.

The focus is on numerical variables because even categorical variables need to be encoded into numbers for computers to run the analysis. We're starting at only two variables, one response and one predictor, because this is the minimum requirement for regression. In fact, a regression that concerns only one predictor variable is called simple linear regression. In contrast, multiple linear regression, which we study later, gets its adjective "multiple," because it concerns the study of two or more predictor variables.

Association between Two Numerical Variables

We must first understand some concepts related to the association between two numerical variables. Consider a dataset that recorded 10 people's weight and height. Assume X as Height (cm) and Y as Weight (kg).

We can visualize the association between two numerical variables using a scatterplot. The scatterplot here tells us that as height increases, weight tends to increase as well. This makes perfect sense because taller people are typically heavier as well. We would describe the association between X and Y as a positive linear association.

The “positive” word refers to the positive slope. When X increases, Y also increases.
The “linear” word refers to the linear trend between the X and Y variable. If you draw a straight line over the data points, it captures the two variables’ association pretty well.

Likewise, if X and Y display a negative slope in approximately a straight line, then we say the variables have a negative linear association. If no obvious slope can be observed, then it’s likely the two variables have no relationship. Here is an example between IQ and Height.

The data points are scattered all over the place, indicating no relationship. It also makes a lot of sense, because a taller person isn’t necessarily smarter.

Covariance

Next, I want to introduce a term named covariance. Covariance stands for the degree of linear association. The stronger the degree of linear association, the greater the covariance, and vice versa. Many people often confuse “degree of linear association” with slope.

The degree of linear association refers to how close the data points resemble a linear (straight) line.

The slope has nothing to do with covariance unless it is 0 where the covariance would also be 0 (no linear relationship). We can calculate the covariance using the following formula:

There is, however, one major problem with covariance. Covariance values are dependent on the units used to measure the X and Y variable. Consider the previous weight (kg) vs height (cm) example. If the height is measured in meters instead, you can imagine that the absolute values of Y decrease. This will also decrease the covariance value, even though the linear association between weight and height didn’t change. Do you see the problem yet?

You can’t directly compare covariance between different variables or units.

Coefficient of Correlation

To fix this problem, statisticians came up with another statistic, the coefficient of correlation. It measures the strength of a linear association between two variables that are not affected by the variable’s units. In short, it’s a refined version of covariance. You can calculate the coefficient of correlation with the following formula:

The coefficient of correlation will always fall between -1 and 1.

If the data points lie perfectly on a straight line with a positive slope, then the coefficient of correlation is 1.
If the data points lie perfectly on a straight line with a negative slope, then the coefficient of correlation is -1.

Again, I would like to stress that the magnitude of the slope does not matter! Below, we have two sets of data points, both sitting perfectly on a straight line. You can see that, despite different slopes, both coefficients of correlations (r) are equal to 1.

Let’s go through more examples of coefficient of correlation.

Above, you can see that as the data points approach to forming a straight positive-sloped line, the r increases gradually. Pretty simple right? Before we proceed, I do want to bring your attention to one particular case.

In the example above, we can see that the data points are displayed in a curved shape and have an r=0. We can clearly identify an association between the X and Y variable, but why is the coefficient of correlation 0? Be reminded that the coefficient of correlation (and covariance) only measure the strength of the LINEAR association. In this example, X and Y exhibits a non-linear relationship (not a line), which the coefficient of correlation failed to capture. The takeaway here is that just because r=0 doesn’t mean the two variables aren’t related. The two variables are just not linearly related but could be non-linearly related.

Correlation and Causation

Finally, you might have heard of this expression elsewhere, “correlation does not imply causation”. Let's see what this statement means by considering sunglasses sales vs ice cream sales.

We can see that sunglasses sales and ice cream sales exhibit a strong positive linear correlation. But that doesn’t imply that it is sunglasses sales that are driving ice cream sales up. A more reasonable explanation would be that it is rising temperature that is simultaneously driving the sales of both sunglasses and ice cream up. This is a classic example that illustrates “correlation does not imply causation”.

Conclusion

Perfect! Now that we are more familiar with some of these concepts and terminologies, we can advance to actually building a regression model! See you in the next story where we will try to build and analyze the most basic form of regression, simple linear regression.