Sampling Distribution Clearly Explained!
- Benjamin Chen
- 2022年2月22日
- 讀畢需時 5 分鐘
已更新:2024年1月22日
In this story, we will cover sampling distribution. Based on my previous teaching experience, many students often misunderstand the meaning of sampling distribution and confuse it with 'sample distribution'. It's SUPER important that you master sampling distribution because upcoming chapters like confidence interval estimation and hypothesis testing are all built upon the sampling distribution. Anyways, let's begin!
Population Distribution
First, let's begin by defining the population distribution.
Assume population age with N observations (capitalized N is the notation of population size) is normally distributed and has a mean of 40 and a standard deviation of 5. The distribution would look like this:

The variable on the x-axis is age, which means the distribution is comprised of age values.
The population mean (μ) = 40
The population standard deviation (σ) = 5
Based on this population distribution, I can calculate the area under the curve for the probability of a randomly selected person (from the population) being within a range of age. For example, the probability that a randomly selected person is between ages 34–36 is the area shaded in yellow. The total area under the curve is 1 and you can find the shaded area using the Z-table after standardization. We covered these concepts in the previous story (Statistics 8: Standardization).

Sampling Distribution
Next, we have sampling distribution. Many people confuse sampling distribution with the distribution of a sample. Let’s take a look at what it really is.
Population and Sample

Most people already know the difference between a population and a sample. Because a population may be too large to collect data from every single entity, we often only collect data from a sample to represent the entirety of the population. Measures that describe the sample are called sample statistics. For example, if I randomly select 30 people as my sample from the population, I can calculate a sample mean age based on the 30 people’s age. This sample mean is my sample statistics and I can use this sample mean as an estimate for the population mean. The sample mean is an unbiased estimator of the population mean.
This is what inferential statistics is all about. We use sample statistics as estimates for population parameters (measures that describe the population such as population mean).
Sampling Distribution
Now let’s assume I really did select 30 people out of the population as my sample and calculated the sample mean (denoted by X̄).
You can imagine that the sample mean might be slightly different each time I select a different sample of 30 people.
For example, in one sample, it might just happen that all 30 people in the sample are young people, leading to a smaller sample mean age. Whereas in another sample, all 30 people happen to be older, leading to a higher sample mean age.
Now, let’s say we extract many many many samples (each with size n) from the population. For each sample, we calculate the sample mean, and we now end up with many many many sample means. Based on what we just discussed, the values of these sample means are all slightly different from each other because each of these samples consists of different observations. We can plot the distribution of the many many many sample means that we just obtained, and this resulting distribution is what we call sampling distribution.
Do you now identify the difference between population distribution and sampling distribution? In our example, the population distribution is consisted of many age values, whereas the sampling distribution is consisted of many sample mean age values. The distribution would look somewhat like this:

The variable on the x-axis is the sample mean age, which means the distribution is comprised of many sample mean age values.
You can also see the values on the x-axis are all somewhat near 40 (the population mean) because each sample mean calculated won’t be too far off the true population mean
What about the mean and standard deviation of the sampling distribution?
Mean and Standard Deviation of Sampling Distribution
The mean of the sampling distribution can also be written as the mean of the sample means. Let that sink in for a second. Likewise, the standard deviation of the sampling distribution is the standard deviation of the sample means. The standard deviation of the sampling distribution is also called the standard error. We’ll come back to this term later. The notations are as follows:

We can derive these two values from the population mean (μ) and population standard deviation (σ). Their relationships are as follows:

The mean of the sampling distribution will be the same as the population mean
The standard deviation of the sampling distribution (or standard error) is equal to taking the population standard deviation and dividing it by root n (where n is the sample size for each of the many many many samples extracted)
So in our example, we can calculate the mean and standard deviation of the sampling distribution as follow (in green):

You can see that the standard deviation of the sampling distribution depends on the sample size n, and this totally makes sense! Imagine if the sample size for each of the many many many samples that you extracted is large. The greater your sample size, the more accurate each of your sample means will be to the true population mean. This means the standard deviation of the sample means (standard error) will be smaller. (the sampling distribution will be thinner)
Meaning of Sampling Distribution
Now, let’s take a step back to really understand what the sampling distribution represents.
The sampling distribution consists of many many many sample means
We also said the sample mean is an unbiased estimator of the population mean
Given the two points above, every single sample mean that makes up this sampling distribution is an estimator of the population mean. So in a sense, you can view the entire sampling distribution as an estimate for the true population mean. The sampling distribution just expresses the estimate as a distribution.
By expressing the estimate as a distribution, not only do we get an idea about the true population mean, but also the amount of estimation error due to sampling. (the greater the standard deviation of the sampling distribution, the more inaccurate a sample mean is to be compared to the true population mean)
Traits of Sampling Distribution
The normality of the sampling distribution depends on two things:
the shape of the population distribution
the sample size n
If the population distribution is normal, then the sampling distribution will also be normal regardless of the sample size.
If the population distribution is not normal, then the shape of the sampling distribution will depend on the sample size n.
If the sample size is too small (less than 30), the sampling distribution will not be normal
If the sample size is large enough (greater than or equal to 30), the sampling distribution will be normal regardless of the shape of the population distribution. This is also called the central limit theorem.
Conclusion
Hopefully, this story gave you a better idea of what sampling distribution is. This is quite a difficult topic to comprehend especially if this is your first time studying it. Once you understand sampling distribution, confidence interval estimation, our next topic, will become super easy! Good luck!



留言