top of page

How to Make Inferences About Proportion Like a Pro

  • 作家相片: Benjamin Chen
    Benjamin Chen
  • 2022年3月29日
  • 讀畢需時 4 分鐘

已更新:2024年1月22日

In this story, we’ll be moving on to inference about proportion. Up until now, we’ve focused solely on inference about means. We’ve built sampling distribution on the sample mean, confidence interval estimation for the population mean, and hypothesis testing on the population mean. Everything was about means. But there are many more statistics measures out there and it’s time we move on to another one, proportions. This story will quickly skim through the topics of the sampling distribution, confidence interval estimation, and sample size determination for proportion. So if you haven't already mastered those concepts, I recommend you review them first.


Proportion


When we talk about the mean, the measure is usually used to describe numerical variables. Proportions, on the other hand, are usually used to describe categorical variables. Take gender as an example. If out of 100 people (population), 45 are male, then the population proportion of males would be 0.45. The notation for a population proportion is π.

Now I can also take a portion of the population as my sample. Say I take a sample size of 20 and 8 (out of the 20 people) are male. Then, my sample proportion would be 0.4. The notation for sample proportion is p.

You should notice where we are going with this one. Just like the sample mean, different samples would have slightly different sample proportions because each sample is composed of different values. This means we can also construct a sampling distribution for sample proportion!


Sampling Distribution for Sample Proportion


The mean and standard deviation of the sampling distribution for sample proportion can be derived as follow:

To check the normality of the sampling distribution for a sample proportion, we have to ensure three conditions are satisfied:

If you remember from sampling distribution for the sample mean, you’ll notice that the two latter conditions are new. The latter two conditions are just to ensure that the population proportion π is not too close to 0 or 1. Proportion only spans from 0 to 1. So a population proportion π that is too close to 0 (mean of the sampling distribution would also be close to 0) would mean that the sampling distribution may get cut off at 0, and thus not follow a normal distribution. The same logic applies to population proportion π being too close to 1.


After we determine that the sampling distribution (of sample proportion) is normal, we can then apply the following formula to standardize the sampling distribution to a Z distribution.

This allows us to find the area under the curve (probability) using the Z-table, as well as perform confidence interval estimations and hypothesis testing on the population proportion. We’ll take a deeper look into these procedures soon.


Confidence Interval Estimation


The purpose of performing confidence interval estimation is to estimate the population parameter, which in this case would be the population proportion π. This means we don’t know the population proportion π, which means we also wouldn’t be able to determine the standard deviation of the sampling distribution (for sample proportion).

We can estimate the standard deviation of the sampling distribution (for sample proportion) with the sample standard deviation, which uses sample proportion p instead population proportion π.

This slight change means that the conditions to determine whether our sampling distribution (for sample proportion) is normal would also be slightly modified.

Because the population proportion π is unknown, we basically replace it with the sample proportion p. After we are able to determine that the sampling distribution (of sample proportion) is normal, we can apply the following formula to perform confidence interval estimation for the population proportion.

By now, you should already know why this formula is formulated in the way it is. If you need some refresh, Statistics 10: Confidence Interval Estimation should be a good review.


There is one small characteristic of confidence interval estimation for population proportion that you have to be careful of. Remember when I said how proportion can only span from 0 to 1? This means that the proportion will never go below 0 or above 1. Therefore,

  • If the lower boundary of your confidence interval estimation goes below 0, we must replace the lower boundary by 0

  • If the upper boundary of your confidence interval estimation goes above 0, we must replace the upper boundary by 1


Sample Size Determination


Speaking of confidence interval estimation, we mentioned in Statistics 11: Determining Sample Size that we can alter the confidence interval formula such that it is represented in terms of sample size. The right-hand side of the confidence interval formula is the sampling error.

Note: You'll notice that the equation above uses π instead of p. This is, in fact, the original proper form. Only when π is unknown do we estimate it with p.


We can manipulate the sampling error equation such that it becomes:

We can use the equation above to find the appropriate sample size such that the sampling error can be achieved within E.


Just in case π is unknown, we simply replace π with p.

In the case when even p is unknown, we estimate it again with 0.5. We choose 0.5 because it is the safest estimate for proportion.


Conclusion

Let's wrap up this story here. In this story, we covered sampling distribution, confidence interval estimate, and sample size determination for proportion. If you have read my previous posts, you should be conversant with these concepts, which is why I didn't really go super deep into the details of these topics in this story. In the next story, we will cover hypothesis testing for proportion.


留言


  • Kaggle
  • GitHub
  • Youtube
  • Linkedin

©2022 by Ben's Blog. Proudly created with Wix.com

bottom of page