Statistical sampling, for some reason, is one of those concepts that gives some people a lot of trouble. There is a book by Daniel Kahneman titled “Thinking, Fast and Slow” that I always recommend. One of the concepts Kahneman likes to write about is System 1 thinking vs. System 2 thinking. I suppose that those who feel comfortable when dealing with uncertainty and probabilistic models are people who are good system 2 thinkers.
So, what is statistical sampling? To keep things simple, it may be better that we start with an example.
Suppose that we work at a light bulb factory that makes millions of light bulbs per day and are in charge of testing the light bulbs. We could test for many things but, to keep things simple, let’s say that we are testing whether a light bulb can survive a drop from 3 feet onto concrete. I am not sure why anyone would want that, but it sounds like fun.
There is one sure way to ensure that we can test this and be absolutely certain that we know what percentage of our light bulbs could pass the test. That is, by dropping every single light bulb on a concrete floor.
The problem with that approach is that we need the light bulbs for other things, such as selling them to make a profit.
Statistics gives us a way to test the light bulbs and have less broken glass to sweep.
Using statistical sampling, we could take a small sample of light bulbs and use them to represent the entire batch. We just need to follow some simple rules:
- We must choose the light bulbs in our sample at random
- Each light bulb must have an equal chance of being selected
- We must select a large enough sample size… (more on that in another post)
So, if we sampled one thousand light bulbs at random and 100 of them broke, statistics allows us to say that 90% of all light bulbs we produce can survive a fall of 3 feet onto a concrete floor.
What scares people off is what follows. There is a probabilistic nature to statistical processes. So the result of our drop test is not really 90% – it is actually “around 90%”, in other words a range of values determined by something called a “confidence interval”.
We are not going to get that deep into statistical sampling in this post, but I wanted to start things off with a very simple scenario that shows the benefit of statistical sampling.
One Reply to “What’s the formula for statistical sampling?”