Let's go over the numbers in our video one more time.
There 20 households with 5 children and 100 households with 2 children. That's 300 children in total and 120 households. On average that means that there are 2.5 children per households.
If I were doing a questionaire and asked children randomly on the street about their household then I might get different results! If I ask 30 children randomly about their situation then I can expect 10 of these children to come from a household with 5 children and I can expect 20 of them to come from a household with two children. When I ask these children about their household then I'll learn that there are 90 children across 30 households, which would lead me to 3.0 children per household.
This may feel like a paradox. Why are these two numbers different?
The reason is that these numbers simply measure different things. The first number actually counts per household while the latter counts per child. Since it's more likely to select a child from a large household (they have more children, after all) it will also mean that our statistic is going to be skewed.
This is a simple example about households. But sampling biases like this one may appear when you're doing AB tests as well. After all, we may be serving companies but some companies may be larger than other ones. If you're tracking the behavior of users, can we be sure that they give a reflection of the company?
This is why it's very important to be clear about what statistic is being calculate and to also be very upfront about what it does, and does not, represent.