Kickstarter Stats 101: What's a "Normal Distribution"?

Before we perform any statistical test, we will want know if the data is normally distributed. This is very important, as you will soon find out, but it can also be a convoluted discussion.

So, again for the sake of simplicity and time, I am going to overgeneralize my definition, but I’ll provide a couple different examples and ways to think about it so that you can have a more intuitive understanding.

Normal Distribution Example: Men’s Heights

Let’s say we recorded the heights from 100 different men aged 18 to 66 in the U.S. Then we added them all up and divided by 100 to find the mean (which would be the average height). We would get a number right around 5-feet, 8-inches tall (173 cm.), which we would interpret to mean that most men are 5-feet, 8-inches tall.

There are still a great number of men who around 5-feet, 6-inches tall, or who are around 5-feet, 10-inches tall, but not as many as there would be that are roughly 5-feet, 8-inches tall. And even fewer men that are shorter than 5-feet, 4-inches tall or taller than 6-feet tall.

The point is, the further we move away from the mean (average) the less frequently we find values to fit into those groups. We would refer to this as the frequency. We frequently find men between 5-feet, 7-inches tall and 5-feet, 9-inches tall, but it is very infrequent to find men shorter than 5-feet tall or taller than 7-feet tall – but we still find them.

A Normal Distribution and Histograms

If we graphed all the heights we would get a graph that looks like the one below. This is called a histogram. The pattern we see is referred to as a bell-shaped curve because the pattern is sort of shaped like a bell. If the shape of the pattern in a histogram resembles the shape of a bell-shaped curve, then we say that the data is normally distributed. The mean should also be very close to the highest bar in the graph.

A very rough but usable explanation of what it means to be normally distributed is this: normally distributed simply means that results derived from the data will be greatly driven by the average of the data, (what’s normal about the data) and results will not be overwhelmed by certain values which are abnormal and that could, all by themselves, skew the results. Abnormal values like this are referred to as “outliers” and will skew your results.

A Normal Distribution and Outliers

Here’s a quick example. Let’s say that as I am entering the men’s heights from above into my spreadsheet or calculator, and I accidentally type in the value 567 inches instead of 5.67 inches for one of the numbers. This number is so large compared to the other ninety-nine numbers that it will drive all the results, and the other ninety-nine numbers will be much less meaningful. The average height of men will now come out to be roughly 11’ foot 4” inches. JUST FROM ONE OUTLIER!!

The original data may have been “normally distributed” but you can guarantee that the new set of numbers (due specifically to the value of 567) won’t be.

Think of it this way, is it normal to see the value 567, next to ninety-nine other number with values right around 5 and 6? No, not really… Now this is one simple example (and a bit of an over-exaggerated example) just to make the point.

Most statistical tests will only yield valid results if the data you use is normally distributed. I won’t go into more detail about how to determine normality here, but if I have more time later I will explain how to test for normality and how to read other types of normality graphs and charts.

Please comment if you learned anything, if you want to correct anything, or if you have any questions!