First off, I compiled my data in August of 2014. I used only data that fell under the designation of “Tabletop Games.” When I compiled the data, there had been 4,432 projects launched under the designation of “Tabletop Games”
A programmer friend of mine built a “bot” to scrape the Kickstarter website. The data scraped was then compiled into a spreadsheet. This data consisted mainly of information that anyone can see when viewing a Kickstarter project page such as the funding goal, total number of backers, total funding raised, date launched, date ended, etc…
With this data, I've tried to identify specific variables that correlated to either the success, or lack of success, of projects.
A Problem With Our Kickstarter Data
We were only able to collect data for 3,629 of those 4,432 projects. The remaining 803 projects were redirecting to another location and we could not figure out why. After much deliberation, we identified where.
Those 803 projects were all failed projects that had been re-launched at a later date. Therefore the URLs were redirecting to the later version of the campaign and gave no information about the previous launch of the campaign. Of these 803 projects, some failed and some succeed in the subsequent launches. So we decided to just use the 3,629 projects that were readily available and perform our statistical analysis on these projects alone.
The Kickstarter Data is Skewed Positively
But if you are an intelligent person, a huge problem may be glaring you in the face. The data is skewed! The data shows a much higher proportion of successful projects than it realistically should since 803 projects which failed are being excluded from the analysis.
Yes, this is a problem, but this is what we have and this is what we will use. Although, I think the data is still credible for a number of reasons:
- In general we will be performing statistical tests which consider the average success rates of one group and then compare them to another group. Since the 803 failed projects were projects of all types, this means it should affect all the groups within our analyses in a similar way, so it shouldn't really affect the results of our statistical test.
- According to a sample size analysis (this is going to get super geeky – so put on your stats cap for just a moment) having 3,629 pieces of data from a set of 4,432 gives us a 99% confidence level with a confidence interval of just under 2%. (I'll go into more detail about these terms later in this blog)
The Point of Statistics for our Kickstarter Analysis
But we have to remember the point of statistics. Statistics are not going to prove anything, and as such, statistics are not going to tell you what to do. They will simply tell us what kinds of trends are occurring, and what types of outcomes are more likely, or less likely, given a set of conditions. Given that we roll two six sided dice, the probability that both dice land on the number 1 is 1/36 or a 2.8% chance. But if we actually roll the dice and they both land on 1, it’s not statistics' fault! Statistics could only tell you the likelihood that will occur, but it will never influence the outcome in any way. So even though the data is not complete, it will still give us an idea of what kinds of trends exist. It wasn’t going to change any outcomes in any way, anyway! And this is exactly what I was looking for. An Important Note about the Data:
Since we’re missing 803 failed projects from a total of 4432 projects, the success rates will all be inflated by roughly 17% across the board. Since we are trying to find major differences of success rates between very specific variables, this inflated average won’t really matter as long as it is fairly consistent across all groups we’re interested in. With this in mind, note that any time a table says “Adjusted Data” it means that all the success rates across all groups were re-adjusted so the data reflects success rates with respect to the total 4,432 projects instead of only the 3,629 projects.