Preparing data is a challenge. One must “massage” the data in such a way that the central “reality” of the data will become more and more evident, while at the same time, not “massaging” the data so much that it has been manipulated to tell you what you want to hear!
There are a number of ways of going about it, but first let me describe what I want to do with the data and the core problem I had with achieving this, so you will understand why I did what I did with the data.
My goal was to identify certain trends that correlate to the success or failure or projects. It’s hard to say with a very high level of confidence that one thing is causing another, but we can say with a very high level of confidence when one thing is correlated to another.
But to determine if two variables are correlated, one must be extra careful to isolate just those two variables, in order to remove anything else that could be influencing the results – these influencers are usually referred to as confounders. The one variable I am most concerned with in this blog is success rate, and the other… every variable that shows a corollary relationship with it!
So here is my biggest problem: In order to use the test I think is most effective for this task, and identify if one variable (for example, the month a project ends) is correlated to success rates, I need to break up that variable into groups so that I can determine the variation across these groups in order to compare it to the variation across the differing weekdays. This way we can say definitively if the success rates differ based upon the weekday they ended, or if they simply differ randomly and by chance.
So the big question, what groups should this variable be broken down into? I could break them into groups based upon the number of projects the project owner has created, or break them into groups based upon their total funding goal, or a number of things. BUT these groups may have their own trends and their own correlation with the overall success rate of projects. So if we did this, we wouldn’t be isolating the variable of interest from other influences at all!
Here was my solution. Create a random number generator, assign every project a random number between 1 and 5. But we don’t just do this once, we do it multiple times, let’s say three different times. So now we have five different groups for each of three separate trials. Now we compile the success rates of projects that (1) launched on a given weekday and that (2) randomly fell into group 1 into one column. Then we compile the success rates of projects that launched on a given weekday and randomly fell into group 2 into the next column. We repeat this up to column 5, and then we redo the whole thing again for the second trial, and redo the whole thing again for the third trial.
Next, and this is very important, we run a test (the same test we will use to identify if success rates vary significantly over each weekday) over each of the five groups of randomly assigned projects to ensure that the variation across all groups is controlled (or minimized). I did this for all the variables in my blog and there is very little variation across all groups (usually a p-value right around 0.995… so yeah…)
Now, we run our test, and identify whether the success rates are correlated with our variable of interest. In this case, the weekday.
Please tell me what you think. What would you do differently, and what would you do the same? Do you foresee any issues with this method? And please feel free to ask any questions if you need clarification.