Error rate
It is two weeks before a general election. The TV polls say “Labour 48%, Conservative 51%”. In small print at the bottom it reads “Margin of error: +/-5%”. That small piece of text makes the poll result completely meaningless. Labour could have as much as 53% of the vote, and the Conservatives could have as little as 46% – and vice versa if the poll showed a 51/48 Lab/Con split.
The error rate of an experiment whose results are based on statistics is indirectly inversely proportional to the sample size. In English, that means the more children you test, the more accurate a result you get with less chance of misleading results. Moreover, it also means that if the results of “corruption vs non-corruption” are fairly close, we cannot know for sure what the true result of the test is unless the percentage difference is more than twice the error rate. In a test with a dozen children, the error rate is extremely high because the sample size is not big enough to reasonably emulate the behaviour of the entire child population.
Spread
Corruption is not a black-and-white issue. For simplicity, let’s say corruption is the same as aggression for the next example. Pre-exposure aggression will be measured on a scale, and if you draw a graph with aggression on the horizontal axis and number of people on the vertical axis, you should see a bell curve, indicating that the majority of people have some baseline middling amount of aggression, tailing off in each direction, with a small number of outliers having very low or high general aggression levels in their personalities.
Why am I mentioning this? Well, many people seem to have strange ideas on how to define whether a test of video games on kids gives a meaningful result. We have seen recent “research” about how many children out of a small sample pick up a pencil that a psychiatrist deliberately drops after they have played a violent game, and simply adding up the results. If there is no bell curve so-to-speak, the research is frankly not worth the paper it’s written on.
The reason is simple: the bell curve provides an excellent, clean and simple way to test the results which cannot be disputed: when you draw the post-exposure aggression (bell) curve over the pre-exposure curve, has it moved to the right by more than twice the error rate? If so, you have a statistically significant result. If not, you don’t. This is important because it removes any subjectivity from interpreting the results.
Repeatability
Newton is sitting under a tree and an apple falls on his head. He writes down the equation for gravity. 100 scientists come along with apples and let go of them at arm’s length, but they all float up to the sky.
This silly example highlights the dangers of doing a single experiment and taking the results as proof of correctness. And in fact, this example is not nearly as silly as it seems, because in the early 20th century Einstein essentially disproved gravity by demonstrating that what we perceive as gravity is really just the visible perception of the natural curvature of space-time. The point of this is, if you are going to test a theory – like whether children are corrupted by video games – you need to do it multiple times, using different methods, with different people doing the test, and reach a consensus. This is analogous to the problem of the imperfect fair coin and the human error present in the way one person might flip it in the coin example earlier.
If an observer is asked to sit in a room full of children and write down on a report sheet – in her own terms – how aggressive she thinks each child is after a fixed period of observation, it is possible that she will report a significantly different set of numbers to another observer asked to rate the same children at the same time. The observer’s very perception of what aggression is distorts the measurement of accurate data, and makes the results unreliable. All the other factors I’ve covered like the children’s ethnographic and psychiatric backgrounds, the way the samples are selected and so on also limit the applicability of any single study. To mitigate this and reduce the impact of human judgment or error on the results, it is important to repeat the test in different ways and with different people conducting it. So far, a consensus has not been reached.