So far, I've taught inferential statistics every year, and never felt really satisfied with the outcome. Yes, my students can copy my example to obtain a test-statistic and compare it to the critical value in the book. Yes, they can even say "thus we reject the null hypothesis." But rarely do they demonstrate true understanding. This year's attempt to teach inferential stats failed, as usual. Students complained so much about their lack of understanding (I love when they do that) that I decided to give it another, serious, try.
So for today, I considered the main difficulty in understanding inferential statistics. I think the main difficulty is understanding that random variation might create differences between groups that are due to chance. So I started with an object students know behaves randomly, a coin, and focused the lesson on the concept of random variation.
The setup: a normal coin, which I flip 15 times. Record number of heads and tails in a contingency table.
Then I "bless" it. I make a show of it, concentrating hard and blowing on the coin carefully in cupped hands.
Next I flip the coin another 30 times.
It turned out that before the blessing, the coin came up 5 heads and 10 tails. After the blessing it came up 13 heads and 17 tails. Oh my, my blessing made the coin come up heads more than twice as much! Students immediately complained that I should take into account the different number of flips in each condition, thank you students. We calculated percentages. 33% heads without blessing vs 43% heads with blessing.
Key Question: did my blessing work? Students were laughing at this, and suggesting wonderful things, like that the difference might be too small, and the sample too small, to be sure the results weren't merely due to chance. So how big should the difference be, for this sample, and how sure is "sure"? This led us into significance levels, and the need for statistical tests. We did a chi-square online (vassarstats is great for this) and when we saw that the p-value was 0.75 we concluded that the difference in heads was most likely due to chance. We experimented with changing the data a bit, say what if there was 29 heads and 1 tail in the "blessed" condition? Students agreed that would be more convincing, and voilá the p-value was less than 0.001.
That's nominal data. I also wanted students to experiment themselves, and to obtain ordinal data to use with a Mann-Whitney U-test. So I asked: "Are you telepathic?"
Students paired up. One student in each pair thought (but didn't speak) of a word, either "BIG" or "small". The other person then said a number, any number. The was tallied up in two columns according to the two words. At the end, I picked the data of one pair of students and calculated the medians. Oh my - the median for BIG numbers was 87.5, compared to just 15 for the small numbers. Students thought about this, could they be sure their classmates were telepathic? We did a Mann-Whitney U-test online (thanks again, vassarstats) and found a p-value of 0.009. Students were impressed. We concluded that we can be at least 95%* (or even 99%) sure that this pair of students were telepathic, except...
What if there wasn't a random variation causing the difference in results? What if the variation comes from confounding variable within the experiment? Students were saying that maybe the girl thinking of the word somehow consciously or unconsciously signaled the word she was thinking. Someone said humans have a hard time being truly unpredictable and random. So we arrived at the conclusion that more evidence is needed, and that statistical tests can only (at best) rule out that the difference is due to random variation but that there can still be other threats to validity present.
Overall, I am very pleased with this lesson. I am happy I chose a coin, even more happy I chose ESP - something many students are naturally curious about and have already considered in a somewhat relevant manner. Many students told me later that they finally got the idea, that it made sense, that it was obvious that descriptive stats is insufficient to draw conclusions about data. They could even transfer their understanding to psychology, to explain how participants in an experiment might be randomly different from each other or even compared to themselves at an earlier point in time. I am particularly happy with the telepathy-experiment. At first I thought that I should have made them flip a coin to decide what word to think about, to make it truly unpredictable, but because their choice of words wasn't perfectly random we had that very good discussion about internal validity and confounding variables which I think deepened students' understanding of the power and limitations of inferential statistical tests.
Some changes I'll do for next time: provide each pair of students with a computer so they can do the test themselves. Spend more time working with hypotheses and writing up the results of the inferential test. I want them to say "therefore the difference between conditions is significant and we should reject the null hypothesis" so we should have spent more time saying, and thinking, about this statement and what it means.
*Yes, I know that this is an incorrect interpretation of significance level. I know, and it hurts me to teach it this way. But seriously I think I must, at least to begin with, because students are simply not ready/able/given enough time to fully understand the concept of significance according to the frequentist approach to statistics. I comfort myself with the thoughts that hey, priorities gotta be made, and that perhaps, if looked at from a Bayesian perspective, what I'm teaching my students actually makes sense. It's a hard decision, though.