User:Georges/Multiple Testing -- a comment
A reply to a message suggesting that I didn't have a problem with ordinary p-values in the context of multiple tests:
Alas, I'm sorry if I gave the impression that I didn't have a problem. I do very much have a problem -- it's just that I don't have a solution!! Or at least not a clear and simple one. I've given this some thought over time and it's driven me to become a Bayesian at heart. I feel like a Catholic priest who, in his heart, has converted to Buddhism but has to keep his job. So I still practice Frequentism, outwardly anyways, and try to help students practice it as best I can.
What's a reasonable way to handle multiple testing in practice? Definitely not to rummage through tons of data and models, find a significant nugget and report a test with an unadjusted p-value as if it were a hypothesis formulated a priori. That leads to the Tower of Babel. (I apologize for using metaphors that might not be familiar -- admittedly a poor one in this case -- but they're easy and fun to google).
The opposite extreme, imposing a single family-wise alpha on all hypotheses in a study, is almost as egregious although much less likely to be seen in practice because a statistician who insisted on always doing this would have no clients.
So we need to distinguish somehow between reasonable prior hypotheses and nuggets we've discovered while rummaging.
One problem is that most clients don't come with a nicely organized list of prior hypotheses. They expect the statistician to do that. But the statistician can't really do it until she's played with the problem and done some analysis -- at which point she's compromised.
Is there a way out?
Suppose one had carefully formulated 100 prior hypotheses. How does being a prior hypothesis justify not adjusting for multiple hypothesis testing? It's okay not to adjust provided you plan to report the results for all 100 hypotheses whether each is significant or not. That way, if only 5 of them end up significant -- at the 0.05 level of course, the reader can readily judge that the researcher is not very astute at formulating plausible hypotheses. And the 5 of 100 that ended up significant might well have done so by chance.
If you have a secret list of 100 prior hypotheses and only reveal the 5 that ended up significant then you are seriously misleading your reader.
So I propose a 'test' to determine whether a hypothesis is a reasonable prior hypothesis even if you didn't happen to have thought about it at the onset: Is is something so central to the research questions that you could convince someone that you would have planned to discuss it whether it turned out to be significant or not? In that case I think it's okay to treat it as a prior hypothesis and report its p-value without adjustment along, of course, with all the the other hypotheses, significant or not, that you 'planned' to report.
If it's a 'discovered hypothesis' then you should report, in addition to the ordinary p-value, some adjusted p-value with a brief discussion of the reason for the adjustment. The appropriate method for adjustment is rarely clear but doing it is a signal to the reader that there is a problem. This is a nod to a Bayesian approach. You are giving your reader a choice of p-values that will depend on his attitude towards the hypothesis. Even if a reader who knows nothing about Frequentism or Bayesianism, one who thinks that the alternative hypothesis is very plausible might be happy to take the unadjusted p-value but one who feels that it's far-fetched would be attracted to the adjusted p-value which reflects a greater degree of skepticism about the hypothesis.
Thanks for the motivation to try to write down these ideas.
All the best, Georges