多重比较之FWE & FDR

同样的开场白，同样适用。话不多说，还是那句熟悉的，本来我也是不想写的，但是还是写了（其实，我写到一半的时候都不想写了的orz），简单入门方便理解的那种，不会深究。才疏学浅，有问题请指正。

引入

为了厘清思路，接下来会从几个方面阐述：
多重比较是什么以及为什么需要FWE和FDR？
FWE是什么？
FDR是什么？
FWE和FDR有什么异同点

多重比较是什么以及为什么需要FWE和FDR？

多重比较使用的场合都涉及到多次统计比较下判断显著性的问题。一般来说，要确定一个统计结果是不是显著的，就需要确保犯错误的概率是小于0.05的，也就是小概率，这时我们就有把握说这个统计结果是比较可靠的，显著性水平为0.05，也就是说，结果的假阳性的概率最多为5%。如果进行一次统计分析时p值的阈限是0.05，进行两次比较时，p值的阈限实际上应该是0.05*(1/2)。依次类推，如果进行和很多次的比较后，此时的p值就远远不是0.05了，为了能够控制假阳性也使得结果可被接受，因此就需要进行校正。
FWE和FDR就是两种校正p值的方式。
多重比较不是事后比较，两者首先名字就不一样，是完全不一样的东西。多重比较是进行了多次统计之后，判断结果是否显著的问题。事后比较是多水平统计结果显著后（e.g.方差分析）为了进一步探明是其中哪些水平上显著而进行的进一步的分析。

在什么时候需要多次统计，也就是多重比较呢？在脑科学中，fMRI和EEG的数据通常涉及到多个voxel通道时频点等，往往在进行比较分析中需要对这些层面上的值进行统计分析，涉及多次比较。

FWE是什么？

familywise error correction 是面临着多重比较时候的控制方法。其中最经典标准的是Bonferroni检验，计算方式就是前面提到的用一次统计时的p值阈限除以总的统计次数，即 p/n。FWE衡量的是在多次统计比较时，至少犯一次错误的概率。

FDR是什么？

我的理解是FDR false discovery rate 是在实际应用中对FWE标准的放宽。因为在实际的应用中，FWE的标准过于严格。对voxel进行了成千上万次的统计后，结果很难存活。而且另一方面，其实很难说被筛选掉的结果就一定不存在一定程度上的激活。FDR强调的是控制假阳率，也就是控制错误发现的概率的方法。控制罗FDR并不是保证结果中没有假阳性，而是保证结果中只有几个假阳性。将FDR控制水平设置为0.05将保证不超过5%的部分为假阳性。这个结果是可以被接受的，也没有FWE一样严苛。

FWE和FDR有什么异同点

FWE：更加严格，可以保证最后结果中包含的一定是显著的，但是也会带来真值没有检测出来的情况；
FDR：比FWE宽松，最终结果中真值一定会全包含，但也会包含可接受范围内的假值。

参考资料：
http://mindhive.mit.edu/book/export/html/90

此处贴上部分原文

P threshold FAQ

Frequently Asked Questions - P-thresholds

PthresholdPapers

PthresholdHowTos

PthresholdLinks

What is the multiple-comparison problem? What is familywise error correction (FWE)?
To start, Nichols and Hayasaka (PthresholdPapers) provide an excellent introduction to the issue of FWE in neuroimaging in very readable fashion. You're encouraged to check it out.
Many scientific fields have had to confront the problem of assessing statistical significance in the context of multiple tests. With a single statistical test, the standard conventionally dictates a statistic is significant if it is less than 5% likely to occur by chance - a p-threshold of 0.05. But in fields like DNA microassays or neuroimaging, many thousands of tests are done at once. Each voxel in the brain constitutes a separate test, which usually means tens of thousands of tests for a given subject. If the conventional p-threshold of 0.05 is applied on a voxelwise basis, then, just by chance you're almost guaranteed to have many hundreds of false-positive voxels. In order to avoid any false positives, then, researchers generally correct their p-threshold to account for how many tests they're performing. This type of correction prevents Type I error across the whole family of tests you're doing - a familwise error correction, or FWE correction.
The standard approach to FWE correction has been the Bonferroni correction - simply divide the desired p-threshold by the number of tests, and you'll maintain correct control over the FWE rate. In general, the Bonferroni correction is a pretty conservative correction, and it suffers from a fatal flaw with neuroimaging data. The Bonferroni correction demands that all the tests be independent from each other, and that demand is manifestly not fulfilled in neuroimaging data, where there is a complex, substantial and generally unknown structure of spatial correlations in the data. Essentially, the Bonferroni correction assumes there are more spatial 'degrees of freedom' than there really are; one voxel is not independent from the next, and so one only needs to correct for the 'true' number of independent tests you're doing. This effort, though, is tricky, and so a good deal of theory has been developed on ways around Bonferroni-type corrections that still control the FWE at a reasonable level.

What is Gaussian random-field theory and how does it apply to FWE?
Worsley et. al (PthresholdPapers) is one of the first papers to link random-field theory with neuroimaging data, and that link has been tremendously productive in the years since. Random-field theory (RFT) corrections attempt to control the FWE rate by assuming that the data follow certain specified patterns of spatial variance - that the distributions of statistics mimic a smoothly varying random field. RFT corrections work by calculating the smoothness of the data in a given statistic image and estimating how unlikely it is that voxels (or clusters or patterns) with particular statistic levels would appear by chance in data of that local smoothness. The big advantages of RFT corrections are that they adapt to the smoothness in the data - with highly correlated data, Bonferroni corrections are far too severe, but RFT corrections are much more liberal. RFT methods are also computationally extremely efficient.
However, RFT corrections make many assumptions about the data which render the methods somewhat less palatable. Chief among these is the assumption that the data must have a minimum level of smoothness in order to fit the theory - at least 2-3 times the voxel size is recommended at minimum, and more is better. For those researchers unwilling to pay the cost in resolution that smoothing imposes, RFT methods are problematic. As well, RFT corrections are only available for statistics whose distributions in a random field have been laboriously calculated and derived - the common statistics fall in this category (F, t, minimum t, etc.), but ad hoc statistics can't be corrected in this manner. Finally, it's become clear (and Nichols and Hayasaka show in PthresholdPapers), that even with the assumptions minimally satisfied, RFT corrections tend to be too conservative.
Random-field theory corrections are available by default in SPM; in SPM99 or earlier, choosing a "corrected" p-threshold means using an RFT correction, while in SPM2, choosing the "FWE" correction to your p-threshold uses these methods. I don't believe corrections of this sort are available in AFNI or BrainVoyager.

What is false discovery rate (FDR)? How is it different from other types of multiple-comparison correction?
RFT methods may have their flaws, but some researchers have pointed out a different problem with the whole concept of FWE correction. FWE correction in general controls the error rate for the whole family; it guarantees that there's only a 5% chance (for example) of any false positives appearing in the data. This type of correction simply doesn't fit the intuition of many neuroimaging researchers, because it suggests that every voxel activated is a true active voxel, and most researchers correctly assume there's enough noise in every stage of the process to make a few voxels here and there look active just by chance. Indeed, it's rarely of crucial interest in a particular study whether one particular voxel is necessarily truly or falsely positive - most researchers are willing to accept that some of their signal is actually noise - but that level of inference is precisely what FWE corrections attempt to license.
Benjamini & Hochberg, faced with this conundrum, developed a new idea. Rather than controlling the FWE rate, what if you could control the amount of false-positive data you had? They developed a method to control the false discovery rate, or FDR. Genovese et. al (PthresholdPapers) recently imported this method specifically into neuroimaging. The idea in controlling the FDR is not to guarantee you have no false positives - it's to guarantee you only have a few. Setting the FDR control level to 0.05 will guarantee that no more than 5% of your active voxels are false positives. You don't know which ones they might be, and you don't even know if fully 5% are false positive. But no more than 5% are falsely active.
The big advantage of FDR is that is adapts to the level of signal present in the data. With small signal, the correction is very liberal. With huge signal, it's relatively more severe. This adaptation renders it more sensitive than an RFT correction if there's any signal present in the data. It allows a much more liberal threshold to be set than RFT, at a cost that most researchers have already mentally paid - a few false positive voxels. It requires almost no computational effort, and doesn't require laborious derivations to be used with new statistics.
FDR is not a perfect cure-all - it does require some assumptions about the level of spatial correlation in the data. At the outer bound, allowing any arbitrary correlation structure, it is only slightly more liberal than the equivalent RFT correction. But with looser assumptions, it's a great deal more liberal. Genovese et. al have argued that fMRI data in many situations fits a very loose set of assumptions, enabling a pretty liberal correction.
The latest edition of every major neuroimaging program provides some methods for FDR control - SPM2 and BrainVoyager QX have it built-in, and AFNI's 3dFDR program does the same work. Tom Nichols has predicted FDR methods will essentially replace most FWE correction methods within a few years, and they are beginning to be widely used throughout neuroimaging literature.

What is permutation testing? How is it different from other types of multiple-comparison correction?
Permutation testing is a form of non-parametric testing, and Nichols and Holmes give an excellent introduction to the field in their paper (PthresholdPapers), a much better treatment than I can give it here. But here's the extreme nutshell version. Permutation tests are a sensitive way of controlling FWE that make almost no assumptions about the data, and are related to the stats/CS concept of 'bootstrapping.'
The idea is this. You hope your experimental manipulation has had some effect on the data, and to the extent that it has, your design matrix is a model that explains the data pretty well, with large beta weights for the conditions of interest. But what if your design matrix had been different? What if you randomly re-labeled your trials, so that a trial that was actually an A trial in the real experiment was re-labeled as a B, and put into the design matrix as a B, and a B trial was re-labeled and modeled as a C trial, and a C as an A, and so forth. If your experiment had a big effect, the new, randomly mixed-up design matrix won't explain it well at all - if you re-ran your model using that matrix, you'd get much smaller beta weights. Of course, on the null hypothesis, there wasn't any effect at all due to your manipulation, which means the random design matrix should explain it just as well.
And now that you've re-labeled your design matrix and re-run your stats, you mix up the design matrix again, differently and do the same thing. And then do it again. And again, until you've run through all the possible permutations of the design matrix (or at least a lot of them). You'll end up with a distribution of beta weights for that condition from possible design matrices. And now you go back and look at the beta weight from your real experiment. If it's at the extreme end of that distribution you've created - congrats! You've got a significant effect for that condition. The idea in permutation testing is you don't make any assumptions about what the statistic distribution could be - you go out and empirically determine it, from your own real data.
But how does that help you with the multiple-comparison problem? One nice thing about permuation testing is that aren't restricted to testing significance for stats with known distributions, like t or F. We can use these on any ad hoc statistic we like. So let's do it across the design matrices, using as our statistic the maximal T: the value of the maximum T-statistic in the whole image for that design matrix. We come up with a distribution, just like before, and we can find the t-statistic that corresponds to the 5% most extreme parts of the maximal T distribution. And now, the clever bit: we go back to our real experiment's statistical map, and threshold it at that 5% level from the maximal T. Hopefully the t-statistics from our real experiment are generally so much higher than those from the random design matrices as to mean a lot of voxels in our real experiment will have t-statistics above that level - and we don't need to correct their significance at all, because anything in that extreme part of the maximal T distribution is guaranteed to be among the most extreme possible t-statistics for any voxel for any design matrix.
Permuation tests have the big advantages of making almost no (but not totally none - see Nichols and Holmes for details) assumptions about your data, which means they work particularly well with low degrees of freedom, where other methods' assumptions about the shape of their statistic's distribution can be violated. They also are extremely flexible - any true or ad hoc statistic can be tested, such as maximal T, or size of structure, or voxel's favorite color - anything. But they have a big disadvantage: computational cost. Running a permutation test involves re-estimating at least 20 models to be able to guarantee a 0.05 significance level, and so in SPM for individual data, that cost can be prohibitive. For other programs, the situation's not as bad, but it can still be pretty difficult to wait. Permuation tests are available at least in SPM99 with the SnPM toolbox, and in AFNI with the 3dMonteCarlo program. Not sure about BrainVoyager.

When should I use different types of multiple-comparison correction?
Nichols and Hayasaka's paper (PthresholdPapers) does an explicit review of various FWE correction methods (as well as FDR) on simulated and real data of a variety of smoothness levels and degrees of freedom, to judge how conservative or liberal different methods were. Their main findings are:

Random-field corrections are extremely conservative for all smoothnesses except the highest. This bias becomes stronger as the degrees of freedom go down, such that low-degree-of-freedom, low-smoothness images corrected with RFT methods show the worst underactivation. At the highest smoothness (8-12mm FWHM), they perform reasonably well for all df.

Permutation methods are almost exact for all degrees of freedom and for all smoothnesses. They become slightly better with data of high smoothness, but basically perform tremendously well under all conditions.

FDR is not strictly speaking intended to control FWE, but it does an excellent job doing so for low-smoothness data at all degrees of freedom. At high smoothnesses (6mm FWHM and greater), the correction becomes too conservative.
Accordingly, the nutshell recommendations are as follows:

Random-field methods are good for highly-smoothed data only and are best for single-subject data. For researchers who need a good deal of smoothing to collect significant signal, or who aren't particularly interested in very fine resolution, RFT corrections are quite exact and easily implemented for single subjects. At low degrees of freedom for any smoothness (say, less than 20 df), the RF corrections are generally too conservative for any smoothness.

For unsmoothed (or low-smoothed), single-subject data, FDR corrections are the best. They have very high sensitivity while still providing good control of false positives, even with low degrees of freedom. Group data tend naturally to be smoother than single-subject data, due to the blurring imposed by anatomical variability, and so may not be ideal for FDR corrections.

Permutation tests are optimized for group data - they perform perfectly at very low degrees of freedom, where other methods' assumptions are invalidated, and they improve slightly with high-smoothness data, although they still do fine with unsmoothed. In group testing, the permutation is whether each subject's t-statistic signs are true or flipped - presumably, if the mean is zero, flipping the sign of the statistic won't make a difference, but if the mean is nonzero, that flipping will matter. As well, the relative speed of estimating group models in most programs helps counter the increased computational cost of permutation testing in general.>

多重比较之FWE & FDR

引入

多重比较是什么以及为什么需要FWE和FDR？

FWE是什么？

FDR是什么？

FWE和FDR有什么异同点

P threshold FAQ

Frequently Asked Questions - P-thresholds

你可能感兴趣的:(多重比较之FWE & FDR)