Update: This post used an incorrect implementation of the bootstrap, so the conclusions don’t hold. See this correction

This surprised me. I decided to try out bootstrapping as a method of testing if two sets of numbers are drawn from different distributions. I did this by generating sets of numbers of size m from two ex-gaussian distributions which are identical except for a fixed difference, d


All code is matlab. Sorry about that.

Then, for each pair of numbers I apply a series of different tests for if the distributions are different.
1. Standard t-test (0.05 significance level)
2. Is the mean(s1) 3. Bootstrapping using mean as the test statistic (0.05 significance level)
4. Bootstrapping using the median as the test statistic (0.05 significance level)

I used Ione Fine’s pages on bootstrapping as a guide. The bootstrapping code is:

function H=bootstrap(s1,s2,samples,alpha,method)

for i=1:samples
    if method==1


H = CI(1)>0 | CI(2)<0;

I do that 5000 times for each difference, d, and each sample size, m. Then I take the average answer from each test (where 1 is 'conclude there distributions are different' and 0 is 'don't conclude the distributions are different'). For the case where d > 0 this gives you a hit rate, the likelihood that the test will tell you there is a difference when there is a difference. For d = 0.5 you get a difference that most of the tests can detect the majority of the time as long as the sample is more than 50. For the case where d = 0, you can calculate the false alarm rate for each test (at each sample size).

From these you can calculate d-prime as a standard index of sensitivity and plot the result. Sttest, Smean, Sbootstrap and Sbootstrap2 are matrices which hold the likelihood of the four tests giving a positive answer for each sample size (columns) for two differences, 0 and 0.5 (the rows):

hold on
xlabel('Sample size')
ylabel('sensitivity - d prime')

Here is the result (click for larger):

What surprised me was:

  • The t-test is more sensitive than the bootstrap, if the mean is used as the test statistic
  • How much more sensitive the bootstrap is than the other tests if the median is used as the test statistic
  • How well the simple mean does. I suspect there's so nuance I'm missing here, such as unacceptably high false positive rate for smaller differences

Update 28/11/12
-Fixed an inconsequential bug in the dprime calculation
-Closer inspection shows that the simple mean case gives a ~50% false alarm rate, but the high sensitivity offsets this. Suggests dprime isn't a wise summary statistic?