Categories
science

Testing bootstrapping

Update: This post used an incorrect implementation of the bootstrap, so the conclusions don’t hold. See this correction

This surprised me. I decided to try out bootstrapping as a method of testing if two sets of numbers are drawn from different distributions. I did this by generating sets of numbers of size m from two ex-gaussian distributions which are identical except for a fixed difference, d

    s1=randn(1,m)+exp(randn(1,m));
    s2=randn(1,m)+exp(randn(1,m))+d;

All code is matlab. Sorry about that.

Then, for each pair of numbers I apply a series of different tests for if the distributions are different.
1. Standard t-test (0.05 significance level)
2. Is the mean(s1) 3. Bootstrapping using mean as the test statistic (0.05 significance level)
4. Bootstrapping using the median as the test statistic (0.05 significance level)

I used Ione Fine’s pages on bootstrapping as a guide. The bootstrapping code is:

function H=bootstrap(s1,s2,samples,alpha,method)

for i=1:samples
    
    boot1=s1(ceil(rand(1,length(s1))*length(s1)));
    boot2=s2(ceil(rand(1,length(s2))*length(s2)));
    
    if method==1
        a(i)=mean(boot1)-mean(boot2);
    else
        a(i)=median(boot1)-median(boot2);    
    end
    
end

CI=prctile(a,[100*alpha/2,100*(1-alpha/2)]);

H = CI(1)>0 | CI(2)<0;

I do that 5000 times for each difference, d, and each sample size, m. Then I take the average answer from each test (where 1 is 'conclude there distributions are different' and 0 is 'don't conclude the distributions are different'). For the case where d > 0 this gives you a hit rate, the likelihood that the test will tell you there is a difference when there is a difference. For d = 0.5 you get a difference that most of the tests can detect the majority of the time as long as the sample is more than 50. For the case where d = 0, you can calculate the false alarm rate for each test (at each sample size).

From these you can calculate d-prime as a standard index of sensitivity and plot the result. Sttest, Smean, Sbootstrap and Sbootstrap2 are matrices which hold the likelihood of the four tests giving a positive answer for each sample size (columns) for two differences, 0 and 0.5 (the rows):

figure(1);clf
plot(measures,norminv(Sttest(2,:))-norminv(Sttest(1,:),0,1),'k')
hold on
plot(measures,norminv(Smean(2,:))-norminv(Smean(1,:)),'r')
%plot(measures,norminv(Smedian(2,:))-norminv(Smedian(1,:)),'c--')
plot(measures,norminv(Sbootstrap(2,:))-norminv(Sbootstrap(1,:)),'m')
plot(measures,norminv(Sbootstrap2(2,:))-norminv(Sbootstrap2(1,:)),'g')
xlabel('Sample size')
ylabel('sensitivity - d prime')
legend('T-test','mean','bstrap-mean','bstrap-median')

Here is the result (click for larger):

What surprised me was:

  • The t-test is more sensitive than the bootstrap, if the mean is used as the test statistic
  • How much more sensitive the bootstrap is than the other tests if the median is used as the test statistic
  • How well the simple mean does. I suspect there's so nuance I'm missing here, such as unacceptably high false positive rate for smaller differences

Update 28/11/12
-Fixed an inconsequential bug in the dprime calculation
-Closer inspection shows that the simple mean case gives a ~50% false alarm rate, but the high sensitivity offsets this. Suggests dprime isn't a wise summary statistic?

3 replies on “Testing bootstrapping”

Three things.
First, the description of the test you apply to the sample means is cut off (maybe an HTML error?), so I can’t tell what that’s really doing.
Second, with an exponential distribution, the sampling distribution of the mean is already pretty Gaussian by a sample size of 30. (Plot it and see!) So it’s not surprising the t-test works pretty well.
Third, for reasons discussed here, one generally gets more accurate bootstrap confidence intervals by re-centering around the original estimate, rather than just taking the tails of the bootstrap values. (I suspect that sentence isn’t very clear, but see the link for what I mean.) If all you really want is a test for equality of means here, it’d be even faster and more efficient to simply permute the labels of the two groups from the sample measurements and get the distribution of differences, but that won’t give you a confidence interval for the difference.

Leave a Reply

Your email address will not be published. Required fields are marked *