Update: This post used an incorrect implementation of the bootstrap, so the conclusions don’t hold. See this correction
This surprised me. I decided to try out bootstrapping as a method of testing if two sets of numbers are drawn from different distributions. I did this by generating sets of numbers of size m from two ex-gaussian distributions which are identical except for a fixed difference, d
All code is matlab. Sorry about that.
Then, for each pair of numbers I apply a series of different tests for if the distributions are different.
1. Standard t-test (0.05 significance level)
2. Is the mean(s1)
4. Bootstrapping using the median as the test statistic (0.05 significance level)
I used Ione Fine’s pages on bootstrapping as a guide. The bootstrapping code is:
function H=bootstrap(s1,s2,samples,alpha,method) for i=1:samples boot1=s1(ceil(rand(1,length(s1))*length(s1))); boot2=s2(ceil(rand(1,length(s2))*length(s2))); if method==1 a(i)=mean(boot1)-mean(boot2); else a(i)=median(boot1)-median(boot2); end end CI=prctile(a,[100*alpha/2,100*(1-alpha/2)]); H = CI(1)>0 | CI(2)<0;
I do that 5000 times for each difference, d, and each sample size, m. Then I take the average answer from each test (where 1 is 'conclude there distributions are different' and 0 is 'don't conclude the distributions are different'). For the case where d > 0 this gives you a hit rate, the likelihood that the test will tell you there is a difference when there is a difference. For d = 0.5 you get a difference that most of the tests can detect the majority of the time as long as the sample is more than 50. For the case where d = 0, you can calculate the false alarm rate for each test (at each sample size).
From these you can calculate d-prime as a standard index of sensitivity and plot the result. Sttest, Smean, Sbootstrap and Sbootstrap2 are matrices which hold the likelihood of the four tests giving a positive answer for each sample size (columns) for two differences, 0 and 0.5 (the rows):
figure(1);clf plot(measures,norminv(Sttest(2,:))-norminv(Sttest(1,:),0,1),'k') hold on plot(measures,norminv(Smean(2,:))-norminv(Smean(1,:)),'r') %plot(measures,norminv(Smedian(2,:))-norminv(Smedian(1,:)),'c--') plot(measures,norminv(Sbootstrap(2,:))-norminv(Sbootstrap(1,:)),'m') plot(measures,norminv(Sbootstrap2(2,:))-norminv(Sbootstrap2(1,:)),'g') xlabel('Sample size') ylabel('sensitivity - d prime') legend('T-test','mean','bstrap-mean','bstrap-median')
Here is the result (click for larger):
What surprised me was:
- The t-test is more sensitive than the bootstrap, if the mean is used as the test statistic
- How much more sensitive the bootstrap is than the other tests if the median is used as the test statistic
- How well the simple mean does. I suspect there's so nuance I'm missing here, such as unacceptably high false positive rate for smaller differences
-Fixed an inconsequential bug in the dprime calculation
-Closer inspection shows that the simple mean case gives a ~50% false alarm rate, but the high sensitivity offsets this. Suggests dprime isn't a wise summary statistic?