Skip to content

More Two Sample Tests

In this lecture we'll discuss the nonparametric Wilcoxon rank sum test as well as the paired sample t-test.

Wilcoxon Signed-Rank Test (one sample test)

First, let's briefly recall the one sample "Wilcoxon Signed-Rank Test". This is a nonparametric test used on data which is assumed to come from a symmetric distribution. Similar to the "Sign Test", this is a test on the median value. But since is assumed to be symmetric, the median is equal to the mean.

  • H0:

For the signed rank test, we sort the sample values by distance from and then add up the "ranks" of the values which are bigger than . (This is similar to the "Sign Test" where you count the number of samples above the mean; except that now we also track distribution of values.)

  • Statistic:

For small , you can look up probability values in a table, but for "large " (i.e. ) we expect to be approximately normal with


  •             

  •             
X <- sample.int(100,50,replace=TRUE)  # generate 50 random integers between 1 and 100
wilcox.test(X,mu=60)                  # Wilcox Signed Rank Test against mu=60

DataCamp Workspaces has recently integrated ai code generation into workspaces. In order to show off this ability (and be lazy) I'll ask DataCamp to generate demonstration code for Wilcoxon Signed rank test by hand below:

# Here's an example of how to perform a Wilcoxon Signed Rank Test by hand in R:

# First, let's generate some sample data
X <- sample.int(100,50,replace=TRUE)

# Test against mu = 60
mu <- 60

# Next, we'll calculate the difference from mu
D <- X - mu

# Then, we'll rank the absolute values of the differences
r <- rank(abs(D))

# We'll calculate the sum of the ranks for positive differences
Wp <- sum(r[D > 0])

# And the sum of the ranks for negative differences
Wn <- sum(r[D < 0])

# Finally, we'll calculate the test statistic
W <- min(Wp, Wn)

# We can calculate the p-value using the Wilcoxon Signed Rank Test distribution
n <- length(D)
p <- 2 * pnorm( W, mean = (n * (n + 1)) / 4, sd = sqrt(n * (n + 1) * (2 * n + 1) / 24))

# Let's print out the results
cat("Wilcoxon Signed Rank Test\n")
cat("------------------------\n")
cat("Sample size: ", n, "\n")
cat("Test statistic: ", W, "\n")
cat("p-value: ", p, "\n")

That is pretty awesome!!!!

To be fair, the code above is not exactly the code that DataCamp ai generated. I tweaked it a bit to more closely match my first example. Also there was a mysterious error in the way it computed the p-value. The original computation generated by the ai was

p <- 2 * pnorm(-abs(W), mean = (n * (n + 1)) / 4, sd = sqrt(n * (n + 1) * (2 * n + 1) / 24))

The command

pnorm(-abs(W))

would work to compute p-values after normalization, but doesn't work for normal distributions that don't have mean 0. Anyway, since we already took W <- min(Wp,Wn) we know it will be to the left of the mean, so we can just do pnorm directly.

p <- 2 * pnorm( W , mean = (n * (n + 1)) / 4, sd = sqrt(n * (n + 1) * (2 * n + 1) / 24))

Wilcoxon Rank-Sum Test (two sample test)

The two sample version of the Wilcoxon test is essentially computing -values for a quantile-quantile plot. We begin with two data sets and , and we assume that they have the same shape and spread, differing only in position.

  • H0:

For this test, we will combine the sample values and into a single set and sort them from smallest to largest, then add up the "ranks" of samples from each distribution.

If there are samples from and samples from , then the total number of samples is , so the sum of all ranks is
 
If and are distributed the same, then we expect for the sums and to match the proportion of samples from and .


  •           

Once again, for small values of you can look up corresponding -values in tables; while for larger values (according to our textbook "large" means ""), is approximately normal, allowing us to use a -test (pnorm(..)).

Note. This reduces to the Signed-Rank Test if we let and where and are the samples above and below respectively. ("Folding at " to compare the distribution of the samples below with those above .)

X <- sample.int(100,20,replace=TRUE)  # generate 20 random integers between 1 and 100
Y <- sample.int(80, 30,replace=TRUE)  # generate 30 random integers between 1 and 80
wilcox.test(X,Y)                      # Wilcox Rank-Sum test of X vs Y

Compare this to a computation by hand (using the same data)

n <- length(X)         # get length of X
m <- length(Y)         # get length of Y

r <- rank(c(X,Y))      # make vector of ranks of X,Y values

Wx <- sum(r[1:n])      # sum up the ranks of X values
Wy <- (n+m)*(n+m+1) / 2 - Wx  # get ranks of Y values

W  <- min(Wx,Wy)       # the minimum gives best numerical accuracy?
mu <- min(n,m) * (n+m+1) / 2

p  <- 2*pnorm( -abs(W-mu), mean = 0, sd = sqrt(n*m*(n+m+1)/12) )

cat("Wilcoxon Rank-Sum Test\n")
cat("------------------------\n")
cat("Sample sizes: ", n, "&", m, "\n")
cat("Test statistic: ", W, "\n")
cat("p-value: ", p, "\n")

The code above gives a slightly different -value because we are using the normal approximation to , while R is doing something more precise....

Since and are both "big", the normal approximation is good enough.

Paired Sample -Test

Freqently we wish to test for difference of means in data which is not independent. For example, testing before/after effects where two measurements are made on the same person, one before a treatment and one after. Or else testing opinion differences where each person is asked their opinion of two different items. In each of these cases, fact that measurements were made from the same source introduces possible correlation, violating the assumptions of the t-test.

Setup:
  • Input is an independent set of pairs of samples
  • Wish to test against H0:
    or more generally
Idea:
  • For the -test before, we used statistic )   (difference of means)
  • Now we will use                     (mean of differences)
Plan:
  • Convert pairs to differences .
  • Do a single sample -test on .
X <- rnorm(10, 20, 8)     # generate some X samples
Y <- X + rnorm(10, 3, 4)  # Y ≈ X+3  (not independent)

curve(dnorm(x,20,6),from = 0, to = 40, ylab='')     # plot the distribution of X
curve(dnorm(x,23,sqrt(64+16)), add=TRUE, col='red') # plot the distribution of Y
legend('topleft', col = c('black','red'), 
	              lty = c( 1     ,  1  ),
	           legend = c('X',     'Y' ))

t.test(X, Y, paired=TRUE)

Note that paired two sample t-tests are really just one sample -tests on the difference.

t.test(X-Y)

Paired Sample -test vs Two sample -test

The paired sample -test and two sample -test are both tests on

  • H0:

The "mean difference" statistic used in the paired t-test is
   
identical to the "difference of means" statistic used in the two sample -test.

Since and pairs are not independent (although the individual samples are independent of each other, and similarly for ), the variance of involves a covariance term
   

Recall that the first two terms above are the variance used in the two sample t-test if X and Y are each sampled n times
   

This is the benefit of the paired -test over the two sample -test! Subtracting the covariance means it involves a much smaller standard error, so it will usually yield smaller -value. (Also the two sample -test was designed assuming that and were independent, so it shouldn't really work in a non-independent setting anyway....)

Note that, in cases where and are almost independent (i.e. is small) then it may be still possible for the two sample -test to yield better -values, since it has degrees of freedom instead of . This is especially the case if is small (once , there isn't a big difference between and degrees of freedom...).