Testing two means and contingency tables

In this chapter we focus on testing differences between groups. Previously we have calculated p-values and confidence intervals for one-sample means and proportions. In psychiatric research, we are more often interested in comparing means between groups. For instance, the post-treatment symptom levels between the treatment group and the control group.

In the STePS study, a primary comparison was the severity of social anxiety symptoms post-treatment between the self-guided and the therapist-guided treatment groups. As a first check, we can load the data and calculate get some descriptive statistics for the post-treatment LSAS-scores across the treatment groups.

Load packages and data

library(tidyverse)
library(here)
library(janitor)

df_data <- read_csv(here("data", "steps_clean.csv"))

Testing the mean difference of two independent groups

And check the mean post-treatment LSAS-score across the groups

df_data |>
  group_by(trt) |>
  summarise(
    n      = n(),
    mean   = mean(lsas_post, na.rm = TRUE),
    sd     = sd(lsas_post, na.rm = TRUE)
  )
# A tibble: 3 × 4
  trt                  n  mean    sd
  <chr>            <int> <dbl> <dbl>
1 self-guided         61  64.9  21.1
2 therapist-guided    60  57.4  23.2
3 waitlist            60  78.4  25.4

We see that the therapist-guided groups has the lowest treatment scores post-treatment. We can also do a simple check of differences in the mean values across the groups of interest.

mean(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = TRUE) -
  mean(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = TRUE)
[1] 7.560429

So the mean values of LSAS-scores is 7.6 point higher in the treatment group receiving self-guided treatment compared to the group receiving therapist-guided treatment. However, we still do not know how likely this difference is to have occurred by chance. For this we need a statistical test!

To test the difference between two means, we can use the Student’s independent groups t-test:

\[ t = \frac{\bar{X}_1 - \bar{X}_2}{SE_{\text{pooled}}} \]

Where \(\bar{X_1}\) and \(\bar{X_2}\) are the means of the first and second group, and where the pooled standard error is:

\[ SE_{\text{pooled}} = \sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)} \]

and the pooled variance is:

\[ s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} \]

Rather than comparing the mean value against some value of interest, as in the one-sample t-test, here we compare the means of two groups against each other. Usually this is done under the null hypothesis:

\[ H_{0}: \mu_{\text{therapist-guided}} = \mu_{\text{self-guided}} \]

and the alternative hypothesis:

\[ H_{A}: \mu_{\text{therapist-guided}} \neq \mu_{\text{self-guided}} \]

This may all seem complicated, but the t-value simply represents the mean difference, expressed in standard errors units. In other words, the number of standard errors between the means. Let’s use the formulas above to calculate a t-value.

# define the components of the t-test formula
x_bar_tg <- mean(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = T)
x_bar_sg <- mean(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = T)
s_2_tg <- var(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = T)
s_2_sg <- var(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = T)
n_tg <- sum(!is.na(df_data$lsas_post[df_data$trt == "therapist-guided"])) # here we only count the number of persons that provided data
n_sg <- sum(!is.na(df_data$lsas_post[df_data$trt == "self-guided"])) # here we only count the number of persons that provided data
sp2 <- ((n_tg - 1) * s_2_tg + (n_sg - 1) * s_2_sg) / (n_tg + n_sg - 2) # pooled variance
SE_pooled <- sqrt(sp2 * (1 / n_tg + 1 / n_sg)) # pooled standard error

# and put the together
t_value <- (x_bar_sg - x_bar_tg) / SE_pooled
t_value
[1] 1.799271

And use the t-distribution with \(n_1 + n_2 - 2\) degrees of freedom to get the p-value for this specific t-value

df <- n_tg + n_sg - 2
p_value <- 2 * (1 - pt(abs(t_value), df)) # multiplied by two to get the two-tailed p-value
p_value
[1] 0.07474269

Or use the t.test() function, and filter to remove the waitlist group.

t.test(lsas_post ~ trt,
  data = df_data[df_data$trt != "waitlist", ],
  var.equal = TRUE
)

    Two Sample t-test

data:  lsas_post by trt
t = 1.7993, df = 109, p-value = 0.07474
alternative hypothesis: true difference in means between group self-guided and group therapist-guided is not equal to 0
95 percent confidence interval:
 -0.7676792 15.8885369
sample estimates:
     mean in group self-guided mean in group therapist-guided 
                      64.91228                       57.35185 
NoteAssumptions of an Independent Groups t-test
  1. Independence of observations
    • The two groups must be independent (no participant in both groups, no repeated measures).
  2. Scale of measurement
    • The dependent variable should be continuous (interval or ratio).
    • The independent variable should be categorical with two groups.
  3. Normality
    • The dependent variable in each group should be approximately normally distributed.
    • More critical for small samples; larger samples benefit from the Central Limit Theorem.
  4. Homogeneity of variance
    • Both groups should have equal variances.
    • If violated, use Welch’s t-test instead (set the argument var.equal = FALSE in the t.test() function)
  5. Random sampling
    • Ideally, groups are randomly sampled or randomly assigned in an experiment.

We can also get the confidence interval for this mean difference. As always, the confidence interval is calculated by taking the estimate of interest, in this case the mean difference \(\pm\) our desired number of standard errors. For a confidence interval with large samples, the sampling distribution approaches the z-distribution (normal distribution with \(\mu = 0\) and \(\sigma = 1\)). We can then use the z-values covering the middle 95% of the z-distribution, which are \(\pm 1.96\). To check that the \(z_\alpha/2\) for an alpha of 0.05 is indeed \(\pm 1.96\) we could use the the function qnorm() that gives the quantile function of the normal distribution.

qnorm(1 - 0.05 / 2)
[1] 1.959964

The formula for the confidence interval of the mean difference using z-values is:

\[ (\bar{X}_1 - \bar{X}_2) \;\pm\; z_{\alpha/2} \cdot SE_{z} \]

where

\[ SE_{z} = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \]

If we do not have a large sample, we can use the t-distribution with the formula:

\[ (\bar{X}_1 - \bar{X}_2) \;\pm\; t_{\alpha/2,\,df} \cdot SE_{\text{pooled}} \]

Putting this together, we get the confidence interval for the mean difference.

# define the components of the formula
x_bar_tg <- mean(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = T)
x_bar_sg <- mean(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = T)
s_2_tg <- var(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = T)
s_2_sg <- var(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = T)
n_tg <- sum(!is.na(df_data$lsas_post[df_data$trt == "therapist-guided"])) # here we only count the number of persons that provided data
n_sg <- sum(!is.na(df_data$lsas_post[df_data$trt == "self-guided"])) # here we only count the number of persons that provided data
se_z <- sqrt((s_2_tg / n_tg) + (s_2_sg / n_sg)) # save the SE of the difference as a separate object for conveninence
sp2 <- ((n_tg - 1) * s_2_tg + (n_sg - 1) * s_2_sg) / (n_tg + n_sg - 2) # pooled variance
SE_pooled <- sqrt(sp2 * (1 / n_tg + 1 / n_sg)) # pooled standard error

# and put it together with the z-value formula

ucl <- (x_bar_sg - x_bar_tg) + 1.96 * se_z
lcl <- (x_bar_sg - x_bar_tg) - 1.96 * se_z
print(c(lcl, ucl))
[1] -0.6967551 15.8176128
# or the t-value formula
df <- n_sg + n_tg - 2
alpha <- 0.05
t_crit <- qt(1 - alpha / 2, df)
ucl <- (x_bar_sg - x_bar_tg) + t_crit * SE_pooled
lcl <- (x_bar_sg - x_bar_tg) - t_crit * SE_pooled
print(c(lcl, ucl))
[1] -0.7676792 15.8885369

The confidence interval from the t-distribution is also given by the t.test() function.

t.test(lsas_post ~ trt,
  data = df_data[df_data$trt != "waitlist", ],
  var.equal = TRUE
)

    Two Sample t-test

data:  lsas_post by trt
t = 1.7993, df = 109, p-value = 0.07474
alternative hypothesis: true difference in means between group self-guided and group therapist-guided is not equal to 0
95 percent confidence interval:
 -0.7676792 15.8885369
sample estimates:
     mean in group self-guided mean in group therapist-guided 
                      64.91228                       57.35185 

Testing dependent means

The independent group t-test assumes that the groups compared are independent, which is the case for the two treatment groups. Sometimes, however, we want to compare the means of dependent groups. This could be for instance the difference in means between two time-point of the same group. Then we can use a paired-sample t-test.

For a paired samples t-test, the statistic is:

\[ t = \frac{\bar{D}}{SE_D} \]

where:

  • \(\bar{D}\) = mean of the difference scores
  • \(SE_D\) = the standard error of the difference scores calculated as \(s_D/ \sqrt{n}\) = the standard error of the differences, with \(s_D\) = the standard deviation of the differences and \(n\) = the number of paired observations

Let’s use this to formula to test the difference in LSAS-scores from pre-to post-treatment.

diff_pre_post <- df_data$lsas_post - df_data$lsas_screen
mean_diff <- mean(diff_pre_post, na.rm = TRUE)
sd_diff <- sd(diff_pre_post, na.rm = TRUE)
n <- sum(!is.na(diff_pre_post))
se_diff <- sd_diff / sqrt(n)
t_value <- mean_diff / se_diff
t_value
[1] -11.07385

And get the two-tailed p-value from this as before

df <- n - 1
p_value <- 2 * (1 - pt(abs(t_value), df))
p_value
[1] 0

Or using the t.test() function with the the argument paired = TRUE.

t.test(df_data$lsas_post, df_data$lsas_screen, paired = TRUE)

    Paired t-test

data:  df_data$lsas_post and df_data$lsas_screen
t = -11.074, df = 168, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -21.05556 -14.68409
sample estimates:
mean difference 
      -17.86982 

We can also get the confidence intervals of this difference, using a familiar formula for large samples

\[ \bar{D} \;\pm\; z_{\alpha/2} \cdot SE_D \]

For smaller samples, we need use the t-distribution and find the right t-value for our degrees of freedom and coverage interval.

\[ CI = \bar{D} \;\pm\; t_{\alpha/2, \, df} \cdot SE_D \]

And again, \(SE_D = {s_D}/ {\sqrt{n}}\)

diff_pre_post <- df_data$lsas_post - df_data$lsas_screen
mean_diff <- mean(diff_pre_post, na.rm = TRUE)
sd_diff <- sd(diff_pre_post, na.rm = TRUE)
n <- sum(!is.na(diff_pre_post))
se_diff <- (sd_diff / sqrt(n)) # saving the standard error of the difference as a separate object for convenience

# and putting it together
lcl <- mean_diff - 1.96 * se_diff
ucl <- mean_diff + 1.96 * se_diff
print(c(lcl, ucl))
[1] -21.03267 -14.70698

or manually using the t-values

df <- n - 1
alpha <- 0.05
t_crit <- qt(1 - alpha / 2, df)
lcl <- mean_diff - t_crit * se_diff
ucl <- mean_diff + t_crit * se_diff
print(c(lcl, ucl))
[1] -21.05556 -14.68409

Or again, more conveniently using the t.test() function

t.test(df_data$lsas_post, df_data$lsas_screen, paired = TRUE)

    Paired t-test

data:  df_data$lsas_post and df_data$lsas_screen
t = -11.074, df = 168, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -21.05556 -14.68409
sample estimates:
mean difference 
      -17.86982 

Contingency tables and the Chi-squared test

Say that we wanted to test the distribution of some categorical variable across two groups, the the methods presented above will not help. For this we need other tests.

Say that we wanted to know if the gender distribution was similar across in the self-guided as in the therapist-guided treatment group of the the STEpS study. The first step to such an investigation would be to create a contingency table showing the gender distribution across the groups.

# create a gender variable
df_data$gender <- rbinom(nrow(df_data), 1, 0.7)
df_data$gender <- ifelse(df_data$gender == 1, "Woman", "Man")
table(df_data$trt, df_data$gender) # number of person
                  
                   Man Woman
  self-guided       14    47
  therapist-guided  19    41
  waitlist          14    46
prop.table(table(df_data$trt, df_data$gender), margin = 1) # proportion in each treatment group
                  
                         Man     Woman
  self-guided      0.2295082 0.7704918
  therapist-guided 0.3166667 0.6833333
  waitlist         0.2333333 0.7666667

We see that there are some small differences between the groups, but we have no statistical test of these differences. Say that we wanted to test the null hypothesis that there is no association between gender and the treatment group you end up in. This hypothesis should hold, due to the randomization of the STEpS study.

One way to test if the proportion of men and women differs between the groups is perform a Chi-squared test (\(X^2-test\)).

The formula for Chi-squared test statistic is:

\[ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

where

\[ E_{ij} = \frac{(\text{row total})_i \cdot (\text{column total})_j}{\text{grand total}} \]

This test compares the observed counts in each cell against what would be expected if values where distributed equally across the groups (i.e. across the cells of the contingency table). Coding this formula manually is a bit tricky, but here is the code for those interested.

## Build the contingency table
tbl <- table(trt = df_data$trt, gender = df_data$gender)

## Convert this table to matrix format for the following calculations
O <- as.matrix(tbl)
O
                  gender
trt                Man Woman
  self-guided       14    47
  therapist-guided  19    41
  waitlist          14    46
## Expected counts E_ij = (row total_i * col total_j) / grand total
row_tot <- rowSums(O)
col_tot <- colSums(O)
grand <- sum(O)
E <- outer(row_tot, col_tot) / grand # outer is a function to multiply arayys
E
                      Man    Woman
self-guided      15.83978 45.16022
therapist-guided 15.58011 44.41989
waitlist         15.58011 44.41989
## Chi-square statistic: sum_{i,j} (O_ij - E_ij)^2 / E_ij
chi_sq <- sum((O - E)^2 / E)

## Degrees of freedom: (r - 1)(c - 1)
r <- nrow(O)
c <- ncol(O)
df <- (r - 1) * (c - 1)

## p-value from chi-square distribution
p_value <- pchisq(chi_sq, df = df, lower.tail = FALSE)
p_value
[1] 0.4678827

In R, these calculations can be easily performed using the function chisq.test() that also provide tables for the observed and expected frequencies

chisq.test(df_data$trt, df_data$gender)

    Pearson's Chi-squared test

data:  df_data$trt and df_data$gender
X-squared = 1.5191, df = 2, p-value = 0.4679
chisq.test(df_data$trt, df_data$gender)$observed
                  df_data$gender
df_data$trt        Man Woman
  self-guided       14    47
  therapist-guided  19    41
  waitlist          14    46
chisq.test(df_data$trt, df_data$gender)$expected
                  df_data$gender
df_data$trt             Man    Woman
  self-guided      15.83978 45.16022
  therapist-guided 15.58011 44.41989
  waitlist         15.58011 44.41989

Our results tells us that the null hypothesis of no association cannot be rejected. In other words, our data would not be unexpected if there was no association between gender and treatment group in the underlying population (kind of a strange though experiment, as there are only treatment variables in treatment studies, not in the underlying population).

Caution

The Chi-squared test is only valid if the counts in each cell is >5. When cell counts are lower, Fisher’s exact test should be used instead. You can use the fisher.test() function

Other non-parametric tests

There are a number of other non-parametric tests that can be used if our data does not fulfill the assumptions for parametric tests, like t-tests and z-tests. We won’t go through the formulas for all these, but show you how they can be implemented in R.

Sign test

The sign test is used to test whether the median of paired differences equals a hypothesized value (often 0), using only the signs of differences, (i.e. + or -). This test can be used even when the sample is small, and the underlying population distribution is assumed to be non-normal.

The test is performed by counting how often a difference between pairs of values are positive. If there was no systematic change, this should occur 50% of the time. We then test whether our actual proportion of positive signs is likely to come from an underlying population where 50% of the signs are positive.

We could use it to test the nyll hypothesis that the median difference of LSAS-score at between pre-treatment and post-treatment is 0 (analog to our paired t-test above).

# Paired sign test: H0 median LSAS(pre - post) = 0
diff_pre_post <- df_data$lsas_post - df_data$lsas_screen # create differnece scores
n <- sum(diff_pre_post != 0, na.rm = TRUE) # exclude ties (exact zeros)
s <- sum(diff_pre_post[diff_pre_post != 0] > 0, na.rm = TRUE) # number of positive differences (again excluding ties)
binom.test(s, n, p = 0.5, alternative = "two.sided") # a test of the proportion

    Exact binomial test

data:  s and n
number of successes = 24, number of trials = 167, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.0943006 0.2062463
sample estimates:
probability of success 
             0.1437126 

This null hypothesis seems very unlikely.

Wilcoxon signed rank test

Tests if the distribution of paired differences is symmetric around 0 (often framed as median difference = 0), using magnitudes and signs.

Assumptions: Paired observations; differences are symmetrically distributed.

Pro’s: Unlike the t-test, it does not assume interval level data. It is also better powered that the sign test.

We can use this to test the null hypothesis that the median LSAS score is the same at post-treatment as at pre-treatment.

wilcox.test(df_data$lsas_post, df_data$lsas_screen,
  paired = TRUE,
  alternative = "two.sided",
  exact = FALSE
)

    Wilcoxon signed rank test with continuity correction

data:  df_data$lsas_post and df_data$lsas_screen
V = 1310, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Again, the null hypothesis seems very unlikely.

Wilcox rank-sum test

This test, aslo known as the Mann-Whitney U-test, test whether two independent samples come from the same distribution (often interpreted as a shift in location/medians).

Assumptions: Independent samples; similar shapes are helpful for a “median shift” interpretation.

Unlike the independent groups t-test, it does not assume interval level data. We can use this test to test the null hypothesis that LSAS scores at post-treatment have the same distribution in the self-guided and the therapist-guided treatment groups.

# Two-sample Wilcoxon rank-sum test (unpaired)

wilcox.test(lsas_post ~ trt,
  data = df_data,
  subset = trt %in% c("self-guided", "therapist-guided"),
  exact = FALSE
)

    Wilcoxon rank sum test with continuity correction

data:  lsas_post by trt
W = 1872, p-value = 0.04974
alternative hypothesis: true location shift is not equal to 0