library(tidyverse)
library(here)
library(janitor)
<- read_csv(here("data", "steps_clean.csv")) df_data
Testing two means and contingency tables
In this chapter we focus on testing differences between groups. Previously we have calculated p-values and confidence intervals for one-sample means and proportions. In psychiatric research, we are more often interested in comparing means between groups. For instance, the post-treatment symptom levels between the treatment group and the control group.
In the STePS study, a primary comparison was the severity of social anxiety symptoms post-treatment between the self-guided and the therapist-guided treatment groups. As a first check, we can load the data and calculate get some descriptive statistics for the post-treatment LSAS-scores across the treatment groups.
Load packages and data
Testing the mean difference of two independent groups
And check the mean post-treatment LSAS-score across the groups
|>
df_data group_by(trt) |>
summarise(
n = n(),
mean = mean(lsas_post, na.rm = TRUE),
sd = sd(lsas_post, na.rm = TRUE)
)
# A tibble: 3 × 4
trt n mean sd
<chr> <int> <dbl> <dbl>
1 self-guided 61 64.9 21.1
2 therapist-guided 60 57.4 23.2
3 waitlist 60 78.4 25.4
We see that the therapist-guided groups has the lowest treatment scores post-treatment. We can also do a simple check of differences in the mean values across the groups of interest.
mean(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = TRUE) -
mean(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = TRUE)
[1] 7.560429
So the mean values of LSAS-scores is 7.6 point higher in the treatment group receiving self-guided treatment compared to the group receiving therapist-guided treatment. However, we still do not know how likely this difference is to have occurred by chance. For this we need a statistical test!
To test the difference between two means, we can use the Student’s independent groups t-test:
\[ t = \frac{\bar{X}_1 - \bar{X}_2}{SE_{\text{pooled}}} \]
Where \(\bar{X_1}\) and \(\bar{X_2}\) are the means of the first and second group, and where the pooled standard error is:
\[ SE_{\text{pooled}} = \sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)} \]
and the pooled variance is:
\[ s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} \]
Rather than comparing the mean value against some value of interest, as in the one-sample t-test, here we compare the means of two groups against each other. Usually this is done under the null hypothesis:
\[ H_{0}: \mu_{\text{therapist-guided}} = \mu_{\text{self-guided}} \]
and the alternative hypothesis:
\[ H_{A}: \mu_{\text{therapist-guided}} \neq \mu_{\text{self-guided}} \]
This may all seem complicated, but the t-value simply represents the mean difference, expressed in standard errors units. In other words, the number of standard errors between the means. Let’s use the formulas above to calculate a t-value.
# define the components of the t-test formula
<- mean(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = T)
x_bar_tg <- mean(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = T)
x_bar_sg <- var(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = T)
s_2_tg <- var(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = T)
s_2_sg <- sum(!is.na(df_data$lsas_post[df_data$trt == "therapist-guided"])) # here we only count the number of persons that provided data
n_tg <- sum(!is.na(df_data$lsas_post[df_data$trt == "self-guided"])) # here we only count the number of persons that provided data
n_sg <- ((n_tg - 1) * s_2_tg + (n_sg - 1) * s_2_sg) / (n_tg + n_sg - 2) # pooled variance
sp2 <- sqrt(sp2 * (1 / n_tg + 1 / n_sg)) # pooled standard error
SE_pooled
# and put the together
<- (x_bar_sg - x_bar_tg) / SE_pooled
t_value t_value
[1] 1.799271
And use the t-distribution with \(n_1 + n_2 - 2\) degrees of freedom to get the p-value for this specific t-value
<- n_tg + n_sg - 2
df <- 2 * (1 - pt(abs(t_value), df)) # multiplied by two to get the two-tailed p-value
p_value p_value
[1] 0.07474269
Or use the t.test()
function, and filter to remove the waitlist group.
t.test(lsas_post ~ trt,
data = df_data[df_data$trt != "waitlist", ],
var.equal = TRUE
)
Two Sample t-test
data: lsas_post by trt
t = 1.7993, df = 109, p-value = 0.07474
alternative hypothesis: true difference in means between group self-guided and group therapist-guided is not equal to 0
95 percent confidence interval:
-0.7676792 15.8885369
sample estimates:
mean in group self-guided mean in group therapist-guided
64.91228 57.35185
- Independence of observations
- The two groups must be independent (no participant in both groups, no repeated measures).
- Scale of measurement
- The dependent variable should be continuous (interval or ratio).
- The independent variable should be categorical with two groups.
- Normality
- The dependent variable in each group should be approximately normally distributed.
- More critical for small samples; larger samples benefit from the Central Limit Theorem.
- Homogeneity of variance
- Both groups should have equal variances.
- If violated, use Welch’s t-test instead (set the argument var.equal = FALSE in the
t.test()
function)
- Random sampling
- Ideally, groups are randomly sampled or randomly assigned in an experiment.
We can also get the confidence interval for this mean difference. As always, the confidence interval is calculated by taking the estimate of interest, in this case the mean difference \(\pm\) our desired number of standard errors. For a confidence interval with large samples, the sampling distribution approaches the z-distribution (normal distribution with \(\mu = 0\) and \(\sigma = 1\)). We can then use the z-values covering the middle 95% of the z-distribution, which are \(\pm 1.96\). To check that the \(z_\alpha/2\) for an alpha of 0.05 is indeed \(\pm 1.96\) we could use the the function qnorm()
that gives the quantile function of the normal distribution.
qnorm(1 - 0.05 / 2)
[1] 1.959964
The formula for the confidence interval of the mean difference using z-values is:
\[ (\bar{X}_1 - \bar{X}_2) \;\pm\; z_{\alpha/2} \cdot SE_{z} \]
where
\[ SE_{z} = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} \]
If we do not have a large sample, we can use the t-distribution with the formula:
\[ (\bar{X}_1 - \bar{X}_2) \;\pm\; t_{\alpha/2,\,df} \cdot SE_{\text{pooled}} \]
Putting this together, we get the confidence interval for the mean difference.
# define the components of the formula
<- mean(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = T)
x_bar_tg <- mean(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = T)
x_bar_sg <- var(df_data$lsas_post[df_data$trt == "therapist-guided"], na.rm = T)
s_2_tg <- var(df_data$lsas_post[df_data$trt == "self-guided"], na.rm = T)
s_2_sg <- sum(!is.na(df_data$lsas_post[df_data$trt == "therapist-guided"])) # here we only count the number of persons that provided data
n_tg <- sum(!is.na(df_data$lsas_post[df_data$trt == "self-guided"])) # here we only count the number of persons that provided data
n_sg <- sqrt((s_2_tg / n_tg) + (s_2_sg / n_sg)) # save the SE of the difference as a separate object for conveninence
se_z <- ((n_tg - 1) * s_2_tg + (n_sg - 1) * s_2_sg) / (n_tg + n_sg - 2) # pooled variance
sp2 <- sqrt(sp2 * (1 / n_tg + 1 / n_sg)) # pooled standard error
SE_pooled
# and put it together with the z-value formula
<- (x_bar_sg - x_bar_tg) + 1.96 * se_z
ucl <- (x_bar_sg - x_bar_tg) - 1.96 * se_z
lcl print(c(lcl, ucl))
[1] -0.6967551 15.8176128
# or the t-value formula
<- n_sg + n_tg - 2
df <- 0.05
alpha <- qt(1 - alpha / 2, df)
t_crit <- (x_bar_sg - x_bar_tg) + t_crit * SE_pooled
ucl <- (x_bar_sg - x_bar_tg) - t_crit * SE_pooled
lcl print(c(lcl, ucl))
[1] -0.7676792 15.8885369
The confidence interval from the t-distribution is also given by the t.test()
function.
t.test(lsas_post ~ trt,
data = df_data[df_data$trt != "waitlist", ],
var.equal = TRUE
)
Two Sample t-test
data: lsas_post by trt
t = 1.7993, df = 109, p-value = 0.07474
alternative hypothesis: true difference in means between group self-guided and group therapist-guided is not equal to 0
95 percent confidence interval:
-0.7676792 15.8885369
sample estimates:
mean in group self-guided mean in group therapist-guided
64.91228 57.35185
Testing dependent means
The independent group t-test assumes that the groups compared are independent, which is the case for the two treatment groups. Sometimes, however, we want to compare the means of dependent groups. This could be for instance the difference in means between two time-point of the same group. Then we can use a paired-sample t-test.
For a paired samples t-test, the statistic is:
\[ t = \frac{\bar{D}}{SE_D} \]
where:
- \(\bar{D}\) = mean of the difference scores
- \(SE_D\) = the standard error of the difference scores calculated as \(s_D/ \sqrt{n}\) = the standard error of the differences, with \(s_D\) = the standard deviation of the differences and \(n\) = the number of paired observations
Let’s use this to formula to test the difference in LSAS-scores from pre-to post-treatment.
<- df_data$lsas_post - df_data$lsas_screen
diff_pre_post <- mean(diff_pre_post, na.rm = TRUE)
mean_diff <- sd(diff_pre_post, na.rm = TRUE)
sd_diff <- sum(!is.na(diff_pre_post))
n <- sd_diff / sqrt(n)
se_diff <- mean_diff / se_diff
t_value t_value
[1] -11.07385
And get the two-tailed p-value from this as before
<- n - 1
df <- 2 * (1 - pt(abs(t_value), df))
p_value p_value
[1] 0
Or using the t.test()
function with the the argument paired = TRUE
.
t.test(df_data$lsas_post, df_data$lsas_screen, paired = TRUE)
Paired t-test
data: df_data$lsas_post and df_data$lsas_screen
t = -11.074, df = 168, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-21.05556 -14.68409
sample estimates:
mean difference
-17.86982
We can also get the confidence intervals of this difference, using a familiar formula for large samples
\[ \bar{D} \;\pm\; z_{\alpha/2} \cdot SE_D \]
For smaller samples, we need use the t-distribution and find the right t-value for our degrees of freedom and coverage interval.
\[ CI = \bar{D} \;\pm\; t_{\alpha/2, \, df} \cdot SE_D \]
And again, \(SE_D = {s_D}/ {\sqrt{n}}\)
<- df_data$lsas_post - df_data$lsas_screen
diff_pre_post <- mean(diff_pre_post, na.rm = TRUE)
mean_diff <- sd(diff_pre_post, na.rm = TRUE)
sd_diff <- sum(!is.na(diff_pre_post))
n <- (sd_diff / sqrt(n)) # saving the standard error of the difference as a separate object for convenience
se_diff
# and putting it together
<- mean_diff - 1.96 * se_diff
lcl <- mean_diff + 1.96 * se_diff
ucl print(c(lcl, ucl))
[1] -21.03267 -14.70698
or manually using the t-values
<- n - 1
df <- 0.05
alpha <- qt(1 - alpha / 2, df)
t_crit <- mean_diff - t_crit * se_diff
lcl <- mean_diff + t_crit * se_diff
ucl print(c(lcl, ucl))
[1] -21.05556 -14.68409
Or again, more conveniently using the t.test()
function
t.test(df_data$lsas_post, df_data$lsas_screen, paired = TRUE)
Paired t-test
data: df_data$lsas_post and df_data$lsas_screen
t = -11.074, df = 168, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-21.05556 -14.68409
sample estimates:
mean difference
-17.86982
Contingency tables and the Chi-squared test
Say that we wanted to test the distribution of some categorical variable across two groups, the the methods presented above will not help. For this we need other tests.
Say that we wanted to know if the gender distribution was similar across in the self-guided as in the therapist-guided treatment group of the the STEpS study. The first step to such an investigation would be to create a contingency table showing the gender distribution across the groups.
# create a gender variable
$gender <- rbinom(nrow(df_data), 1, 0.7)
df_data$gender <- ifelse(df_data$gender == 1, "Woman", "Man") df_data
table(df_data$trt, df_data$gender) # number of person
Man Woman
self-guided 14 47
therapist-guided 19 41
waitlist 14 46
prop.table(table(df_data$trt, df_data$gender), margin = 1) # proportion in each treatment group
Man Woman
self-guided 0.2295082 0.7704918
therapist-guided 0.3166667 0.6833333
waitlist 0.2333333 0.7666667
We see that there are some small differences between the groups, but we have no statistical test of these differences. Say that we wanted to test the null hypothesis that there is no association between gender and the treatment group you end up in. This hypothesis should hold, due to the randomization of the STEpS study.
One way to test if the proportion of men and women differs between the groups is perform a Chi-squared test (\(X^2-test\)).
The formula for Chi-squared test statistic is:
\[ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]
where
\[ E_{ij} = \frac{(\text{row total})_i \cdot (\text{column total})_j}{\text{grand total}} \]
This test compares the observed counts in each cell against what would be expected if values where distributed equally across the groups (i.e. across the cells of the contingency table). Coding this formula manually is a bit tricky, but here is the code for those interested.
## Build the contingency table
<- table(trt = df_data$trt, gender = df_data$gender)
tbl
## Convert this table to matrix format for the following calculations
<- as.matrix(tbl)
O O
gender
trt Man Woman
self-guided 14 47
therapist-guided 19 41
waitlist 14 46
## Expected counts E_ij = (row total_i * col total_j) / grand total
<- rowSums(O)
row_tot <- colSums(O)
col_tot <- sum(O)
grand <- outer(row_tot, col_tot) / grand # outer is a function to multiply arayys
E E
Man Woman
self-guided 15.83978 45.16022
therapist-guided 15.58011 44.41989
waitlist 15.58011 44.41989
## Chi-square statistic: sum_{i,j} (O_ij - E_ij)^2 / E_ij
<- sum((O - E)^2 / E)
chi_sq
## Degrees of freedom: (r - 1)(c - 1)
<- nrow(O)
r <- ncol(O)
c <- (r - 1) * (c - 1)
df
## p-value from chi-square distribution
<- pchisq(chi_sq, df = df, lower.tail = FALSE)
p_value p_value
[1] 0.4678827
In R, these calculations can be easily performed using the function chisq.test()
that also provide tables for the observed and expected frequencies
chisq.test(df_data$trt, df_data$gender)
Pearson's Chi-squared test
data: df_data$trt and df_data$gender
X-squared = 1.5191, df = 2, p-value = 0.4679
chisq.test(df_data$trt, df_data$gender)$observed
df_data$gender
df_data$trt Man Woman
self-guided 14 47
therapist-guided 19 41
waitlist 14 46
chisq.test(df_data$trt, df_data$gender)$expected
df_data$gender
df_data$trt Man Woman
self-guided 15.83978 45.16022
therapist-guided 15.58011 44.41989
waitlist 15.58011 44.41989
Our results tells us that the null hypothesis of no association cannot be rejected. In other words, our data would not be unexpected if there was no association between gender and treatment group in the underlying population (kind of a strange though experiment, as there are only treatment variables in treatment studies, not in the underlying population).
The Chi-squared test is only valid if the counts in each cell is >5. When cell counts are lower, Fisher’s exact test should be used instead. You can use the fisher.test()
function
Other non-parametric tests
There are a number of other non-parametric tests that can be used if our data does not fulfill the assumptions for parametric tests, like t-tests and z-tests. We won’t go through the formulas for all these, but show you how they can be implemented in R.
Sign test
The sign test is used to test whether the median of paired differences equals a hypothesized value (often 0), using only the signs of differences, (i.e. + or -). This test can be used even when the sample is small, and the underlying population distribution is assumed to be non-normal.
The test is performed by counting how often a difference between pairs of values are positive. If there was no systematic change, this should occur 50% of the time. We then test whether our actual proportion of positive signs is likely to come from an underlying population where 50% of the signs are positive.
We could use it to test the nyll hypothesis that the median difference of LSAS-score at between pre-treatment and post-treatment is 0 (analog to our paired t-test above).
# Paired sign test: H0 median LSAS(pre - post) = 0
<- df_data$lsas_post - df_data$lsas_screen # create differnece scores
diff_pre_post <- sum(diff_pre_post != 0, na.rm = TRUE) # exclude ties (exact zeros)
n <- sum(diff_pre_post[diff_pre_post != 0] > 0, na.rm = TRUE) # number of positive differences (again excluding ties)
s binom.test(s, n, p = 0.5, alternative = "two.sided") # a test of the proportion
Exact binomial test
data: s and n
number of successes = 24, number of trials = 167, p-value < 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.0943006 0.2062463
sample estimates:
probability of success
0.1437126
This null hypothesis seems very unlikely.
Wilcoxon signed rank test
Tests if the distribution of paired differences is symmetric around 0 (often framed as median difference = 0), using magnitudes and signs.
Assumptions: Paired observations; differences are symmetrically distributed.
Pro’s: Unlike the t-test, it does not assume interval level data. It is also better powered that the sign test.
We can use this to test the null hypothesis that the median LSAS score is the same at post-treatment as at pre-treatment.
wilcox.test(df_data$lsas_post, df_data$lsas_screen,
paired = TRUE,
alternative = "two.sided",
exact = FALSE
)
Wilcoxon signed rank test with continuity correction
data: df_data$lsas_post and df_data$lsas_screen
V = 1310, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
Again, the null hypothesis seems very unlikely.
Wilcox rank-sum test
This test, aslo known as the Mann-Whitney U-test, test whether two independent samples come from the same distribution (often interpreted as a shift in location/medians).
Assumptions: Independent samples; similar shapes are helpful for a “median shift” interpretation.
Unlike the independent groups t-test, it does not assume interval level data. We can use this test to test the null hypothesis that LSAS scores at post-treatment have the same distribution in the self-guided and the therapist-guided treatment groups.
# Two-sample Wilcoxon rank-sum test (unpaired)
wilcox.test(lsas_post ~ trt,
data = df_data,
subset = trt %in% c("self-guided", "therapist-guided"),
exact = FALSE
)
Wilcoxon rank sum test with continuity correction
data: lsas_post by trt
W = 1872, p-value = 0.04974
alternative hypothesis: true location shift is not equal to 0