Lab: P-values and confidence intervals

In this lab, we’ll practice calculating p-values and confidence intervals in R. This is based on the P-values and confidence intervals chapter.

If you are working in R Studio, first load packages and data:

Calculate the two-sided p-value for the null hypothesis that the mean PHQ-9 value at screening (df_clean$phq9_screen) in the underlying population is 9, and describe in words what this number mean using the t.test() function in R. If you are working in R Studio, you can type ?t.test to get a description of what the function does.

You can calculate it manually using the formula in the In this lab, we’ll practice calculating p-values and confidence intervals in R. This is based on the P-values and confidence intervals chapter, or use the t.test() function

t.test(df_clean$phq9_screen,
       mu=___,
       alternative = "two.sided")

The full solution is:

#use the t.test() function
t.test(df_clean$phq9_screen,
       mu=9,
       alternative = "two.sided")
       
       
#or calculate it manually
x_bar <- mean(df_clean$phq9_screen)
n <- nrow(df_clean)
se <- sd(df_clean$phq9_screen) / sqrt(n)
t_value <- (x_bar - 9) / se
(1 - pt(t_value, df = 180))*2

Calculate the 95% confidence interval for PHQ-9 at screening (df_clean$phq9_screen), and describe in words what the numbers mean using the ´t.test()` function.

Again use the t.test() function, or if you want a challenge, try to calculate calculate the confidence interval manually using the formula in the P-values and confidence intervals chapter.

t.test(df_clean$_____)

The full solution is:


t.test(df_clean$phq9_screen) 

#or manually 
x_bar <- mean(df_clean$phq9_screen) # getting the mean and saving as "x_bar"
n <- nrow(df_clean) # getting the n (same as the number of rows) and saving as "n"
se <- sd(df_clean$phq9_screen) / sqrt(n) # calculating the stadard error, and saving as "se"
z <- 1.96 # setting the z-value to 1.95

#calculating the upper confidence limit and save as "ucl"
ucl <- x_bar + z*se
#calculating the lower confidence limit, and save as "lcl"
lcl <- x_bar - z*se

print(c(lcl, ucl)) # printing the upper and lower confidence limits

Calculate the p-value for getting our observed proportion of men, $\hat{p}$, (df_clean$gender=="Man")if the the true population proportion, $p$, is 40% or more using the prop.test() function. Remember that you can type ?prop.test() in R Studio to get some info on the function.

The full solution is:

# Simulating a gender variable
n <- nrow(df_clean)
df_clean$gender <- rbinom(n, 1, 0.7)
df_clean$gender <- ifelse(df_clean$gender == 1, "Woman", "Man")

prop.test(x= sum(df_clean$gender=="Man"), 
          n= length(df_clean$gender), 
          p=0.4,
          alternative="less")

Calculate the confidence interval for getting our observed proportion of men, $\hat{p}$, using the prop.test() function

The full solution is:

# Simulating a gender variable
n <- nrow(df_clean)
df_clean$gender <- rbinom(n, 1, 0.7)
df_clean$gender <- ifelse(df_clean$gender == 1, "Woman", "Man")

#getting ci 
prop.test(x= sum(df_clean$gender=="Man"), 
          n= length(df_clean$gender))
          
# #or use the table function for easier syntax 
prop.test(table(df_clean$gender))

Reason about the the meaning and interpretation of the confidence intervals you have calculated in the context of how the actual STEpS study was performed. The study can be found at: https://www.nature.com/articles/s44184-024-00063-0

No hints here, but please discuss with us :)

No solution here, but please discuss with us :)

Calculate the 95% Wald confidence interval using the formula for the confidence interval and compare it to what you got using the prop.test() function

you can use the formula for the standard error of the proportion: \[ \mathrm{SE}(p) = \sqrt{\frac{p(1 - p)}{n}} \]

and combine with the formula for the z-scores

\[ z= \frac{p - \hat{p}}{SE} \]

p_hat <- mean(df_clean$gender=="___")
n = nrow(____)
se <- sqrt(p_hat*(1-p_hat)/n)
z= 1.96

#upper confidence limit
ucl <- p_hat + __*__

#lower confidence limit
lcl <- p_hat - __*s__

print(c(lcl,ucl))

The full solution is:

# manual Wald CI
p_hat <- mean(df_clean$gender=="Man")
n = nrow(df_clean)
se <- sqrt(p_hat*(1-p_hat)/n)
z= 1.96
#upper coinfidence limit
ucl <- p_hat + z*se
#lower confidence limit
lcl <- p_hat - z*se

print(c(lcl,ucl))

Modify the simulation code for the sampling distribution to determine what would happen to the p-value if the sample size was 10, 100 or 1000

n_samples <- 1e4 # the number of samples
smp_size <- ____ # the size of our samples
means <- rep(NA, n_samples) # an empty vector to contain our mean values

for (i in 1:n_samples) {
  x <- rnorm(smp_size, mean = 82, sd = sd(df_clean$lsas_screen))
  means[i] <- mean(x)
}

mean(means >= mean(df_clean$lsas_screen)) # proportion of simulated means that are larger than our observed mean

The full solution is:

# for a sample size of 10
n_samples <- 1e4 # the number of samples
smp_size <- 10 # the size of our samples
means <- rep(NA, n_samples) # an empty vector to contain our mean values

for (i in 1:n_samples) {
  x <- rnorm(smp_size, mean = 82, sd = sd(df_clean$lsas_screen))
  means[i] <- mean(x)
}

mean(means >= mean(df_clean$lsas_screen)) # proportion of simulated means that are larger than our observed mean

# for a sample size of 100

n_samples <- 1e4 # the number of samples
smp_size <- 100 # the size of our samples
means <- rep(NA, n_samples) # an empty vector to contain our mean values

for (i in 1:n_samples) {
  x <- rnorm(smp_size, mean = 82, sd = sd(df_clean$lsas_screen))
  means[i] <- mean(x)
}

mean(means >= mean(df_clean$lsas_screen)) # proportion of simulated means that are larger than our observed mean


# for a sample size of 1000 

n_samples <- 1e4 # the number of samples
smp_size <- 1000 # the size of our samples
means <- rep(NA, n_samples) # an empty vector to contain our mean values

for (i in 1:n_samples) {
  x <- rnorm(smp_size, mean = 82, sd = sd(df_clean$lsas_screen))
  means[i] <- mean(x)
}

mean(means >= mean(df_clean$lsas_screen)) # proportion of simulated means that are larger than our observed mean