Lab: P-values and confidence intervals
In this lab, we’ll practice calculating p-values and confidence intervals in R. This is based on the P-values and confidence intervals chapter.
If you are working in R Studio, first load packages and data:
Calculate the two-sided p-value for the null hypothesis that the mean PHQ-9 value at screening (df_clean$phq9_screen
) in the underlying population is 9, and describe in words what this number mean using the t.test()
function in R. If you are working in R Studio, you can type ?t.test
to get a description of what the function does.
You can calculate it manually using the formula in the In this lab, we’ll practice calculating p-values and confidence intervals in R. This is based on the P-values and confidence intervals chapter, or use the t.test()
function
t.test(df_clean$phq9_screen,
mu=___,
alternative = "two.sided")
The full solution is:
#use the t.test() function
t.test(df_clean$phq9_screen,
mu=9,
alternative = "two.sided")
#or calculate it manually
<- mean(df_clean$phq9_screen)
x_bar <- nrow(df_clean)
n <- sd(df_clean$phq9_screen) / sqrt(n)
se <- (x_bar - 9) / se
t_value 1 - pt(t_value, df = 180))*2
(
Calculate the 95% confidence interval for PHQ-9 at screening (df_clean$phq9_screen
), and describe in words what the numbers mean using the ´t.test()` function.
Again use the t.test()
function, or if you want a challenge, try to calculate calculate the confidence interval manually using the formula in the P-values and confidence intervals chapter.
t.test(df_clean$_____)
The full solution is:
t.test(df_clean$phq9_screen)
#or manually
<- mean(df_clean$phq9_screen) # getting the mean and saving as "x_bar"
x_bar <- nrow(df_clean) # getting the n (same as the number of rows) and saving as "n"
n <- sd(df_clean$phq9_screen) / sqrt(n) # calculating the stadard error, and saving as "se"
se <- 1.96 # setting the z-value to 1.95
z
#calculating the upper confidence limit and save as "ucl"
<- x_bar + z*se
ucl #calculating the lower confidence limit, and save as "lcl"
<- x_bar - z*se
lcl
print(c(lcl, ucl)) # printing the upper and lower confidence limits
Calculate the p-value for getting our observed proportion of men, \(\hat{p}\), (df_clean$gender=="Man"
)if the the true population proportion, \(p\), is 40% or more using the prop.test()
function. Remember that you can type ?prop.test()
in R Studio to get some info on the function.
prop.test(x= sum(df_clean$gender=="Man"),
n= length(____),
p=___,
alternative="less")
The full solution is:
# Simulating a gender variable
<- nrow(df_clean)
n $gender <- rbinom(n, 1, 0.7)
df_clean$gender <- ifelse(df_clean$gender == 1, "Woman", "Man")
df_clean
prop.test(x= sum(df_clean$gender=="Man"),
n= length(df_clean$gender),
p=0.4,
alternative="less")
Calculate the confidence interval for getting our observed proportion of men, \(\hat{p}\), using the prop.test()
function
prop.test(x= sum(df_clean$gender=="___"),
n= length(df_clean$gender))
#or use the table function for easier syntax
prop.test(table(df_clean$____))
The full solution is:
# Simulating a gender variable
<- nrow(df_clean)
n $gender <- rbinom(n, 1, 0.7)
df_clean$gender <- ifelse(df_clean$gender == 1, "Woman", "Man")
df_clean
#getting ci
prop.test(x= sum(df_clean$gender=="Man"),
n= length(df_clean$gender))
# #or use the table function for easier syntax
prop.test(table(df_clean$gender))
Reason about the the meaning and interpretation of the confidence intervals you have calculated in the context of how the actual STEpS study was performed. The study can be found at: https://www.nature.com/articles/s44184-024-00063-0
No hints here, but please discuss with us :)
No solution here, but please discuss with us :)
Calculate the 95% Wald confidence interval using the formula for the confidence interval and compare it to what you got using the prop.test()
function
you can use the formula for the standard error of the proportion: \[ \mathrm{SE}(p) = \sqrt{\frac{p(1 - p)}{n}} \]
and combine with the formula for the z-scores
\[ z= \frac{p - \hat{p}}{SE} \]
<- mean(df_clean$gender=="___")
p_hat = nrow(____)
n <- sqrt(p_hat*(1-p_hat)/n)
se = 1.96
z
#upper confidence limit
<- p_hat + __*__
ucl
#lower confidence limit
<- p_hat - __*s__
lcl
print(c(lcl,ucl))
The full solution is:
# manual Wald CI
<- mean(df_clean$gender=="Man")
p_hat = nrow(df_clean)
n <- sqrt(p_hat*(1-p_hat)/n)
se = 1.96
z#upper coinfidence limit
<- p_hat + z*se
ucl #lower confidence limit
<- p_hat - z*se
lcl
print(c(lcl,ucl))
Modify the simulation code for the sampling distribution to determine what would happen to the p-value if the sample size was 10, 100 or 1000
<- 1e4 # the number of samples
n_samples <- ____ # the size of our samples
smp_size <- rep(NA, n_samples) # an empty vector to contain our mean values
means
for (i in 1:n_samples) {
<- rnorm(smp_size, mean = 82, sd = sd(df_clean$lsas_screen))
x <- mean(x)
means[i]
}
mean(means >= mean(df_clean$lsas_screen)) # proportion of simulated means that are larger than our observed mean
The full solution is:
# for a sample size of 10
<- 1e4 # the number of samples
n_samples <- 10 # the size of our samples
smp_size <- rep(NA, n_samples) # an empty vector to contain our mean values
means
for (i in 1:n_samples) {
<- rnorm(smp_size, mean = 82, sd = sd(df_clean$lsas_screen))
x <- mean(x)
means[i]
}
mean(means >= mean(df_clean$lsas_screen)) # proportion of simulated means that are larger than our observed mean
# for a sample size of 100
<- 1e4 # the number of samples
n_samples <- 100 # the size of our samples
smp_size <- rep(NA, n_samples) # an empty vector to contain our mean values
means
for (i in 1:n_samples) {
<- rnorm(smp_size, mean = 82, sd = sd(df_clean$lsas_screen))
x <- mean(x)
means[i]
}
mean(means >= mean(df_clean$lsas_screen)) # proportion of simulated means that are larger than our observed mean
# for a sample size of 1000
<- 1e4 # the number of samples
n_samples <- 1000 # the size of our samples
smp_size <- rep(NA, n_samples) # an empty vector to contain our mean values
means
for (i in 1:n_samples) {
<- rnorm(smp_size, mean = 82, sd = sd(df_clean$lsas_screen))
x <- mean(x)
means[i]
}
mean(means >= mean(df_clean$lsas_screen)) # proportion of simulated means that are larger than our observed mean