Lab: Sampling from a population

In this lab, we’ll practice calculating p-values and confidence intervals in R. This is based on the Sampling from a population chapter.

If you are working in R Studio, first load packages and data:

Calculate the standard error of the GAD-7 scale at the screening (df_clean$gad_screen), and describe in words what the number means.

The formula for the standard error of a mean using the sample variance $s^2$ is:

\[ SE = {\sqrt{s^2 / n}} \] You can save the different parts of the formula as objects in your R environment:

s2 <-  var(df_clean$gad_screen)
n <- length(df_clean$gad_screen)
se <- sqrt(_________)

The full solution is:

s2 <-  sd(df_clean$gad_screen)
n <- length(df_clean$gad_screen)
se <- sqrt(s2/n)
se

This value represents the standard deviation of the sampling distribution. In other words, it represents the average spread of the mean values of repeated samples taken from the underlying population.

Calculate the standard error for the proportion of participants with a high income in the STePS study, and describe the meaning of this number in words.

The formula for the standard error of a proportion is:

\[ \mathrm{SE}(p) = \sqrt{\frac{p(1 - p)}{n}} \] You can save the different parts of the formula as objects in your R environment:

p <- mean(df_clean$income=="High") # using the mean function to get the probability, please ask about how this works :)
n <- nrow(___) # the number of observations (since we have no NA values) 
se <- sqrt(___) # standard error 
se

The full solution is:

#simulate some income data
n= nrow(df_clean)
income_levels <- c("Low", "Medium", "High")
income_probs <- c(0.2, 0.6, 0.2) # Adjust probabilities as needed
df_clean$income <- sample(income_levels, size = n, replace = TRUE, prob = income_probs)

p <- mean(df_clean$income=="High") # using the mean function to get the probability 
n <- nrow(df_clean) # the number of observations (since we have no NA values) 
se <- sqrt(p * (1 - p) / n) # standard error 
se

Reason about what would happen to the standard error if the sample size was increased to 1000, and why?

Recall the formula for the standard error:
- Standard error of a mean: $SE = \frac{s}{\sqrt{n}}$
- Standard error of a proportion: $SE_p = \sqrt{\frac{p(1-p)}{n}}$
Focus on what happens when (n) gets larger.
Think about variability: more data means more stability in the estimate.

The standard error decreases as the sample size increases.

For a mean, when (n = 1000), the denominator $\sqrt{n}$ becomes much larger, so the SE shrinks.
For a proportion, the same principle applies: $\sqrt{\frac{p(1-p)}{n}}$ becomes smaller when $n$ is larger.

Why: Larger samples reduce variability and produce more precise estimates. This is why researchers prefer larger sample sizes when possible.