Load the data

First, let’s load the packages and data.

1 Selecting columns with `select()`

The select() function lets you choose which columns to keep in your dataset.

Exercise 1 (Select specific columns)

Select only the id, group, and baseline LSAS score (lsas_screen) from the dataset.


# Select specific columns
df_selected <- df_clean |>
1  select(id, trt, lsas_screen)

# Check the result
head(df_selected)

1: Select just these three columns from the dataset

Exercise 2 (Use select helpers to grab multiple columns)

Select the id, trt, and all columns that start with “phq9” using the starts_with() helper function.


# Select using helper functions
df_phq9 <- df_clean |>
1  select(id, trt, starts_with("phq9"))

# How many columns?
ncol(df_phq9)
names(df_phq9)

1: starts_with() selects all columns beginning with “phq9”

You should have 6 columns total: id, trt, and 4 PHQ-9 measurements (screen, post, fu6, and fu12).

2 Filtering rows with `filter()`

The filter() function selects rows based on conditions.

Exercise 3 (Filter based on one condition)

Filter the data to keep only participants in the “therapist-guided” treatment group.


# Filter for one group
df_guided <- df_clean |>
1  filter(trt == "therapist-guided")

# How many participants?
nrow(df_guided)

1: Keep only rows where treatment group equals “therapist-guided”

Exercise 4 (Filter with multiple conditions)

Filter to keep only participants who:

are in the “therapist-guided” group, AND
have baseline LSAS scores of 60 or higher


# Filter with multiple conditions
df_guided_severe <- df_clean |>
1  filter(trt == "therapist-guided", lsas_screen >= 60)

nrow(df_guided_severe)

1: Both conditions must be true (AND logic) - note quotes around text value

3 Creating new variables with `mutate()`

The mutate() function creates new columns or modifies existing ones.

Exercise 5 (Calculate change scores)

Create a new variable called lsas_change that represents the change in LSAS scores from baseline to post-treatment (post - baseline).


# Create change score
df_with_change <- df_clean |>
  select(id, trt, lsas_screen, lsas_post) |>
  mutate(
1    lsas_change = lsas_post - lsas_screen
  )

head(df_with_change)

1: Subtract baseline from post-treatment to get change

Negative values indicate improvement (lower anxiety at post-treatment).

Exercise 6 (Create categorical variables with case_when())

Create a categorical variable for GAD-7 severity with these categories:

“Minimal” for scores 0-4
“Mild” for scores 5-9
“Moderate” for scores 10-14
“Severe” for scores 15 or higher


# Create categorical variable
df_with_severity <- df_clean |>
  select(id, gad_screen) |>
  mutate(
    gad_severity = case_when(
1      gad_screen < 5 ~ "Minimal",
2      gad_screen < 10 ~ "Mild",
3      gad_screen < 15 ~ "Moderate",
4      gad_screen >= 15 ~ "Severe"
    )
  )

# Check the categories
table(df_with_severity$gad_severity)

1: Scores 0-4 = Minimal
2: Scores 5-9 = Mild
3: Scores 10-14 = Moderate
4: Scores 15+ = Severe

4 Working with factors

Factors control the order of categorical variables in tables and plots.

Exercise 7 (Create ordered factors)

The trt variable contains treatment assignments as text (“waitlist”, “self-guided”, “therapist-guided”). Create a properly ordered factor with the levels in this order: Waitlist, Self-guided, Therapist-guided.


# Create ordered factor
df_with_factor <- df_clean |>
  select(id, trt) |>
  mutate(
    trt_factor = factor(
      trt,
1      levels = c("waitlist", "self-guided", "therapist-guided")
    )
  )

# Check the result
table(df_with_factor$trt_factor)
levels(df_with_factor$trt_factor)

1: Specify levels in the desired order (not alphabetical)

5 Summarizing data with `group_by()` and `summarize()`

Calculate summary statistics by groups.

Exercise 8 (Calculate grouped statistics)

Calculate the mean and standard deviation of baseline LSAS scores by treatment group.


# Group and summarize
group_stats <- df_clean |>
1  group_by(trt) |>
  summarize(
2    n = n(),
3    mean_lsas = mean(lsas_screen, na.rm = TRUE),
    sd_lsas = sd(lsas_screen, na.rm = TRUE),
    .groups = "drop"
  )

group_stats

1: Group by treatment assignment
2: Count participants in each treatment group
3: Calculate mean and SD for each treatment group

Exercise 9 (Use the .by syntax)

Calculate the same statistics using the .by argument instead of group_by().


# Summarize with .by
group_stats_by <- df_clean |>
  summarize(
    n = n(),
    mean_lsas = mean(lsas_screen, na.rm = TRUE),
    sd_lsas = sd(lsas_screen, na.rm = TRUE),
1    .by = trt
  )

group_stats_by

1: .by is a cleaner alternative to group_by() for simple grouping by treatment

6 Reshaping data: Wide to Long format

Convert data from wide format (one row per participant) to long format (one row per measurement).

Exercise 10 (Convert to long format with pivot_longer())

Convert the LSAS measurements to long format. Select columns that start with “lsas”, then pivot them longer with:

Column names going to “time_point”
Values going to “lsas_score”


# Pivot to long format
df_lsas_long <- df_clean |>
1  select(id, trt, starts_with("lsas")) |>
  pivot_longer(
2    cols = starts_with("lsas"),
3    names_to = "time_point",
4    values_to = "lsas_score"
  )

# Check the result
head(df_lsas_long, 12)

1: Select ID, treatment, and all LSAS columns
2: Pivot all columns starting with “lsas”
3: Old column names go to this new column
4: Values go to this new column

Each participant now has multiple rows, one for each time point.

Exercise 11 (Clean the long format data)

The time_point column has values like “lsas_screen”, “lsas_post”, etc. Use separate() to split this into two columns: “measure” and “time”.


# Clean time variable
df_lsas_long <- df_clean |>
1  select(id, trt, starts_with("lsas")) |>
  pivot_longer(
    cols = starts_with("lsas"),
    names_to = "time_point",
    values_to = "lsas_score"
  ) |>
  separate(
2    time_point,
3    into = c("measure", "time"),
4    sep = "_"
  )

head(df_lsas_long)

1: Select ID, treatment, and LSAS columns
2: Column to separate
3: Names for the new columns
4: Separator character (underscore)

Now we have “measure” (always “lsas”) and “time” (screen, post, etc.) as separate columns.

7 Reshaping data: Long to Wide format

Convert summary data from long to wide format.

Exercise 12 (Convert to wide format with pivot_wider())

First, calculate mean LSAS scores by group and time. Then convert to wide format with groups as columns.


# Prepare long data with clean time labels
df_lsas_long <- df_clean |>
1  select(id, trt, lsas_screen, lsas_post) |>
  pivot_longer(
    cols = starts_with("lsas"),
    names_to = "time_point",
    values_to = "lsas_score"
  ) |>
  separate(
    time_point, 
    into = c("measure", "time"), 
    sep = "_"
  ) |>
  mutate(
    trt_factor = factor(
      trt, 
      levels = c("waitlist", "self-guided", "therapist-guided")
    )
  )

# Summarize
lsas_summary <- df_lsas_long |>
  summarize(
    mean_lsas = mean(lsas_score, na.rm = TRUE),
    .by = c(trt_factor, time)
  )

# Pivot wider
lsas_wide <- lsas_summary |>
  pivot_wider(
2    names_from = trt_factor,
3    values_from = mean_lsas
  )

lsas_wide

1: Select ID, treatment, and LSAS baseline and post
2: Column to spread into new column names
3: Column with values to fill the new columns

Now we have a table with time points as rows and treatment groups as columns.

8 Calculate remission proportions by treatment group

Use filtering, mutating, and summarizing to calculate proportions by group.

Exercise 13 (Remission proportions by treatment group)

Chain multiple data manipulation steps to:

Selects id, trt, and post-treatment LSAS score
Creates a binary variable indicating remission (LSAS < 30)
Calculates the percentage in remission by treatment group


# Complete pipeline
remission_desc <- df_clean |>
  # Select relevant columns
1  select(id, trt, lsas_post) |>
  # Create remission indicator
  mutate(
2    remission = lsas_post < 30
  ) |>
  # Calculate percentage by treatment group
  summarize(
    n = n(),
3    n_remission = sum(remission, na.rm = TRUE),
    pct_remission = (n_remission / n) * 100,
    .by = trt
  )

remission_desc

1: Select only needed variables (ID, treatment assignment, post-treatment LSAS)
2: Create binary variable: TRUE if LSAS < 30 (remission)
3: Count and calculate percentage in remission by treatment group

This pipeline shows how to combine selecting variables, creating new variables, and summarizing in a single workflow. LSAS scores below 30 indicate remission from social anxiety. You’ll see results for all three treatment groups.

9 Calculate and visualize missing data patterns

Exploring missing data patterns is an important part of data analysis. Let’s calculate and visualize missingness across time points.

Exercise 14 (Missing LSAS observations by time and group)

Calculate the percentage of missing LSAS observations at each time point for each treatment group, then create a line plot to visualize the pattern.

Steps:

Convert LSAS data to long format (all columns starting with “lsas”)
Separate the time variable
Create clean time labels and factor variable for plotting
Filter out waitlist group at follow-up time points (not measured)
Create a binary variable indicating if the observation is missing
Calculate the percentage missing by time and group
Plot the results using the ordered time factor


# Convert to long format and create time variables
df_lsas_missing <- df_clean |>
1  select(id, trt, starts_with("lsas")) |>
  pivot_longer(
    cols = starts_with("lsas"),
    names_to = "time_point",
    values_to = "lsas_score"
  ) |>
2  separate(time_point, into = c("measure", "time"), sep = "_") |>
  mutate(
    time_clean = case_when(
      time == "screen" ~ "Baseline",
      time == "v1" ~ "Week 1",
      time == "v2" ~ "Week 2",
      time == "v3" ~ "Week 3",
      time == "v4" ~ "Week 4",
      time == "v5" ~ "Week 5",
      time == "v6" ~ "Week 6",
      time == "v7" ~ "Week 7",
      time == "v8" ~ "Week 8",
      time == "post" ~ "Post-treatment",
      time == "fu6" ~ "6-month follow-up",
      time == "fu12" ~ "12-month follow-up"
    ),
3    time_factor = factor(
      time_clean,
      levels = c(
        "Baseline", "Week 1", "Week 2", "Week 3", "Week 4",
        "Week 5", "Week 6", "Week 7", "Week 8",
        "Post-treatment", "6-month follow-up", "12-month follow-up"
      )
    )
  ) |>
  select(-measure) |>
  # Filter out waitlist at follow-up (not measured)
4  filter(!(trt == "waitlist" & time %in% c("fu6", "fu12")))

# Calculate missing percentages
missing <- df_lsas_missing |>
  mutate(
5    is_missing = is.na(lsas_score)
  ) |>
  summarize(
    n = n(),
6    n_missing = sum(is_missing),
7    missing_percent = (n_missing / n) * 100,
8    .by = c(trt, time_factor)
  )

# Plot
library(ggplot2)
ggplot(
  missing,
  aes(
9    time_factor,
    missing_percent,
    group = trt,
    color = trt
  )
) +
  geom_line() +
  geom_point() +
  labs(
    title = "LSAS Percent missing observations",
    x = "Time",
    y = "Missing (%)",
    color = "Treatment"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

1: Select ID, treatment assignment, and LSAS columns
2: Separate the time_point column to extract time
3: Create ordered factor for proper chronological plotting with readable labels
4: Exclude waitlist at 6 and 12 month follow-ups (not measured)
5: Create indicator: TRUE if missing, FALSE if present
6: Sum the TRUE/FALSE values to count missing
7: Calculate percentage missing
8: Group by both treatment assignment and time factor
9: Use time_factor for x-axis to ensure proper time ordering with readable labels

The plot reveals missing data patterns across time and treatment groups, which is important for understanding dropout patterns.

10 Summary

In this lab, you practiced:

Selecting columns with select() and helper functions
Filtering rows with filter() and multiple conditions
Creating variables with mutate() and case_when()
Working with factors to control categorical variable ordering
Summarizing data with group_by() and summarize()
Reshaping data with pivot_longer() and pivot_wider()
Calculating proportions including missing data patterns
Visualizing patterns with ggplot2

Load the data

1 Selecting columns with select()

2 Filtering rows with filter()

3 Creating new variables with mutate()

4 Working with factors

5 Summarizing data with group_by() and summarize()

6 Reshaping data: Wide to Long format

7 Reshaping data: Long to Wide format

8 Calculate remission proportions by treatment group

9 Calculate and visualize missing data patterns

10 Summary

1 Selecting columns with `select()`

2 Filtering rows with `filter()`

3 Creating new variables with `mutate()`

5 Summarizing data with `group_by()` and `summarize()`