Lab: Tidy Data Manipulation

In this lab, you’ll practice data manipulation skills using the tidyverse. We’ll work with the STePS dataset to practice selecting, filtering, creating new variables, summarizing, and reshaping data.

Tip

While you can complete all the exercises in your browser, we recommend also practicing in RStudio. Using an editor like RStudio will help you build real-world skills for writing, running, and saving your R code.

Load the data

First, let’s load the packages and data.

1 Selecting columns with select()

The select() function lets you choose which columns to keep in your dataset.

Exercise 1 (Select specific columns)  

Select only the id, group, and baseline LSAS score (lsas_screen) from the dataset.

The select() function takes column names separated by commas. You don’t need quotes around the column names.

Think about which three columns you need: the participant identifier, their treatment assignment, and their baseline LSAS score.

# Select specific columns df_selected <- df_clean |> select(id, trt, lsas_screen) #<1> # Check the result head(df_selected)

# Select specific columns
df_selected <- df_clean |>
1  select(id, trt, lsas_screen)

# Check the result
head(df_selected)
1
Select just these three columns from the dataset

Exercise 2 (Use select helpers to grab multiple columns)  

Select the id, trt, and all columns that start with “phq9” using the starts_with() helper function.

Helper functions like starts_with() need a character string (text in quotes) as their argument.

Look at the column names in the problem - what pattern do all the PHQ-9 columns share? What text do they all start with?

# Select using helper functions df_phq9 <- df_clean |> select(id, trt, starts_with("phq9")) #<1> # How many columns? ncol(df_phq9) names(df_phq9)

# Select using helper functions
df_phq9 <- df_clean |>
1  select(id, trt, starts_with("phq9"))

# How many columns?
ncol(df_phq9)
names(df_phq9)
1
starts_with() selects all columns beginning with “phq9”

You should have 6 columns total: id, trt, and 4 PHQ-9 measurements (screen, post, fu6, and fu12).

2 Filtering rows with filter()

The filter() function selects rows based on conditions.

Exercise 3 (Filter based on one condition)  

Filter the data to keep only participants in the “therapist-guided” treatment group.

Remember that filter() keeps rows where a condition is TRUE.

To test if a variable equals a specific value, use the double equals operator == (not a single =, which is used for assignment).

The ‘trt’ variable is the treatment allocation variable coded as ‘waitlist’, ‘self-guided’, and ‘therapist-guided’.

# Filter for one group df_guided <- df_clean |> filter(trt == "therapist-guided") #<1> # How many participants? nrow(df_guided)

# Filter for one group
df_guided <- df_clean |>
1  filter(trt == "therapist-guided")

# How many participants?
nrow(df_guided)
1
Keep only rows where treatment group equals “therapist-guided”

Exercise 4 (Filter with multiple conditions)  

Filter to keep only participants who:

  • are in the “therapist-guided” group, AND
  • have baseline LSAS scores of 60 or higher

When you have multiple conditions in filter(), separate them with commas. All conditions must be TRUE for a row to be kept (AND logic).

You need one condition for treatment group (equals “therapist-guided” - remember quotes for text!) and another for LSAS scores (greater than or equal to 60).

Use >= for “greater than or equal to”.

# Filter with multiple conditions df_guided_severe <- df_clean |> filter(trt == "therapist-guided", lsas_screen >= 60) #<1> nrow(df_guided_severe)

# Filter with multiple conditions
df_guided_severe <- df_clean |>
1  filter(trt == "therapist-guided", lsas_screen >= 60)

nrow(df_guided_severe)
1
Both conditions must be true (AND logic) - note quotes around text value

3 Creating new variables with mutate()

The mutate() function creates new columns or modifies existing ones.

Exercise 5 (Calculate change scores)  

Create a new variable called lsas_change that represents the change in LSAS scores from baseline to post-treatment (post - baseline).

A change score represents the difference between two time points.

Think about which measurement comes first (baseline) and which comes second (post-treatment). The change is calculated as: later measurement minus earlier measurement.

The baseline LSAS is lsas_screen and post-treatment is lsas_post.

# Create change score df_with_change <- df_clean |> select(id, trt, lsas_screen, lsas_post) |> mutate( lsas_change = lsas_post - lsas_screen #<1> ) head(df_with_change)

# Create change score
df_with_change <- df_clean |>
  select(id, trt, lsas_screen, lsas_post) |>
  mutate(
1    lsas_change = lsas_post - lsas_screen
  )

head(df_with_change)
1
Subtract baseline from post-treatment to get change

Negative values indicate improvement (lower anxiety at post-treatment).

Exercise 6 (Create categorical variables with case_when())  

Create a categorical variable for GAD-7 severity with these categories:

  • “Minimal” for scores 0-4
  • “Mild” for scores 5-9
  • “Moderate” for scores 10-14
  • “Severe” for scores 15 or higher

The case_when() function evaluates conditions in order from top to bottom.

The syntax is: condition ~ "result"

Think about the cutoffs: scores below 5 are “Minimal”, scores from 5 to 9 are “Mild”, etc.

Since case_when() checks conditions in order, if a score is 7, it won’t match < 5 but will match < 10, so it becomes “Mild”.

# Create categorical variable df_with_severity <- df_clean |> select(id, gad_screen) |> mutate( gad_severity = case_when( gad_screen < 5 ~ "Minimal", #<1> gad_screen < 10 ~ "Mild", #<2> gad_screen < 15 ~ "Moderate", #<3> gad_screen >= 15 ~ "Severe" #<4> ) ) # Check the categories table(df_with_severity$gad_severity)

# Create categorical variable
df_with_severity <- df_clean |>
  select(id, gad_screen) |>
  mutate(
    gad_severity = case_when(
1      gad_screen < 5 ~ "Minimal",
2      gad_screen < 10 ~ "Mild",
3      gad_screen < 15 ~ "Moderate",
4      gad_screen >= 15 ~ "Severe"
    )
  )

# Check the categories
table(df_with_severity$gad_severity)
1
Scores 0-4 = Minimal
2
Scores 5-9 = Mild
3
Scores 10-14 = Moderate
4
Scores 15+ = Severe

4 Working with factors

Factors control the order of categorical variables in tables and plots.

Exercise 7 (Create ordered factors)  

The trt variable contains treatment assignments as text (“waitlist”, “self-guided”, “therapist-guided”). Create a properly ordered factor with the levels in this order: Waitlist, Self-guided, Therapist-guided.

The factor() function needs:

  • The variable to convert (trt)
  • levels: the values in the order you want them (as character strings: “waitlist”, “self-guided”, “therapist-guided”)

By default, factors are ordered alphabetically, which would put “self-guided” before “therapist-guided” and “waitlist” last. Specifying levels explicitly controls the order.

# Create ordered factor df_with_factor <- df_clean |> select(id, trt) |> mutate( trt_factor = factor( trt, levels = c("waitlist", "self-guided", "therapist-guided") #<1> ) ) # Check the result table(df_with_factor$trt_factor) levels(df_with_factor$trt_factor)

# Create ordered factor
df_with_factor <- df_clean |>
  select(id, trt) |>
  mutate(
    trt_factor = factor(
      trt,
1      levels = c("waitlist", "self-guided", "therapist-guided")
    )
  )

# Check the result
table(df_with_factor$trt_factor)
levels(df_with_factor$trt_factor)
1
Specify levels in the desired order (not alphabetical)

5 Summarizing data with group_by() and summarize()

Calculate summary statistics by groups.

Exercise 8 (Calculate grouped statistics)  

Calculate the mean and standard deviation of baseline LSAS scores by treatment group.

The workflow is:

  1. Use group_by() to specify which variable defines the groups
  2. Use summarize() to calculate statistics for each group

Inside summarize():

  • n() counts the number of observations
  • mean() and sd() need the variable name and na.rm = TRUE to handle missing values

Which variable represents the baseline LSAS score?

# Group and summarize group_stats <- df_clean |> group_by(trt) |> #<1> summarize( n = n(), #<2> mean_lsas = mean(lsas_screen, na.rm = TRUE), #<3> sd_lsas = sd(lsas_screen, na.rm = TRUE), .groups = "drop" ) group_stats

# Group and summarize
group_stats <- df_clean |>
1  group_by(trt) |>
  summarize(
2    n = n(),
3    mean_lsas = mean(lsas_screen, na.rm = TRUE),
    sd_lsas = sd(lsas_screen, na.rm = TRUE),
    .groups = "drop"
  )

group_stats
1
Group by treatment assignment
2
Count participants in each treatment group
3
Calculate mean and SD for each treatment group

Exercise 9 (Use the .by syntax)  

Calculate the same statistics using the .by argument instead of group_by().

The .by argument is an alternative to group_by() that goes directly inside summarize().

Instead of piping to group_by() first, you can specify .by = as one of the arguments to summarize().

The value for .by is the name of the grouping variable (without quotes).

# Summarize with .by group_stats_by <- df_clean |> summarize( n = n(), mean_lsas = mean(lsas_screen, na.rm = TRUE), sd_lsas = sd(lsas_screen, na.rm = TRUE), .by = trt #<1> ) group_stats_by

# Summarize with .by
group_stats_by <- df_clean |>
  summarize(
    n = n(),
    mean_lsas = mean(lsas_screen, na.rm = TRUE),
    sd_lsas = sd(lsas_screen, na.rm = TRUE),
1    .by = trt
  )

group_stats_by
1
.by is a cleaner alternative to group_by() for simple grouping by treatment

6 Reshaping data: Wide to Long format

Convert data from wide format (one row per participant) to long format (one row per measurement).

Exercise 10 (Convert to long format with pivot_longer())  

Convert the LSAS measurements to long format. Select columns that start with “lsas”, then pivot them longer with:

  • Column names going to “time_point”
  • Values going to “lsas_score”

The pivot_longer() function needs three key arguments:

  • cols: Which columns to pivot (use a helper function)
  • names_to: What to call the new column that will contain the old column names (as a string)
  • values_to: What to call the new column that will contain the values (as a string)

The column names should be descriptive - what information is in the old column names? What are the values measuring?

# Pivot to long format df_lsas_long <- df_clean |> select(id, trt, starts_with("lsas")) |> #<1> pivot_longer( cols = starts_with("lsas"), #<2> names_to = "time_point", #<3> values_to = "lsas_score" #<4> ) # Check the result head(df_lsas_long, 12)

# Pivot to long format
df_lsas_long <- df_clean |>
1  select(id, trt, starts_with("lsas")) |>
  pivot_longer(
2    cols = starts_with("lsas"),
3    names_to = "time_point",
4    values_to = "lsas_score"
  )

# Check the result
head(df_lsas_long, 12)
1
Select ID, treatment, and all LSAS columns
2
Pivot all columns starting with “lsas”
3
Old column names go to this new column
4
Values go to this new column

Each participant now has multiple rows, one for each time point.

Exercise 11 (Clean the long format data)  

The time_point column has values like “lsas_screen”, “lsas_post”, etc. Use separate() to split this into two columns: “measure” and “time”.

The separate() function splits one column into multiple columns.

Look at the values in time_point: they’re formatted like “lsas_screen”, “lsas_post”, etc.

What character separates the two pieces of information? That’s your separator.

You need to specify:

  • Which column to separate
  • Names for the new columns (in c())
  • What separator character to split on (as a string)
# Clean time variable df_lsas_long <- df_clean |> select(id, trt, starts_with("lsas")) |> #<1> pivot_longer( cols = starts_with("lsas"), names_to = "time_point", values_to = "lsas_score" ) |> separate( time_point, #<2> into = c("measure", "time"), #<3> sep = "_" #<4> ) head(df_lsas_long)

# Clean time variable
df_lsas_long <- df_clean |>
1  select(id, trt, starts_with("lsas")) |>
  pivot_longer(
    cols = starts_with("lsas"),
    names_to = "time_point",
    values_to = "lsas_score"
  ) |>
  separate(
2    time_point,
3    into = c("measure", "time"),
4    sep = "_"
  )

head(df_lsas_long)
1
Select ID, treatment, and LSAS columns
2
Column to separate
3
Names for the new columns
4
Separator character (underscore)

Now we have “measure” (always “lsas”) and “time” (screen, post, etc.) as separate columns.

7 Reshaping data: Long to Wide format

Convert summary data from long to wide format.

Exercise 12 (Convert to wide format with pivot_wider())  

First, calculate mean LSAS scores by group and time. Then convert to wide format with groups as columns.

pivot_wider() is the opposite of pivot_longer() - it spreads data across columns.

You need to tell it:

  • names_from: Which column contains the values that will become new column names?
  • values_from: Which column contains the values to fill those new columns?

Think about what you want as columns in the wide format. The problem says “groups as columns” - which variable contains the treatment labels?

# Prepare long data with clean time labels df_lsas_long <- df_clean |> select(id, trt, lsas_screen, lsas_post) |> #<1> pivot_longer( cols = starts_with("lsas"), names_to = "time_point", values_to = "lsas_score" ) |> separate( time_point, into = c("measure", "time"), sep = "_" ) |> mutate( trt_factor = factor( trt, levels = c("waitlist", "self-guided", "therapist-guided") ) ) # Summarize lsas_summary <- df_lsas_long |> summarize( mean_lsas = mean(lsas_score, na.rm = TRUE), .by = c(trt_factor, time) ) # Pivot wider lsas_wide <- lsas_summary |> pivot_wider( names_from = trt_factor, #<2> values_from = mean_lsas #<3> ) lsas_wide

# Prepare long data with clean time labels
df_lsas_long <- df_clean |>
1  select(id, trt, lsas_screen, lsas_post) |>
  pivot_longer(
    cols = starts_with("lsas"),
    names_to = "time_point",
    values_to = "lsas_score"
  ) |>
  separate(
    time_point, 
    into = c("measure", "time"), 
    sep = "_"
  ) |>
  mutate(
    trt_factor = factor(
      trt, 
      levels = c("waitlist", "self-guided", "therapist-guided")
    )
  )

# Summarize
lsas_summary <- df_lsas_long |>
  summarize(
    mean_lsas = mean(lsas_score, na.rm = TRUE),
    .by = c(trt_factor, time)
  )

# Pivot wider
lsas_wide <- lsas_summary |>
  pivot_wider(
2    names_from = trt_factor,
3    values_from = mean_lsas
  )

lsas_wide
1
Select ID, treatment, and LSAS baseline and post
2
Column to spread into new column names
3
Column with values to fill the new columns

Now we have a table with time points as rows and treatment groups as columns.

8 Calculate remission proportions by treatment group

Use filtering, mutating, and summarizing to calculate proportions by group.

Exercise 13 (Remission proportions by treatment group)  

Chain multiple data manipulation steps to:

  1. Selects id, trt, and post-treatment LSAS score
  2. Creates a binary variable indicating remission (LSAS < 30)
  3. Calculates the percentage in remission by treatment group

Work through this step-by-step:

  1. Select: You need participant ID, treatment assignment (trt), and post-treatment LSAS
  2. Mutate: Create a logical variable - is LSAS post-treatment less than 30?
  3. Summarize:
    • Count total participants with n()
    • Count those in remission by summing the TRUE/FALSE variable (TRUE = 1, FALSE = 0)
    • Calculate percentage: (count in remission / total count) × 100
    • Group by treatment assignment (trt)
# Complete pipeline remission_desc <- df_clean |> # Select relevant columns select(id, trt, lsas_post) |> #<1> # Create remission indicator mutate( remission = lsas_post < 30 #<2> ) |> # Calculate percentage by treatment group summarize( n = n(), n_remission = sum(remission, na.rm = TRUE), #<3> pct_remission = (n_remission / n) * 100, .by = trt ) remission_desc

# Complete pipeline
remission_desc <- df_clean |>
  # Select relevant columns
1  select(id, trt, lsas_post) |>
  # Create remission indicator
  mutate(
2    remission = lsas_post < 30
  ) |>
  # Calculate percentage by treatment group
  summarize(
    n = n(),
3    n_remission = sum(remission, na.rm = TRUE),
    pct_remission = (n_remission / n) * 100,
    .by = trt
  )

remission_desc
1
Select only needed variables (ID, treatment assignment, post-treatment LSAS)
2
Create binary variable: TRUE if LSAS < 30 (remission)
3
Count and calculate percentage in remission by treatment group

This pipeline shows how to combine selecting variables, creating new variables, and summarizing in a single workflow. LSAS scores below 30 indicate remission from social anxiety. You’ll see results for all three treatment groups.

9 Calculate and visualize missing data patterns

Exploring missing data patterns is an important part of data analysis. Let’s calculate and visualize missingness across time points.

Exercise 14 (Missing LSAS observations by time and group)  

Calculate the percentage of missing LSAS observations at each time point for each treatment group, then create a line plot to visualize the pattern.

Steps:

  1. Convert LSAS data to long format (all columns starting with “lsas”)
  2. Separate the time variable
  3. Create clean time labels and factor variable for plotting
  4. Filter out waitlist group at follow-up time points (not measured)
  5. Create a binary variable indicating if the observation is missing
  6. Calculate the percentage missing by time and group
  7. Plot the results using the ordered time factor

Breaking this down:

  1. Long format: Similar to Exercise 10 - select LSAS columns and pivot them longer
  2. Separate: Use separate() on the time_point column (like Exercise 11)
  3. Create time variables: The case_when() and factor() code is provided - it creates readable time labels and an ordered factor
  4. Filter: Exclude waitlist at fu6 and fu12 time points - they weren’t measured then. Use !(condition) for NOT, and %in% to check if time is in a vector of values. What’s the waitlist value in the trt variable?
  5. Missing indicator: Use is.na() to test if a value is missing. This returns TRUE/FALSE
  6. Summarize:
    • Count missing with sum(is_missing) (TRUE = 1, FALSE = 0)
    • Calculate percentage: (n_missing / n) × 100
    • Group by both treatment and time (use trt and time_factor)
  7. Plot: Use time_factor for the x-axis to get proper chronological ordering with readable labels

Which variable contains the LSAS scores? That’s what you check for missingness.

# Convert to long format and create time variables df_lsas_missing <- df_clean |> select(id, trt, starts_with("lsas")) |> #<1> pivot_longer( cols = starts_with("lsas"), names_to = "time_point", values_to = "lsas_score" ) |> separate(time_point, into = c("measure", "time"), sep = "_") |> #<2> mutate( time_clean = case_when( time == "screen" ~ "Baseline", time == "v1" ~ "Week 1", time == "v2" ~ "Week 2", time == "v3" ~ "Week 3", time == "v4" ~ "Week 4", time == "v5" ~ "Week 5", time == "v6" ~ "Week 6", time == "v7" ~ "Week 7", time == "v8" ~ "Week 8", time == "post" ~ "Post-treatment", time == "fu6" ~ "6-month follow-up", time == "fu12" ~ "12-month follow-up" ), time_factor = factor( #<3> time_clean, levels = c( "Baseline", "Week 1", "Week 2", "Week 3", "Week 4", "Week 5", "Week 6", "Week 7", "Week 8", "Post-treatment", "6-month follow-up", "12-month follow-up" ) ) ) |> select(-measure) |> # Filter out waitlist at follow-up (not measured) filter(!(trt == "waitlist" & time %in% c("fu6", "fu12"))) #<4> # Calculate missing percentages missing <- df_lsas_missing |> mutate( is_missing = is.na(lsas_score) #<5> ) |> summarize( n = n(), n_missing = sum(is_missing), #<6> missing_percent = (n_missing / n) * 100, #<7> .by = c(trt, time_factor) #<8> ) # Plot library(ggplot2) ggplot( missing, aes( time_factor, #<9> missing_percent, group = trt, color = trt ) ) + geom_line() + geom_point() + labs( title = "LSAS Percent missing observations", x = "Time", y = "Missing (%)", color = "Treatment" ) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Convert to long format and create time variables
df_lsas_missing <- df_clean |>
1  select(id, trt, starts_with("lsas")) |>
  pivot_longer(
    cols = starts_with("lsas"),
    names_to = "time_point",
    values_to = "lsas_score"
  ) |>
2  separate(time_point, into = c("measure", "time"), sep = "_") |>
  mutate(
    time_clean = case_when(
      time == "screen" ~ "Baseline",
      time == "v1" ~ "Week 1",
      time == "v2" ~ "Week 2",
      time == "v3" ~ "Week 3",
      time == "v4" ~ "Week 4",
      time == "v5" ~ "Week 5",
      time == "v6" ~ "Week 6",
      time == "v7" ~ "Week 7",
      time == "v8" ~ "Week 8",
      time == "post" ~ "Post-treatment",
      time == "fu6" ~ "6-month follow-up",
      time == "fu12" ~ "12-month follow-up"
    ),
3    time_factor = factor(
      time_clean,
      levels = c(
        "Baseline", "Week 1", "Week 2", "Week 3", "Week 4",
        "Week 5", "Week 6", "Week 7", "Week 8",
        "Post-treatment", "6-month follow-up", "12-month follow-up"
      )
    )
  ) |>
  select(-measure) |>
  # Filter out waitlist at follow-up (not measured)
4  filter(!(trt == "waitlist" & time %in% c("fu6", "fu12")))

# Calculate missing percentages
missing <- df_lsas_missing |>
  mutate(
5    is_missing = is.na(lsas_score)
  ) |>
  summarize(
    n = n(),
6    n_missing = sum(is_missing),
7    missing_percent = (n_missing / n) * 100,
8    .by = c(trt, time_factor)
  )

# Plot
library(ggplot2)
ggplot(
  missing,
  aes(
9    time_factor,
    missing_percent,
    group = trt,
    color = trt
  )
) +
  geom_line() +
  geom_point() +
  labs(
    title = "LSAS Percent missing observations",
    x = "Time",
    y = "Missing (%)",
    color = "Treatment"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
1
Select ID, treatment assignment, and LSAS columns
2
Separate the time_point column to extract time
3
Create ordered factor for proper chronological plotting with readable labels
4
Exclude waitlist at 6 and 12 month follow-ups (not measured)
5
Create indicator: TRUE if missing, FALSE if present
6
Sum the TRUE/FALSE values to count missing
7
Calculate percentage missing
8
Group by both treatment assignment and time factor
9
Use time_factor for x-axis to ensure proper time ordering with readable labels

The plot reveals missing data patterns across time and treatment groups, which is important for understanding dropout patterns.

10 Summary

In this lab, you practiced:

  1. Selecting columns with select() and helper functions
  2. Filtering rows with filter() and multiple conditions
  3. Creating variables with mutate() and case_when()
  4. Working with factors to control categorical variable ordering
  5. Summarizing data with group_by() and summarize()
  6. Reshaping data with pivot_longer() and pivot_wider()
  7. Calculating proportions including missing data patterns
  8. Visualizing patterns with ggplot2