Lab: Tidy Data Manipulation
In this lab, you’ll practice data manipulation skills using the tidyverse. We’ll work with the STePS dataset to practice selecting, filtering, creating new variables, summarizing, and reshaping data.
While you can complete all the exercises in your browser, we recommend also practicing in RStudio. Using an editor like RStudio will help you build real-world skills for writing, running, and saving your R code.
Load the data
First, let’s load the packages and data.
1 Selecting columns with select()
The select()
function lets you choose which columns to keep in your dataset.
Exercise 1 (Select specific columns)
Select only the id
, group
, and baseline LSAS score (lsas_screen
) from the dataset.
The select()
function takes column names separated by commas. You don’t need quotes around the column names.
Think about which three columns you need: the participant identifier, their treatment assignment, and their baseline LSAS score.
# Select specific columns
df_selected <- df_clean |>
select(id, trt, lsas_screen) #<1>
# Check the result
head(df_selected)
# Select specific columns
<- df_clean |>
df_selected 1select(id, trt, lsas_screen)
# Check the result
head(df_selected)
- 1
- Select just these three columns from the dataset
Exercise 2 (Use select helpers to grab multiple columns)
Select the id
, trt
, and all columns that start with “phq9” using the starts_with()
helper function.
Helper functions like starts_with()
need a character string (text in quotes) as their argument.
Look at the column names in the problem - what pattern do all the PHQ-9 columns share? What text do they all start with?
# Select using helper functions
df_phq9 <- df_clean |>
select(id, trt, starts_with("phq9")) #<1>
# How many columns?
ncol(df_phq9)
names(df_phq9)
# Select using helper functions
<- df_clean |>
df_phq9 1select(id, trt, starts_with("phq9"))
# How many columns?
ncol(df_phq9)
names(df_phq9)
- 1
-
starts_with()
selects all columns beginning with “phq9”
You should have 6 columns total: id, trt, and 4 PHQ-9 measurements (screen, post, fu6, and fu12).
2 Filtering rows with filter()
The filter()
function selects rows based on conditions.
Exercise 3 (Filter based on one condition)
Filter the data to keep only participants in the “therapist-guided” treatment group.
Remember that filter()
keeps rows where a condition is TRUE.
To test if a variable equals a specific value, use the double equals operator ==
(not a single =
, which is used for assignment).
The ‘trt’ variable is the treatment allocation variable coded as ‘waitlist’, ‘self-guided’, and ‘therapist-guided’.
# Filter for one group
df_guided <- df_clean |>
filter(trt == "therapist-guided") #<1>
# How many participants?
nrow(df_guided)
# Filter for one group
<- df_clean |>
df_guided 1filter(trt == "therapist-guided")
# How many participants?
nrow(df_guided)
- 1
- Keep only rows where treatment group equals “therapist-guided”
Exercise 4 (Filter with multiple conditions)
Filter to keep only participants who:
- are in the “therapist-guided” group, AND
- have baseline LSAS scores of 60 or higher
When you have multiple conditions in filter()
, separate them with commas. All conditions must be TRUE for a row to be kept (AND logic).
You need one condition for treatment group (equals “therapist-guided” - remember quotes for text!) and another for LSAS scores (greater than or equal to 60).
Use >=
for “greater than or equal to”.
# Filter with multiple conditions
df_guided_severe <- df_clean |>
filter(trt == "therapist-guided", lsas_screen >= 60) #<1>
nrow(df_guided_severe)
# Filter with multiple conditions
<- df_clean |>
df_guided_severe 1filter(trt == "therapist-guided", lsas_screen >= 60)
nrow(df_guided_severe)
- 1
- Both conditions must be true (AND logic) - note quotes around text value
3 Creating new variables with mutate()
The mutate()
function creates new columns or modifies existing ones.
Exercise 5 (Calculate change scores)
Create a new variable called lsas_change
that represents the change in LSAS scores from baseline to post-treatment (post - baseline).
A change score represents the difference between two time points.
Think about which measurement comes first (baseline) and which comes second (post-treatment). The change is calculated as: later measurement minus earlier measurement.
The baseline LSAS is lsas_screen
and post-treatment is lsas_post
.
# Create change score
df_with_change <- df_clean |>
select(id, trt, lsas_screen, lsas_post) |>
mutate(
lsas_change = lsas_post - lsas_screen #<1>
)
head(df_with_change)
# Create change score
<- df_clean |>
df_with_change select(id, trt, lsas_screen, lsas_post) |>
mutate(
1lsas_change = lsas_post - lsas_screen
)
head(df_with_change)
- 1
- Subtract baseline from post-treatment to get change
Negative values indicate improvement (lower anxiety at post-treatment).
Exercise 6 (Create categorical variables with case_when()
)
Create a categorical variable for GAD-7 severity with these categories:
- “Minimal” for scores 0-4
- “Mild” for scores 5-9
- “Moderate” for scores 10-14
- “Severe” for scores 15 or higher
The case_when()
function evaluates conditions in order from top to bottom.
The syntax is: condition ~ "result"
Think about the cutoffs: scores below 5 are “Minimal”, scores from 5 to 9 are “Mild”, etc.
Since case_when()
checks conditions in order, if a score is 7, it won’t match < 5
but will match < 10
, so it becomes “Mild”.
# Create categorical variable
df_with_severity <- df_clean |>
select(id, gad_screen) |>
mutate(
gad_severity = case_when(
gad_screen < 5 ~ "Minimal", #<1>
gad_screen < 10 ~ "Mild", #<2>
gad_screen < 15 ~ "Moderate", #<3>
gad_screen >= 15 ~ "Severe" #<4>
)
)
# Check the categories
table(df_with_severity$gad_severity)
# Create categorical variable
<- df_clean |>
df_with_severity select(id, gad_screen) |>
mutate(
gad_severity = case_when(
1< 5 ~ "Minimal",
gad_screen 2< 10 ~ "Mild",
gad_screen 3< 15 ~ "Moderate",
gad_screen 4>= 15 ~ "Severe"
gad_screen
)
)
# Check the categories
table(df_with_severity$gad_severity)
- 1
- Scores 0-4 = Minimal
- 2
- Scores 5-9 = Mild
- 3
- Scores 10-14 = Moderate
- 4
- Scores 15+ = Severe
4 Working with factors
Factors control the order of categorical variables in tables and plots.
Exercise 7 (Create ordered factors)
The trt
variable contains treatment assignments as text (“waitlist”, “self-guided”, “therapist-guided”). Create a properly ordered factor with the levels in this order: Waitlist, Self-guided, Therapist-guided.
The factor()
function needs:
- The variable to convert (
trt
) levels
: the values in the order you want them (as character strings: “waitlist”, “self-guided”, “therapist-guided”)
By default, factors are ordered alphabetically, which would put “self-guided” before “therapist-guided” and “waitlist” last. Specifying levels
explicitly controls the order.
# Create ordered factor
df_with_factor <- df_clean |>
select(id, trt) |>
mutate(
trt_factor = factor(
trt,
levels = c("waitlist", "self-guided", "therapist-guided") #<1>
)
)
# Check the result
table(df_with_factor$trt_factor)
levels(df_with_factor$trt_factor)
# Create ordered factor
<- df_clean |>
df_with_factor select(id, trt) |>
mutate(
trt_factor = factor(
trt,1levels = c("waitlist", "self-guided", "therapist-guided")
)
)
# Check the result
table(df_with_factor$trt_factor)
levels(df_with_factor$trt_factor)
- 1
- Specify levels in the desired order (not alphabetical)
5 Summarizing data with group_by()
and summarize()
Calculate summary statistics by groups.
Exercise 8 (Calculate grouped statistics)
Calculate the mean and standard deviation of baseline LSAS scores by treatment group.
The workflow is:
- Use
group_by()
to specify which variable defines the groups - Use
summarize()
to calculate statistics for each group
Inside summarize()
:
n()
counts the number of observationsmean()
andsd()
need the variable name andna.rm = TRUE
to handle missing values
Which variable represents the baseline LSAS score?
# Group and summarize
group_stats <- df_clean |>
group_by(trt) |> #<1>
summarize(
n = n(), #<2>
mean_lsas = mean(lsas_screen, na.rm = TRUE), #<3>
sd_lsas = sd(lsas_screen, na.rm = TRUE),
.groups = "drop"
)
group_stats
- 1
- Group by treatment assignment
- 2
- Count participants in each treatment group
- 3
- Calculate mean and SD for each treatment group
Exercise 9 (Use the .by
syntax)
Calculate the same statistics using the .by
argument instead of group_by()
.
The .by
argument is an alternative to group_by()
that goes directly inside summarize()
.
Instead of piping to group_by()
first, you can specify .by =
as one of the arguments to summarize()
.
The value for .by
is the name of the grouping variable (without quotes).
# Summarize with .by
group_stats_by <- df_clean |>
summarize(
n = n(),
mean_lsas = mean(lsas_screen, na.rm = TRUE),
sd_lsas = sd(lsas_screen, na.rm = TRUE),
.by = trt #<1>
)
group_stats_by
# Summarize with .by
<- df_clean |>
group_stats_by summarize(
n = n(),
mean_lsas = mean(lsas_screen, na.rm = TRUE),
sd_lsas = sd(lsas_screen, na.rm = TRUE),
1.by = trt
)
group_stats_by
- 1
-
.by
is a cleaner alternative togroup_by()
for simple grouping by treatment
6 Reshaping data: Wide to Long format
Convert data from wide format (one row per participant) to long format (one row per measurement).
Exercise 10 (Convert to long format with pivot_longer()
)
Convert the LSAS measurements to long format. Select columns that start with “lsas”, then pivot them longer with:
- Column names going to “time_point”
- Values going to “lsas_score”
The pivot_longer()
function needs three key arguments:
cols
: Which columns to pivot (use a helper function)names_to
: What to call the new column that will contain the old column names (as a string)values_to
: What to call the new column that will contain the values (as a string)
The column names should be descriptive - what information is in the old column names? What are the values measuring?
# Pivot to long format
df_lsas_long <- df_clean |>
select(id, trt, starts_with("lsas")) |> #<1>
pivot_longer(
cols = starts_with("lsas"), #<2>
names_to = "time_point", #<3>
values_to = "lsas_score" #<4>
)
# Check the result
head(df_lsas_long, 12)
- 1
- Select ID, treatment, and all LSAS columns
- 2
- Pivot all columns starting with “lsas”
- 3
- Old column names go to this new column
- 4
- Values go to this new column
Each participant now has multiple rows, one for each time point.
Exercise 11 (Clean the long format data)
The time_point
column has values like “lsas_screen”, “lsas_post”, etc. Use separate()
to split this into two columns: “measure” and “time”.
The separate()
function splits one column into multiple columns.
Look at the values in time_point
: they’re formatted like “lsas_screen”, “lsas_post”, etc.
What character separates the two pieces of information? That’s your separator.
You need to specify:
- Which column to separate
- Names for the new columns (in
c()
) - What separator character to split on (as a string)
# Clean time variable
df_lsas_long <- df_clean |>
select(id, trt, starts_with("lsas")) |> #<1>
pivot_longer(
cols = starts_with("lsas"),
names_to = "time_point",
values_to = "lsas_score"
) |>
separate(
time_point, #<2>
into = c("measure", "time"), #<3>
sep = "_" #<4>
)
head(df_lsas_long)
- 1
- Select ID, treatment, and LSAS columns
- 2
- Column to separate
- 3
- Names for the new columns
- 4
- Separator character (underscore)
Now we have “measure” (always “lsas”) and “time” (screen, post, etc.) as separate columns.
7 Reshaping data: Long to Wide format
Convert summary data from long to wide format.
Exercise 12 (Convert to wide format with pivot_wider()
)
First, calculate mean LSAS scores by group and time. Then convert to wide format with groups as columns.
pivot_wider()
is the opposite of pivot_longer()
- it spreads data across columns.
You need to tell it:
names_from
: Which column contains the values that will become new column names?values_from
: Which column contains the values to fill those new columns?
Think about what you want as columns in the wide format. The problem says “groups as columns” - which variable contains the treatment labels?
# Prepare long data with clean time labels
df_lsas_long <- df_clean |>
select(id, trt, lsas_screen, lsas_post) |> #<1>
pivot_longer(
cols = starts_with("lsas"),
names_to = "time_point",
values_to = "lsas_score"
) |>
separate(
time_point,
into = c("measure", "time"),
sep = "_"
) |>
mutate(
trt_factor = factor(
trt,
levels = c("waitlist", "self-guided", "therapist-guided")
)
)
# Summarize
lsas_summary <- df_lsas_long |>
summarize(
mean_lsas = mean(lsas_score, na.rm = TRUE),
.by = c(trt_factor, time)
)
# Pivot wider
lsas_wide <- lsas_summary |>
pivot_wider(
names_from = trt_factor, #<2>
values_from = mean_lsas #<3>
)
lsas_wide
# Prepare long data with clean time labels
<- df_clean |>
df_lsas_long 1select(id, trt, lsas_screen, lsas_post) |>
pivot_longer(
cols = starts_with("lsas"),
names_to = "time_point",
values_to = "lsas_score"
|>
) separate(
time_point, into = c("measure", "time"),
sep = "_"
|>
) mutate(
trt_factor = factor(
trt, levels = c("waitlist", "self-guided", "therapist-guided")
)
)
# Summarize
<- df_lsas_long |>
lsas_summary summarize(
mean_lsas = mean(lsas_score, na.rm = TRUE),
.by = c(trt_factor, time)
)
# Pivot wider
<- lsas_summary |>
lsas_wide pivot_wider(
2names_from = trt_factor,
3values_from = mean_lsas
)
lsas_wide
- 1
- Select ID, treatment, and LSAS baseline and post
- 2
- Column to spread into new column names
- 3
- Column with values to fill the new columns
Now we have a table with time points as rows and treatment groups as columns.
8 Calculate remission proportions by treatment group
Use filtering, mutating, and summarizing to calculate proportions by group.
Exercise 13 (Remission proportions by treatment group)
Chain multiple data manipulation steps to:
- Selects id, trt, and post-treatment LSAS score
- Creates a binary variable indicating remission (LSAS < 30)
- Calculates the percentage in remission by treatment group
Work through this step-by-step:
- Select: You need participant ID, treatment assignment (trt), and post-treatment LSAS
- Mutate: Create a logical variable - is LSAS post-treatment less than 30?
- Summarize:
- Count total participants with
n()
- Count those in remission by summing the TRUE/FALSE variable (TRUE = 1, FALSE = 0)
- Calculate percentage: (count in remission / total count) × 100
- Group by treatment assignment (trt)
- Count total participants with
# Complete pipeline
remission_desc <- df_clean |>
# Select relevant columns
select(id, trt, lsas_post) |> #<1>
# Create remission indicator
mutate(
remission = lsas_post < 30 #<2>
) |>
# Calculate percentage by treatment group
summarize(
n = n(),
n_remission = sum(remission, na.rm = TRUE), #<3>
pct_remission = (n_remission / n) * 100,
.by = trt
)
remission_desc
# Complete pipeline
<- df_clean |>
remission_desc # Select relevant columns
1select(id, trt, lsas_post) |>
# Create remission indicator
mutate(
2remission = lsas_post < 30
|>
) # Calculate percentage by treatment group
summarize(
n = n(),
3n_remission = sum(remission, na.rm = TRUE),
pct_remission = (n_remission / n) * 100,
.by = trt
)
remission_desc
- 1
- Select only needed variables (ID, treatment assignment, post-treatment LSAS)
- 2
- Create binary variable: TRUE if LSAS < 30 (remission)
- 3
- Count and calculate percentage in remission by treatment group
This pipeline shows how to combine selecting variables, creating new variables, and summarizing in a single workflow. LSAS scores below 30 indicate remission from social anxiety. You’ll see results for all three treatment groups.
9 Calculate and visualize missing data patterns
Exploring missing data patterns is an important part of data analysis. Let’s calculate and visualize missingness across time points.
Exercise 14 (Missing LSAS observations by time and group)
Calculate the percentage of missing LSAS observations at each time point for each treatment group, then create a line plot to visualize the pattern.
Steps:
- Convert LSAS data to long format (all columns starting with “lsas”)
- Separate the time variable
- Create clean time labels and factor variable for plotting
- Filter out waitlist group at follow-up time points (not measured)
- Create a binary variable indicating if the observation is missing
- Calculate the percentage missing by time and group
- Plot the results using the ordered time factor
Breaking this down:
- Long format: Similar to Exercise 10 - select LSAS columns and pivot them longer
- Separate: Use
separate()
on the time_point column (like Exercise 11) - Create time variables: The
case_when()
andfactor()
code is provided - it creates readable time labels and an ordered factor - Filter: Exclude waitlist at fu6 and fu12 time points - they weren’t measured then. Use
!(condition)
for NOT, and%in%
to check if time is in a vector of values. What’s the waitlist value in thetrt
variable? - Missing indicator: Use
is.na()
to test if a value is missing. This returns TRUE/FALSE - Summarize:
- Count missing with
sum(is_missing)
(TRUE = 1, FALSE = 0) - Calculate percentage: (n_missing / n) × 100
- Group by both treatment and time (use
trt
andtime_factor
)
- Count missing with
- Plot: Use
time_factor
for the x-axis to get proper chronological ordering with readable labels
Which variable contains the LSAS scores? That’s what you check for missingness.
# Convert to long format and create time variables
df_lsas_missing <- df_clean |>
select(id, trt, starts_with("lsas")) |> #<1>
pivot_longer(
cols = starts_with("lsas"),
names_to = "time_point",
values_to = "lsas_score"
) |>
separate(time_point, into = c("measure", "time"), sep = "_") |> #<2>
mutate(
time_clean = case_when(
time == "screen" ~ "Baseline",
time == "v1" ~ "Week 1",
time == "v2" ~ "Week 2",
time == "v3" ~ "Week 3",
time == "v4" ~ "Week 4",
time == "v5" ~ "Week 5",
time == "v6" ~ "Week 6",
time == "v7" ~ "Week 7",
time == "v8" ~ "Week 8",
time == "post" ~ "Post-treatment",
time == "fu6" ~ "6-month follow-up",
time == "fu12" ~ "12-month follow-up"
),
time_factor = factor( #<3>
time_clean,
levels = c(
"Baseline", "Week 1", "Week 2", "Week 3", "Week 4",
"Week 5", "Week 6", "Week 7", "Week 8",
"Post-treatment", "6-month follow-up", "12-month follow-up"
)
)
) |>
select(-measure) |>
# Filter out waitlist at follow-up (not measured)
filter(!(trt == "waitlist" & time %in% c("fu6", "fu12"))) #<4>
# Calculate missing percentages
missing <- df_lsas_missing |>
mutate(
is_missing = is.na(lsas_score) #<5>
) |>
summarize(
n = n(),
n_missing = sum(is_missing), #<6>
missing_percent = (n_missing / n) * 100, #<7>
.by = c(trt, time_factor) #<8>
)
# Plot
library(ggplot2)
ggplot(
missing,
aes(
time_factor, #<9>
missing_percent,
group = trt,
color = trt
)
) +
geom_line() +
geom_point() +
labs(
title = "LSAS Percent missing observations",
x = "Time",
y = "Missing (%)",
color = "Treatment"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Convert to long format and create time variables
<- df_clean |>
df_lsas_missing 1select(id, trt, starts_with("lsas")) |>
pivot_longer(
cols = starts_with("lsas"),
names_to = "time_point",
values_to = "lsas_score"
|>
) 2separate(time_point, into = c("measure", "time"), sep = "_") |>
mutate(
time_clean = case_when(
== "screen" ~ "Baseline",
time == "v1" ~ "Week 1",
time == "v2" ~ "Week 2",
time == "v3" ~ "Week 3",
time == "v4" ~ "Week 4",
time == "v5" ~ "Week 5",
time == "v6" ~ "Week 6",
time == "v7" ~ "Week 7",
time == "v8" ~ "Week 8",
time == "post" ~ "Post-treatment",
time == "fu6" ~ "6-month follow-up",
time == "fu12" ~ "12-month follow-up"
time
),3time_factor = factor(
time_clean,levels = c(
"Baseline", "Week 1", "Week 2", "Week 3", "Week 4",
"Week 5", "Week 6", "Week 7", "Week 8",
"Post-treatment", "6-month follow-up", "12-month follow-up"
)
)|>
) select(-measure) |>
# Filter out waitlist at follow-up (not measured)
4filter(!(trt == "waitlist" & time %in% c("fu6", "fu12")))
# Calculate missing percentages
<- df_lsas_missing |>
missing mutate(
5is_missing = is.na(lsas_score)
|>
) summarize(
n = n(),
6n_missing = sum(is_missing),
7missing_percent = (n_missing / n) * 100,
8.by = c(trt, time_factor)
)
# Plot
library(ggplot2)
ggplot(
missing,aes(
9
time_factor,
missing_percent,group = trt,
color = trt
)+
) geom_line() +
geom_point() +
labs(
title = "LSAS Percent missing observations",
x = "Time",
y = "Missing (%)",
color = "Treatment"
+
) theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
- 1
- Select ID, treatment assignment, and LSAS columns
- 2
- Separate the time_point column to extract time
- 3
- Create ordered factor for proper chronological plotting with readable labels
- 4
- Exclude waitlist at 6 and 12 month follow-ups (not measured)
- 5
- Create indicator: TRUE if missing, FALSE if present
- 6
- Sum the TRUE/FALSE values to count missing
- 7
- Calculate percentage missing
- 8
- Group by both treatment assignment and time factor
- 9
- Use time_factor for x-axis to ensure proper time ordering with readable labels
The plot reveals missing data patterns across time and treatment groups, which is important for understanding dropout patterns.
10 Summary
In this lab, you practiced:
- Selecting columns with
select()
and helper functions - Filtering rows with
filter()
and multiple conditions - Creating variables with
mutate()
andcase_when()
- Working with factors to control categorical variable ordering
- Summarizing data with
group_by()
andsummarize()
- Reshaping data with
pivot_longer()
andpivot_wider()
- Calculating proportions including missing data patterns
- Visualizing patterns with ggplot2