Lab: Classification

In this lab, we’ll explore classification metrics using the PHQ-9 scale. We’ll practice calculating sensitivity, specificity, positive predictive value, and negative predictive value. The companion chapter Classification explains the concepts in more detail, so use it as a reference if you get stuck.

Tip

While you can complete all the exercises in your browser, we recommend also practicing in RStudio. Using an editor like RStudio will help you build real-world skills for writing, running, and saving your R code.

1 Load and prepare the data

First, let’s load the data and create the binary outcome variable we’ll use for classification.

Exercise 1 (Create a binary variable of PHQ-9 at post-treatment)  

For the exercises in this chapter, you will work with PHQ-9 instead of LSAS. Create a binary outcome variable phq9_post_bin that is “Low” if phq9_post is less than 10, and “High” otherwise.

The cutoff for PHQ-9 is 10. Use if_else() to create the binary variable.

library(dplyr) library(readr) library(here) # load data df_clean <- read_csv(here("data", "steps_clean.csv")) # create binary outcome variable df_clean <- df_clean |> mutate( phq9_post_bin = if_else(phq9_post <10, "Low", "High"), phq9_post_bin = factor(phq9_post_bin, levels = c("Low", "High")) ) # Check the distribution table(df_clean$phq9_post_bin)

library(dplyr)
library(readr)
library(here)

# load data
df_clean <- read_csv(here("data", "steps_clean.csv"))

# create binary outcome variable
df_clean <- df_clean |>
  mutate(
    phq9_post_bin = if_else(phq9_post <10, "Low", "High"),
    phq9_post_bin = factor(phq9_post_bin, levels = c("Low", "High"))
  )

# Check the distribution
table(df_clean$phq9_post_bin)
  1. PHQ-9 cutoff of 10 separates low from high depression symptoms

The binary variable is now ready for classification analysis.

2 Fit a logistic regression model

Now let’s fit a logistic regression model to predict the binary outcome. We will use a few functions from the tidymodels package to achieve this, but regression models can be fit with other packages as well. Note also that regression models are not the focus of this course, so we will not go into the details of fitting them here.

Exercise 2 (Fit a logistic regression model (PHQ-9))  

Fit a logistic regression model to predict phq9_post_bin using phq9_screen and group. Create a confusion matrix for the predictions.

The tidymodels workflow is:
1. Specify the model type: logistic_reg()
2. Set the engine: set_engine("glm")
3. Set the mode: set_mode("classification")
4. Fit the model: fit(formula, data)
5. Make predictions: predict(model, new_data)
6. Create confusion matrix: conf_mat(data, truth, estimate)

The missing step here is the formula. The typical formula is outcome ~ predictors.

library(tidymodels) # Fit logistic regression model phq9_fit <- logistic_reg() |> #<1> set_engine("glm") |> #<2> set_mode("classification") |> #<3> fit(phq9_post_bin ~ phq9_screen + group, data = df_clean) #<4> # Make predictions phq9_pred <- predict(phq9_fit, new_data = df_clean) |> #<5> bind_cols(df_clean) # Create confusion matrix phq9_conf_mat <- conf_mat(phq9_pred, truth = phq9_post_bin, estimate = .pred_class) #<6> # Display confusion matrix phq9_conf_mat
library(tidymodels)

# Fit logistic regression model
1phq9_fit <- logistic_reg() |>
2  set_engine("glm") |>
3  set_mode("classification") |>
4  fit(phq9_post_bin ~ phq9_screen + group, data = df_clean)

# Make predictions
5phq9_pred <- predict(phq9_fit, new_data = df_clean) |>
  bind_cols(df_clean)

# Create confusion matrix
6phq9_conf_mat <- conf_mat(phq9_pred, truth = phq9_post_bin, estimate = .pred_class)

# Display confusion matrix
phq9_conf_mat
1
Specify logistic regression model
2
Use GLM engine
3
Set to classification mode
4
Fit model with predictors
5
Make predictions on the data
6
Create confusion matrix

The confusion matrix shows how well our model predicts the binary outcome.

3 Calculate classification metrics manually

Now let’s calculate the key classification metrics manually by looking at the confusion matrix. We will look at sensitivity, specificity, positive predictive value, and negative predictive value. All you need are the four numbers from the confusion matrix!

If you need to refresh your memory about which numbers are needed for which metric, look in the companion chapter Classification.

Exercise 3 (Calculate sensitivity)  

We begin by extracting all values from the confusion matrix. Then we will use the relevant values for each metric.

Calculate the sensitivity (True Positive Rate) for the PHQ-9 predictions. Extract the values from the confusion matrix first.

To calculate sensitivity you need the true positives and false negatives.

# Extract values from confusion matrix true_pos <- phq9_conf_mat$table[1, 1] # True positives false_pos <- phq9_conf_mat$table[2, 1] # False positives false_neg <- phq9_conf_mat$table[1, 2] # False negatives true_neg <- phq9_conf_mat$table[2, 2] # True negatives # Calculate sensitivity (True Positive Rate) sensitivity <- true_pos / (true_pos + false_neg) #<1> # Display result sensitivity

# Extract values from confusion matrix
true_pos <- phq9_conf_mat$table[1, 1]  # True positives
false_pos <- phq9_conf_mat$table[2, 1] # False positives
false_neg <- phq9_conf_mat$table[1, 2] # False negatives
true_neg <- phq9_conf_mat$table[2, 2]  # True negatives

# Calculate sensitivity (True Positive Rate)
1sensitivity <- true_pos / (true_pos + false_neg)

# Display result
sensitivity
1
Sensitivity: proportion of actual positives correctly identified

Sensitivity tells us how well our model identifies true positive cases.

Exercise 4 (Calculate specificity)  

Calculate the specificity (True Negative Rate) for the PHQ-9 predictions.

To calculate specificity you need the true negatives and false positives.

# Calculate specificity (True Negative Rate) specificity <- true_neg / (true_neg + false_pos) #<1> # Display result specificity

# Calculate specificity (True Negative Rate)
1specificity <- true_neg / (true_neg + false_pos)

# Display result
specificity
1
Specificity: proportion of actual negatives correctly identified

Specificity tells us how well our model identifies true negative cases.

Exercise 5 (Calculate positive predictive value)  

Calculate the positive predictive value (PPV) for the PHQ-9 predictions.

To calculate PPV you need the true positives and false positives.

# Calculate positive predictive value (Precision) ppv <- true_pos / (true_pos + false_pos) #<1> # Display result ppv

# Calculate positive predictive value (Precision)
1ppv <- true_pos / (true_pos + false_pos)

# Display result
ppv
1
PPV: proportion of predicted positives that are actually positive

PPV tells us how reliable our positive predictions are.

Exercise 6 (Calculate negative predictive value)  

Calculate the negative predictive value (NPV) for the PHQ-9 predictions.

To calculate NPV you need the true negatives and false negatives.

# Calculate negative predictive value npv <- true_neg / (true_neg + false_neg) #<1> # Display result npv

# Calculate negative predictive value
1npv <- true_neg / (true_neg + false_neg)

# Display result
npv
1
NPV: proportion of predicted negatives that are actually negative

4 Summary

In this lab, you learned:

  1. Binary outcome creation: How to create binary outcomes from continuous variables using meaningful cutoffs
  2. Logistic regression: How to fit classification models using tidymodels
  3. Confusion matrices: How to create and interpret confusion matrices
  4. Classification metrics: How to calculate sensitivity, specificity, PPV, and NPV using the confusion matrix