Import and clean data

In this chapter, we will:

Introduce the datasets we will be using throughout the course
Import data into R
Perform basic cleaning of the data

Introduction to the dataset

Note

This datasets have the common .csv file format, which is very popular. If you need to import other file formats, such as .xlsx (Excel) or .sav (SPSS), you can use the readxl and haven packages, respectively.

STePS-study

This is a dataset from the STePS study, which is a RCT comparing guided and unguided internet-delivered psychodynamic therapy for social anxiety disorder. The study is published online.

In true open science fashion, the data is openly available online from the Open Science Framework.

Load packages

We will be using the tidyverse package quite a lot in this course. Tidyverse is a collection of useful packages for data manipulation and visualization, such as dplyr and ggplot2. We will also use the here package, which helps us manage file paths in a simple way. Finally, the janitor package has a great function for cleaning column names.

library(tidyverse)
library(here)
library(janitor)

If you are using R for the first time, you have to run the following code in the console to install the packages (one time only!). You can then load them using the library() function.

if (!require(tidyverse)) install.packages("tidyverse")
if (!require(here)) install.packages("here")

R packages

There is a rich ecosystem of packages for R, which add functionality to the base R language. You can find specialised packages for many different tasks. It’s important to keep in mind, however, that these are developed open source, and it is good practice to always read the documentation so that you understand how to use the package correctly.

Importing data into R

Let’s start working with the STePS dataset. We use the read_csv() function, along with the here() function, to import the data. Within the here() function, we first specify the folder, then the filename. This is helpful for keeping the file paths clean, and makes it easier to share your code with others since the file paths are relative to the project folder.

We also check the number of rows and columns in the dataset after importing it.

df_rawdata <- read_csv(here("data", "steps_raw.csv"))

row_check <- nrow(df_rawdata) # 181
col_check <- ncol(df_rawdata) # 37

Check data structure

After importing the data, we can check the structure of the dataset using the glimpse() function. This function provides a quick overview of the dataset, including the number of rows and columns, as well as the data types of each column. Does it look as expected?

Your object df_rawdata should contain 181 rows and 37 columns. We can see that the first column is named ID, which is the unique identifier for each participant. The second column is named Group, which indicates the group assignment: unguided treatment, guided treatment, or waitlist. The rest seem to be various questionnaires and scales.

glimpse(df_rawdata)

Rows: 181
Columns: 37
$ ID               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ Group            <dbl> 2, 1, 0, 2, 2, 2, 2, 2, 1, 1, 1, 1, 0, 2, 1, 1, 2, 0,…
$ `LSAS Screening` <dbl> 63, 71, 98, 63, 74, 81, 67, 76, 88, 73, 86, 78, 97, 7…
$ GAD_screen       <dbl> 7, 17, 18, 8, 14, 11, 5, 8, 14, 5, 15, 16, 17, 13, 10…
$ `PHQ-9 screen`   <dbl> 6, 13, 19, 4, 18, 8, 9, 8, 14, 3, 5, 11, 12, 18, 10, …
$ BBQ_screen       <dbl> 60, 66, 4, 50, 22, 23, 47, 52, 31, 46, 67, 24, 57, 40…
$ SCS_screen       <dbl> 25, 16, 22, 35, 29, 30, 20, 34, 21, 26, 35, 21, 32, 3…
$ DMRSODF_screen   <dbl> 4.9178, 5.0727, 4.5074, 5.3810, 4.8444, 5.0899, 4.692…
$ `DERS-16_screen` <dbl> 44, 73, 65, 45, 46, 49, 57, 38, 67, 45, 55, 56, 71, 3…
$ `PID-5_screen`   <dbl> 25, 20, 48, 17, 24, 20, 24, 26, 39, 23, 24, 26, 23, 1…
$ LSAS_V1          <chr> "72", "missing", "81", "44", "39", "65", "68", "69", …
$ LSAS_V2          <chr> "64", "missing", "89", "33", "115", "64", NA, "70", "…
$ LSAS_V3          <chr> "72", "missing", "73", "36", "missing", "63", NA, "71…
$ LSAS_V4          <chr> "61", "missing", "94", "44", "missing", "60", NA, "51…
$ LSAS_V5          <chr> "61", NA, "93", "21", "missing", "55", NA, "55", NA, …
$ LSAS_V6          <chr> "46", NA, "88", "20", "missing", "46", NA, "56", "93"…
$ LSAS_V7          <dbl> 55, NA, NA, 18, NA, 45, NA, 64, NA, 54, 101, NA, 89, …
$ LSAS_V8          <dbl> 49, NA, NA, 17, NA, 44, NA, 52, NA, 61, 84, NA, 89, N…
$ LSAS_POST        <dbl> 50, NA, 77, 22, NA, 52, 75, 45, 79, 64, 80, 89, 83, 4…
$ GAD_POST         <dbl> 4, NA, 19, 6, NA, 9, 4, 3, 10, 7, 15, 7, 11, 11, 11, …
$ `PHQ-9_POST`     <dbl> 3, NA, 22, 4, NA, 6, 11, 2, 14, 4, 8, 7, 8, 8, 13, 4,…
$ BBQ_POST         <chr> "76", NA, "68", "57", "missing", "14", "46", "70", "3…
$ SCS_POST         <dbl> 34, NA, 34, 34, NA, 30, 19, 34, 23, 23, 31, 37, 34, 3…
$ DMRSODF_POST     <dbl> 5.0776, NA, 4.2809, 5.2069, NA, 5.1758, 4.9000, 5.023…
$ `DERS-16_POST`   <dbl> 36, NA, 78, 38, NA, 54, 61, 26, 56, 51, 52, 25, 37, 3…
$ LSAS_FU6         <dbl> 33, NA, NA, 14, 6, 60, 64, 49, NA, 45, 85, 74, NA, 58…
$ GAD_FU6          <dbl> 0, NA, NA, 0, NA, 14, 6, 4, NA, 5, 16, NA, NA, 11, 8,…
$ PHQ9_FU6         <dbl> 3, NA, NA, 2, NA, 6, 8, 3, NA, 6, 22, NA, NA, 6, 5, 1…
$ BBQ_FU6          <dbl> 77, NA, NA, 68, NA, 9, 56, 64, NA, 48, 36, NA, NA, 39…
$ SCS_FU6          <chr> "28", NA, NA, "41", "missing", "24", "29", "33", "mis…
$ DERS_FU6         <dbl> 35, NA, NA, 36, NA, 72, 61, 26, NA, 37, 67, NA, NA, 3…
$ LSAS_FU12        <dbl> 27, NA, NA, 16, 39, 75, 66, 43, NA, 51, 79, NA, NA, 5…
$ GAD_FU12         <dbl> 5, NA, NA, 0, 11, 7, 6, 4, NA, 4, 14, NA, NA, 8, 19, …
$ `PHQ-9_FU12`     <dbl> 5, NA, NA, 2, 19, 9, 14, 3, NA, 5, 15, NA, NA, 8, 18,…
$ BBQ_FU12         <dbl> 76, NA, NA, 62, 12, 22, 52, 54, NA, 47, 25, NA, NA, 3…
$ SCS_FU12         <dbl> 38, NA, NA, 40, 28, 33, 26, 34, NA, 21, 30, NA, NA, 3…
$ DERS16_FU12      <dbl> 35, NA, NA, 32, 46, 65, 42, 27, NA, 53, 52, NA, NA, 3…

Other options for checking the basic structure of the data include head(), names(), ncol(), and nrow().

head(df_rawdata) shows the first few rows of the dataset.
names(df_rawdata) shows the names of the columns.
ncol(df_rawdata) shows the number of columns in the dataset.
nrow(df_rawdata) shows the number of rows in the dataset.

Types of data

It’s good practice to check and verify the type of data in each column after importing the data. For example, if you have a variable that should be numeric, but is instead character, you need to fix this before you can do any analyses. You will learn more about manipulating data types in the Tidy data manipulation chapter.

The most common types of columns you will work with are:

chr: character strings
int: integers
dbl: numeric values
lgl: logical (TRUE/FALSE)
fct: factors
date: dates

Our dataset is fairly clean, but there are problems with some variables. You can see from the glimpse() call above that some variables for LSAS and other questionnaires are coded as <chr> even though they should only have numbers. We will address these issues below.

Data cleaning

From the output above, we can see some potential issues with the raw data:

Some column names contain spaces and other problematic characters (such as -).
There are missing values in several columns.
The measurement points are not consistent.
The data types of some columns are not as expected (e.g., character instead of numeric).

Let’s get started with cleaning the data. Before we proceed, we will create a copy of the raw data and leave df_rawdata unchanged. This is a good practice to avoid losing the original data and to keep track of the changes we make.

Important

Make a habit of keeping the raw data you import from data collection platforms (e.g., BASS, REDCap) unchanged. This way, your life will be easier when you need to do new exports or updates. You can always go back to the original data if needed.

df_data <- df_rawdata

Clean column names

We will use the clean_names() function from the janitor package to clean the column names. This function replaces spaces and other problematic characters with underscores, and converts all names to lowercase. This makes it easier to work with the data later on.

What’s wrong with these column names? And what is the difference after clean_names()?

names(df_data)

 [1] "ID"             "Group"          "LSAS Screening" "GAD_screen"    
 [5] "PHQ-9 screen"   "BBQ_screen"     "SCS_screen"     "DMRSODF_screen"
 [9] "DERS-16_screen" "PID-5_screen"   "LSAS_V1"        "LSAS_V2"       
[13] "LSAS_V3"        "LSAS_V4"        "LSAS_V5"        "LSAS_V6"       
[17] "LSAS_V7"        "LSAS_V8"        "LSAS_POST"      "GAD_POST"      
[21] "PHQ-9_POST"     "BBQ_POST"       "SCS_POST"       "DMRSODF_POST"  
[25] "DERS-16_POST"   "LSAS_FU6"       "GAD_FU6"        "PHQ9_FU6"      
[29] "BBQ_FU6"        "SCS_FU6"        "DERS_FU6"       "LSAS_FU12"     
[33] "GAD_FU12"       "PHQ-9_FU12"     "BBQ_FU12"       "SCS_FU12"      
[37] "DERS16_FU12"

df_data <- df_data |>
  clean_names()

We have sorted some of the problems, but we still have inconsistent names for time-points (screening and screen), and some questionnaires seems to have gotten inconsistent names as well (ders_16, ders16, and ders).

df_data <- df_data |>
  rename_with(~ .x |>
    str_replace_all("screening", "screen") |>
    str_replace_all("ders_16|ders16", "ders") |>
    str_replace_all("phq_9", "phq9"))

This is rarely a linear process

This type of data cleaning is often a result of going back and forth between running code and checking the results. Sometimes you will come all the way to the analyses before finding that DERS-16 is missing a time-point! This is why it’s helpful to design your project in a way that allows you to easily go back and forth between steps.

Ensure missing values are coded as NA

If we are lucky, all the missing values are already coded as NA and R will recognize them as missing. More often, however, is that the missing values are coded as "", -99, or other values. This can cause problems later, so we need to ensure that all missing values are coded as NA.

Sometimes, weird labels for missing values is also the reason why some columns are not recognized as numeric.

Let’s check the data structure again. It looks like some columns that should be numeric are instead <chr>, which means they are character strings.

glimpse(df_data)

Rows: 181
Columns: 37
$ id             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ group          <dbl> 2, 1, 0, 2, 2, 2, 2, 2, 1, 1, 1, 1, 0, 2, 1, 1, 2, 0, 1…
$ lsas_screen    <dbl> 63, 71, 98, 63, 74, 81, 67, 76, 88, 73, 86, 78, 97, 72,…
$ gad_screen     <dbl> 7, 17, 18, 8, 14, 11, 5, 8, 14, 5, 15, 16, 17, 13, 10, …
$ phq9_screen    <dbl> 6, 13, 19, 4, 18, 8, 9, 8, 14, 3, 5, 11, 12, 18, 10, 4,…
$ bbq_screen     <dbl> 60, 66, 4, 50, 22, 23, 47, 52, 31, 46, 67, 24, 57, 40, …
$ scs_screen     <dbl> 25, 16, 22, 35, 29, 30, 20, 34, 21, 26, 35, 21, 32, 33,…
$ dmrsodf_screen <dbl> 4.9178, 5.0727, 4.5074, 5.3810, 4.8444, 5.0899, 4.6923,…
$ ders_screen    <dbl> 44, 73, 65, 45, 46, 49, 57, 38, 67, 45, 55, 56, 71, 36,…
$ pid_5_screen   <dbl> 25, 20, 48, 17, 24, 20, 24, 26, 39, 23, 24, 26, 23, 15,…
$ lsas_v1        <chr> "72", "missing", "81", "44", "39", "65", "68", "69", NA…
$ lsas_v2        <chr> "64", "missing", "89", "33", "115", "64", NA, "70", "97…
$ lsas_v3        <chr> "72", "missing", "73", "36", "missing", "63", NA, "71",…
$ lsas_v4        <chr> "61", "missing", "94", "44", "missing", "60", NA, "51",…
$ lsas_v5        <chr> "61", NA, "93", "21", "missing", "55", NA, "55", NA, "6…
$ lsas_v6        <chr> "46", NA, "88", "20", "missing", "46", NA, "56", "93", …
$ lsas_v7        <dbl> 55, NA, NA, 18, NA, 45, NA, 64, NA, 54, 101, NA, 89, 43…
$ lsas_v8        <dbl> 49, NA, NA, 17, NA, 44, NA, 52, NA, 61, 84, NA, 89, NA,…
$ lsas_post      <dbl> 50, NA, 77, 22, NA, 52, 75, 45, 79, 64, 80, 89, 83, 45,…
$ gad_post       <dbl> 4, NA, 19, 6, NA, 9, 4, 3, 10, 7, 15, 7, 11, 11, 11, 1,…
$ phq9_post      <dbl> 3, NA, 22, 4, NA, 6, 11, 2, 14, 4, 8, 7, 8, 8, 13, 4, 1…
$ bbq_post       <chr> "76", NA, "68", "57", "missing", "14", "46", "70", "36"…
$ scs_post       <dbl> 34, NA, 34, 34, NA, 30, 19, 34, 23, 23, 31, 37, 34, 34,…
$ dmrsodf_post   <dbl> 5.0776, NA, 4.2809, 5.2069, NA, 5.1758, 4.9000, 5.0235,…
$ ders_post      <dbl> 36, NA, 78, 38, NA, 54, 61, 26, 56, 51, 52, 25, 37, 33,…
$ lsas_fu6       <dbl> 33, NA, NA, 14, 6, 60, 64, 49, NA, 45, 85, 74, NA, 58, …
$ gad_fu6        <dbl> 0, NA, NA, 0, NA, 14, 6, 4, NA, 5, 16, NA, NA, 11, 8, 8…
$ phq9_fu6       <dbl> 3, NA, NA, 2, NA, 6, 8, 3, NA, 6, 22, NA, NA, 6, 5, 12,…
$ bbq_fu6        <dbl> 77, NA, NA, 68, NA, 9, 56, 64, NA, 48, 36, NA, NA, 39, …
$ scs_fu6        <chr> "28", NA, NA, "41", "missing", "24", "29", "33", "missi…
$ ders_fu6       <dbl> 35, NA, NA, 36, NA, 72, 61, 26, NA, 37, 67, NA, NA, 36,…
$ lsas_fu12      <dbl> 27, NA, NA, 16, 39, 75, 66, 43, NA, 51, 79, NA, NA, 55,…
$ gad_fu12       <dbl> 5, NA, NA, 0, 11, 7, 6, 4, NA, 4, 14, NA, NA, 8, 19, 12…
$ phq9_fu12      <dbl> 5, NA, NA, 2, 19, 9, 14, 3, NA, 5, 15, NA, NA, 8, 18, 3…
$ bbq_fu12       <dbl> 76, NA, NA, 62, 12, 22, 52, 54, NA, 47, 25, NA, NA, 33,…
$ scs_fu12       <dbl> 38, NA, NA, 40, 28, 33, 26, 34, NA, 21, 30, NA, NA, 32,…
$ ders_fu12      <dbl> 35, NA, NA, 32, 46, 65, 42, 27, NA, 53, 52, NA, NA, 36,…

Let’s fix this by replacing the problematic values with NA. We will use the mutate() and across() functions from the dplyr package to apply the na_if() function to all character columns. This will replace any occurrence of “missing” with NA.

df_data <- df_data |>
  mutate(across(where(is.character), ~ na_if(., "missing")))

Fix column types

The NA values are now correctly coded, but we still have some columns that are not numeric despite only including numbers. We can use the mutate() function again to convert the columns to the correct data type. In this step, we also ensure that the id and group columns are factors.

# which columns should be numeric?
num_cols <- c("lsas", "gad", "phq9", "bbq", "scs", "dmrsodf", "ders", "pid_5")

df_data <- df_data |>
  mutate(
    across(starts_with(num_cols), as.numeric),
    id = factor(id),
    group = factor(group)
  )

Create a treatment indicator variable

Our group variable is coded as 0, 1, and 2. This is not very informative, and we run the risk of misinterpreting the results if we don’t have clear labels for the three groups. Let’s create a new variable trt with better labels.

df_data <- df_data |>
  mutate(
    trt = factor(
      group,
      levels = c(0, 1, 2),
      labels = c("waitlist", "self-guided", "therapist-guided")
    )
  )

Save the cleaned data

Now that we have cleaned the data, we can save it as a new CSV file. We will use the write_csv() function from the readr package to do this. The cleaned data will be saved in the data folder with the name steps_clean.csv.

write_csv(df_data, here("data", "steps_clean.csv"))

Saving data

When working with data in your research projects, it’s a good idea to save your data in formats that can be reused later. In this course we are using an open dataset, but for your own projects you may want to save your data on KI server or another secure location.

Try to keep your file names readable and understandable for both machines and humans.

Avoid spaces or special characters in file names
Use underscores _ or hyphens - to separate words
Use lowercase letters
Use date stamps when relevant (e.g., bass_export_2024_01_15.csv)
Use descriptive names (e.g., steps_baseline.csv)
Be consistent in naming