Lab: Import and clean data

This lab is linked to the Import and clean data chapter. We will look at some common steps involved in importing and cleaning data, such as fixing column names, missing values, and column types.

Tip

Three underscores (___) in the code below indicate where you need to fill in the missing parts. We use the webr engine to run the code in your browser, but you can also run it in RStudio. The feedback you get from these web exercises comes from the gradethis package. It’s great but not perfect.

Load packages and data

First, let’s load the packages and data. Do you experience any issues with loading the packages in RStudio? If so, you may need to install them first using the install.packages() function in the console.

1 Inspect data using glimpse()

We will check the basic structure of our raw data by using the glimpse() function.

If you are doing this lab in RStudio, you can try calling view() to open the data in a spreadsheet-like view.

Exercise 1 (Inspect data)  

Use the glimpse() function to check the structure of the STePS data. What do you notice about the data? Are there any potential issues with the data structure?

Use the function on your raw data object.

glimpse(df_rawdata)

glimpse(df_rawdata)

2 Clean column names

Exercise 2 (Clean column names, part 1)  

Use the clean_names() function to clean the column names in df_rawdata. Check the names before and after cleaning. What changes do you notice? Are there any column names that are still problematic?

Use the clean_names() function from the janitor package.

names(df_rawdata) # Clean the column names df_rawdata |> clean_names() |> names()

names(df_rawdata)

# Clean the column names
df_rawdata |> 
  clean_names() |> 
  names()

Exercise 3 (Clean column names, part 2)  

The clean_names() is very helpful, but it didn’t sort all of the problems. We still have some inconsistent column names for screening, DERS-16, and PHQ-9.

Use the rename_with() function to fix the column names of PHQ-9 in df_rawdata. Check the names before and after cleaning.

Use str_replace_all() to standardize the PHQ-9 column names, similar to DERS-16.

df_rawdata |> clean_names() |> rename_with(~ .x |> str_replace_all("screening", "screen") |> str_replace_all("ders_16|ders16", "ders") |> str_replace_all("phq_9|phq9", "phq9") ) |> names()

df_rawdata |> 
  clean_names() |> 
  rename_with(~ .x |> 
    str_replace_all("screening", "screen") |> 
    str_replace_all("ders_16|ders16", "ders") |> 
    str_replace_all("phq_9|phq9", "phq9")
  ) |> 
  names()

3 Fix missing values

Exercise 4 (Check missing values)  

Check which columns have weird values for missing values. Here, we use the unique() function to check which values are present in the character columns. Often when columns we think should be numeric have odd values for missing values, they end up as character columns.

The structure for selecting by column type is: is.type, such as is.numeric. Do the same but for character type.

# Check unique values in a few columns to see "missing" values df_rawdata |> select(where(is.character)) |> map(unique)

# Check unique values in a few columns to see "missing" values
df_rawdata |> 
  select(where(is.character)) |> 
  map(unique)

Exercise 5 (Fix missing values)  

Now that we have seen the problem, use the mutate() and across() functions to replace these values with NA. Make sure to replace all occurrences of “missing” in the character columns.

You should replace the “missing” values with NA.

# Fix the missing values df_rawdata |> clean_names() |> mutate(across(where(is.character), ~na_if(., "missing"))) |> glimpse()

# Fix the missing values
df_rawdata |> 
  clean_names() |> 
  mutate(across(where(is.character), ~na_if(., "missing"))) |> 
  glimpse()

4 Fix column types

Exercise 6 (Fix column types)  

Let’s imagine we have many columns that should be numeric. Instead of repeating the same function many times, we can use the across() function.

Use the object we created for the numeric columns.

# which columns should be numeric? num_cols <- c("lsas", "gad", "phq9", "bbq", "scs", "dmrsodf", "ders", "pid_5") # Steps from previous exercises df_rawdata |> clean_names() |> mutate(across(where(is.character), ~na_if(., "missing"))) |> # apply the across function to the numeric columns mutate( across(starts_with(num_cols), as.numeric) ) |> glimpse()

# which columns should be numeric?
num_cols <- c("lsas", "gad", "phq9", "bbq", "scs", "dmrsodf", "ders", "pid_5")

# Steps from previous exercises
df_rawdata |> 
  clean_names() |> 
  mutate(across(where(is.character), ~na_if(., "missing"))) |> 
  # apply the across function to the numeric columns
  mutate(
    across(starts_with(num_cols), as.numeric)
  ) |> 
  glimpse()

5 Save cleaned data

Exercise 7 (Save cleaned data)  

Finally, save the cleaned data to a new CSV file called steps_clean.csv in the /data folder. Use the write_csv() function from the readr package along with here(), just like we did in the beginning of this lab when we loaded the data.

The format for using here() is: here("folder", "file").

num_cols <- c("lsas", "gad", "phq9", "bbq", "scs", "dmrsodf", "ders", "pid_5") # Create the cleaned dataset df_clean <- df_rawdata |> clean_names() |> rename_with(~ .x |> str_replace_all("screening", "screen") |> str_replace_all("ders_16|ders16", "ders") |> str_replace_all("phq_9|phq9", "phq9")) |> mutate(across(where(is.character), ~na_if(., "missing"))) |> mutate( across(starts_with(num_cols), as.numeric) ) # Save the cleaned data write_csv(df_clean, here("data", "steps_clean.csv"))

num_cols <- c("lsas", "gad", "phq9", "bbq", "scs", "dmrsodf", "ders", "pid_5")

# Create the cleaned dataset
df_clean <- df_rawdata |> 
  clean_names() |> 
  rename_with(~ .x |> 
    str_replace_all("screening", "screen") |> 
    str_replace_all("ders_16|ders16", "ders") |> 
    str_replace_all("phq_9|phq9", "phq9")) |>   
  mutate(across(where(is.character), ~na_if(., "missing"))) |> 
  mutate(
    across(starts_with(num_cols), as.numeric)
  )

# Save the cleaned data
write_csv(df_clean, here("data", "steps_clean.csv"))

6 Summary

In this lab, we looked at common steps in importing and cleaning data. We learned how to:

  • Inspect the data using glimpse()
  • Clean column names using clean_names()
  • Fix missing values using mutate() and across()
  • Fix column types using mutate() and across()
  • Save the cleaned data using write_csv()