Lab: Import and clean data
This lab is linked to the Import and clean data chapter. We will look at some common steps involved in importing and cleaning data, such as fixing column names, missing values, and column types.
Three underscores (___
) in the code below indicate where you need to fill in the missing parts. We use the webr
engine to run the code in your browser, but you can also run it in RStudio. The feedback you get from these web exercises comes from the gradethis
package. It’s great but not perfect.
Load packages and data
First, let’s load the packages and data. Do you experience any issues with loading the packages in RStudio? If so, you may need to install them first using the install.packages()
function in the console.
1 Inspect data using glimpse()
We will check the basic structure of our raw data by using the glimpse()
function.
If you are doing this lab in RStudio, you can try calling view()
to open the data in a spreadsheet-like view.
Exercise 1 (Inspect data)
Use the glimpse()
function to check the structure of the STePS data. What do you notice about the data? Are there any potential issues with the data structure?
Use the function on your raw data object.
glimpse(df_rawdata)
glimpse(df_rawdata)
2 Clean column names
Exercise 2 (Clean column names, part 1)
Use the clean_names()
function to clean the column names in df_rawdata
. Check the names before and after cleaning. What changes do you notice? Are there any column names that are still problematic?
Use the clean_names()
function from the janitor package.
names(df_rawdata)
# Clean the column names
df_rawdata |>
clean_names() |>
names()
names(df_rawdata)
# Clean the column names
|>
df_rawdata clean_names() |>
names()
Exercise 3 (Clean column names, part 2)
The clean_names()
is very helpful, but it didn’t sort all of the problems. We still have some inconsistent column names for screening, DERS-16, and PHQ-9.
Use the rename_with()
function to fix the column names of PHQ-9 in df_rawdata
. Check the names before and after cleaning.
Use str_replace_all()
to standardize the PHQ-9 column names, similar to DERS-16.
df_rawdata |>
clean_names() |>
rename_with(~ .x |>
str_replace_all("screening", "screen") |>
str_replace_all("ders_16|ders16", "ders") |>
str_replace_all("phq_9|phq9", "phq9")
) |>
names()
|>
df_rawdata clean_names() |>
rename_with(~ .x |>
str_replace_all("screening", "screen") |>
str_replace_all("ders_16|ders16", "ders") |>
str_replace_all("phq_9|phq9", "phq9")
|>
) names()
3 Fix missing values
Exercise 4 (Check missing values)
Check which columns have weird values for missing values. Here, we use the unique()
function to check which values are present in the character columns. Often when columns we think should be numeric have odd values for missing values, they end up as character columns.
The structure for selecting by column type is: is.type
, such as is.numeric
. Do the same but for character type.
# Check unique values in a few columns to see "missing" values
df_rawdata |>
select(where(is.character)) |>
map(unique)
# Check unique values in a few columns to see "missing" values
|>
df_rawdata select(where(is.character)) |>
map(unique)
Exercise 5 (Fix missing values)
Now that we have seen the problem, use the mutate()
and across()
functions to replace these values with NA
. Make sure to replace all occurrences of “missing” in the character columns.
You should replace the “missing” values with NA
.
# Fix the missing values
df_rawdata |>
clean_names() |>
mutate(across(where(is.character), ~na_if(., "missing"))) |>
glimpse()
# Fix the missing values
|>
df_rawdata clean_names() |>
mutate(across(where(is.character), ~na_if(., "missing"))) |>
glimpse()
4 Fix column types
Exercise 6 (Fix column types)
Let’s imagine we have many columns that should be numeric. Instead of repeating the same function many times, we can use the across()
function.
Use the object we created for the numeric columns.
# which columns should be numeric?
num_cols <- c("lsas", "gad", "phq9", "bbq", "scs", "dmrsodf", "ders", "pid_5")
# Steps from previous exercises
df_rawdata |>
clean_names() |>
mutate(across(where(is.character), ~na_if(., "missing"))) |>
# apply the across function to the numeric columns
mutate(
across(starts_with(num_cols), as.numeric)
) |>
glimpse()
# which columns should be numeric?
<- c("lsas", "gad", "phq9", "bbq", "scs", "dmrsodf", "ders", "pid_5")
num_cols
# Steps from previous exercises
|>
df_rawdata clean_names() |>
mutate(across(where(is.character), ~na_if(., "missing"))) |>
# apply the across function to the numeric columns
mutate(
across(starts_with(num_cols), as.numeric)
|>
) glimpse()
5 Save cleaned data
Exercise 7 (Save cleaned data)
Finally, save the cleaned data to a new CSV file called steps_clean.csv
in the /data
folder. Use the write_csv()
function from the readr package along with here()
, just like we did in the beginning of this lab when we loaded the data.
The format for using here()
is: here("folder", "file")
.
num_cols <- c("lsas", "gad", "phq9", "bbq", "scs", "dmrsodf", "ders", "pid_5")
# Create the cleaned dataset
df_clean <- df_rawdata |>
clean_names() |>
rename_with(~ .x |>
str_replace_all("screening", "screen") |>
str_replace_all("ders_16|ders16", "ders") |>
str_replace_all("phq_9|phq9", "phq9")) |>
mutate(across(where(is.character), ~na_if(., "missing"))) |>
mutate(
across(starts_with(num_cols), as.numeric)
)
# Save the cleaned data
write_csv(df_clean, here("data", "steps_clean.csv"))
<- c("lsas", "gad", "phq9", "bbq", "scs", "dmrsodf", "ders", "pid_5")
num_cols
# Create the cleaned dataset
<- df_rawdata |>
df_clean clean_names() |>
rename_with(~ .x |>
str_replace_all("screening", "screen") |>
str_replace_all("ders_16|ders16", "ders") |>
str_replace_all("phq_9|phq9", "phq9")) |>
mutate(across(where(is.character), ~na_if(., "missing"))) |>
mutate(
across(starts_with(num_cols), as.numeric)
)
# Save the cleaned data
write_csv(df_clean, here("data", "steps_clean.csv"))
6 Summary
In this lab, we looked at common steps in importing and cleaning data. We learned how to:
- Inspect the data using
glimpse()
- Clean column names using
clean_names()
- Fix missing values using
mutate()
andacross()
- Fix column types using
mutate()
andacross()
- Save the cleaned data using
write_csv()