Lab: Descriptive statistics

Use the functions described in the Import and clean data lab to get a quick overview of the dataset called df_clean and give a brief summary of it.

Have a look at Import and clean data

The full solution is:

glimpse(df_clean)
head(df_clean)
ncol(df_clean)
nrow(df_clean)

Visualize the distribution of the PHQ-9 scale and the PID-5 scale and provide the mean, median and mode.

Have a look at Import and clean data

The full solution is:

#PHQ-9
hist(df_clean$phq9_screen)
mean(df_clean$phq9_screen)
median(df_clean$phq9_screen)
get_mode(df_clean$phq9_screen)

#PID-5
hist(df_clean$pid_5_screen)
mean(df_clean$pid_5_screen)
median(df_clean$pid_5_screen)
get_mode(df_clean$pid_5_screen)

How does the centrality measures differ and why?

Think about mean, median, and mode and how they are influenced by the shape of the distribution or outliers.

The mean, median, and mode differ because they describe different aspects of the data:

  • The mean is sensitive to extreme values and gives the arithmetic average.
  • The median is the middle value, unaffected by outliers, and reflects the central point in skewed distributions.
  • The mode identifies the most frequent value and is useful for categorical or discrete variables.

They differ because of how they respond to skewness and outliers in the data.

Reason on the pros and cons of the different centrality measures for these scales.

Consider which measure gives the most typical picture and when one measure might mislead.

  • Mean
    • Pros: Uses all data values, good for symmetric distributions.
    • Cons: Highly affected by skewness and outliers.
  • Median
    • Pros: Robust to outliers, better for skewed distributions.
    • Cons: Ignores the magnitude of extreme values, less informative for symmetric data.
  • Mode
    • Pros: Useful for categorical data and identifying the most common response.
    • Cons: Can be unstable if multiple modes exist or if the distribution is flat.

For these psychological scales, the median is often more informative if the distributions are skewed, while the mean is more common when data is approximately normal. The mode can be informative but is less often used for continuous scales.

Calculate some spread measures for the LSAS-SR scale, what do they tell you and which ones do you think are most useful to describe the spread of the values. Motive you answer briefly!

Have a look at Descriptive statistics

The full solution is:

sd(df_clean$lsas_screen)
var(df_clean$lsas_screen)
range(df_clean$lsas_screen)
IQR(df_clean$lsas_screen)
hist(df_clean$lsas_screen)

where the historgram helps you determine which of the spread measures is most useful

Calculate the counts, proportions and percentages for the simulated income categories and visualize the distribution

To calculate the proportion, you will need to compare the counts with the total number of rows in the dataset nrow().

The full solution is:

# counts
table(df_clean$income)

# proportions
table(df_clean$income)/nrow(df_clean)

# percentages
table(df_clean$income)/nrow(df_clean)*100 

where the histogram helps you determine which of the spread measures is most useful

Visualize the joint distribution of GAD-7 and PHQ-9 as numeric variables and describe what you see. Which plot is the most useful for this purpose?

One useful way to visualize joint distributions of numeric variables is to create a scatter plot. See Descriptive statistics

The full solution is:

plot(df_clean$gad_screen, df_clean$phq9_screen)

Visualize the distribution of of LSAS scores by income level and describe what you see. Which plot is the most useful for this purpose?

One useful way to visualize joint distributions of numeric variables and categoric variables is to create a grouped boxplot. See Descriptive statistics

The full solution is:

boxplot(df_clean$lsas_screen ~df_clean$income,
        ylab = "LSAS-SR",
        xlab = "Income")

Create a table using the tableone package to show descriptive statistics stratified by high vs low depression levels. Briefly interpret what you see.

The full solution is:

# define the variables you want
vars <- c(
  "lsas_screen",
  "gad_screen",
  "phq9_screen",
  "bbq_screen",
  "scs_screen",
  "dmrsodf_screen",
  "ders_screen",
  "pid_5_screen"
)

CreateTableOne(vars= vars, data = df_clean, strata = "phq_cat", test = FALSE)

Calculate the population variance and standard deviation of LSAS. How and why do these differ from the ones given by the functions sd() and var()?

See Descriptive statistics

You can use the following code to create functions for the population variance and SD:

#creating a function to calculate the population variance 
pop_var <- function(x){
  1/length(x)*sum((x-mean(x))^2)
}

# and the population sd
pop_sd <- function(x){
  sqrt(1/length(x)*sum((x-mean(x))^2))
}

The full solution is:

#creating a function to calculate the population variance 
pop_var <- function(x){
  1/length(x)*sum((x-mean(x))^2)
}

# and the population sd
pop_sd <- function(x){
  sqrt(1/length(x)*sum((x-mean(x))^2))
}

# variance
pop_var(df_clean$lsas_screen) # population
var(df_clean$lsas_screen) #sample

# SD
pop_sd(df_clean$lsas_screen) # population 
sd(df_clean$lsas_screen) # sample

Use only the first ten participants and compare the population and the sample variance and standard deviation of LSAS-SR. What do you find, and how do the results compare to those from the previous exercise?

You can use the following code to get the first 10 rows of your data

df_clean_10 <- df_clean[1:10,]

The full solution is:

#creating a function to calculate the population variance 
pop_var <- function(x){
  1/length(x)*sum((x-mean(x))^2)
}

# and the population sd
pop_sd <- function(x){
  sqrt(1/length(x)*sum((x-mean(x))^2))
}

#create dataset with only the first 10 participants
df_clean_10 <- df_clean[1:10,]

# variance
pop_var(df_clean_10$lsas_screen)
var(df_clean_10$lsas_screen)

# SD
pop_sd(df_clean_10$lsas_screen)
sd(df_clean_10$lsas_screen)