Probability rules

In this lab, we will look at how we can work with probability rules in R.

Load packages and data

library(tidyverse)
library(here)
library(knitr)

d_bl <- read_rds(here("data", "steps_baseline.rds"))

Probability rules

Let’s look at how we can implement probability rules in R and apply them to our dataset. In the Import and clean data lab, we created some categorical variables at baseline. These were just simulated, so we will not try to make sense of the results.

d_bl |>
  select(where(is.character)) |>
  glimpse()
Rows: 181
Columns: 5
$ gender    <chr> "Woman", "Man", "Man", "Man", "Woman", "Woman", "Woman", "Ma…
$ education <chr> "Primary", "University", "Secondary", "Secondary", "Universi…
$ income    <chr> "Medium", "Medium", "Medium", "Medium", "Medium", "Low", "Lo…
$ gad_cat   <chr> "Low anxiety", "High anxiety", "High anxiety", "Low anxiety"…
$ phq_cat   <chr> "Low depression", "High depression", "High depression", "Low…

For this lab, we will use the “education” and “income” variables. First, we will create summaries of these variables.

edu_summary <- d_bl |>
  group_by(education) |>
  summarise(
    n = n(),
    proportion = n / nrow(d_bl),
    percent = round(proportion * 100, 1)
  )

kable(edu_summary)
Table 1: Education levels
education n proportion percent
Primary 67 0.3701657 37.0
Secondary 79 0.4364641 43.6
University 35 0.1933702 19.3
income_summary <- d_bl |>
  group_by(income) |>
  summarise(
    n = n(),
    proportion = n / nrow(d_bl),
    percent = round(proportion * 100, 1)
  )

kable(income_summary)
Table 2: Income levels
income n proportion percent
High 41 0.2265193 22.7
Low 37 0.2044199 20.4
Medium 103 0.5690608 56.9

Check that the sum of all probabilities is 1

We do a quick check of the education variable, which has three levels: “Primary”, “Secondary”, and “University”. When we count the proportion of each level, we get the following:

The proportion with the “Primary” level is 0.3701657, the proportion with the “Secondary” level is 0.4364641, and the proportion with the “University” level is 0.1933702. These add up to 1. All good!

Complement rule

The probability of an event not occurring is 1 minus the probability that it will occur.

Let’s check this for the “Secondary” level.

\[ P(\text{not Secondary}) = 1 - P(\text{Secondary}) \]

# probability of Secondary education
p_secondary <- edu_summary |>
  filter(education == "Secondary") |>
  pull(proportion)

# complement rule
p_not_secondary <- 1 - p_secondary
  • P(Secondary education) = 0.4365
  • P(not Secondary education) = 0.5635
  • Check complement rule, sum = 1

Addition rule

The probability that event A or event B occurs (or both).

\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

Let’s implement this using our dataset. We’ll look at the probability of having either “Primary” education OR being in the “Low” income group.

# get probabilities
p_primary <- edu_summary |>
  filter(education == "Primary") |>
  pull(proportion)

p_low_income <- income_summary |>
  filter(income == "Low") |>
  pull(proportion)

# calculate probability of both Primary education AND Low income
p_both <- d_bl |>
  filter(education == "Primary" & income == "Low") |>
  nrow() / nrow(d_bl)

# addition rule
p_either <- p_primary + p_low_income - p_both
  • P(Primary education) = 0.3702
  • P(Low income) = 0.2044
  • P(Both) = 0.0552
  • P(Either) = 0.5193

Multiplication rule

For independent events, the probability of both events occurring is the product of their individual probabilities:

\[P(A \cap B) = P(A) \times P(B)\]

For dependent events, we need to account for the conditional probability:

\[P(A \cap B) = P(A) \times P(B|A)\]

Let’s check if education and income are independent by comparing the observed joint probability with the product of marginal probabilities. We will first use group_by() and summarise() to create Table 3.

edu_income_table <- d_bl |>
  group_by(education, income) |>
  summarise(
    n = n(),
    proportion = n / nrow(d_bl),
    percent = round(proportion * 100, 1),
    .groups = "drop"
  )

kable(edu_income_table)
Table 3: Joint probabilities of education and income
education income n proportion percent
Primary High 18 0.0994475 9.9
Primary Low 10 0.0552486 5.5
Primary Medium 39 0.2154696 21.5
Secondary High 15 0.0828729 8.3
Secondary Low 20 0.1104972 11.0
Secondary Medium 44 0.2430939 24.3
University High 8 0.0441989 4.4
University Low 7 0.0386740 3.9
University Medium 20 0.1104972 11.0
# Check independence for Primary education and Low income
p_primary_indep <- p_primary * p_low_income
p_primary_dep <- p_both
  • If independent: P(Primary ∩ Low income) = 0.075669
  • Observed: P(Primary ∩ Low income) = 0.055249
  • Difference = 0.020421
Tip

Keep in mind that this is simulated data, so the numbers may not represent the real world. Nonetheless, if we observed a result like this, what would we conclude?

Conditional probability

The probability of event B occurring given that event A has occurred:

\[P(B|A) = \frac{P(A \cap B)}{P(A)}\]

Let’s calculate the probability of having “Low” income given that someone has “Primary” education:

p_low_given_primary <- p_both / p_primary
  • P(Low income | Primary education) = 0.1493