Probability rules

In this lab, we will look at how we can work with probability rules in R.

Load packages and data

library(tidyverse)
library(here)
library(knitr)

d_bl <- read_rds(here("data", "steps_baseline.rds"))

Probability rules

Let’s look at how we can implement probability rules in R and apply them to our dataset. In the Import and clean data lab, we created some categorical variables at baseline. These were just simulated, so we will not try to make sense of the results.

d_bl |>
  select(where(is.character)) |>
  glimpse()

Rows: 181
Columns: 5
$ gender    <chr> "Woman", "Man", "Man", "Man", "Woman", "Woman", "Woman", "Ma…
$ education <chr> "Primary", "University", "Secondary", "Secondary", "Universi…
$ income    <chr> "Medium", "Medium", "Medium", "Medium", "Medium", "Low", "Lo…
$ gad_cat   <chr> "Low anxiety", "High anxiety", "High anxiety", "Low anxiety"…
$ phq_cat   <chr> "Low depression", "High depression", "High depression", "Low…

For this lab, we will use the “education” and “income” variables. First, we will create summaries of these variables.

edu_summary <- d_bl |>
  group_by(education) |>
  summarise(
    n = n(),
    proportion = n / nrow(d_bl),
    percent = round(proportion * 100, 1)
  )

kable(edu_summary)

Table 1: Education levels

education	n	proportion	percent
Primary	67	0.3701657	37.0
Secondary	79	0.4364641	43.6
University	35	0.1933702	19.3

income_summary <- d_bl |>
  group_by(income) |>
  summarise(
    n = n(),
    proportion = n / nrow(d_bl),
    percent = round(proportion * 100, 1)
  )

kable(income_summary)

Table 2: Income levels

income	n	proportion	percent
High	41	0.2265193	22.7
Low	37	0.2044199	20.4
Medium	103	0.5690608	56.9

Check that the sum of all probabilities is 1

We do a quick check of the education variable, which has three levels: “Primary”, “Secondary”, and “University”. When we count the proportion of each level, we get the following:

The proportion with the “Primary” level is 0.3701657, the proportion with the “Secondary” level is 0.4364641, and the proportion with the “University” level is 0.1933702. These add up to 1. All good!

Complement rule

The probability of an event not occurring is 1 minus the probability that it will occur.

Let’s check this for the “Secondary” level.

\[ P(\text{not Secondary}) = 1 - P(\text{Secondary}) \]

# probability of Secondary education
p_secondary <- edu_summary |>
  filter(education == "Secondary") |>
  pull(proportion)

# complement rule
p_not_secondary <- 1 - p_secondary

P(Secondary education) = 0.4365
P(not Secondary education) = 0.5635
Check complement rule, sum = 1

Addition rule

The probability that event A or event B occurs (or both).

\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

Let’s implement this using our dataset. We’ll look at the probability of having either “Primary” education OR being in the “Low” income group.

# get probabilities
p_primary <- edu_summary |>
  filter(education == "Primary") |>
  pull(proportion)

p_low_income <- income_summary |>
  filter(income == "Low") |>
  pull(proportion)

# calculate probability of both Primary education AND Low income
p_both <- d_bl |>
  filter(education == "Primary" & income == "Low") |>
  nrow() / nrow(d_bl)

# addition rule
p_either <- p_primary + p_low_income - p_both

P(Primary education) = 0.3702
P(Low income) = 0.2044
P(Both) = 0.0552
P(Either) = 0.5193

Multiplication rule

For independent events, the probability of both events occurring is the product of their individual probabilities:

\[P(A \cap B) = P(A) \times P(B)\]

For dependent events, we need to account for the conditional probability:

\[P(A \cap B) = P(A) \times P(B|A)\]

Let’s check if education and income are independent by comparing the observed joint probability with the product of marginal probabilities. We will first use group_by() and summarise() to create Table 3.

edu_income_table <- d_bl |>
  group_by(education, income) |>
  summarise(
    n = n(),
    proportion = n / nrow(d_bl),
    percent = round(proportion * 100, 1),
    .groups = "drop"
  )

kable(edu_income_table)

Table 3: Joint probabilities of education and income

education	income	n	proportion	percent
Primary	High	18	0.0994475	9.9
Primary	Low	10	0.0552486	5.5
Primary	Medium	39	0.2154696	21.5
Secondary	High	15	0.0828729	8.3
Secondary	Low	20	0.1104972	11.0
Secondary	Medium	44	0.2430939	24.3
University	High	8	0.0441989	4.4
University	Low	7	0.0386740	3.9
University	Medium	20	0.1104972	11.0

# Check independence for Primary education and Low income
p_primary_indep <- p_primary * p_low_income
p_primary_dep <- p_both

If independent: P(Primary ∩ Low income) = 0.075669
Observed: P(Primary ∩ Low income) = 0.055249
Difference = 0.020421

Tip

Keep in mind that this is simulated data, so the numbers may not represent the real world. Nonetheless, if we observed a result like this, what would we conclude?

Conditional probability

The probability of event B occurring given that event A has occurred:

\[P(B|A) = \frac{P(A \cap B)}{P(A)}\]

Let’s calculate the probability of having “Low” income given that someone has “Primary” education:

p_low_given_primary <- p_both / p_primary

P(Low income | Primary education) = 0.1493