library(tidyverse)
library(here)
library(knitr)
<- read_rds(here("data", "steps_baseline.rds")) d_bl
Probability rules
In this lab, we will look at how we can work with probability rules in R.
Load packages and data
Probability rules
Let’s look at how we can implement probability rules in R and apply them to our dataset. In the Import and clean data lab, we created some categorical variables at baseline. These were just simulated, so we will not try to make sense of the results.
|>
d_bl select(where(is.character)) |>
glimpse()
Rows: 181
Columns: 5
$ gender <chr> "Woman", "Man", "Man", "Man", "Woman", "Woman", "Woman", "Ma…
$ education <chr> "Primary", "University", "Secondary", "Secondary", "Universi…
$ income <chr> "Medium", "Medium", "Medium", "Medium", "Medium", "Low", "Lo…
$ gad_cat <chr> "Low anxiety", "High anxiety", "High anxiety", "Low anxiety"…
$ phq_cat <chr> "Low depression", "High depression", "High depression", "Low…
For this lab, we will use the “education” and “income” variables. First, we will create summaries of these variables.
<- d_bl |>
edu_summary group_by(education) |>
summarise(
n = n(),
proportion = n / nrow(d_bl),
percent = round(proportion * 100, 1)
)
kable(edu_summary)
education | n | proportion | percent |
---|---|---|---|
Primary | 67 | 0.3701657 | 37.0 |
Secondary | 79 | 0.4364641 | 43.6 |
University | 35 | 0.1933702 | 19.3 |
<- d_bl |>
income_summary group_by(income) |>
summarise(
n = n(),
proportion = n / nrow(d_bl),
percent = round(proportion * 100, 1)
)
kable(income_summary)
income | n | proportion | percent |
---|---|---|---|
High | 41 | 0.2265193 | 22.7 |
Low | 37 | 0.2044199 | 20.4 |
Medium | 103 | 0.5690608 | 56.9 |
Check that the sum of all probabilities is 1
We do a quick check of the education variable, which has three levels: “Primary”, “Secondary”, and “University”. When we count the proportion of each level, we get the following:
The proportion with the “Primary” level is 0.3701657, the proportion with the “Secondary” level is 0.4364641, and the proportion with the “University” level is 0.1933702. These add up to 1. All good!
Complement rule
The probability of an event not occurring is 1 minus the probability that it will occur.
Let’s check this for the “Secondary” level.
\[ P(\text{not Secondary}) = 1 - P(\text{Secondary}) \]
# probability of Secondary education
<- edu_summary |>
p_secondary filter(education == "Secondary") |>
pull(proportion)
# complement rule
<- 1 - p_secondary p_not_secondary
- P(Secondary education) = 0.4365
- P(not Secondary education) = 0.5635
- Check complement rule, sum = 1
Addition rule
The probability that event A or event B occurs (or both).
\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]
Let’s implement this using our dataset. We’ll look at the probability of having either “Primary” education OR being in the “Low” income group.
# get probabilities
<- edu_summary |>
p_primary filter(education == "Primary") |>
pull(proportion)
<- income_summary |>
p_low_income filter(income == "Low") |>
pull(proportion)
# calculate probability of both Primary education AND Low income
<- d_bl |>
p_both filter(education == "Primary" & income == "Low") |>
nrow() / nrow(d_bl)
# addition rule
<- p_primary + p_low_income - p_both p_either
- P(Primary education) = 0.3702
- P(Low income) = 0.2044
- P(Both) = 0.0552
- P(Either) = 0.5193
Multiplication rule
For independent events, the probability of both events occurring is the product of their individual probabilities:
\[P(A \cap B) = P(A) \times P(B)\]
For dependent events, we need to account for the conditional probability:
\[P(A \cap B) = P(A) \times P(B|A)\]
Let’s check if education and income are independent by comparing the observed joint probability with the product of marginal probabilities. We will first use group_by()
and summarise()
to create Table 3.
<- d_bl |>
edu_income_table group_by(education, income) |>
summarise(
n = n(),
proportion = n / nrow(d_bl),
percent = round(proportion * 100, 1),
.groups = "drop"
)
kable(edu_income_table)
education | income | n | proportion | percent |
---|---|---|---|---|
Primary | High | 18 | 0.0994475 | 9.9 |
Primary | Low | 10 | 0.0552486 | 5.5 |
Primary | Medium | 39 | 0.2154696 | 21.5 |
Secondary | High | 15 | 0.0828729 | 8.3 |
Secondary | Low | 20 | 0.1104972 | 11.0 |
Secondary | Medium | 44 | 0.2430939 | 24.3 |
University | High | 8 | 0.0441989 | 4.4 |
University | Low | 7 | 0.0386740 | 3.9 |
University | Medium | 20 | 0.1104972 | 11.0 |
# Check independence for Primary education and Low income
<- p_primary * p_low_income
p_primary_indep <- p_both p_primary_dep
- If independent: P(Primary ∩ Low income) = 0.075669
- Observed: P(Primary ∩ Low income) = 0.055249
- Difference = 0.020421
Keep in mind that this is simulated data, so the numbers may not represent the real world. Nonetheless, if we observed a result like this, what would we conclude?
Conditional probability
The probability of event B occurring given that event A has occurred:
\[P(B|A) = \frac{P(A \cap B)}{P(A)}\]
Let’s calculate the probability of having “Low” income given that someone has “Primary” education:
<- p_both / p_primary p_low_given_primary
- P(Low income | Primary education) = 0.1493