Session 14: One-way ANOVA

So far, the statistical modeling we have learned about has focused on comparing two groups. But what if you have more than two groups you would like to compare? This is where analysis of variance, or ANOVA, comes in.

Start by watching this video from zedstatistics (another nice resource from a different part of the world). You can skip the last few minutes of the video on how to use Excel (we would never!) to run an ANOVA.

We will explore some ANOVA concepts using three data sets that can be generated using this code:

d1 <- tibble(values = c(2.95, 5.12, 5.68, 5.43, 5.12, 
                        2.81, 1.62, 2.51, 1.94, 3.92, 
                        5.72, 6.81, 4.88, 4.67, 
                        1.81, 4.32, 3.09, 3.02), 
             group = c(rep(c("A", "B"), each = 5), 
                       rep(c("C", "D"), each = 4)))

d2 <- tibble(values = c(4.76, 2.76, 4.31, 7.48, 4.99, 
                        3.29, 4.71, 1.07, 1.17, 
                        5.72, 4.31, 4.23, 7.81, 
                        4.27, 3.93, -0.08, 3.32, 3.86), 
             group = c(rep("A", 5),
                       rep(c("B", "C"), each = 4),
                       rep("D", 5)))

d3 <- tibble(values = c(5.86, 3.91, 5.2, 3.52, 
                        4.84, 3.03, 3.27, 2.92, 
                        5.36, 4.06, 4.21, 4.58, 2.69, 
                        4.23, 4.77, 4.19, 3.01, 2.27),
             group = c(rep(c("A", "B"), each = 4), 
                       rep(c("C", "D"), each = 5)))
  1. Copy the code and run it on your machine. Create a plot for each tibble (or one plot with three facets if you so prefer) showing the mean and standard deviation of each group, using geom_pointrange(). Looking at these plots, see if you can guess which data will have the smallest and largest F-statistic.
  1. Calculating the F-statistic took a fair amount of work, but now that we have gone through the different parts, we can put it all in a function and never have to write that code again. Complete the function below by filling in the blanks. Note that since TSS = SSW + SSB, we actually don’t need to calculate all three of these.
f_stat <- function(data, values_col, group_col){
  tss <- data |> 
    mutate(m = mean({{values_col}}), 
           dev = ____, 
           sq = ____) |> 
    summarise(sum(sq)) |> 
    pull()
  
  ssw <- data |> 
    group_by({{group_col}}) |> 
    mutate(m = ____, 
           dev = ____, 
           sq = ____) |> 
    summarise(ssq = ____) |> 
    summarise(sum(ssq)) |> 
    pull()
  
  ssb <- tss - ____
  
  n_gr <- data |> 
    group_by(____) |> 
    n_groups()
  
  n_obs <- data |> 
    count() |> 
    pull()
  
  tibble(f = (____ / (n_gr - 1)) / (____ / (n_obs - n_gr)))
}
  1. In the zedstatistics video, Justin skips over the part where an F-statistic is turned into a p-value. But that is not an issue for us because we have a statistical inference superpower that lets us calculate p-values in almost any situation. Yes, we are referring to permutation. Before we get to this part however, spend some time thinking about how you would state the null hypothesis for an ANOVA test and what the null distribution represents in this case.
fs <- perms |>
  map(\(i) f_stat(i, values, shuffled)) |> 
  list_rbind()
  1. Use the same permutation method to investigate if the mean flipper length of Adelie penguins differ significantly depending on which island they come from.
  1. Recently, we introduced the idea of effect size, a measure of the strength or magnitude of relationships. When comparing two group means, the most commonly used effect size metric is Cohen’s d. In the ANOVA case, the go-to effect size is called eta-squared (\(\eta^2\)). Check out this short video.
eta_sq <- function(data, values_col, group_col){
  tss <- ____
  
  ssw <- ____
  
  ssb <- ____ - ____
  
  tibble(eta_sq = ____ / tss)
}
  1. Now that we know what effect size metric to use when working with ANOVAs, we can calculate statistical power. We help you with the code for this. First, here is a way to simulate data with specified values for total sample size, \(eta^2\) and number of groups. Note that the rnorm_exact() function is just a helper, the main work is done by sim_eta().
rnorm_exact <- function(n, mean, sd) { 
  mean + sd * scale(rnorm(n)) |> 
    as.vector()
  }

sim_eta <- function(total_n, eta_sq, n_grps){
  ssqw <- 100 * (1 - eta_sq)
  ssqb <- 100 * eta_sq
  sample_means <- rnorm_exact(n = n_grps, 
                              mean = 0, 
                              sd = sqrt(ssqb / (total_n * ((n_grps - 1) / n_grps))))
  
  map(sample_means, \(i){
    d <- rnorm(n = total_n / n_grps, 
               mean = i, 
               sd = sqrt(ssqw / (total_n - n_grps)))
    
    tibble(values = d)
  }) |> 
    list_rbind(names_to = "group") |> 
    mutate(group = as.factor(group))
}

This code may look a bit intimidating, but the concept is actually quite simple. All we do is create a number of samples equal to n_grps and the sample means and standard deviations are set to values so that the true population \(\eta^2\) equals the argument eta_sq in the function.

Great, now that we have a collection of data sets, the next step is to analyze them all using an appropriate statistical model. We show you how this can be done below. Here we create a tibble storing \(\eta^2\) and p-values for each of the 10,000 data sets analyzed. We use the aov() function to run the ANOVAs (read more about the function here). As mentioned last week, we do not rely on permutation for significance testing in this case because it would take too long to run. Further, we use the eta_squared() function from the rstatix package to calculate \(\eta^2\). The rstatix package can be worth looking at more closely as it “provides a simple and intuitive pipe-friendly framework, coherent with the ‘tidyverse’ design philosophy, for performing basic statistical tests”. Read more here.

library(rstatix)
sims_ep <- sims |> 
  map(\(i) {
    fit <- aov(values ~ group, data = i)
    tibble(eta = eta_squared(fit), 
           p = summary(fit)[[1]][["Pr(>F)"]][1])
  }) |> 
  list_rbind()
  1. When we extracted the p-values that went into the sims_ep object, we used this ugly line of code: p = summary(fit)[[1]][["Pr(>F)"]][1]. Often times, the output from functions running statistical models in R is stored in a way that makes it a bit tricky to extract what you want. Luckily, there is a package called broom that helps you clean up model output. Run an ANOVA using the aov() function and use the tidy() function as shown here to pull out model output in a tidy format.