Session 11: Hypothesis testing, p-values & Permutation

This week we will introduce the idea of hypothesis testing, p-values, null distributions, and how to compare groups.

Before we get to the exercises, we believe it makes sense to in one fell swoop cover a lot of the theory you need to know for the rest of the materials on this page to make sense. Watch Video 1 (Hypothesis Testing and The Null Hypothesis, Clearly Explained!!!), Video 2 (p-values: What they are and how to interpret them) and Video 3 (Using Bootstrapping to Calculate p-values!!!).

Last week we discussed sampling distributions and how we can generate them through simulation and, more importantly, bootstrapping. A sampling distribution is a collection of values, typically statistics, that represents what we would expect to see if we went out into the world and collected many, many samples and calculated the statistic of interest for each one. In practice, we can almost never collect many, many samples, so we rely on computational techniques to approximate these distributions. Bootstrapping, a method that creates “fake” samples by sampling observations with replacement, does an impressive job of mimicking repeated sampling. By using bootstrapping to generate a sampling distribution, we can estimate quantities such as the standard error of the mean and construct confidence intervals.

Now, as described in this week’s videos, we will use a slightly modified version of bootstrapping to generate null distributions. In this setting, a null distribution can be viewed as a sampling distribution representing what we would expect to observe if we collected many, many samples under a true null scenario. The specific null scenario depends on the context of the analysis, but in its simplest form the null may specify a particular mean value against which we compare our observed mean. To produce a null distribution, we first shift our sample so that its mean equals the null value, and then apply bootstrapping to this shifted data to generate the distribution.

This null distribution represents the range of sample means we would expect to see if the null hypothesis were true. The null hypothesis is the formal statement of the assumed null scenario, for example that the population mean equals a particular value. We can then carry out statistical inference by asking what is the probability of observing a mean as extreme as the one in my sample if the null hypothesis is true as represented by the null distribution.

Use your bootstrapping skills to test the hypothesis that the true bill length in Adelie penguins is zero millimeters. Clearly state the null hypothesis in terms of the population mean and decide how your sample would need to be adjusted to reflect this assumption. Generate a null distribution from the adjusted data, calculate the probability of observing a sample mean as extreme as the one in your original data, and state your conclusion in context.

What is the smallest non-zero p-value you can obtain if you run 10,000 iterations? In statistics, we rarely want to say that a probability is exactly zero. Consider a better way to report a p-value when your calculation produces a result of zero.
Run a test investigating the hypothesis that the true bill length in Adelie penguins is 38.5 millimeters and report the resulting p-value.

A little while ago, when we talked about covariance and correlation, we introduced the idea of chance correlations: a distribution of correlation coefficients generated by repeatedly sampling pairs of variables for which the true correlation is zero. You can think of this as going out into the real world and taking weight measurements from one sample of individuals and height measurements from a completely separate group. You can store these vectors side by side in a data frame, but any relationship you might find, for example calculating the correlation between weights in group 1 and heights in group 2, will of course be due to random chance alone (hence the term chance correlation). If we repeat this process of collecting unrelated data, we can create a null distribution, which, as we already know, can be used to calculate p-values.

Instead of collecting large amounts of unrelated data, we can use simulation to approximate the process. Fill in the blanks in the code provided below to generate a correlation null distribution (a distribution of chance correlations). Plot the result.

sim_null_corr <- map(1:10000, \(i) tibble(x = rnorm(20, 0, 1),
                                          y = rnorm(20, 0, 1))) |> 
  list_rbind(names_to = _____)

sim_null_corr |>  
  group_by(_____) |> 
  summarise(null_corr = cor(_____, _____))

In the code, you can change the mean and standard deviation of the populations from which x and y are sampled. Try changing these values and observe how the null distribution is affected.
How large, or how extreme, must a correlation coefficient be to be considered statistically significant if the sample size is 10, 100, or 1000? Hint: use the quantile() function, as we did last week.

Generating null distributions via simulation is a powerful tool for statistical inference, but what if you wanted to use a sample of observed values to create a correlation null distribution? We have already seen how this can be done when investigating whether the mean of a sample variable is different from zero (or any other value of interest). In that case, the mean of the sample was shifted, and a null distribution was generated using bootstrapping. However, it is not immediately clear how this same method could be used for something like the correlation. Simply shifting the means of the variables we want to compare will not work, since the correlation is not affected by the means. Instead, we want to repeatedly break the association between our variables of interest. As we discussed when talking about covariance and correlation, correlation can be viewed as a measure of how much two variables “rank together,” meaning that the order of the observations matters. By randomly shuffling the order of one or both of the variables, the association is broken, and if we repeat this process many times, calculating the correlation each time, we can generate a null distribution.

Let’s use the method described above to generate a null distribution of the correlation between bill length and body mass in Chinstrap penguins. Study the code below, fill in the blanks, and run it. Then plot the resulting distribution.

chinstrap_no_missing <- penguins |>  
  filter(species == _____, 
         !is.na(_____),
         !is.na(_____)) 

break_assoc <- function(data, variable_to_shuffle){
  data |>  
    mutate(shuffled = sample({{variable_to_shuffle}}, replace = FALSE))
}

iterations <- map(1:10000, \(i) break_assoc(_____, _____)) |>  
  list_rbind(names_to = _____)

iterations |>  
  group_by(_____) |>  
  summarise(null_corr = cor(_____, _____))

When we bootstrapped, we sampled with replacement (replace = TRUE) and in this case we do not. Explain why.
Here we only shuffle one variable. Would the results change if we shuffled both bill_length_mm and body_mass_g? Why/why not?
Add the observed correlation to your plot of the distribution using geom_vline(). If this was a test for which you wanted to report the p-value, what would you report?

Congratulations! You just performed, perhaps without realizing it, your first permutation test. Permutation is another powerful and useful tool for statistical inference. The most common use case for permutation tests is comparing two groups, often by their means. There is a fantastic online visual explanation of how this works. Go through the webpage, we are confident you will enjoy it!

Last week we saw that the confidence intervals for mean bill length in Chinstrap and Gentoo penguins overlapped by a small amount. Josh pointed out that even when confidence intervals overlap slightly, it is still possible for the mean difference to be statistically significant. It is now time to test this using permutation. This will follow the exact same idea as the alpaca example you just went through. We have already provided all the tools you need to do this yourself. Hint: the break_assoc() function will be helpful here, and when processing the iterations you will need to group by more than one variable. You will also want to use summarize() twice, once to calculate the group means and once to calculate the mean difference using the diff() function. Reach out to us if you get stuck; it is important that you complete the exercise.

Plot the null distribution and add the observed (initial) mean difference to the plot.
Calculate the p-value. What conclusions do you draw?

Permutation tests are great and we argue that you should use them a lot when analyzing data. Therefore it could be good to have access to some nice helper functions to help facilitate your work. The infer package provides such helper functions. Read this article (focus on the 2-sample t-test part, but if you want you can replicate your exercise 1 analysis by following the instructions at the top of the page) and run versions of the analysis you did for exercise 4 using the methods described.

The shade_p_value() and get_p_value() functions let you specify a direction = argument. Run ?shade_p_value() and find in the help page what options you have regarding direction. Try the different options. How does the output relate to our discussion about greater than and more extreme? Josh does a good job explaining the difference between one-sided and two-sided p-values in this Video. There is no need to watch the whole video, just the last part starting at 20:12.
The t_test() function from the infer package lets you preform a theory-based t-test. Use the internet to learn about t-tests. What is a t-statistic and how is it calculated? What is a t-distribution and how does it relate to the distributions we have created using permutation? What are some pros and cons with t-tests vs permutation tests?

The equation used to calculate a t-statistic includes both group standard deviations and sample size, but when comparing groups via permutation test (in the examples we have looked at) the test statistic is just the mean difference. Since the results in terms of statistical significance are often very similar when comparing t-tests and permutation test, it seems reasonable to assume that permutation tests are sensitive to variance and sample size. Use the code below to simulate data and apply your permutation testing skills to generate a null distribution. Repeat the process and show how standard deviations and sample sizes of the simulated data affect the null distribution (bonus points if you use iteration to try different values, find a way to summarize the most important information from the distributions and visualize the results).

sim_data <- tibble(y = c(rnorm(20, 0, 1), rnorm(20, 0.5, 1)),
                   x = rep(c("A", "B"), each = 20))

The output from the t_test() function includes the columns estimate, lower_ci and upper_ci. The values under these column names represent the mean difference and confidence interval for the mean difference, respectively. Revisit your analysis from above, comparing mean bill length in Chinstrap and Gentoo penguins, this time using the t_test() function. Study the p-value and confidence interval. How does these compare in terms of statistical significance?

Code up a bootstrapping method that calculates the confidence interval for the mean difference. You can reuse the boot() function from last week, but you will want to add a group_by() before slice_sample(). The key is to sample with replacement within each of the groups you want to compare. Run 10,000 iterations, and for each iteration and group calculate the mean. Then compute the difference in group means for all iterations, and finally use quantile() to find the appropriate percentiles. Compare your bootstrap confidence interval with the output from the t_test() function to make sure your code works as it should.