Session 9: Covariance & Correlation

This week we will continue our statistics adventure by learning about covariance (video 1) and Pearson’s correlation (video 2). These concepts build directly on variance and standard deviation, helping us quantify the relationship between two variables.

As before, after watching the videos, we’ll cement the ideas through a set of exercises in R. We’ll use our Thursday group meeting to discuss results and interpretations.

Let’s begin by watching Video 1 (Covariance, Clearly Explained!!!)

Up to this point, we have been considering population estimates from a single variable, e.g. the expression level of gene X mentioned in Josh’s video. Such estimates are helpful: by knowing how a particular variable is distributed, we can determine the probability of sampling particular values from the population. But we’re often interested in identifying the relationships between different variables: how does rate of plant growth vary with respect to watering frequency and volume? Covariance tells us the extent to which two variables align in how they vary around their respective means.

1 Let’s explore this by first generating two random variables \(x_1\) and \(x_2\), where each row in the dataset is an observation.

library(tidyverse)

generate_data <- function(n1,mu1,sd1, n2,mu2,sd2) {
  tibble(
    x1 = rnorm(n1,mu1,sd1),
    x2 = rnorm(n2,mu2,sd2)
  )
}
  1. Fill in this function to calculate the covariance between two variables. How does it differ from the calculation of variance? If you wanted to use this function to calculate the variance of, say x1, what might you input for the function’s var1 and var2 arguments?
find_cov <- function(var1,var2) {
  diffs_var1 <- __________
  diffs_var2 <- __________
  cov_var1_2 <- sum(______ * ______) 
  cov_var1_2 / (_____ - 1)
}
  1. Let’s create a new tibble where the second variable is related to the first but with an added bit of noise. Modify generate_data() such that instead of drawing x1 and x2 from separate random samples, x2 should be created by mutating x1 with a bit of added random noise: x2 = x1 + rnorm(n = 100, mean = 0, sd = 1)
cov_scale <- function(dat,scale=1, mean_shift=0) {
    dat |> summarize(
    scale = scale,
    mean_shift = mean_shift,
    cov_xy = find_cov(dat[[1]],dat[[2]]*scale + mean_shift),
  ) |> tibble()
  
}
cov_scaling <- map(seq(1,100,10), 
               \(x) cov_scale(df,scale=x)) |>
  list_rbind()
cov_meanshift <- map(seq(1,100,10), 
               \(x) cov_scale(df,mean_shift=x)) |>
  list_rbind()

Correlation

  1. Let’s continue the StatQuest by watching Video 2 (Pearson’s Correlation, Clearly Explained!!!).
  • Josh explained how correlations offer a simple means of testing the linear relationship between two variables, falling between -1 and 1. To get an intuition for detecting linear correlation values in data, let’s start with a fun correlation guessing game. Run the following code block online and set my_guess to your estimate of the data’s correlation. Try re-running the code a few times until you feel confident estimating correlation.
  1. An important concept to parse out is the difference between covariance and correlation. We saw above how scaling the data influenced the covariance between two variables and how, by normalizing your variables, your covariance values always fell between -1 and 1. You have just calculated correlation! Pearson’s correlation is simply the normalized covariance or said another way, covariance between two z-scored variables.
  • To belabor this point futher, take the function cov_scale and replace your normalized find_cov() function with cor(), R’s inbuilt correlation function. Compare the values with those you calculated using the standardized find_cov.
  1. Finally, Josh highlighted the importance not just of correlation as an estimate of the relationship between two variables, but the degree of confidence we have in that relationship. He showed how, for any two points of data (where the standard deviation of both variables > 0), that the correlation will always equal 1 or -1.
  • Try this yourself: pass two vectors of length 2 into cor. For example, you might choose c(1,2) and c(-5,20).

  • As sample size increases, the confidence in the correlation, even if that correlation = 0, also increases. We will explore this idea of confidence now.

  • Study the function below. Can you explain what each part of the function is doing? Copy the code into RStudio and set n, the number of observations, to 5 and run the code, saving the output to an object. You don’t have to specify nruns.

generate_corr_hist <- function(n, nruns = 1000) {
  chance_corrs <- map(1:nruns, \(i) {
    df <- generate_data(n1 = n, mu1 = 0,
                        sd1 = 5, 
                        n2 = n,
                        mu2 = 0,
                        sd2 = 5) 
    
    cor(df[[1]], df[[2]])
  }) |> as.numeric()
  
  tibble(corrs = chance_corrs)
}
  • Use ggplot to draw a histogram of the values and a density plot with the geom_density argument. You can set bw = (smoothing bandwidth) to make the curve look smooth like you did last week.
  • Now calculate the probability of finding a correlation value >= 0.5 using methods similar to those you might have used last week. Now calculate the probability of finding a correlation more extreme than 0.5 or -0.5. Make a note of these probabilities
  • Now set n to 30 and rerun this process. You may wish to create a function to calculate the probability.
  • Set n to 100 and rerun it one final time. What do you notice about the shape of the histogram and density plots? How are the probability values changing and why is this occurring?
  • As a bonus exercise, create a plot that overlays all of the density plots onto one graph.