Session 8: Variance & Standard Deviation

This week we will begin our statistics adventure by learning about population and estimated parameters (video 1), variance and standard deviation (video 2) and why dividing by N underestimates the variance (video 3).

You will learn about the topics listed above by first studying the theory, provided via YouTube videos (this week all 3 videos are from StatQuest). Then you will work through some exercises, utilizing the R language to demonstrate various topics in a more “hands on” way, hopefully cementing the ideas presented in the videos. We have designed the exercises in ways we hope will help you to think statistically and generate discussion points. We will use the Wednesday group meetings to discuss the exercises and how conclusions drawn from them relate to the theory presented in the videos.

Let’s jump in and get started with Video 1 (Population and Estimated Parameters, Clearly Explained!!!).

  1. Probably the most important concept introduced in video 1 is the idea that when estimating population parameters from samples, the sample size affects how confident we can be in the accuracy of the estimates. To illustrate this point, take a look at the code below. This code chunk will draw random samples (5 different samples) from a normally distributed population with population parameters mean = 20 and sd = 10 (these parameters correspond to the values used by Josh in the video).
five_samples <- map(1:5, \(i) tibble(values = rnorm(n = 5,
                                                    mean = 20, 
                                                    sd = 10))) |>
  list_rbind(names_to = "sample")
  1. In Video 1, Josh showed how we can use a population distribution (histogram) to calculate probabilities and statistics. In the video example a population of 240 billion liver cells was used. This is a lot of observations, and even modern computers struggle a bit when working with such large numbers. Many times in statistics we rely on the assumption that samples around 10,000 observations generate population estimates close to the true population parameters.

    Generate a large sample (n = 10,000) using the same population parameters as in Exercise 1. Name the object samp_10k. Do you need to use map() in this case? Why/why not?

We will now move on to Video 2 (Calculating the Mean, Variance and Standard Deviation, Clearly Explained!!!)

  1. What does the fun_x() function in the code chunk below calculate? Explain what each line of the code is doing.
fun_x <- function(x){
  dev <- x - mean(x)
  dev_sq <- dev ^ 2
  s_dev_sq <- sum(dev_sq)
  s_dev_sq / length(x)
}
  1. Using the tibble() function, create a tibble with two columns, x and y, each containing six observations with no duplicated values, following the instructions below.
tibble(x = c(1, -1, -3, 3, _____, _____))

We will now go a bit deeper into the theory of variance, focusing in particular on why we divide by n − 1 when estimating the population variance. Watch Video 3 (Why Dividing by N Underestimates the Variance). In this video, Josh does a great job explaining why using the sample mean leads to an underestimation of the variance, but the explanation is somewhat theoretical. We can use a simple simulation to demonstrate empirically that dividing by n − 1 performs better than dividing by n (and other alternatives we might want to try).

  1. Study the function in the code chunk below and explain what each part does. Fill in the blanks so that the samples_df object contains 10,000 samples, each of size 10, drawn from a population with a mean of 5 and a variance of 4. Next, complete the code so that the five columns generated in summarize() use values of a ranging from −2 to 2. What does the output tell us? Finally, visualize the results.
var_a <- function(x, a){
  ssq <- sum((x - mean(x)) ^ 2)
  ssq / (length(x) + a)
}

samples_df <- map(1:_____, \(i) tibble(values = rnorm(n = _____, 
                                                      mean = _____, 
                                                      sd = _____))) |> 
  list_rbind(names_to = "sample")

samples_df |>
  group_by(sample) |>
  summarize(n_minus_2 = var_a(values, a = _____),
            n_minus_1 = var_a(values, a = _____),
            n = var_a(values, a = _____),
            n_plus_1 = var_a(values, a = _____),
            n_plus_2 = var_a(values, a = _____)) |>
  summarize(across(2:6, mean))