Session 13: Project work I

After five weeks of statistics training (already!), we would like to switch gears a bit. Instead of introducing new topics, we would like to ask you to apply what you have learned so far by conducting analyses on data you are familiar with.

We are aware that many of you have already worked with real Marcus data for some time and have likely implemented statistical methods that go beyond what we have covered so far in the training. This week’s assignment is therefore not necessarily intended to help you make direct progress on your project work. Rather, it should be viewed as an opportunity to practice the specific methods we have introduced, with a focus on simulation, bootstrapping, and permutation. We believe that these methods not only provide a general statistical framework that supports robust workflows and reliable results, but also help strengthen statistical reasoning. In turn, this can deepen your analytic understanding and help you identify appropriate methods for future data analysis challenges.

We advise you to work with your own project data as much as possible. However, if some of the exercises described below cannot be carried out using your data, for example because your dataset contains too few variables, lacks suitable variation, or does not include the types of variables required for a particular method, please feel free to use datasets available in R packages. Examples of such datasets include sleep (effects of sleep medication) and PlantGrowth (group comparisons).

Use Quarto to organize your analyses, combining code, results, and reflections in a single reproducible document. Please prepare a clear final report with figures and key conclusions, and be ready to present your work next Wednesday.

  1. For a numeric variable in your data, calculate its mean and standard deviation, and then use simulation to estimate a 95% confidence interval for the mean.
  1. Many classical parametric statistical tests (i.e. t tests) make certain assumptions about the data. One key assumption is that observations should be drawn from a particular distribution, commonly a normal distribution. When data violate this assumption, drawing inferences from the test statistic will inevitably be invalid. As you inspect your dataset, perhaps plotting variable histograms to gauge normality of continuous variables, you may observe a left- or right-tailed skew. How much skewness is acceptable (is ‘normalish’) before violating assumptions (though other measures of normality are also relevant, including tail density)? Of course, this parametric assumption is one of the reasons why this training has encouraged the use of such non-parametric techniques as resampling (bootstrapping and permutation), which exacts no such requirement of the data.
  1. Select two numeric variables in your data and calculate their covariance.
  1. Select a binary grouping variable and two numeric variables in your data. If you have a grouping variable with three or more levels, please filter your data for two of three. Likewise, if you don’t have a grouping variable, you could discretize another variable (i.e. a median split with values below and equal to the median assigned to Group 1 and above, Group 2).
  1. Select a binary grouping variable in your data and a numeric outcome variable. Calculate the difference in means between the two groups. Next, perform a permutation test by randomly shuffling the group labels many times and recalculating the mean difference for each shuffled dataset.
  1. To highlight the ability to compute different test statistics via resampling techniques, we’d like to explore a test of differences in proportions. The classical counterpart to this test is a one- or two-sampled z-test of proportions. Choose a binary grouping variable and a response variable (variables that take values of 0 and 1). If you don’t have variables of this type in your data, either try discretizing a continuous numeric variable or switch over to an R dataset of your choosing.
  1. Select the same binary grouping variable and a numeric outcome variable in your data.
  1. Using the same binary grouping variable and numeric outcome as in the previous exercise, generate a bootstrapped sampling distribution of Cohen’s d by repeatedly resampling within each group and recalculating the effect size for each iteration. Examine the resulting distribution and calculate the median as well as the 25th and 75th percentiles, which represent the interquartile range of plausible effect sizes.
  1. Let’s revisit the idea of the Winner’s curse we discussed last week.