# An introduction to ANOVA

| September 22nd, 2014

## Working memory experiment

Throughout this chapter, we'll use data from the working memory experiment, which investigates the relationship between the number of training days and a change in IQ. There are four independent groups, each of which trained for a different period of time: 8, 12, 17, or 19 days. The independent variable is the number of training days and the dependent variable is the IQ gain.

Have a look at the wm data, which are already loaded in your workspace. Typing head(wm) in the console will display the first 6 rows. You should see the following variables:

• subject: a unique number identifying each subject
• condition (or just cond for short): number of training days (i.e. grouping condition)
• iq: change in IQ score

### Instructions

• After loading and looking at the dataset, provide some summary statistics for every individual group using the describeBy() function from the psych R package. Provide the data frame wm as the first argument and "condition" as the second argument.
• Use the boxplot() function to create a boxplot of iq as a function of cond. Title the boxplot "Boxplot", then label the x- and y-axes "Group (cond)" and "IQ", respectively. Here is an example of how to use the boxplot function:
boxplot(dependent_var ~ independent_var,
main = "Title", xlab = "Label here",
ylab = "Another label here")

library(psych) # Read data into a dataframe that is called wm wm <- read.table(url("http://s3.amazonaws.com/assets.datacamp.com/course/Conway/Lecture_Data/L15-16_WMtraining_Example.txt"), header = T) wm$condition <- factor(wm$cond, levels = c("8 days","12 days", "17 days", "19 days"))  # Summary statistics by group # Boxplot of iq versus cond  # Summary statistics by group describeBy(wm, wm$cond) # Boxplot of iq versus cond boxplot(wm$iq ~ wm$cond, main = "Boxplot", xlab = "Group (cond)", ylab = "IQ")  test_function("describeBy", args = c("x", "group"), not_called_msg = "Please use the <code>describeBy()</code> function to make a statistic summary. Type <code>?describeBy</code> in the console to get the help file.", incorrect_msg = c("Use the data frame <code>wm</code> in the <code>describeBy()</code> function.", "Did you set the grouping variable to condition, so did you use <code>wm$cond</code>? ")) test_function("boxplot", not_called_msg = "Please use the <code>boxplot()</code> function to generate a boxplot of IQ versus condition. Type <code>?boxplot</code> in the console to get the help file. Use the method for class <code>formula</code>.") test_student_typed("wm$iq ~ wm$c", not_typed_msg = "Did you use a formula as input argument? The formula should look like <code>x ~ group</code> where <code>x</code> is the dependent variable and <code>group</code> is the grouping variable.") test_function("boxplot", args = c("main", "xlab", "ylab"), incorrect_msg = c("Did you set the title of the <code>boxplot()</code> function correctly?", "Did you set the label of the x-axis of the <code>boxplot()</code> function correctly?", "Did you correctly set the y-label of the boxplot() function?")) success_msg("Well done! Look at the four trained groups in your boxplot. You should notice that the IQ increases as the amount of training sessions increases. Continue to the next exercise.") 

The first argument to boxplot() should be wm$iq ~ wm$cond, which says to plot iq as a function of cond.

## Generate density plot of the F-distribution

The test statistic associated with ANOVA is the F-test (or F-ratio). Recall that when carrying out a t-test, you computed an observed t-value, then compared that with a critical value derived from the relevant t-distribution. That t-distribution came from a family of t-distributions, each of which was defined entirely by its degrees of freedom.

ANOVA uses the same principle, but instead an observed F-value is computed and compared to the relevant F-distribution. That F-distribution comes from a family of F-distributions, each of which is defined by two numbers (i.e. degrees of freedom), which we'll refer to as df1 and df2. The F-distribution has a different shape than the t-distribution and in this exercise, you'll generate a few density plots of the F-distribution to help visualize this.

### Instructions

• Use the seq() function to create a variable x containing 200 numbers between 0 and 2. See ?seq if you need help with this.
• Use df() to evaluate the densities for seven different t-distributions at these values of x. The t-distributions should have the following degrees of freedom (df1,df2): (1,1), (3,1), (6,1), (3,3), (6,3), (3,6) and (6,6). For example, dt(x, 5, 5) would evaluate the t-distribution with degrees of freedom (5,5) at all 200 values of x.
• Create line plots of these densities, using different colors for each. Use plot() to create the first (with argument type = "l" for "line"), then use lines() to create the rest, so that the lines get added to the original plot instead of overwriting it.
• Add a legend to the topright corner of your plot with the title "F distributions". In the script, you only have to fill in the first two arguments: "topright" and title = "___".
library(psych) # Read data into a dataframe that is called wm wm <- read.table(url("http://s3.amazonaws.com/assets.datacamp.com/course/Conway/Lecture_Data/L15-16_WMtraining_Example.txt"), header = T) wm$condition <- factor(wm$cond, levels = c("8 days","12 days", "17 days", "19 days"))  # Create the vector x x <- seq(from = ___, to = ___, length = ___) # Evaluate the densities y_1 <- df(x, 1, 1) y_2 <- ___ y_3 <- ___ y_4 <- ___ y_5 <- ___ y_6 <- ___ y_7 <- ___ # Plot the densities plot(x, y_1, col = 1, type = "l") lines(x, ___, col = 2) lines(x, ___, col = 3) lines(x, ___, col = 4) lines(x, ___, col = 5) lines(x, ___, col = 6) lines(x, ___, col = 7) # Add the legend legend(___, title = ___, c("df = (1,1)", "df = (3,1)", "df = (6,1)", "df = (3,3)", "df = (6,3)", "df = (3,6)", "df = (6,6)"), col = c(1, 2, 3, 4, 5, 6, 7), lty = 1) # Create the vector x x <- seq(from = 0, to = 2, length = 200) # Evaluate the densities y_1 <- df(x, 1, 1) y_2 <- df(x, 3, 1) y_3 <- df(x, 6, 1) y_4 <- df(x, 3, 3) y_5 <- df(x, 6, 3) y_6 <- df(x, 3, 6) y_7 <- df(x, 6, 6) # Plot the densities plot(x, y_1, col = 1, "l") lines(x, y_2, col = 2) lines(x, y_3, col = 3) lines(x, y_4, col = 4) lines(x, y_5, col = 5) lines(x, y_6, col = 6) lines(x, y_7, col = 7) # Add the legend legend("topright", title = "F-distributions", c("df = (1,1)", "df = (3,1)", "df = (6,1)", "df = (3,3)", "df = (6,3)", "df = (3,6)", "df = (6,6)"), col = c(1, 2, 3, 4, 5, 6, 7), lty = 1)  test_object("x", incorrect_msg = "Did you correctly define vector x?") test_object("y_1", incorrect_msg = "Did you correctly define y_1?") test_object("y_2", incorrect_msg = "Did you correctly define y_2?") test_object("y_3", incorrect_msg = "Did you correctly define y_3?") test_object("y_4", incorrect_msg = "Did you correctly define y_4?") test_object("y_5", incorrect_msg = "Did you correctly define y_5?") test_object("y_6", incorrect_msg = "Did you correctly define y_6?") test_object("y_7", incorrect_msg = "Did you correctly define y_7?") test_function("plot", not_called_msg = "Don't change anything about the plot() function.") test_function("lines", not_called_msg = "Don't delete the lines() functions in the script.") test_function("legend", not_called_msg = "Don't delete the legend() function!") test_error() success_msg("Great work!") 
• You just have to fill in the given values in the text and use the command df(x, df1, df2) with values for (df1,df2) equal to (1,1), (3,1), etc. to generate the densities.
• Check the legend() function documentation by entering ?legend.

## Between group sum of squares

To calculate the F-value, you need to calculate the ratio between the variance between groups and the variance within groups. Furthermore, to calculate the variance (i.e. mean of squares), you first have to calculate the sum of squares.

Let's start with the between group sum of squares. The formula for the calculation of the between group sum of squares is

\begin{aligned} ss_a & = n \sum(y_j - y_t)^2 \end{aligned}

where $$y_j$$ are the group means, $$y_t$$ is the grand mean, and $$n$$ is the number of items in each group.

Now, remember that the working memory experiment investigates the relationship between the change in IQ and the number of training sessions. Calculate the between group sum of squares for the data from this experiment. wm is still loaded in your workspace.

### Instructions

• Determine the number of subjects in each group and store the result in n. If you don't know the number of subjects in each group, you can always print the data to the console or use more creative means of figuring it out.
• Use tapply() to compute the group means and save the result to y_j. tapply() allows you to perform an operation on iq once for each level of cond. Consequently, it can calculate each group mean. The first argument should contain the data column of data for which you want to calculate the means and the second argument should contain the column containing information on which group each subject belongs to.
• Compute the grand mean and assign the result to y_t. This is just the mean of all IQ gains in the data.
• You now have all the ingredients to calculate the between group sum of squares by applying the formula. Save this in the variable ss_a
library(psych) # Read data into a dataframe that is called wm wm <- read.table(url("http://s3.amazonaws.com/assets.datacamp.com/course/Conway/Lecture_Data/L15-16_WMtraining_Example.txt"), header = T) wm$condition <- factor(wm$cond, levels = c("8 days","12 days", "17 days", "19 days"))  # Define number of subjects in each group n <- ___ # Calculate group means y_j <- tapply(___, ___, mean) # Calculate the grand mean y_t <- ___ # Calculate the sum of squares ss_a <- ___ # Define number of subjects in each group n <- 20 # Calculate group means y_j <- tapply(wm$iq, wm$cond, mean) # Calculate the grand mean y_t <- mean(wm$iq) # Calculate the sum of squares ss_a <- n*sum((y_j - y_t)^2)  test_predefined_objects("wm") test_object("n", undefined_msg = "It seems that you did not define <code>n</code>. Remember that <code>n</code> is the number of subjects in each group.", incorrect_msg = "The value of <code>n</code> does not seem to be correct. The number of subjects in each group is 20.") test_function("tapply", args = c("X", "INDEX", "FUN"), eval = c(T, T, F), not_called_msg = "Please use the <code>tapply()</code> function to calculate the group means. Type <code>?tapply</code> in the console to get the help file.", incorrect_msg = c("The first argument in the function <code>tapply()</code> should contain the data of which you want to calculate the means, so in this case <code>wm$iq</code>.", "The second argument in the function <code>tapply()</code> should contain the data that you need to sort the different groups, so in this case <code>wmcond</code>.", "The last argument in the function <code>tapply()</code> should contain the function you want to apply.")) test_object("y_j", undefined_msg = "Make sure to create a vector <code>y_j</code> that contains the group means. Use the function <code>tapply()</code> to do this. If you do not know how the function <code>tapply()</code> works, type <code>?tapply</code> in the console to get the help file.") test_object("y_t", undefined_msg = "Make sure to create a variable <code>y_t</code>. This variable should contain the grand mean of the IQ scores.", incorrect_msg = "Looks like <code>y_t</code> is not the correct value. Did you calculate the mean of the proper data? Did you calculate the mean of the IQ scores?") test_object("ss_a", undefined_msg = "Make sure to define the variable <code>ss_a</code>. This variable should contain the between group sum of squares.", incorrect_msg = "Looks like <code>ss_a</code> is not the correct value. Check the formula again, it should be a simple copy paste.") success_msg("Well done! You can use this result later to calculate the F-ratio. Continue to the next exercise, where you will calculate the within groups sum of squares.")  • To calculate the grand mean, use the R function mean(). • The sum in the formula can easily be calculated by using the command sum(), which sums up all elements of a vector. ## Within groups sum of squares To calculate the F-value, you also need the variance within groups. Similar to the last exercise, we'll start by computing the within groups sum of squares, which is equal to the following: \begin{aligned} ss_{s/a} & = \sum(Y_{ij} - y_j)^2 \end{aligned} where $$Y_{ij}$$ are the individual scores and $$y_j$$ are the group means. Now you are going to apply this formula yourself! ### Instructions • Create a separate vector of IQ gains (wmiq) for each training group (wm$condition) using the subset() function. The first one has been done for you. Define them in order from the least amount of training to the most amount of training. • Now you can easily subtract each subject's IQ gain by its corresponding group mean. You already calculated the group means in the previous exercise, so use this result. The vector containing the four group means is called y_j and will need to be subsetting using [ ] notation to extract each mean. • Create a new vector, s_t, to combine s_1, s_2, s_3, and s_4 back into a single vector. • Calculate the within groups sum of squares. You just need to square the previous result and sum up the elements of the vector using the sum() function. library(psych) # Read data into a dataframe that is called wm wm = read.table(url("http://s3.amazonaws.com/assets.datacamp.com/course/Conway/Lecture_Data/L15-16_WMtraining_Example.txt"), header = T) wm$condition <- factor(wm$cond, levels = c("8 days","12 days", "17 days", "19 days")) y_j = tapply(wm$iq, wm$cond, mean) # Create a separate vector of IQ gains for each training group y_i1 <- subset(wm$iq, wm$cond == "8 days") y_i2 <- subset(wm$iq, wm$cond == ___) y_i3 <- subset(wm$iq, wm$cond == ___) y_i4 <- subset(wm$iq, wm$cond == ___) # Subtract group means from the individual values s_1 <- y_i1 - y_j[1] s_2 <- ___ - ___ s_3 <- ___ - ___ s_4 <- ___ - ___ # Put everything back together into one vector s_t <- c(___, ___, ___, ___) # Calculate the sum of squares using s_t ss_sa <- ___ # Create a separate vector of IQ gains for each training group y_i1 <- subset(wm$iq, wm$cond == "8 days") y_i2 <- subset(wm$iq, wm$cond == "12 days") y_i3 <- subset(wm$iq, wm$cond == "17 days") y_i4 <- subset(wm$iq, wmcond == "19 days") # Subtract group means from the individual values s_1 <- y_i1 - y_j[1] s_2 <- y_i2 - y_j[2] s_3 <- y_i3 - y_j[3] s_4 <- y_i4 - y_j[4] # Put everything back together into one vector s_t <- c(s_1, s_2, s_3, s_4) # Calculate the sum of squares using s_t ss_sa <- sum(s_t^2)  # Create a separate vector of IQ gains for each training group test_object("y_i1", incorrect_msg = "Did you correctly define y_i1 using subset()?") test_object("y_i2", incorrect_msg = "Did you correctly define y_i2 using subset()?") test_object("y_i3", incorrect_msg = "Did you correctly define y_i3 using subset()?") test_object("y_i4", incorrect_msg = "Did you correctly define y_i4 using subset()?") # Subtract group means from the individual values test_object("s_1", incorrect_msg = "Did you correctly define s_1?") test_object("s_2", incorrect_msg = "Did you correctly define s_2?") test_object("s_3", incorrect_msg = "Did you correctly define s_3?") test_object("s_4", incorrect_msg = "Did you correctly define s_4?") # Put everything back together into one vector test_object("s_t", incorrect_msg = "Did you correctly define s_t?") # Calculate the sum of squares using s_t test_object("ss_sa", incorrect_msg = "Did you correctly define ss_sa?") test_error() success_msg("Great work!")  • For the subsetting, you just need to specify the respective training levels: "8 days", "12 days", "17 days", "19 days". • To get the values of a vector, write the name of the vector and the number of the element between brackets. For example, y_j[3] gives the third element of the vector named y_j. ## Calculating the F-ratio In this exercise, you'll calculate the F-ratio. No need to worry! You already did a lot of preparation in the previous exercises. The F-ratio is calculated as follows: \begin{aligned} F & = \frac{ms_a}{ms_{s/a}} \end{aligned} where \begin{aligned} ms_a & = \frac{ss_a}{df_a}, \\ ms_{s/a} & = \frac{ss_{s/a}}{df_{s/a}} \end{aligned} and $$df$$ stands for the degrees of freedom. In this case, we have \begin{align} df_a &= a - 1 \\ df_{s/a} &= a (n - 1). \end{align} The values $$a$$ and $$n$$ represent the number of groups and the number of subjects in each group, respectively. You will find all relevant variables from the previous exercises loaded in your workspace, such as ss_a and ss_sa. You can always type ls() in the console to see what's there. Calculate the F-ratio by implementing the above formulas step-by-step. ### Instructions • Assign to a the number of different groups in the experiment. • Assign to n the number of subjects in each group. • Assign to df_a and df_sa the respective degrees of freedom. Remember to use the * sign when performing multiplication. For example, 10 * 5 multiplies 10 and 5. • Assign to ms_a and ms_da the between and within groups mean squares, respectively. • Calculate the F-ratio and assign the result to f_rat. ## Read data into a dataframe that is called wm wm <- read.table(url("http://s3.amazonaws.com/assets.datacamp.com/course/Conway/Lecture_Data/L15-16_WMtraining_Example.txt"), header = T) wmcondition <- factor(wm$cond, levels = c("8 days","12 days", "17 days", "19 days")) n <- 20 # Calculate group means y_j <- tapply(wm$iq, wm$cond, mean) # Calculate the grand mean y_t <- mean(wm$iq) # Calculate the sum of squares ss_a <- n*sum((y_j-y_t)^2) ## Create four subsets of the four groups, containing the IQ results # Make the subset for the group cond = "8 days" y_i1 <- subset(wm$iq, wm$cond == "8 days") # Make the subset for the group cond = "12 days" y_i2 <- subset(wm$iq, wm$cond == "12 days") # Make the subset for the group cond = "17 days" y_i3 <- subset(wm$iq, wm$cond == "17 days") # Make the subset for the group cond = "19 days" y_i4 <- subset(wm$iq, wm$cond == "19 days") ## Subtract the individual values by their group means # You have already calculated the group means in the previous exercise so use this result, the vector that contains these group means was called y_j s_1 <- y_i1 - y_j[1] s_2 <- y_i2 - y_j[2] # Do it without the vector y_j, so calculate the group means again. s_3 <- y_i3 - mean(y_i3) s_4 <- y_i4 - mean(y_i4) #Put everything back into one vector s_t <- c(s_1,s_2,s_3,s_4) #Calculate the sum of squares by using the vector s_t ss_sa <- sum(s_t^2) # Number of groups a <- ___ # Number of subjects in each group n <- ___ # Define degrees of freedom df_a <- ___ df_sa <- ___ # Calculate mean squares using ss_a and ss_sa ms_a <- ___ / ___ ms_sa <- ___ / ___ # Calculate the F-ratio f_rat <- ___ / ___ # Number of groups a <- 4 # Number of subjects in each group n <- 20 # Define degrees of freedom df_a <- a - 1 df_sa <- a * (n - 1) # Calculate mean squares using ss_a and ss_sa ms_a <- ss_a / df_a ms_sa <- ss_sa / df_sa # Calculate the F-ratio f_rat <- ms_a / ms_sa  test_predefined_objects("wm") test_object("a", undefined_msg = "It seems that you did not define <code>a</code>. Remember that <code>a</code> is the number of groups.", incorrect_msg = "The value of <code>a</code> does not seem to be correct. The number of groups is 4.") test_object("n", undefined_msg = "It seems that you did not define <code>n</code>. Remember that <code>n</code> is the number of subjects in each group.", incorrect_msg = "The value of <code>n</code> does not seem to be correct. The number of subjects in each group is 20.") test_object("df_a", undefined_msg = "It seems that you did not define <code>df_a</code>. Remember that <code>df_a</code> is the degrees of freedom between the groups.", incorrect_msg = "The value of <code>df_a</code> does not seem to be correct. Did you use the proper formula?") test_object("df_sa", undefined_msg = "It seems that you did not define <code>df_sa</code>. Remember that <code>df_sa</code> is the degrees of freedom within the groups.", incorrect_msg = "The value of <code>df_sa</code> does not seem to be correct. Did you use the proper formula?") test_object("ms_a", undefined_msg = "It seems that you did not define <code>ms_a</code>, this variable represents the between group mean squares.", incorrect_msg = "The value of <code>ms_a</code> does not seem to be correct. Did you use the correct sum of squares and the correct degrees of freedom?") test_object("ms_sa", undefined_msg = "It seems that you did not define <code>ms_sa</code>, this variable represents the within groups' mean squares.", incorrect_msg = "The value of <code>ms_sa</code> does not seem to be correct. Did you use the correct sum of squares and the correct degrees of freedom?") test_object("f_rat", undefined_msg = "It seems that you did not define <code>f_rat</code>, this variable represents the F-ratio.", incorrect_msg = "The value of <code>f_rat</code> does not seem to be correct. Did you use the correct formula?") success_msg("Well done. Print the F-value in the console. Do you think it's significant? We'll address this in the exercises ahead.") 

After assigning the right values to $$a$$ and $$n$$, just follow along with the formulas.

## A faster way: ANOVA in R

Normally, you do not have to do all calculations yourself to get the F-value and to see whether or not the null hypothesis (i.e. that all groups are equal) should be rejected. R's aov() function does the heavy lifting for you!

### Instructions

• Apply aov() to the working memory data. The only argument should be a formula containing the dependent variable iq and independent variable condition. For example, aov(dependent_var ~ independent_var).
• Make a summary table of the result with summary(). The only argument is the result from the first instruction.
# Read data into a dataframe that is called wm wm <- read.table(url("http://s3.amazonaws.com/assets.datacamp.com/course/Conway/Lecture_Data/L15-16_WMtraining_Example.txt"), header = T) wm$condition <- factor(wm$cond, levels = c("8 days","12 days", "17 days", "19 days"))  ## wm is already loaded # Apply the aov function anova_wm <- ___ # Look at the summary table of the result  ## wm is already loaded # Apply the aov function anova_wm <- aov(wm$iq ~ wm$condition) # Look at the summary table of the result summary(anova_wm)  test_predefined_objects("wm") test_correct(test_object("anova_wm"), test_function("aov", args = c("formula"), not_called_msg = "Please use the <code>aov()</code> function to calculate the ANOVA model. Type <code>?aov</code> in the console to get the help file.", incorrect_msg = "Use the proper formula in the <code>aov()</code> function. Remember that a formula has the form <code>x ~ y</code>. If you do not know what to fill in for <code>x</code> and <code>y</code>, you could take a look at the hint.")) test_output_contains("summary(anova_wm)", incorrect_msg = "Use the <code>summary()</code> function to print the main results of the ANOVA model to the console.") success_msg("Notice that you get the same F-value as in the previous exercise, only much faster and easier. The summary table also contains the mean squares, the sums of squares, and the degrees of freedom. You can also compare these values with your previous results: they should be the same. In short, the summary table contains everything that was used to calculate the F-value, or F-ratio.") 

The aov() function should take a single argument: wm$iq ~ wm$condition

## Levene's test

The assumptions of ANOVA are relatively simple. Similar to an independent t-test, we have a continuous dependent variable, which we assume to be normally distributed. Furthermore, we assume homogeneity of variance, which can be tested with Levene's test.

It's good practice to check both assumptions before you do ANOVA, but here we'll focus on the latter. If the assumptions don't hold, then the ANOVA results won't be valid.

### Instructions

• Perform Levene's test for the working memory data wm. Use leveneTest from the car package with the dependent and independent variables as first and second arguments, respectively.
• If you don't specify additional arguments, the deviation scores are calculated by comparing each score to its group median. This is the default behavior, even though they are typically calculated by comparing each score to its group mean. If you want to use means and not medians, add an argument center = mean. Do this now and compare the results to the first test.
library(car) # Read data into a dataframe that is called wm wm <- read.table(url("http://s3.amazonaws.com/assets.datacamp.com/course/Conway/Lecture_Data/L15-16_WMtraining_Example.txt"), header = T) wm$condition <- factor(wm$cond, levels = c("8 days","12 days", "17 days", "19 days"))  # wm is already loaded # Levene's test # Levene's test with center = mean  # wm is already loaded # Levene's test leveneTest(wm$iq, wm$cond) # Levene's test with center = mean leveneTest(wm$iq, wm$cond, center = mean)  test_predefined_objects("wm") test_function("leveneTest", not_called_msg = "Did you use the function <code>leveneTest()</code> to check the homogeneity of variance assumption? If you do not know how to use the function, type <code>?leveneTest</code> in the console to get the help page.") test_output_contains("leveneTest(wm$iq, wm$cond)", incorrect_msg = "Did you use the function <code>leveneTest()</code> to check the homogeneity of variance assumption? If you do not know how to use the function, type <code>?leveneTest</code> in the console to get the help page. Also, check if you have used the correct input arguments. Remember that the dependent variable is contained in <code>wm$iq</code> and that the independent variable is contained in <code>wm$cond</code>.") test_output_contains("leveneTest(wm$iq, wm$cond, center = mean )", incorrect_msg = "Did you use the function <code>leveneTest()</code> and did you add the change for the default, namely <code>center = mean</code>?") success_msg("Well done. Have a close look at the ouput. Does the homogeneity of the variance assumption hold?") 

The dependent variable is iq and the independent variable is condition.