An introduction to Correlation Coefficients

| September 22nd, 2014

Manual computation of correlation coefficients (1)

In order to get acquainted with the concept of correlation coefficients, it is a good idea to manually calculate the coefficients step by step in R. The general formula for calculating the correlation coefficient between two variables is $$r = \frac{cov(A,B)}{s_A \cdot s_B},$$ where $$cov(A,B)$$ is the covariance between $$A$$ and $$B$$, while $$s_A$$ and $$s_B$$ are the standard deviations.

Instructions

• The vectors A and B have already been loaded into your workspace. Begin by calculating the covariance $$cov(A,B)$$ between these vectors. Save all necessary intermediate values in new variables to simplify the computations.
• First you have to compute the differences of each vector element with the mean and save them in the new variables diff_A and diff_B. Use the mean() function to do this.
• Then, calculate the covariance between A and B using your new variables. Save your result in the variable cov.
A <- c(1,2,3) B <- c(3,6,7) # The vectors A and B have already been loaded # Take a quick peek at both vectors # Save the differences of each vector element with the mean in a new variable diff_A <- ___ - mean(___) diff_B <- ___ - mean(___) # Do the summation of the elements of the vectors and divide by N-1 in order to acquire the covariance between the two vectors cov <- sum(___*___)/ ___ # The vectors A and B have already been loaded # Take a quick peek at both vectors A B # Save the differences of each vector element with the mean in a new variable diff_A <- A - mean(A) diff_B <- B - mean(B) # Do the summation of the elements of the vectors and divide by N-1 in order to acquire the covariance between the two vectors cov <- sum(diff_A*diff_B)/(3-1)  test_output_contains("A", incorrect_msg = "Did you print out the vector A?") test_output_contains("B", incorrect_msg = "Did you print out the vector B?") test_object("diff_A", undefined_msg = "It seems that the variable <code>diff_A</code> is not yet assigned or contains a syntax error. Please fix this and try again.", incorrect_msg = "It looks like you calculated <code>diff_A</code> incorrectly. Remember that you have to calculate the difference of each element of vector <code>A</code> with the vector mean.") test_object("diff_B", undefined_msg = "It seems that the variable <code>diff_B</code> is not yet assigned or contains a syntax error. Please fix this and try again.", incorrect_msg = "It looks like you calculated <code>diff_B</code> incorrectly. Remember that you have to calculate the difference of each element of vector <code>B</code> with the vector mean.") test_object("cov", undefined_msg = "It seems that the variable <code>cov</code> is not yet assigned or contains a syntax error. Please fix this and try again.", incorrect_msg = "It looks like you defined the variable <code>cov</code> incorrectly. Have a look at the formula again.") success_msg("Well done!") 

Remember that the covariance can be expressed as $$covariance = \frac{\sum \limits_{k=1}^N (x_i - \overline{x}).(y_i - \overline{y})}{N-1},$$ where $$N$$ is the length of the vectors $$x$$ and $$y$$.

You can use the function length() to determine $$N$$.

Manual computation of correlation coefficients (2)

In order to get acquainted with the concept of correlation coefficients, it is a good idea to manually calculate the coefficients step by step in R. The general formula for calculating the correlation coefficient between two variables is $$r = \frac{cov(A,B)}{s_A \cdot s_B},$$ where $$cov(A,B)$$ is the covariance between $$A$$ and $$B$$, while $$s_A$$ and $$s_B$$ are the standard deviations.

You have reached the half way point! Calculate the sample standard deviations $$s_A$$ and $$s_B$$ of the vectors A and B. Your workspace still contains these two vectors and the results of the previous exercise.

Instructions

• Define the variables sq_diff_A and sq_diff_B as the squares of the differences of each vector element with the mean.
• Use your new variables to calculate sd_A and sd_B. These two variables should contain the sample standard deviations of A and B respectively.
A <- c(1,2,3) B <- c(3,6,7) diff_A <- A - mean(A) diff_B <- B - mean(B) # Your workspace still contains the results of the previous exercise # Square the differences that were found in the previous step sq_diff_A <- ___ sq_diff_B <- ___ # Take the sum of the elements, divide them by N-1 and consequently take the square root to acquire the sample standard deviations sd_A <- ___(sum(___)/(___)) sd_B <- ___(sum(___)/(___)) # Your workspace still contains the results of the previous exercise # Square the differences that were found in the previous step sq_diff_A <- diff_A^2 sq_diff_B <- diff_B^2 # Take the sum of the elements, divide them by N-1 and consequently take the square root to acquire the sample standard deviations sd_A <- sqrt(sum(sq_diff_A)/(3-1)) sd_B <- sqrt(sum(sq_diff_B)/(3-1))  test_predefined_objects(c("A", "B", "diff_A", "diff_B")) test_object("sq_diff_A", undefined_msg = "It seems like the variable <code>sq_diff_A</code> is not yet assigned or contains a syntax error. Please fix this and try again.", incorrect_msg= "You should square the differences <code>diff_A</code> to get <code>sq_diff_A</code>.") test_object("sq_diff_B", undefined_msg = "It seems like the variable <code>sq_diff_B</code> is not yet assigned or contains a syntax error. Please fix this and try again.", incorrect_msg = "You should square the differences <code>diff_B</code> to get <code>sq_diff_B</code>.") test_function("sqrt", not_called_msg = "Use the <code>sqrt()</code> function to take the square root to acquire the sample standard deviations. Type <code>?sqrt</code> in the console to get the help file.") test_object("sd_A", incorrect_msg = "Make sure to calculate the sample standard deviation <code>sd_A</code> correctly. The formula is given in the hint, if you don't know what to do. You have to use sqrt() and sum().") success_msg("Fantastic! Just a few more calculations to go.") 

Remember that the sample standard deviation can be expressed as $$\sqrt{\frac{\sum \limits_{k=1}^N (x_i - \overline{x})^2}{N-1}}$$ .

Manual computation of correlation coefficients (3)

In order to get acquainted with the concept of correlation coefficients, it is a good idea to manually calculate the coefficients step-by-step in R. The general formula for calculating the correlation coefficient between two variables is $$r = \frac{cov(A,B)}{s_A \cdot s_B},$$ where $$cov(A,B)$$ is the covariance between $$A$$ and $$B$$, while $sA$ and $sB$ are the standard deviations.

Instructions

• Almost at the finish line! Combine the results of the previous steps to reap the fruits of your hard labor. Assign the correct value to correlation.
• Check your result by using the cor() command!
A <- c(1,2,3) B <- c(3,6,7) diff_A <- A - mean(A) diff_B <- B - mean(B) sq_diff_A <- diff_A^2 sq_diff_B <- diff_B^2 cov <- (diff_A[1] * diff_B[1]+ diff_A[2] * diff_B[2]+diff_A[3] * diff_B[3])/(3-1) sd_A <- sqrt(sum(sq_diff_A)/(3-1)) sd_B <- sqrt(sum(sq_diff_B)/(3-1)) # Your workspace still contains the results of the previous exercise # Combine all the pieces of the puzzle correlation <- ___ correlation # Check the validity of your result with the cor() command  # Your workspace still contains the results of the previous exercise # Combine all the pieces of the puzzle correlation <- cov/(sd_A*sd_B) correlation # Check the validity of your result with the cor() command cor(A,B)   correlation_correct = cov/(sd_A*sd_B) if ( !exists("correlation") ) { DM.result = list(FALSE, "It seems that the variable <code>correlation</code> is not yet assigned or contains a syntax error. Please try again.") } else if ( ! identical(correlation_correct, correlation) ){ DM.result = list(FALSE, "It looks like the variable <code>correlation</code> was defined incorrectly. Take a good look at the formula and try again.") } else if (! (output_contains("correlation"))) { DM.result = list(FALSE, "Did you look at the output of the <code>correlation</code> variable?") } else if (! (student_typed("cor(A, B)") | student_typed("cor(B, A)"))) { DM.result = list(FALSE, "Did you check the correlation of <code>A</code> and <code>B</code> using the <code>cor()</code> command? If you don't know how this function works, you can always type <code>?cor</code> in the console to get the help file.") } else { DM.result = list(TRUE,"Good job. Now onto the next video...") } rm(correlation_correct) 

This should be easy given that the results from the previous exercises are still available in your workspace! Recall that the covariance between A and B was saved in the variable cov. Furthermore, sd_A and sd_B contain the sample standard deviations of A and B respectively.

Creating scatterplots

The data here comes from a correlational study that investigated predictors of physical endurance with respect to healthy adults. The outcome variable of the study is physical endurance, whereas potential predictor variables are defined as the age of the individuals in the study and the number of years that they have actively been engaged in exercise/sports.

In econometric software, the data can be appropriately analyzed by means of summary statistics and scatterplots. Use the describe() and plot() commands in order to perform these actions in R.

The starting point would be to read the datafile into R. This line of code is commonly used, therefore you are urged to memorize the function.

Instructions

• Read the datafile into R via read.table() from http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt. Include the header.
• Then describe the data by providing summary statistics. Use the function describe().
• Create scatterplots of age against activeyears, endurance against activeyears and endurance against age.
library(psych) # Read data from a URL into a dataframe called PE (physical endurance) PE <- ___ # Summary statistics ___ # Scatter plots plot(___~___) plot(___~___) plot(___~___) # Read data from a URL into a dataframe called PE (physical endurance) PE <- read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header = TRUE) # Summary statistics describe(PE) # Scatter plots plot(PE$age ~ PE$activeyears) plot(PE$endurance ~ PE$activeyears) plot(PE$endurance ~ PE$age)  if ( !exists("PE") ) { DM.result = list(FALSE, "It seems that the variable <code>PE</code> is not yet assigned. Recall that you have to use <code>read.table()</code> to do this. You can always type <code>?read.table</code> in the console to get the help file. ") } else if ((function_has_arguments("read.table","file","http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt")==0) & (student_typed("read.table(url(\"http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt\"")==0)) { DM.result = list(FALSE, "Did you load in the required data using the <code>read.table()</code> function? If you don't know how this function works, you can always type ?read.table in the console to get the help file.") } else if ((function_has_arguments("read.table","header","T")==0) & (function_has_arguments("read.table","header","TRUE")==0)) { DM.result = list(FALSE, "Did you include the header in your dataframe <code>PE</code>? You can include a header by setting the argument <code>header</code> in the <code>read.table()</code> function to <code>TRUE</code>.") } else if ( !(isTRUE(try(all.equal(PE, read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header=T)) )))){ DM.result = list(FALSE, "The dataframe <code>PE</code> seems to contain incorrect values? Are you sure that you used the <code>read.table()</code> correctly?") } else if (! (output_contains("describe(PE)"))) { DM.result = list(FALSE, "Did you look at the summary statistics?") } else if ((function_has_arguments("plot","x","PE$age ~ PE$activeyears")==0) & (function_has_arguments("plot","x","PE$activeyears ~ PE$age"))==0) { DM.result = list(FALSE, "Did you create a scatterplot of <code>activeyears</code> against <code>age</code>?") } else if ((function_has_arguments("plot","x","PE$endurance ~ PE$activeyears")==0) & (function_has_arguments("plot","x","PE$activeyears ~ PE$endurance"))==0) { DM.result = list(FALSE, "Did you create a scatterplot of <code>endurance</code> against <code>activeyears</code>?") } else if ((function_has_arguments("plot","x","PE$endurance ~ PE$age")==0) & (function_has_arguments("plot","x","PE$age ~ PE$endurance"))==0) { DM.result = list(FALSE, "Did you create a scatterplot of <code>endurance</code> against <code>age</code>?") } else { DM.result = list(TRUE,"Good job! The scatterplots enable you to have a good initial view on the pairwise relationships between the variables.") } 

You can use the output of the summary statistics to see the different variable names that you can use in the scatterplots.

Correlation matrix

You can measure the strength of a linear relationship by looking at the correlation coefficient. The correlation coefficient can be directly computed in R by means of the cor() command.

The cor.test() function can be utilized to see whether the correlation coefficient, and thus the relationship between the two variables, is significantly different from zero or not.

Instructions

• Compute the pairwise correlations in matrix form. Leave out the first variable pid as this is just an identification variable and does not provide you with additional insight. Make sure to round off the correlation coefficients to two decimal places!
• See which pairwise correlations are statistically significant. Note that the results link back to the scatterplots of the previous exercise!
PE <- read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header=T) library(psych) # PE is already loaded in # Correlation Analysis round(cor(___), ___) # Do some correlation tests. If the null hypothesis of no correlation can be rejected on a significance level of 5%, then the relationship between variables is significantly different from zero at the 95% confidence level cor.test(___, ___) cor.test(___, ___) cor.test(___, ___) # PE is already loaded in # Correlation Analysis round(cor(PE[2:4]), 2) # Do some correlation tests. If the null hypothesis of no correlation can be rejected on a significance level of 5%, then the relationship between variables is significantly different from zero. cor.test(PE$age, PE$activeyears) cor.test(PE$endurance, PE$activeyears) cor.test(PE$endurance, PE$age)  if (! (output_contains("round(cor(PE[2:4]),2)"))) { DM.result = list(FALSE, "Make sure that you entered the arguments for the <code>round()</code> and <code>cor()</code> function correctly.") } else if (! (student_typed("cor.test(PE$age, PE$activeyears)") | student_typed("cor.test(PE$activeyears, PE$age)"))) { DM.result = list(FALSE, "Did you test the significance of the correlation coefficient of <code>age</code> and <code>activeyears?</code>") } else if (! (student_typed("cor.test(PE$endurance, PE$activeyears)") | student_typed("cor.test(PE$activeyears,PE$endurance)"))) { DM.result = list(FALSE, "Did you test the significance of the correlation coefficient of <code>endurance</code> and <code>activeyears?</code>") } else if (! (student_typed("cor.test(PE$endurance, PE$age)") | student_typed("cor.test(PE$age, PE$endurance)"))) { DM.result = list(FALSE, "Did you test the significance of the correlation coefficient of <code>endurance</code> and <code>age</code>?") } else { DM.result = list(TRUE,"Well done! Notice that the correlation matrix is always symmetric. Time for a few multiple choice questions to test your understanding.") } 
• Omit pid by only using the last three columns of the data matrix by means of the command line PE[2:4].
• The pairwise correlations can be computed in matrix form by using the cor() command. The argument should be the dataset containing all the variables where you want to calculate the pairwise correlations from.

Non-representative data samples

In the video, Professor Conway points out that you should be careful with interpreting correlation coefficients when analyzing non-representative samples. In this exercise, you will learn how to divide a dataset into subsets and see to what extent this can alter the correlation coefficients.

For this exercise the impact dataset is used. This dataset investigates the effects of a sports-related concussion and involves the survey data from both a control group and a group of athletes who suffered from a concussion.

Instructions

• Describe the entire dataset impact in a first step. Use the R function describe().
• Then calculate the correlation coefficient between the variables vismem2 (visual memory after impact) and vermem2 (verbal memory after impact), round off to 2 decimal figures and save it in the variable entirecorr.
• Describe the entire dataset while dividing it directly in two subsets by condition: control and concussed. Use the describeBy() command.
• Now create the two subsets in R containing the data of the control group and the concussed group. You can use subset().
• Finally, calculate and save the same correlation coefficients as in the first step for each subset independently in the variables controlcorr and concussedcorr. Display all correlation coefficients together by using the cbind() command. Save this result in the variable correlations.
impact <- read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.03.txt", header = T) library(psych) # The impact dataset is already loaded in # Summary statistics entire dataset # Calculate correlation coefficient entirecorr <- ___ # Summary statistics subsets describeBy(___, ___) # Create 2 subsets: control and concussed control <- subset(___, ___) concussed <- subset(___, ___) # Calculate correlation coefficients for each subset controlcorr <- ___ concussedcorr <- ___ # Display all values at the same time correlations <- cbind(___, ___, ___) correlations # The impact dataset is already loaded in # Summary statistics entire dataset describe(impact) # Calculate correlation coefficient entirecorr <- round(cor(impact$vismem2,impact$vermem2),2) # Summary statistics subsets describeBy(impact,impact$condition) # Create 2 subsets: control and concussed control <- subset(impact,impact[,2] == "control") concussed <- subset(impact,impact[,2] == "concussed") # Calculate correlation coefficients for each subset controlcorr <- round(cor(control$vismem2,control$vermem2),2) concussedcorr <- round(cor(concussed$vismem2,concussed$vermem2),2) # Display all values at the same time correlations <- cbind(entirecorr,controlcorr,concussedcorr) correlations  entirecorr_correct = round(cor(impact$vismem2,impact$vermem2),2) control_correct = subset(impact,impact[,2] == "control") concussed_correct = subset(impact,impact[,2] == "concussed") controlcorr_correct = round(cor(control_correct$vismem2,control_correct$vermem2),2) concussedcorr_correct = round(cor(concussed_correct$vismem2,concussed_correct$vermem2),2) show_correct = cbind(entirecorr_correct,controlcorr_correct,concussedcorr_correct) if (! (output_contains("describe(impact)"))) { DM.result = list(FALSE, "Did you look at the summary statistics of the dataset using the <code>describe()</code> function? If you don't know how this function works, you can always type <code>?describe</code> in the console to get the help file.") } else if ( !exists("entirecorr") ) { DM.result = list(FALSE, "It seems that the variable <code>entirecorr</code> is not yet defined.") } else if ( ! identical(entirecorr_correct, entirecorr)){ DM.result = list(FALSE, "It seems that you did not define the variable <code>entirecorr</code> correctly. Remember you already did a similar calculation in a past exercise. Don't forget to round off your result to 2 decimal figures. You can use the <code>round()</code> function to do this. Recall that you can always type <code>?round</code> in the console to get the help file.") } else if (! (output_contains("describeBy(impact, impact$condition)"))) { DM.result = list(FALSE, "Did you look at the summary statistics of the data divided into subsets of the variable <code>condition</code>? Use the <code>describeBy()</code> function to aid you in this and type <code>?describeBy</code> in the console if you don't know how this function works.") } else if ( !exists("control") ) { DM.result = list(FALSE, "It seems that the subset <code>control</code> is not yet defined.") } else if ( !exists("concussed") ) { DM.result = list(FALSE, "It seems that the subset <code>concussed</code> is not yet defined.") } else if ( ! identical(control_correct, control)){ DM.result = list(FALSE, "Make sure to define the subset <code>control</code> correctly. Recall that this subset should contain the <code>impact</code> data, but only the data with <code>condition</code> equal to <code>control</code>.") } else if ( ! identical(concussed_correct, concussed)){ DM.result = list(FALSE, "Make sure to define the subset <code>concussed</code> correctly. Recall that this subset should contain the <code>impact</code> data, but only the data with <code>condition</code> equal to <code>concussed</code>.") } else if ( !exists("controlcorr") ) { DM.result = list(FALSE, "It seems that the variable <code>controlcorr</code> is not yet defined.") } else if ( !exists("concussedcorr") ) { DM.result = list(FALSE, "It seems that the variable <code>concussedcorr</code> is not yet defined") } else if ( ! identical(controlcorr_correct, controlcorr)){ DM.result = list(FALSE, "Make sure to define the variable <code>controlcorr</code> correctly. This should be the correlation between the variables <code>vismem2</code> and <code>vermem2</code> of the subset <code>control</code>, round off to 2 decimal figures.") } else if ( ! identical(concussedcorr_correct, concussedcorr)){ DM.result = list(FALSE, "Make sure to define the subset <code>concussedcorr</code> correctly. This should be the correlation between the variables <code>vismem2</code> and <code>vermem2</code> of the subset <code>concussed</code>, round off to 2 decimal figures.") } else if ( !exists("correlations") ) { DM.result = list(FALSE, "It seems that the variable <code>correlations</code> is not yet defined.") } else if (!(student_typed("correlations = cbind(entirecorr, controlcorr, concussedcorr)") | student_typed("correlations = cbind(entirecorr, cocussedcorr, controlcorr)") | student_typed("correlations = cbind(controlcorr, concussedcorr, entirecorr)") | student_typed("correlations = cbind(controlcorr, entirecorr, concussedcorr)") | student_typed("correlations = cbind(concussedcorr, controlcorr, entirecorr)") | student_typed("correlations = cbind(concussed, entirecorr, controlcorr)"))) { DM.result = list(FALSE, "Display all correlation coefficients together via <code>correlations</code>.") } else if (! (output_contains("correlations"))) { DM.result = list(FALSE, "Did you look at the output of the variable <code>correlations</code>?") } else { DM.result = list(TRUE, "Great Job!") } rm(entirecorr_correct, control_correct, concussed_correct, controlcorr_correct, concussedcorr_correct) 
• Recall that you can calculate the correlation coefficient with the cor() command.
• Create a subset of a larger dataset by the following command: subset(impact,impact[,2]==" ")`