Manual computation of correlation coefficients (1)
In order to get acquainted with the concept of correlation coefficients, it is a good idea to manually calculate the coefficients step by step in R. The general formula for calculating the correlation coefficient between two variables is $$ r = \frac{cov(A,B)}{s_A \cdot s_B}, $$ where \(cov(A,B)\) is the covariance between \(A\) and \(B\), while \(s_A\) and \(s_B\) are the standard deviations.
Instructions
 The vectors
A
andB
have already been loaded into your workspace. Begin by calculating the covariance \(cov(A,B)\) between these vectors. Save all necessary intermediate values in new variables to simplify the computations.  First you have to compute the differences of each vector element with the mean and save them in the new variables
diff_A
anddiff_B
. Use themean()
function to do this.  Then, calculate the covariance between
A
andB
using your new variables. Save your result in the variablecov
.
A < c(1,2,3)
B < c(3,6,7)
# The vectors A and B have already been loaded
# Take a quick peek at both vectors
# Save the differences of each vector element with the mean in a new variable
diff_A < ___  mean(___)
diff_B < ___  mean(___)
# Do the summation of the elements of the vectors and divide by N1 in order to acquire the covariance between the two vectors
cov < sum(___*___)/ ___
# The vectors A and B have already been loaded
# Take a quick peek at both vectors
A
B
# Save the differences of each vector element with the mean in a new variable
diff_A < A  mean(A)
diff_B < B  mean(B)
# Do the summation of the elements of the vectors and divide by N1 in order to acquire the covariance between the two vectors
cov < sum(diff_A*diff_B)/(31)
test_output_contains("A", incorrect_msg = "Did you print out the vector `A`?")
test_output_contains("B", incorrect_msg = "Did you print out the vector `B`?")
test_object("diff_A",
undefined_msg = "It seems that the variable <code>diff_A</code> is not yet assigned or contains a syntax error. Please fix this and try again.",
incorrect_msg = "It looks like you calculated <code>diff_A</code> incorrectly. Remember that you have to calculate the difference of each element of vector <code>A</code> with the vector mean.")
test_object("diff_B",
undefined_msg = "It seems that the variable <code>diff_B</code> is not yet assigned or contains a syntax error. Please fix this and try again.",
incorrect_msg = "It looks like you calculated <code>diff_B</code> incorrectly. Remember that you have to calculate the difference of each element of vector <code>B</code> with the vector mean.")
test_object("cov",
undefined_msg = "It seems that the variable <code>cov</code> is not yet assigned or contains a syntax error. Please fix this and try again.",
incorrect_msg = "It looks like you defined the variable <code>cov</code> incorrectly. Have a look at the formula again.")
success_msg("Well done!")
Remember that the covariance can be expressed as $$covariance = \frac{\sum \limits_{k=1}^N (x_i  \overline{x}).(y_i  \overline{y})}{N1},$$ where \(N\) is the length of the vectors \(x\) and \(y\).
You can use the function length()
to determine \(N\).
Manual computation of correlation coefficients (2)
In order to get acquainted with the concept of correlation coefficients, it is a good idea to manually calculate the coefficients step by step in R. The general formula for calculating the correlation coefficient between two variables is $$ r = \frac{cov(A,B)}{s_A \cdot s_B}, $$ where \(cov(A,B)\) is the covariance between \(A\) and \(B\), while \(s_A\) and \(s_B\) are the standard deviations.
You have reached the half way point! Calculate the sample standard deviations \(s_A\) and \(s_B\) of the vectors A
and B
. Your workspace still contains these two vectors and the results of the previous exercise.
Instructions
 Define the variables
sq_diff_A
andsq_diff_B
as the squares of the differences of each vector element with the mean.  Use your new variables to calculate
sd_A
andsd_B
. These two variables should contain the sample standard deviations ofA
andB
respectively.
A < c(1,2,3)
B < c(3,6,7)
diff_A < A  mean(A)
diff_B < B  mean(B)
# Your workspace still contains the results of the previous exercise
# Square the differences that were found in the previous step
sq_diff_A < ___
sq_diff_B < ___
# Take the sum of the elements, divide them by N1 and consequently take the square root to acquire the sample standard deviations
sd_A < ___(sum(___)/(___))
sd_B < ___(sum(___)/(___))
# Your workspace still contains the results of the previous exercise
# Square the differences that were found in the previous step
sq_diff_A < diff_A^2
sq_diff_B < diff_B^2
# Take the sum of the elements, divide them by N1 and consequently take the square root to acquire the sample standard deviations
sd_A < sqrt(sum(sq_diff_A)/(31))
sd_B < sqrt(sum(sq_diff_B)/(31))
test_predefined_objects(c("A", "B", "diff_A", "diff_B"))
test_object("sq_diff_A",
undefined_msg = "It seems like the variable <code>sq_diff_A</code> is not yet assigned or contains a syntax error. Please fix this and try again.",
incorrect_msg= "You should square the differences <code>diff_A</code> to get <code>sq_diff_A</code>.")
test_object("sq_diff_B",
undefined_msg = "It seems like the variable <code>sq_diff_B</code> is not yet assigned or contains a syntax error. Please fix this and try again.",
incorrect_msg = "You should square the differences <code>diff_B</code> to get <code>sq_diff_B</code>.")
test_function("sqrt", not_called_msg = "Use the <code>sqrt()</code> function to take the square root to acquire the sample standard deviations. Type <code>?sqrt</code> in the console to get the help file.")
test_object("sd_A",
incorrect_msg = "Make sure to calculate the sample standard deviation <code>sd_A</code> correctly. The formula is given in the hint, if you don't know what to do. You have to use `sqrt()` and `sum()`.")
success_msg("Fantastic! Just a few more calculations to go.")
Remember that the sample standard deviation can be expressed as $$\sqrt{\frac{\sum \limits_{k=1}^N (x_i  \overline{x})^2}{N1}}$$ .
Manual computation of correlation coefficients (3)
In order to get acquainted with the concept of correlation coefficients, it is a good idea to manually calculate the coefficients stepbystep in R. The general formula for calculating the correlation coefficient between two variables is $$ r = \frac{cov(A,B)}{s_A \cdot s_B}, $$ where \(cov(A,B)\) is the covariance between \(A\) and \(B\), while $sA$ and $sB$ are the standard deviations.
Instructions
 Almost at the finish line! Combine the results of the previous steps to reap the fruits of your hard labor. Assign the correct value to
correlation
. 
Check your result by using the
cor()
command!
A < c(1,2,3)
B < c(3,6,7)
diff_A < A  mean(A)
diff_B < B  mean(B)
sq_diff_A < diff_A^2
sq_diff_B < diff_B^2
cov < (diff_A[1] * diff_B[1]+ diff_A[2] * diff_B[2]+diff_A[3] * diff_B[3])/(31)
sd_A < sqrt(sum(sq_diff_A)/(31))
sd_B < sqrt(sum(sq_diff_B)/(31))
# Your workspace still contains the results of the previous exercise
# Combine all the pieces of the puzzle
correlation < ___
correlation
# Check the validity of your result with the cor() command
# Your workspace still contains the results of the previous exercise
# Combine all the pieces of the puzzle
correlation < cov/(sd_A*sd_B)
correlation
# Check the validity of your result with the cor() command
cor(A,B)
correlation_correct = cov/(sd_A*sd_B)
if ( !exists("correlation") ) {
DM.result = list(FALSE, "It seems that the variable <code>correlation</code> is not yet assigned or contains a syntax error. Please try again.")
} else if ( ! identical(correlation_correct, correlation) ){
DM.result = list(FALSE, "It looks like the variable <code>correlation</code> was defined incorrectly. Take a good look at the formula and try again.")
} else if (! (output_contains("correlation"))) {
DM.result = list(FALSE, "Did you look at the output of the <code>correlation</code> variable?")
} else if (! (student_typed("cor(A, B)")  student_typed("cor(B, A)"))) {
DM.result = list(FALSE, "Did you check the correlation of <code>A</code> and <code>B</code> using the <code>cor()</code> command? If you don't know how this function works, you can always type <code>?cor</code> in the console to get the help file.")
} else {
DM.result = list(TRUE,"Good job. Now onto the next video...")
}
rm(correlation_correct)
This should be easy given that the results from the previous exercises are still available in your workspace! Recall that the covariance between A
and B
was saved in the variable cov
. Furthermore, sd_A
and sd_B
contain the sample standard deviations of A
and B
respectively.
Creating scatterplots
The data here comes from a correlational study that investigated predictors of physical endurance with respect to healthy adults. The outcome variable of the study is physical endurance, whereas potential predictor variables are defined as the age of the individuals in the study and the number of years that they have actively been engaged in exercise/sports.
In econometric software, the data can be appropriately analyzed by means of summary statistics and scatterplots. Use the describe()
and plot()
commands in order to perform these actions in R.
The starting point would be to read the datafile into R. This line of code is commonly used, therefore you are urged to memorize the function.
Instructions
 Read the datafile into R via
read.table()
from http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt. Include the header.  Then describe the data by providing summary statistics. Use the function
describe()
.  Create scatterplots of
age
againstactiveyears
,endurance
againstactiveyears
andendurance
againstage
.
library(psych)
# Read data from a URL into a dataframe called PE (physical endurance)
PE < ___
# Summary statistics
___
# Scatter plots
plot(___~___)
plot(___~___)
plot(___~___)
# Read data from a URL into a dataframe called PE (physical endurance)
PE < read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header = TRUE)
# Summary statistics
describe(PE)
# Scatter plots
plot(PE$age ~ PE$activeyears)
plot(PE$endurance ~ PE$activeyears)
plot(PE$endurance ~ PE$age)
if ( !exists("PE") ) {
DM.result = list(FALSE, "It seems that the variable <code>PE</code> is not yet assigned. Recall that you have to use <code>read.table()</code> to do this. You can always type <code>?read.table</code> in the console to get the help file. ")
} else if ((function_has_arguments("read.table","file","http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt")==0) & (student_typed("read.table(url(\"http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt\"")==0)) {
DM.result = list(FALSE, "Did you load in the required data using the <code>read.table()</code> function? If you don't know how this function works, you can always type `?read.table in the console to get the help file.")
} else if ((function_has_arguments("read.table","header","T")==0) & (function_has_arguments("read.table","header","TRUE")==0)) {
DM.result = list(FALSE, "Did you include the header in your dataframe <code>PE</code>? You can include a header by setting the argument <code>header</code> in the <code>read.table()</code> function to <code>TRUE</code>.")
} else if ( !(isTRUE(try(all.equal(PE, read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header=T)) )))){
DM.result = list(FALSE, "The dataframe <code>PE</code> seems to contain incorrect values? Are you sure that you used the <code>read.table()</code> correctly?")
} else if (! (output_contains("describe(PE)"))) {
DM.result = list(FALSE, "Did you look at the summary statistics?")
} else if ((function_has_arguments("plot","x","PE$age ~ PE$activeyears")==0) & (function_has_arguments("plot","x","PE$activeyears ~ PE$age"))==0) {
DM.result = list(FALSE, "Did you create a scatterplot of <code>activeyears</code> against <code>age</code>?")
} else if ((function_has_arguments("plot","x","PE$endurance ~ PE$activeyears")==0) & (function_has_arguments("plot","x","PE$activeyears ~ PE$endurance"))==0) {
DM.result = list(FALSE, "Did you create a scatterplot of <code>endurance</code> against <code>activeyears</code>?")
} else if ((function_has_arguments("plot","x","PE$endurance ~ PE$age")==0) & (function_has_arguments("plot","x","PE$age ~ PE$endurance"))==0) {
DM.result = list(FALSE, "Did you create a scatterplot of <code>endurance</code> against <code>age</code>?")
} else {
DM.result = list(TRUE,"Good job! The scatterplots enable you to have a good initial view on the pairwise relationships between the variables.")
}
You can use the output of the summary statistics to see the different variable names that you can use in the scatterplots.
Correlation matrix
You can measure the strength of a linear relationship by looking at the correlation coefficient. The correlation coefficient can be directly computed in R by means of the cor()
command.
The cor.test()
function can be utilized to see whether the correlation coefficient, and thus the relationship between the two variables, is significantly different from zero or not.
Instructions
 Compute the pairwise correlations in matrix form. Leave out the first variable
pid
as this is just an identification variable and does not provide you with additional insight. Make sure to round off the correlation coefficients to two decimal places!  See which pairwise correlations are statistically significant. Note that the results link back to the scatterplots of the previous exercise!
PE < read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.04.txt", header=T)
library(psych)
# PE is already loaded in
# Correlation Analysis
round(cor(___), ___)
# Do some correlation tests. If the null hypothesis of no correlation can be rejected on a significance level of 5%, then the relationship between variables is significantly different from zero at the 95% confidence level
cor.test(___, ___)
cor.test(___, ___)
cor.test(___, ___)
# PE is already loaded in
# Correlation Analysis
round(cor(PE[2:4]), 2)
# Do some correlation tests. If the null hypothesis of no correlation can be rejected on a significance level of 5%, then the relationship between variables is significantly different from zero.
cor.test(PE$age, PE$activeyears)
cor.test(PE$endurance, PE$activeyears)
cor.test(PE$endurance, PE$age)
if (! (output_contains("round(cor(PE[2:4]),2)"))) {
DM.result = list(FALSE, "Make sure that you entered the arguments for the <code>round()</code> and <code>cor()</code> function correctly.")
} else if (! (student_typed("cor.test(PE$age, PE$activeyears)")  student_typed("cor.test(PE$activeyears, PE$age)"))) {
DM.result = list(FALSE, "Did you test the significance of the correlation coefficient of <code>age</code> and <code>activeyears?</code>")
} else if (! (student_typed("cor.test(PE$endurance, PE$activeyears)")  student_typed("cor.test(PE$activeyears,PE$endurance)"))) {
DM.result = list(FALSE, "Did you test the significance of the correlation coefficient of <code>endurance</code> and <code>activeyears?</code>")
} else if (! (student_typed("cor.test(PE$endurance, PE$age)")  student_typed("cor.test(PE$age, PE$endurance)"))) {
DM.result = list(FALSE, "Did you test the significance of the correlation coefficient of <code>endurance</code> and <code>age</code>?")
} else {
DM.result = list(TRUE,"Well done! Notice that the correlation matrix is always symmetric. Time for a few multiple choice questions to test your understanding.")
}
 Omit
pid
by only using the last three columns of the data matrix by means of the command linePE[2:4]
.  The pairwise correlations can be computed in matrix form by using the
cor()
command. The argument should be the dataset containing all the variables where you want to calculate the pairwise correlations from.
Nonrepresentative data samples
In the video, Professor Conway points out that you should be careful with interpreting correlation coefficients when analyzing nonrepresentative samples. In this exercise, you will learn how to divide a dataset into subsets and see to what extent this can alter the correlation coefficients.
For this exercise the impact
dataset is used. This dataset investigates the effects of a sportsrelated concussion and involves the survey data from both a control group and a group of athletes who suffered from a concussion.
Instructions

Describe the entire dataset
impact
in a first step. Use the R functiondescribe()
.  Then calculate the correlation coefficient between the variables
vismem2
(visual memory after impact) andvermem2
(verbal memory after impact), round off to 2 decimal figures and save it in the variableentirecorr
.  Describe the entire dataset while dividing it directly in two subsets by
condition
: control and concussed. Use thedescribeBy()
command.  Now create the two subsets in R containing the data of the
control
group and theconcussed
group. You can usesubset()
.  Finally, calculate and save the same correlation coefficients as in the first step for each subset independently in the variables
controlcorr
andconcussedcorr
. Display all correlation coefficients together by using thecbind()
command. Save this result in the variablecorrelations
.
impact < read.table("http://assets.datacamp.com/course/Conway/Lab_Data/Stats1.13.Lab.03.txt", header = T)
library(psych)
# The impact dataset is already loaded in
# Summary statistics entire dataset
# Calculate correlation coefficient
entirecorr < ___
# Summary statistics subsets
describeBy(___, ___)
# Create 2 subsets: control and concussed
control < subset(___, ___)
concussed < subset(___, ___)
# Calculate correlation coefficients for each subset
controlcorr < ___
concussedcorr < ___
# Display all values at the same time
correlations < cbind(___, ___, ___)
correlations
# The impact dataset is already loaded in
# Summary statistics entire dataset
describe(impact)
# Calculate correlation coefficient
entirecorr < round(cor(impact$vismem2,impact$vermem2),2)
# Summary statistics subsets
describeBy(impact,impact$condition)
# Create 2 subsets: control and concussed
control < subset(impact,impact[,2] == "control")
concussed < subset(impact,impact[,2] == "concussed")
# Calculate correlation coefficients for each subset
controlcorr < round(cor(control$vismem2,control$vermem2),2)
concussedcorr < round(cor(concussed$vismem2,concussed$vermem2),2)
# Display all values at the same time
correlations < cbind(entirecorr,controlcorr,concussedcorr)
correlations
entirecorr_correct = round(cor(impact$vismem2,impact$vermem2),2)
control_correct = subset(impact,impact[,2] == "control")
concussed_correct = subset(impact,impact[,2] == "concussed")
controlcorr_correct = round(cor(control_correct$vismem2,control_correct$vermem2),2)
concussedcorr_correct = round(cor(concussed_correct$vismem2,concussed_correct$vermem2),2)
show_correct = cbind(entirecorr_correct,controlcorr_correct,concussedcorr_correct)
if (! (output_contains("describe(impact)"))) {
DM.result = list(FALSE, "Did you look at the summary statistics of the dataset using the <code>describe()</code> function? If you don't know how this function works, you can always type <code>?describe</code> in the console to get the help file.")
} else if ( !exists("entirecorr") ) {
DM.result = list(FALSE, "It seems that the variable <code>entirecorr</code> is not yet defined.")
} else if ( ! identical(entirecorr_correct, entirecorr)){
DM.result = list(FALSE, "It seems that you did not define the variable <code>entirecorr</code> correctly. Remember you already did a similar calculation in a past exercise. Don't forget to round off your result to 2 decimal figures. You can use the <code>round()</code> function to do this. Recall that you can always type <code>?round</code> in the console to get the help file.")
} else if (! (output_contains("describeBy(impact, impact$condition)"))) {
DM.result = list(FALSE, "Did you look at the summary statistics of the data divided into subsets of the variable <code>condition</code>? Use the <code>describeBy()</code> function to aid you in this and type <code>?describeBy</code> in the console if you don't know how this function works.")
} else if ( !exists("control") ) {
DM.result = list(FALSE, "It seems that the subset <code>control</code> is not yet defined.")
} else if ( !exists("concussed") ) {
DM.result = list(FALSE, "It seems that the subset <code>concussed</code> is not yet defined.")
} else if ( ! identical(control_correct, control)){
DM.result = list(FALSE, "Make sure to define the subset <code>control</code> correctly. Recall that this subset should contain the <code>impact</code> data, but only the data with <code>condition</code> equal to <code>control</code>.")
} else if ( ! identical(concussed_correct, concussed)){
DM.result = list(FALSE, "Make sure to define the subset <code>concussed</code> correctly. Recall that this subset should contain the <code>impact</code> data, but only the data with <code>condition</code> equal to <code>concussed</code>.")
} else if ( !exists("controlcorr") ) {
DM.result = list(FALSE, "It seems that the variable <code>controlcorr</code> is not yet defined.")
} else if ( !exists("concussedcorr") ) {
DM.result = list(FALSE, "It seems that the variable <code>concussedcorr</code> is not yet defined")
} else if ( ! identical(controlcorr_correct, controlcorr)){
DM.result = list(FALSE, "Make sure to define the variable <code>controlcorr</code> correctly. This should be the correlation between the variables <code>vismem2</code> and <code>vermem2</code> of the subset <code>control</code>, round off to 2 decimal figures.")
} else if ( ! identical(concussedcorr_correct, concussedcorr)){
DM.result = list(FALSE, "Make sure to define the subset <code>concussedcorr</code> correctly. This should be the correlation between the variables <code>vismem2</code> and <code>vermem2</code> of the subset <code>concussed</code>, round off to 2 decimal figures.")
} else if ( !exists("correlations") ) {
DM.result = list(FALSE, "It seems that the variable <code>correlations</code> is not yet defined.")
} else if (!(student_typed("correlations = cbind(entirecorr, controlcorr, concussedcorr)")  student_typed("correlations = cbind(entirecorr, cocussedcorr, controlcorr)")  student_typed("correlations = cbind(controlcorr, concussedcorr, entirecorr)")  student_typed("correlations = cbind(controlcorr, entirecorr, concussedcorr)")  student_typed("correlations = cbind(concussedcorr, controlcorr, entirecorr)")  student_typed("correlations = cbind(concussed, entirecorr, controlcorr)"))) {
DM.result = list(FALSE, "Display all correlation coefficients together via <code>correlations</code>.")
} else if (! (output_contains("correlations"))) {
DM.result = list(FALSE, "Did you look at the output of the variable <code>correlations</code>?")
} else {
DM.result = list(TRUE, "Great Job!")
}
rm(entirecorr_correct, control_correct, concussed_correct, controlcorr_correct, concussedcorr_correct)
 Recall that you can calculate the correlation coefficient with the
cor()
command.  Create a subset of a larger dataset by the following command:
subset(impact,impact[,2]==" ")