# Assessing Classical Test Assumptions in R

In classical parametric procedures we often assume normality and constant variance for the model error term. Methods of exploring these assumptions in an ANOVA/ANCOVA/MANOVA framework are discussed here. Regression diagnostics are covered under multiple linear regression.

## Outliers

Since outliers can severly affect normality and homogeneity of variance, methods for detecting disparate observerations are described first.

The **aq.plot()** function in the mvoutlier package allows you to identfy multivariate outliers by plotting the ordered squared robust Mahalanobis distances of the observations against the empirical distribution function of the MD^{2}_{i}. Input consists of a matrix or data frame. The function produces 4 graphs and returns a boolean vector identifying the outliers.

```
# Detect Outliers in the MTCARS Data
library(mvoutlier)
outliers <-
aq.plot(mtcars[c("mpg","disp","hp","drat","wt","qsec")])
outliers # show list of outliers
```

## Univariate Normality

You can evaluate the normality of a variable using a **Q-Q plot**.

```
# Q-Q Plot for variable MPG
attach(mtcars)
qqnorm(mpg)
qqline(mpg)
```

Significant departures from the line suggest violations of normality.

You can also perform a Shapiro-Wilk test of normality with the **shapiro.test(***x***)** function, where *x* is a numeric vector. Additional functions for testing normality are available in nortest package.

## Multivariate Normality

MANOVA assumes **multivariate normality**. The function **mshapiro.test( )** in the mvnormtest package produces the Shapiro-Wilk test for multivariate normality. Input must be a numeric matrix.

```
# Test Multivariate Normality
mshapiro.test(M)
```

If we have *p* x 1 multivariate normal random vector then the squared Mahalanobis distance between ** x** and μ is going to be chi-square distributed with

*p*degrees of freedom. We can use this fact to construct a

**Q-Q plot**to assess multivariate normality.

```
# Graphical Assessment of Multivariate Normality
x <- as.matrix(mydata) # n x p numeric matrix
center <- colMeans(x) # centroid
n <- nrow(x); p <- ncol(x); cov <- cov(x);
d <-
mahalanobis(x,center,cov) # distances
qqplot(qchisq(ppoints(n),df=p),d,
main="QQ Plot Assessing Multivariate Normality",
ylab="Mahalanobis D2")
abline(a=0,b=1)
```

## Homogeneity of Variances

The **bartlett.test( )** function provides a parametric K-sample test of the equality of variances. The **fligner.test( )** function provides a non-parametric test of the same. In the following examples **y** is a numeric variable and **G** is the grouping variable.

```
# Bartlett Test of Homogeneity of Variances
bartlett.test(y~G, data=mydata)
# Figner-Killeen Test of Homogeneity of Variances
fligner.test(y~G, data=mydata)
```

The **hovPlot( )** function in the **HH** package provides a graphic test of homogeneity of variances based on Brown-Forsyth. In the following example, **y** is numeric and **G** is a grouping factor. Note that G **must** be of type factor.

```
# Homogeneity of Variance Plot
library(HH)
hov(y~G, data=mydata)
hovPlot(y~G,data=mydata)
```

## Homogeneity of Covariance Matrices

MANOVA and LDF assume homogeneity of variance-covariance matrices. The assumption is usually tested with **Box's M**. Unfortunately the test is very sensitive to violations of normality, leading to rejection in most typical cases. **Box's M** is available via the **boxM** function in the biotools package.

## To Practice

Take the Introduction to Statistics in R course to grow your statistical skills futher.