Load house price data from https://github.com/rashida048/Datasets/blob/master/home_data.csv
house <- read.csv("https://raw.githubusercontent.com/rashida048/Datasets/master/home_data.csv")
head(house) # display the first few lines of data frame
dim(house) # get the dimensions of data frame
Summary of price column from house data frame
Get a quick numeric glimpse of a set of data using the R summary(..) command.
- 'Min.' = 'minimum' value of data
- '1st Qu.' = '1st quartile' of data (1/4 of data is below this value, 3/4 above)
- 'Median' = 2nd quartile of data (1/2 of data is below this value, 1/2 above)
- 'Mean' = average value of data
- '3rd Qu.' = '3rd quartile' of data (3/4 of data is below, 1/4 is above)
- 'Max.' = 'maximum' value of data
summary(house$price)
Plot house prices in 1-D
Include vertical line marking the mean house price (color = red, line width = 3)
plot(house$price, rep(0,length(house$price)))
abline(v=mean(house$price), col='red', lwd=3)
Better plot
... maybe it is better to spread out the data (plot vs index)
Include horizontal line marking the mean price (color = red, line width = 3)
plot(house$price)
abline(h=mean(house$price), col='red', lwd=3)
Histograms.
A better way to visualize the data is to construct a histogram!
A histogram is a cheap, non-parametric estimate of the underlying density function.
hist(house$price)
Better Histograms.
The default rule for choosing #bins in the histogram is "Sturges' Rule" which is terrible.
As a graduate of Rice University (and former student of Prof. David Scott) I suggest that you use "Scott's Rule"
hist(house$price, breaks="scott")
Kernel-based Density Estimate
These days, we have computers doing the dirty work... so we can do better than histograms! We can make a smooth, kernel-based density estimate for our sample distribution using density(...)
A kernel-based density estimate is made by replacing each sample point by a small probability distribution with total area 1/n and then adding up all of the probability distributions. (Variations allow weighting some sample points more or changing the "shape" of each point.)
hist(house$price, breaks="scott", probability=TRUE) # plot histogram for comparison
lines(density(house$price), col='red', lwd=3) # kernel-based density estimate
'Spread' of data is measured by (sample) variance and standard deviation
Recall: Sample variance is
While (discrete) population variance is
In R, the commands 'var' and 'sd' compute sample variance and standard deviation
The sample variance formula is an 'unbiased estimator' for variance (we'll discuss this in a later chapter).