Skip to content

Load house price data from https://github.com/rashida048/Datasets/blob/master/home_data.csv

(see datacamp histogram tutorial)

house <- read.csv("https://raw.githubusercontent.com/rashida048/Datasets/master/home_data.csv")

head(house)      # display the first few lines of data frame

dim(house)       # get the dimensions of data frame

Summary of price column from house data frame

Get a quick numeric glimpse of a set of data using the R summary(..) command.

  • 'Min.' = 'minimum' value of data
  • '1st Qu.' = '1st quartile' of data (1/4 of data is below this value, 3/4 above)
  • 'Median' = 2nd quartile of data (1/2 of data is below this value, 1/2 above)
  • 'Mean' = average value of data
  • '3rd Qu.' = '3rd quartile' of data (3/4 of data is below, 1/4 is above)
  • 'Max.' = 'maximum' value of data
summary(house$price)

Plot house prices in 1-D

Include vertical line marking the mean house price (color = red, line width = 3)

plot(house$price, rep(0,length(house$price)))

abline(v=mean(house$price), col='red', lwd=3)

Better plot

... maybe it is better to spread out the data (plot vs index)

Include horizontal line marking the mean price (color = red, line width = 3)

plot(house$price)

abline(h=mean(house$price), col='red', lwd=3)

Histograms.

A better way to visualize the data is to construct a histogram!

A histogram is a cheap, non-parametric estimate of the underlying density function.

hist(house$price)

Better Histograms.

The default rule for choosing #bins in the histogram is "Sturges' Rule" which is terrible.

As a graduate of Rice University (and former student of Prof. David Scott) I suggest that you use "Scott's Rule"

hist(house$price, breaks="scott")

Kernel-based Density Estimate

These days, we have computers doing the dirty work... so we can do better than histograms! We can make a smooth, kernel-based density estimate for our sample distribution using density(...)

A kernel-based density estimate is made by replacing each sample point by a small probability distribution with total area 1/n and then adding up all of the probability distributions. (Variations allow weighting some sample points more or changing the "shape" of each point.)

hist(house$price, breaks="scott", probability=TRUE)  # plot histogram for comparison

lines(density(house$price), col='red', lwd=3)        # kernel-based density estimate

'Spread' of data is measured by (sample) variance and standard deviation

Recall: Sample variance is

   

While (discrete) population variance is

   

In R, the commands 'var' and 'sd' compute sample variance and standard deviation

The sample variance formula is an 'unbiased estimator' for variance (we'll discuss this in a later chapter).