Introduction to Statiscal Learning with R
A Lesson in Statistical Learning
This post follows along with Springer’s Statistical Learning with R as well as some changes to the book’s R Code, to make it more reproduceable an introduce good R practices
We will mainly focus on statistical learning in this lesson, but what is statistical learning? Essentially, it is what I have described in other posts where I referred to Machine Learning. Statistical Learning can be used interchangeably with Machine Learning.
Let us pretend we have a dataset with predictors for weather. Dewpoint, Temperature, Wind Speed, could all be input variables (aka independent variables, predictors, features). These dependdent variables are often denoted as $X_n$. For example, $X_1 = Tempearture$ and $X_2 = Dewpoint$, etc… These are used to predict whether it will be sunny or rainy also known as the output variables (aka dependent variables or response). Normally the output variables are denoted with Y.
Y will always be a quantitative response for $p$ different predictors. We will use $X$ in our formula to represent all of the predictors, so $X = (X_1, X_2, … , X_p)$. Our formula becomes…
$$ \begin{equation} Y = f(X) + \epsilon \end{equation} $$
where $f$ is a fixed but unknow function of $X_1$, $X_2$, … , $X_p$ and $\epsilon$ is a random error term, which is independent of $X$ and has a mean zero.
Taking a look at some advertising sample data
In the following sections I will use some [$R$](https://en.wikipedia.org/wiki/R_(programming_language) to show an example of the statistical learning concepts we are discussing, as well as some basic $R$ programming skills.
### Load the libraries
### If you have never installed an R package before, than this will cause an error
### To install the following packages you can either do them one at a time with:
### install.packages("DBI")
### Or you can do several at a time using the `c()` function (concatenate)
### install.packages(c("DBI", "RPostgres",...))
library(DBI)
library(RPostgres)
library(getPass)
library(ggplot2)
library(gridExtra)
library(ggthemes)
### In the block I am connecting to my personal Postgres Database
### If you subscribe to my blog you can get access to this data, too, and follow along more easily
### Using `readline()` I can enter my personal information everytime without needting to create a new argument
### `getPass()` lets me hide sensitive data
dbname = readline("Database: ")
host = readline("Host: ")
port = 5432
user = readline("Username: ")
pwd = getPass("Password: ")
### The function to connect to the Postgres instance
con < dbConnect(Postgres(),
dbname = dbname,
host = host,
port = port,
user = user,
password = pwd)
### Using SQL to pull the information I need
adv < dbGetQuery(con, "SELECT * FROM machine_learning.advertising")
### View the first couple of rows with the `head()` function
head(adv)
 index  tv  radio  newspaper  sales 
          
 1  230.1  37.8  69.2  22.1 
 2  44.5  39.3  45.1  10.4 
 3  17.2  45.9  69.3  9.3 
 4  151.5  41.3  58.5  18.5 
 5  180.8  10.8  58.4  12.9 
 6  8.7  48.9  75.0  7.2 
### R has a really fantastic plotting package (dare I say better than Python's Matplotlib...) called ggplot
### ggplot will be the basis of many of our visualizations
### Since I will be repeating a lot of typing with similar functionality, I decided to create a function
### that will repeat the same scatterplot making as many times as I use the function
### I want to define the x input (the independent variable), the y input (dependent variable) and
### the dataset that I will use
scatter_plot_sales < function(x, y, data) {
# ggplot requires that the variables x, y be passed as arguments
# but the arguments don't techinally exist in our local function
# yet since we have not defined our dataset
# This will cause R to throw an error, because there is no 'sales' variable etc...
# With `enquo()` we take the parameter of a function as expression
# That means, we do not evaluate it, we just take the expression, leave it on hold, and use it later on.
x < enquo(x)
y < enquo(y)
# The "!!" we see in the following line takes those x, y variables from earlier
# and returns those as expressions for R to interpret
plot < ggplot(data=data, aes(x=!!x, y=!!y)) +
# Add the points as shape=1(donuts) and the color red with a width of 1.5
geom_point(shape=1, color="red", stroke=1.5) +
# Add the linear model line and remove standard error (se)
geom_smooth(method=lm, se=FALSE, size=2)
return(plot)
}
tv < scatter_plot_sales(tv, sales, adv)
radio < scatter_plot_sales(radio, sales, adv)
newspaper < scatter_plot_sales(newspaper, sales, adv)
### Make sure the graph is wide and not all scrunched together
options(repr.plot.width=15, repr.plot.height=6)
### Grid arrange allows us to plot several graphs in the same row
grid.arrange(tv, radio, newspaper, nrow=1)
Estimating f
There are two main reasons why we are trying to estimate $f$:
 Prediction  trying to estimate future values based on input variables
 Inference  we are not looking to predict $Y$, but instead understand the relationship between $X$ and $Y$
Prediction
In most cases, $X$ (input variables) is widely available, but $Y$ (dependent variables) are not easily obtainable. Since in this instance oue error term $\epsilon$ averages to 0. I was initially confused on why $\epsilon$ averaged to 0, since I have not really done stats modeling in a while. The reason our error is averaged to 0, is because we are just predicting, so we do not have an actual to compare against in this instance, so no matter the output, our error is averaged to 0. In our previous model, our $\epsilon$ was the distance our point was away from our linear model line. Think of the red dots distance away from the blue line in the charts above. We can now predict $Y$ using the following function.
$$ \begin{equation} \hat{Y} = \hat{f}({X}) \end{equation} $$
$\hat{f}$ is our prediction for the function of $f$, and $\hat{Y}$ represents the resulting prediction for Y. $\hat{f}$ is normally treated as a black box when we are trying to predict because we are focused on getting an accurate prediction for $Y$.
Reducible and Irreducible Error
Let us assume that $X_1, … , X_p$ are the characteristics of certain weather measurements, and $Y$ is the chance of precipitation. If we are planning a day at the beach we are very interested in using $X$ to predict $Y$, but how do we measure the accuracy of $\hat{Y}$? as a prediction for $Y$? This depends on two quantities, reducible error and irreducible error.
Reducible Error
Mostly, $\hat{f}$ will not be a perfect estimate for $f$, and this imperfection will in turn create some error. We consider this error reducible because we can potentially improve the accuracy $\hat{f}$ by using the most appropriate technique to estimate $f$.
Irreducible Error
Let’s consider that even if it were possible to form a perfect estimate for $f$ so that $\hat{Y} = f(X)$, our prediction would still have some error in it (think back to the charts above where $Y$ and $f(X)$ are known, but there is still some error when trying to get $f(X)$ to equal $Y$). This is because $Y$ is also a function of $\epsilon$, which by definition cannot be predicted using $X$. This means that the variability associated with $\epsilon$ affects the accuracy of our predicitons. This is known as the irreducible error, because no matter how well we estimate $f$ we will not be able to reduce the error introduced by $\epsilon$.
The reason that the irreducible error always exists is because $\epsilon$ may contain unmeasured variables that are useful in predicting the outcome variable. If they are not contained in $f$ then we need to measure them in $\epsilon$.
Consider the following formula, $E(Y\hat{Y})^2$ which represents the average (or expected) values difference between the predicted and actual values. Given that we know $Y = f(X) + \epsilon$ and $\hat{Y} = \hat{f}(X)$, we can subtract these two equations from each other to show the reducible and irreducible error.
$$ \begin{equation} E(Y  \hat{Y}) = E[f(X) + \epsilon  \hat{f}(X)]^2 \end{equation} $$
$$ \begin{equation} \hspace{3.3cm} = \underbrace{[f(X)  \hat{f}(X)]^2} _ {\text{reducible error}} + \underbrace{Var(\epsilon)} _ {\text{irreducible error}} \end{equation} $$
Inference
Where in prediction we were solely focused on the prediction of $Y$, in inference we are curious on how $Y$ is affected as $X_1, … , X_2$ changes. Our goal is to estimate $f$, but we are not necessarily interested in making predictions for $Y$. We are more focused on the relationship between $X$ and $Y$.
We are essentially interested in answering 3 questions:

Which predictors are associated with the response?
 Often, only a few predictors are significantly associated with $Y$, and identifying the few important predictors is, obviously, very important.

What is the relationship between the response and each predictor?
 Some predictors have positive relationships and some have negative relationships and some are interwoven with other predictors

Can you develop a linear equation from $Y$ and the predictors, or is a more complex model required?
 Sometimes models need to be more complex than a simple linear equation
A good example is the charts we created earlier, we may be more interested in which advertising medium increases sales (i.e. focusing on our predictors) than we are with predicting our sales.
How do we Estimate $f$
Let $x_{ij}$ represent the value of the $j$th predictor, or input, for observation $i$, where $i = 1,2,…,n$ and $j = 1,2,…,p$. Correspondingly, let $y_i$ represent the training data response variable for the $i$th observation. Then our training data consist of:
$$ \begin{equation} {(x_1, y_1), (x_2, y_2), … , (x_n, y_n)} \hspace{1cm}where\hspace{1cm} x_i = (x_{i1}, x_{i2}, … , x_{ip})^T \end{equation} $$
The ultimate goal is to apply statistical learning to training data to estimate the unknown function of $f$.
Parametric Methods
The parametric method relies on estimating a set of parameters, that best fit $f$, i.e. it will look at how your inputs affect $f$. If your inputs (parameters) to predict our $\hat{Y}$(predicted fuel burn) are engine size and miles traveled, it will assign coefficient values to your $\beta_p$. For example:
$$ \begin{equation} \beta_0 = 10 (i.e. \hspace{.1cm}your \hspace{.1cm}intercept) \end{equation} $$
$$ \begin{equation} \beta_1 = .5 (multiply \hspace{.1cm}this \hspace{.1cm}by \hspace{.1cm}engine \hspace{.1cm}size) \end{equation} $$
$$ \begin{equation} \beta_2 = .75 (multiply \hspace{.1cm}this \hspace{.1cm}by \hspace{.1cm}miles \hspace{.1cm}traveled) \end{equation} $$
$$ \begin{equation} \hat{Y} = 10 + .5X_i + .75X_i \end{equation} $$
An issue we will run into with the parametric approach is that it will never match the true unknown form of $f$. We can try to minimize by selecting more flexible models that will take several more estimates of parameters, but we will run into an issue of overfitting the data.
When we are using a parametric model approach, we take a twostep model approach.
 We need to take a guess or make an assumption about the functional form or shape of $f$. For example, if the model is very simple (think of plotting engine size versus miles per gallon) we can assume that the model is most likely linear, i.e.
$$ \begin{equation} f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_pX_p \end{equation} $$
Once we have estimated that $f$ is linear, we only need to estimate the $p + 1$ coefficients, eliminating the need to estimate an arbitrary pdimensional function.
 Once the model is selected, we need a procedure that will be used to train the model. Since in step 1 we determined that we have a linear model, we need to find $\beta$ (i.e. the coefficients) or better known as the parameters $\beta_0, \beta_1, … , \beta_p$.
NonParametric Methods
NonParametric methods are indifferent to the functional form of $f$ as they are more focused on fitting the data points of $f$, but try not to be too rough or wiggly. This makes no assumptions about our parameters, we have the ability to fit a wider range of shapes (relative to the graphs) for $f$. Nonparametric methods do need a significant amount of data, due to the fact it tries to fit the data points independent of parameters.
Prediction Accuracy vs Model Interpretability TradeOff
In the figure above, there are some model type examples that show the tradeoff between interpretability and flexibility. So the question is, when would I want interpretability and when would I want flexibility? Let us think back to our two reasons that we are trying to estimate $f$ in the first place, prediction and inference. If we are trying to predict, we may not care how $f$ is estimated, we don’t care about the inputs, therefore we can rely on a more flexible model. If we are are worried about inference, i.e. understanding the inputs and how they affect $\hat{Y}$, we want to create a more interpretable model.
Supervised vs. Unsupervised Learning
Since supervised vs unsupervised learning is covered quite extensively in other posts, this will be an abbreviated refresher. The quickest way to remember is that supervised learning has labels (outputs are known). Consider our linear regression model where the miles per gallon are known. Since we have the labels of mpg to test against, our model is supervised, as we can drop in and look at the true outputs. Unsupervised learning the labels are not known. Consider if we did not have mpg available, we could just see how engine size and miles traveled compare to each other. We would then be able to make the assumption that the bigger the engine the less the vehicle can travel.
Regression vs Classification
Again, since regression vs classification is covered quite extensively in other posts, this will be an abbreviated refresher. Essentially, you will use regression when your output variables are numerical, double, float, decimal, or any other quantitative measure, like in our mpg model we have been discussing. Otherwise, if your model is has classes or objects (i.e. Honda, Toyota, Nissan), then you will need to build a classification model. If you go to the post I linked above, there are examples on how you can turn categorical variables into quantitative with Ranking and OneHot Encoding, but this is outside the scope of this post.
Measuring Model Accuracy
It is important in statistical learning to determine which model has the best fit, given the goal you are trying to achieve. Keep in mind, some models may be more accurate than others, but you can not draw inferences from them since they are highly complex. No one model is inheritantly better than any other given model. It is up to the Data Scientist to determine what is best for their circumstance.
Quality of Fit
When determing how well our model performs, we want to see how closely the data fits to the model. When building regression models, the normal quality measurement is using Mean Squared Error (MSE), given by $\hat{f}(x_i)$ is the prediction that gives $\hat{f}$ for the $i$th observation. Remember that $\hat{f}(x_i)$ will be our predicted value of $Y$ at point $i$ and $y_i$ is our actual value of $Y$. We square all of the differences and then divide by $n$ (i.e. the total number of observations).
$$ \begin{equation} MSE=\frac{1}{n}\sum_{i=1}^{n}\big(y_i\hat{f}(x_i)\big)^2 \end{equation} $$
Ideally, we would want MSE to be as small as possible. The issue is, if we have too small of a value, there is a chance that we have overfit our data. This is a good reason on why we split out our train (normally 80% of the data and test (remaining 20% of the data) sets.
BiasVariance TradeOff
In order to achieve the lowest possible test error we need to be able to build a model with the lowest possible variance and lowest posssible bias. If we consider the error a sum of 3 separate parts; the variance of $\hat{f}_0$, the squared bias of $\hat{f}_0$, and the variance of the error terms $\epsilon$, the formula can be viewed as such:
$$ \begin{equation} E\big(y_0\hat{f}(x_0)\big)=Var\big(\hat{f}(x_0)\big)+[Bias\big(\hat{f}(x_0)\big)]^2+Var(\epsilon) \end{equation} $$
Variance
Variance refers to the amount of which $\hat{f}$ would change if we estimated it using a different training set. If we use different data, then we would expect our $\hat{f}$ to change, since the data is different, even if ever so slightly. If we substituted one data point in a million data point set, it has the potential to change $\hat{f}$. Our goal in building a model is to make sure that it is robust enough to where $\hat{f}$ does not change very much between training data sets. If a method has high variance, then small changes in the training data will drastically change $\hat{f}$. The more flexible the model, the higher the variance, generally.
Bias
Bias refers to the error that is introduced by approximating a reallife problem, which may be extremely complicated, by a much simpler model. For example, a linear regression model is one of the simplest models that can be used in machine learning, and it is unlikely that any real world problem can easily be solved by a straight line. Yet, we tried to fit our car model to a regression model. In this case, we introduced a high bias.
In general, as we simplify our model, the variance becomes lower, but the bias increases and viceversa as we increase the complexity of our model. Granted, they tradeoff may not be proportionately inverse, see the figure above.
Classification
In the previous section we mentioned classification vs regression, and since we have discussed regression at length, let’s talk a little about classification. In classification, we look to assign a class to a prediction, not a value. Given that $\hat{y}_i$ is the predicted class label for the $i$th observation using $\hat{f}$, and $I(y_i \neq \hat{y}_i)$ is an indicator variable that equals $1$ if $y_i \neq \hat{y}_i$ and $0$ if $y_i = \hat{y}_i$. Thusly, we can represent as follows:
$$ \begin{equation} \frac{1}{n}\sum_{i=1}^{n}I(y_i \neq \hat{y}_i) \end{equation} $$
If we get the prediction correct, the error is $0$ otherwise it is equal to $1$, which is why $\big[y_i \neq \hat{y}_i=1\big]$ and $\big[y_i = \hat{y}_i=0\big]$.
Bayes Classifier
We can minimize the error rate when we assign each observation to the most likely class given its predictor values. Thus, this is a conditional probability we can assign the predictor vector $x_0$ to the class $j$ for which is largest:
$$ \begin{equation} Pr(Y=jX=x_0) \end{equation} $$
This very simple classifier is called Bayes Classifier. In a two class problem there can only be two classes, class 1 or class 2. Where the prediction for class 1 is:
$$ \begin{equation} Pr(Y=1X=x_0) > 0.5 \end{equation} $$
The Bayes Classifier produces the lowest possible error rate called the Bayes Error Rate. The overall error rate is given by:
$$ \begin{equation} 1E\big(\stackrel{max}{j}Pr(Y=jX)\big) \end{equation} $$
KNearest Neighbors
In theory, we would always like to predict qualitative responses using the Bayes classifier, but we do not know the conditional distribution of $Y$ given $X$. So computing the Bayes classifier is impossible, therefore, the Bayes classifier is an unattainable method. Many approaches attempt to estimate the conditional distribution of $Y$ given $X$, and then classify a given observation to the class with highest estimated probability. This is what Knearest neighbors (KNN) classifier attempts to do. Given a positive integer $K$ and a test observation $x_0$, the KNN classifier first identifies the $K$ points in the training data that are closest to $x_0$, represented by $N_0$:
$$ \begin{equation} Pr(Y=jX=x_0)=\frac{1}{K}\sum_{i \in N_0}I(y_i=j) \end{equation} $$
For example, if we choose a $K=3$, then the KNN classifier will look for the 3 closest neighbors of each point and assign the majority to the point. So, if 2/3 neighbors are blue and 1/3 neighbors are red, the model will assign blue to the point.
Summary and Examples
For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
(a) The sample size n is extremely large, and the number of predictors p is small.  We would expect an flexible model to perform better because we could fit the large amount of data closer than an inflexible method. (b) The number of predictors p is extremely large, and the number of observations n is small.  We would expect an inflexible method to perform better, since the sample size is so small, we could risk overfitting with a flexible method. (c) The relationship between the predictors and response is highly nonlinear.  We would expect a flexible method to perform better since the data is not easily fit to a linear model (d) *The variance of the error terms, i.e. $\sigma^2 = Var(\epsilon)$, is extremely high.  We would expect an inflexible method to perform better, because an inflexible method would try to fit the noise of the variance
Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide $n$ and $p$.
(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.  In this scenario, we would build a regression model with the predictors (p) [record profit, number of employees, industry (can be onehot encoded)] and a sample size (n) of 500. We are interested in what factors affect our label (output variable) of CEO Salary, which means we are interested in inference. (b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.  In this scenario, we would build a classification model since our output will be binary (success or failure) with our predictors (p) being [price, marketing budget, competition pricem, and 10 other variables] and our sample size (n) being 20 (other products). We are interested in prediction, because all we care about are chances of success for the product. (c) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.  In this scenario, we will be focused on building a regression model since we want a quantitative output where our predictors (p) will be [% change in US/British/German markets] with a sample size of 52 (weeks in a year) and we are looking for a predictive output.
### Import `college` dataset
college < dbGetQuery(con, "SELECT * FROM machine_learning.college")
### Convert the `college` column to rownames then remove the `college` column
rownames(college) < college[,1]
college < college[, 1]
### Convert the `private` column to a factor
college$private < as.factor(college$private)
### Since we pulled `college` from a Postgres DB, the class of each integer
### is integer64 and we need to convert it to int for R to understand
### count the number of columns
for (i in colnames(college)){
### if the column has a class of integer64 then...
if (class(college[,i]) == "integer64") {
### convert it to integer
college[,i] < as.integer(college[,i])
### if it doesn't have a class of integer64, then leave it alone
} else {}
}
### Take a look at the data
head(college)
 idx  private  apps  accept  enroll  top10perc  top25perc  f_undergrad  p_undergrad  outstate  room_board  books  personal  phd  terminal  sf_ratio  perc_alumni  expend  grad_rate 
                                      
 Abilene Christian University  Yes  1660  1232  721  23  52  2885  537  7440  3300  450  2200  70  78  18.1  12  7041  60 
 Adelphi University  Yes  2186  1924  512  16  29  2683  1227  12280  6450  750  1500  29  30  12.2  16  10527  56 
 Adrian College  Yes  1428  1097  336  22  50  1036  99  11250  3750  400  1165  53  66  12.9  30  8735  54 
 Agnes Scott College  Yes  417  349  137  60  89  510  63  12960  5450  450  875  92  97  7.7  37  19016  59 
 Alaska Pacific University  Yes  193  146  55  16  44  249  869  7560  4120  800  1500  76  72  11.9  2  10922  15 
 Albertson College  Yes  587  479  158  38  62  678  41  13500  3335  500  675  67  73  9.4  11  9727  55 
### Check the summary statistics for `college`
summary(college)
private apps accept enroll top10perc
No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
Median : 1558 Median : 1110 Median : 434 Median :23.00
Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
Max. :48094 Max. :26330 Max. :6392 Max. :96.00
top25perc f_undergrad p_undergrad outstate
Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
room_board books personal phd
Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
Median :4200 Median : 500.0 Median :1200 Median : 75.00
Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
terminal sf_ratio perc_alumni expend
Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
Median : 82.0 Median :13.60 Median :21.00 Median : 8377
Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
grad_rate
Min. : 10.00
1st Qu.: 53.00
Median : 65.00
Mean : 65.46
3rd Qu.: 78.00
Max. :118.00
### Create a pairs plot for the first 10 columns of `college`
pairs(college[,1:10], col="green", col.axis="white")
### Create a boxplot of Out of State vs Private
ggplot(data = college, mapping = aes(x = private, y = outstate)) +
geom_boxplot() +
theme_economist()
### Create a new column in the `college` dataset that determines the Elite
### status of the university
### Create a list with "No" equating to the number of rows in `college`
Elite < rep("No", nrow(college))
### If the college has 50 students or more that were top 10% then turn to "Yes"
Elite[college$top10perc > 50] < "Yes"
### Convert it to a factor
Elite < as.factor(Elite)
### Add it to the `college` dataframe
college$Elite < Elite
summary(college$Elite)
ggplot(data = college, mapping = aes(x = Elite, y = outstate)) +
geom_boxplot() +
theme_economist()
par(mfrow = c(2,2))
hist(college$top10perc, col="green", col.axis="white", col.main="white")
hist(college$top25perc, col="green", col.axis="white", col.main="white")
hist(college$expend, col="green", col.axis="white", col.main="white")
hist(college$accept/college$apps, col="green", col.axis="white", col.main="white")
Archives
202110 (6)Tags