## Logistic Regression in R Tutorial

**Discover all about logistic regression: how it differs from linear regression, how to fit and evaluate these models it in R with the glm() function and more!**

Logistic regression is yet another technique borrowed by machine learning from the field of statistics. It's a powerful statistical way of modeling a binomial outcome with one or more explanatory variables. It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution.

This R tutorial will guide you through a simple execution of logistic regression:

- You'll first explore the theory behind logistic regression: you'll learn more about the differences with linear regression and what the logistic regression model looks like. You'll also discover multinomial and ordinal logistic regression.
- Next, you'll tackle logistic regresssion in R: you'll not only explore a data set, but you'll also fit the logistic regression models
using the powerful
`glm()`

function in R, evaluate the results and solve overfitting.

**Tip**: if you're interested in taking your skills with linear regression to the next level, consider also DataCamp's Multiple and Logistic Regression course!

### Regression Analysis: Introduction

As the name already indicates, logistic regression is a regression analysis technique. Regression analysis is a set of statistical processes that you can use to estimate the relationships among variables. More specifically, you use this set of techniques to model and analyze the relationship between a dependent variable and one or more independent variables. Regression analysis helps you to understand how the typical value of the dependent variable changes when one of the independent variables is adjusted and others are held fixed.

As you already read, there are various regression techniques. You can distinguish them by looking at three aspects: the number of independent variables, the type of dependent variables and the shape of regression line.

#### Linear Regression

Linear regression is one of the most widely known modeling techniques. It allows you, in short, to use a linear relationship to predict the (average) numerical value of

As a consequence, the linear regression model is

Prediciting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, the methods that are often used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification.

Linear regression is not capable of predicting probability. If you use linear regression to model a binary response variable, for example, the resulting model may not restrict the predicted Y values within 0 and 1. Here's where logistic regression comes into play, where you get a probaiblity score that reflects the probability of the occurrence at the event.

#### Logistic Regression

Logistic regression is an instance of a classification technique that you can use to predict a qualitative response. More specifically, logistic regression models the probability that

That means that, if you are trying to do gender classification, where the response `male`

or `female`

, you'll use logistic regression models to estimate the probability that

For example, the probability of

The values of

Given

The problem with this approach is that, any time a straight line is fit to a binary response that is coded as

To avoid this problem, you can use the logistic function to model

The logistic function will always produce an S-shaped curve, so regardless of the value of

The above equation can also be reframed as:

The quantity

is called the odds ratio, and can take on any value between

By taking the logarithm of both sides from the equation above, you get:

The left-hand side is called the **logit**. In a logistic regression
model, increasing *X* by one unit changes the logit by *X* will
depend on the current value of *X*. But regardless of the value of *X*,
if *X* will be associated
with increasing *X* will be associated with decreasing

The coefficients

You seek estimates for

This intuition can be formalized using a mathematical equation called a likelihood function:

The estimates

#### Multinomial Logistic Regression

So far, this tutorial has only focused on Binomial Logistic Regression, since you were classifying instances as male or female. Multinomial Logistic Regression model is a simple extension of the binomial logistic regression model, which you use when the exploratory variable has more than two nominal (unordered) categories.

In multinomial logistic regression, the exploratory variable is dummy coded
into multiple 1/0 variables. There is a variable for all categories but
one, so if there are M categories, there will be

The mulitnomial logistic regression then estimates a separate binary
logistic regression model for each of those dummy variables. The result
is

#### Ordinal Logistic Regression

Next to multinomial logistic regression, you also have ordinal logistic regression, which is another extension of binomial logistics regression. Ordinal regression is used to predict the dependent variable with ‘ordered’ multiple categories and independent variables. You already see this coming back in the name of this type of logistic regression, since "ordinal" means "order of the categories".

In other words, it is used to facilitate the interaction of dependent variables (having multiple ordered levels) with one or more independent variables.

For example, you are doing customer interviews to evaluate their satisfaction towards our newly released product. You are tasked to ask a question to respondent where their answer lies between

### Logistic Regression in R with `glm`

`glm`

In this section, you'll study an example of a binary logistic
regression, which you'll tackle with the `ISLR`

package, which will
provide you with the data set, and the `glm()`

function, which is
generally used to fit generalized linear models, will be used to fit the
logistic regression model.

#### Loading Data

The first thing to do is to install and load the `ISLR`

package, which has
all the datasets you're going to use.

`source("requirements.R")`

```
knitr::opts_chunk$set(
dpi = 150,
fig.width = 6,
fig.height = 4,
message = FALSE,
warning = FALSE)
```

`require(ISLR)`

For this tutorial, you're going to work with the Smarket dataset within RStudio. The dataset shows daily percentage returns for the S&P 500 stock index between 2001 and 2005.

#### Exploring Data

Let's explore it for a bit. `names()`

is useful for seeing what's on the
data frame, `head()`

is a glimpse of the first few rows, and `summary()`

is
also useful.

`names(Smarket)`

`head(Smarket)`

`summary(Smarket)`

The `summary()`

function gives you a simple summary of each of the
variables on the Smarket data frame. You can see that there's a number
of lags, volume, today's price, and direction. You will use `Direction`

as
a response vairable, as that shows whether the market went up or down
since the previous day.

#### Visualizing Data

Data visualization is perhaps the fastest and most useful way to summarize and learn more about your data. You'll start by exploring the numeric variables individually.

Histograms provide a bar chart of a numeric variable split into bins with the height showing the number of instances that fall into each bin. They are useful to get an indication of the distribution of an attribute.

```
par(mfrow = c(1, 8))
for (i in 1:8) {
hist(Smarket[, i], main = names(Smarket)[i])
}
```

It's extremely hard to see, but most of the variables show a Gaussian or double Gaussian distribution.

You can look at the distribution of the data a different way using box and whisker plots. The box captures the middle 50% of the data, the line shows the median and the whiskers of the plots show the reasonable extent of data. Any dots outside the whiskers are good candidates for outliers.

```
par(mfrow = c(1, 8))
for (i in 1:8) {
boxplot(Smarket[, i], main = names(Smarket)[i])
}
```

You can see that the `Lag`

s and `Today`

all has a similar range. Otherwise, there's no sign of any outliers.

Missing data have have a big impact on modeling. Thus, you can use a missing plot to get a quick idea of the amount of missing data in the dataset. The x-axis shows attributes and the y-axis shows instances. Horizontal lines indicate missing data for an instance, vertical blocks represent missing data for an attribute.