Course Notes: Generalized Linear Models in Python

Generalized Linear Models in Python

Introduction to GLMs

Going beyond linear regression

Review of Linear Models, Variables, and Formulas

Linear Models

Linear models are a fundamental tool in statistical analysis and machine learning. They describe the relationship between a dependent variable (also known as the response variable) and one or more independent variables (also known as predictors or features) using a linear equation.

The general form of a linear model is:

# Mock to run code

# Import necessary libraries
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols

# Create a mock dataset
np.random.seed(0)
my_data = pd.DataFrame({
    'X': np.random.rand(100),
    'y': 2.5 * np.random.rand(100) + 1.5
})

# Fit the linear model
model = ols(formula='y ~ X', data=my_data).fit()

# Display the summary of the model
model.summary()

# Generalized Linear Model

import statsmodels.api as sm
from statsmodels.formula.api import glm

# Mock to run code
model = glm(formula='y ~ X',
            data=my_data,
            family=sm.families.Gaussian()).fit()

# Display the summary of the model
model.summary()

Regression Function Equation

# Exercises

import statsmodels.api as sm
from statsmodels.formula.api import ols, glm

# Mock to run code
# Create a mock dataset
np.random.seed(0)
salary = pd.DataFrame({
    'Experience': np.random.rand(100) * 10,  # Random experience between 0 and 10 years
    'Salary': 50000 + 2000 * np.random.rand(100) * 10  # Random salary between 50k and 70k
})

# Fit a linear model
model_lm = ols(formula='Salary ~ Experience', data=salary).fit()

# View model coefficients
model_lm.params

# Fit a GLM
model_glm = glm(formula='Salary ~ Experience', data=salary, family=sm.families.Gaussian()).fit()

# View model coefficients
model_glm.params

How to build a GLM?

Components of the Generalized Linear Model (GLM)

A Generalized Linear Model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The components of a GLM are:

Random Component:
- Specifies the probability distribution of the response variable (Y). Common distributions include:
  - Gaussian (Normal)
  - Binomial
  - Poisson
  - Gamma
Systematic Component:
- Specifies the linear predictor, which is a linear combination of the explanatory variables (X). It is represented as:
- Here, is the linear predictor, is the intercept, and are the coefficients of the explanatory variables .
Link Function:
- Connects the mean of the response variable to the linear predictor . The link function is chosen based on the distribution of the response variable. Common link functions include:
  - Identity link:
  - Logit link:
  - Log link:
  - Inverse link:

Example of a GLM

For a Gaussian distribution with an identity link function, the GLM is equivalent to ordinary linear regression:

For a binomial distribution with a logit link function, the GLM is used for logistic regression:

In summary, the GLM framework allows for a wide range of models by specifying different combinations of the random component, systematic component, and link function.

import statsmodels.api as sm
import pandas as pd

# Mock data
crab = pd.DataFrame({
    'width': [2.8, 3.1, 3.3, 3.5, 3.7, 3.9, 4.1, 4.3, 4.5, 4.7],
    'y': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
})

# Define model formula
formula = 'y ~ width'

# Define probability distribution for the response variable for 
# the linear (LM) and logistic (GLM) model
family_LM = sm.families.Gaussian()
family_GLM = sm.families.Binomial()

# Define and fit a linear regression model
model_LM = sm.GLM.from_formula(formula, data=crab, family=family_LM).fit()
model_LM.summary()

# Define and fit a logistic regression model
model_GLM = sm.GLM.from_formula(formula, data=crab, family=family_GLM).fit()
model_GLM.summary()

# Mock test set
test = pd.DataFrame({
    'width': [3.0, 3.4, 3.8, 4.2, 4.6]
})

# Compute estimated probabilities for linear model: pred_lm
pred_lm = model_LM.predict(test)

# Compute estimated probabilities for GLM model: pred_glm
pred_glm = model_GLM.predict(test)

# Create dataframe of predictions for linear and GLM model: predictions
predictions = pd.DataFrame({'Pred_LM': pred_lm, 'Pred_GLM': pred_glm})

# Concatenate test sample and predictions and view the results
all_data = pd.concat([test, predictions], axis=1)
all_data

How to fit a GLM in Python?

Explanation of `family` Arguments and Functions in GLM

In Generalized Linear Models (GLMs), the family argument specifies the probability distribution of the response variable and the link function that relates the linear predictor to the mean of the distribution function. The statsmodels library in Python provides several families that can be used with GLMs. Here are some common families and their uses:

Gaussian Family:

Usage: Used for continuous response variables that are normally distributed.
Link Function: Identity link (default), which means the linear predictor is directly the mean of the distribution.

Example:

family_Gaussian = sm.families.Gaussian()

# Describing the model

from statsmodels.formula.api import glm
import statsmodels.api as sm
import pandas as pd

# Mock data
crab = pd.DataFrame({
    'width': [2.8, 3.1, 3.3, 3.5, 3.7, 3.9, 4.1, 4.3, 4.5, 4.7],
    'y': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
})

# Define model formula
formula = 'y ~ width'

# Define probability distribution for the response variable for the logistic (GLM) model
family = sm.families.Binomial()

# Define and fit a logistic regression model
model = glm(formula, data=crab, family=family).fit()
model.summary()

‌
‌
‌