Maximum Likelihood Estimation For Simple Linear Regression Model with R
Introduction
The "Simple Linear Regression - Placement Data" dataset from Kaggle provides a comprehensive platform for educational purposes, particularly in the realm of linear regression analysis. The dataset comprises 1000 observations and includes variables such as "cgpa" (Cumulative Grade Point Average) (independent variable) and "placement exam marks" (dependent variable). The primary goal of this dataset is to facilitate the exploration and analysis of the relationship between students' academic performance (as measured by CGPA) and their placement exam marks. This relationship is crucial for understanding how well students' grades correlate with their placement exam performance, which can inform educational strategies and student guidance.
Data Loading and Libraries
The analysis begins with loading the necessary libraries.
# Load necessary libraries
# ggplot2: A data visualization package for creating complex and multi-layered graphics.
library(ggplot2)
# dplyr: A package for data manipulation and transformation with a focus on simplicity and speed.
library(dplyr)
# knitr: A package for dynamic report generation in R, integrating code and outputs into documents.
library(knitr)
# kableExtra: An extension of the knitr package for creating well-formatted HTML and PDF tables.
library(kableExtra)
# stats4: Provides tools for statistical calculations, including maximum likelihood estimation (MLE).
library(stats4)
# outliers: A package for detecting outliers in datasets.
library(outliers)
# mvoutlier: Provides methods for detecting multivariate outliers.
library(mvoutlier)
# tidyverse: A collection of R packages designed for data science, including dplyr, ggplot2, tidyr, readr, etc.
library(tidyverse)
# lmtest: A package for testing linear regression models, including tests for heteroscedasticity and autocorrelation.
library(lmtest)
# MASS: A package that contains functions and datasets to support the book "Modern Applied Statistics with S".
library(MASS)
# sandwich: Provides robust covariance matrix estimators for linear models.
library(sandwich)
# car: Companion to Applied Regression package, includes functions for regression diagnostics.
library(car) # for Durbin-Watson test
# nlme: Fits and compares Gaussian linear and nonlinear mixed-effects models.
library(nlme) # for gls function to handle autocorrelationlibraries is used
- (ggplot2) - (dplyr) - (stats4) - (Knitr) - (KableExtra) - (outliers) - (mvoutlier) - (tidyverse) - (lmtest) - (MASS) - (sandwich) - (car) - (nlme)
The dataset is then read from a CSV file:
# Read the dataset
data <- read.csv('placement.csv')
head(data)Processing of Outliers
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 <- quantile(data$cgpa, 0.25)
Q3 <- quantile(data$cgpa, 0.75)
Q1_marks <- quantile(data$placement_exam_marks, 0.25)
Q3_marks <- quantile(data$placement_exam_marks, 0.75)
# Calculate IQR
IQR <- Q3 - Q1
IQR_marks <- Q3_marks - Q1_marks
# Define the lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
lower_bound_marks <- Q1_marks - 1.5 * IQR_marks
upper_bound_marks <- Q3_marks + 1.5 * IQR_marks
# Identify outliers
outliers <- subset(data, data$cgpa < lower_bound | data$cgpa > upper_bound)
outliers_marks <- subset(data, data$placement_exam_marks < lower_bound_marks | data$placement_exam_marks > upper_bound_marks)
# View outliers
outliers
outliers_marksRemoving of Outliers
# Remove outliers
cleaned_data <- subset(data, data$cgpa >= lower_bound & data$cgpa <= upper_bound)
cleaned_data_marks <- subset(data, data$placement_exam_marks >= lower_bound_marks & data$placement_exam_marks <= upper_bound_marks)
# View the first few rows of the cleaned dataset
head(cleaned_data)
head(cleaned_data_marks)
Plot variable cgpa with and without outliers
# Boxplot before removing outliers
ggplot(data, aes(y = cgpa)) +
geom_boxplot() +
ggtitle("Boxplot of CGPA with Outliers")
# Boxplot after removing outliers
ggplot(cleaned_data, aes(y = cgpa)) +
geom_boxplot() +
ggtitle("Boxplot of CGPA without Outliers")
Plot variable Place Exam Marks with and without outliers