This report presents the analysis and implementation process for building a logistic regression model to predict heart disease based on the "Heart Attack Data Set." This dataset includes medical variables commonly associated with cardiovascular health, such as age, cholesterol levels, gender, and electrocardiogram results. The objective of this report is to detail each step taken in preprocessing, exploratory data analysis, feature selection, model training, and evaluation, while also justifying key decisions and the suitability of logistic regression for this classification task.
import pandas as pd
data = pd.read_excel('Heart Attack Data Set spreadsheet.xlsx')The dataset was first loaded and inspected using the Pandas library, which provided a quick overview of its size, shape, and presence of missing or duplicate values. The dataset contained no missing values, and no encoding was required, as all categorical variables were already presented in binary form. Basic statistics for each variable were obtained using data.describe() to understand the range, mean, and distribution of each numerical feature.
print("Shape:\n", "\n", data.shape)
print("\n")
print("Size:\n", "\n", data.size)
print("\n")
print("Info:\n", "\n", data.info())
print("\n")
print("Describe:\n", "\n", data.describe())
print("\n")
print("Null values:\n", "\n", data.isnull().sum())
print("\n")
print("Duplicate values:\n", "\n", data.duplicated().sum())
print("\n")Correlation Analysis
To identify relationships between variables, a Seaborn correlation heatmap was generated using Matplotlib and Seaborn. A Seaborn correlation heatmap was chosen as it provides an easy visualisation and “identify highly correlated or inversely correlated variables at a glance” (Bothma, 2024). The most significant positive correlation was observed between chest pain and target (the variable indicating heart disease presence. This strong association is supported by cardiovascular conditions being diagnosed in more than half of patients presenting with chest pain in the emergency department and 21.43% of patients in a Pretoria hospital admitted for chest pains reporting some form of cardiovascular disease (Geyser & Smith, 2016). The most negative correlation was between ‘oldpeak’ (ST depression induced by exercise) and slope (slope of peak exercise ST segment), suggesting an inverse relationship between exercise-induced ST depression and slope readings.
import matplotlib.pyplot as plt
import seaborn as sns
correlation_matrix = data.corr()
plt.figure(figsize=(10, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()Distribution Analysis
Age and Cholesterol Levels
Using a Seaborn Scatter plot (from Matplotlib and Seaborn), the distribution between age and cholesterol levels was analysed. Scatter plots are ideal for understanding the distribution and relationship between two continuous variables and can identify any underlying patterns or trends.
Heart Disease Distribution by Gender
A Seaborn.countplot was used to display the distribution of heart disease cases by gender, helping to examine whether there are gender-based differences in heart disease prevalence. A Seaborn.countplot is a method used to display the frequency of categorical observations in each category within a dataset (Tutorialspoint, 2024). It functions similarly to a histogram but is applied to categorical rather than numerical variables. Since it shares the core API and configuration options with barplot(), it allows for easy comparison of counts across multiple nested variables.
Heart Disease Distribution by Resting Electrocardiogram (ECG) Results
Another Seaborn.countplot was used to explore heart disease distribution across ECG results categories.
import matplotlib.pyplot as plt
import seaborn as sns
# Plot for 'age' and 'serum cholesterol'
sns.scatterplot(data = data, x = 'age', y = 'serum cholesterol', hue='age')
plt.title('Scatter Plot Serum Cholesterol vs. Age')
plt.xlabel('Age')
plt.ylabel('Serum Cholesterol')
plt.show()
print("\n")
sns.lmplot(data = data, x = 'age', y = 'serum cholesterol', hue='age')
plt.title('Lmplot Serum Cholesterol vs. Age')
plt.xlabel('Age')
plt.ylabel('Serum Cholesterol')
plt.show()
print("\n")# Plot the distribution of those with heart disease by resting electrocardiogram
sns.countplot(x = 'target', hue = 'resting electrocardiogram results', data = data)
plt.title('Distribution of Heart Disease by Resting ElectroCARDIOGRAM')
plt.xlabel('Heart Disease (0: No, 1: Yes)')
plt.ylabel('Count')
plt.show()Data Preprocessing and Feature Selection
Scaling the Data
The features were standardized using StandardScaler from sklearn.preprocessing, which scales each variable to a mean of zero and a standard deviation of one. This process was necessary because the Logistic Regression model used within RFE did not converge to a solution within the default number of iterations on the first try. Thus in addition to increasing the maximum number of iterations, standardisation ensured that each feature contributed equally to the model (avcontentteam, 2023).
Feature Selection with Recursive Feature Elimination (RFE)
Recursive Feature Elimination (RFE) was applied to the dataset to select the most relevant features for the logistic regression model. RFE was set to 1000 iterations, gradually removing less significant features and narrowing down to a final set that contributes the most predictive power. RFE was chosen because it is suitable for models such as logistic regressions, as it iteratively removes the weakest features based on their impact on model performance. This method allows us to optimize the feature set, avoiding overfitting and enhancing generalizability (Brownlee, 2020).
Train-Test Split
After RFE, the target variable was removed from the feature set, and the data was split into 70% training and 30% validation/test sets, with the validation and test data further split into 15% validation and 15% test sets. This follows the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) Statement to for best practice in the reporting of prediction models (Moons et al., 2015). This approach reduces the risk of overfitting and ensures robust model evaluation.
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler # Import for scaling
# Separate the features (X) and target variable (y)
X = data.drop('target', axis=1)
y = data['target']
# Scale the features using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fit and transform the data
# Create a logistic regression model
model = LogisticRegression(max_iter=1000)
# Initialize RFE with the scaled data and model
rfe = RFE(estimator=model, n_features_to_select=5)
# Fit RFE to the scaled data
rfe.fit(X_scaled, y)
# Get the selected features
selected_features = X.columns[rfe.support_]
# Print the selected features
print("Selected Features:", selected_features)import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
features = ['sex', 'chest pain', 'maximum heart rate', 'oldpeak', 'no. of major vessels colored by fluoroscopy']
X = data[features]
y = data['target']
# Split into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42) # 70% train, 30% for validation and test
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) # Split the 30% into 15% validation and 15% test
# Create and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Make predictions on the validation set
y_val_pred = model.predict(X_val)
# Evaluate the model on the validation set
print("Validation Set Performance:")
print("Accuracy:", accuracy_score(y_val, y_val_pred))
print("Classification Report:\n", classification_report(y_val, y_val_pred))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_val_pred))# Testing on the test set
y_test_pred = model.predict(X_test)
# Evaluate the model on the test set
print("\nTest Set Performance:")
print("Accuracy:", accuracy_score(y_test, y_test_pred))
print("Classification Report:\n", classification_report(y_test, y_test_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))