Alzheimer's Disease Diagnosis Prediction (Classification)

Alzheimer's Disease Diagnosis Prediction

Project Overview

This project aims to predict the diagnosis of Alzheimer's Disease using various machine learning methods. Alzheimer's Disease is a progressive neurological disorder that leads to memory loss, cognitive decline, and, ultimately, the inability to perform daily activities. Early and accurate diagnosis is critical for managing the disease and improving the quality of life for affected individuals.

In this project, we will utilize the Alzheimer's Disease Dataset from Kaggle. The dataset contains detailed information about patients, including demographic details, lifestyle factors, medical history, clinical measurements, cognitive and functional assessments, and symptoms.

Project Goals

The primary goal of this project is to develop a machine learning model that can accurately predict whether a patient is diagnosed with Alzheimer's Disease based on the available features in the dataset. We will explore various machine learning algorithms to determine the most effective model for this prediction task.

Dataset Overview

Attributes	Description
PatientID	A unique identifier assigned to each patient (4751 to 6900).
Age	The age of the patients ranges from 60 to 90 years.
Gender	Gender of the patients, where 0 represents Male and 1 represents Female.
Ethnicity	The ethnicity of the patients, coded as follows: 0: Caucasian, 1: African American, 2: Asian, 3: Other.
EducationLevel	The education level of the patients, coded as follows: 0: None, 1: High School, 2: Bachelor's, 3: Higher.
BMI	Body Mass Index of the patients, ranging from 15 to 40.
Smoking	Smoking status, where 0 indicates No and 1 indicates Yes.
AlcoholConsumption	Weekly alcohol consumption in units, ranging from 0 to 20.
PhysicalActivity	Weekly physical activity in hours, ranging from 0 to 10.
DietQuality	Diet quality score, ranging from 0 to 10.
SleepQuality	Sleep quality score, ranging from 4 to 10.
FamilyHistoryAlzheimers	Family history of Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.
CardiovascularDisease	Presence of cardiovascular disease, where 0 indicates No and 1 indicates Yes.
Diabetes	Presence of diabetes, where 0 indicates No and 1 indicates Yes.
Depression	Presence of depression, where 0 indicates No and 1 indicates Yes.
HeadInjury	History of head injury, where 0 indicates No and 1 indicates Yes.
Hypertension	Presence of hypertension, where 0 indicates No and 1 indicates Yes.
SystolicBP	Systolic blood pressure, ranging from 90 to 180 mmHg.
DiastolicBP	Diastolic blood pressure, ranging from 60 to 120 mmHg.
CholesterolTotal	Total cholesterol levels, ranging from 150 to 300 mg/dL.
CholesterolLDL	Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL.
CholesterolHDL	High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL.
CholesterolTriglycerides	Triglycerides levels, ranging from 50 to 400 mg/dL.
MMSE	Mini-Mental State Examination score, ranging from 0 to 30. Lower scores indicate cognitive impairment.
FunctionalAssessment	Functional assessment score, ranging from 0 to 10. Lower scores indicate greater impairment.
MemoryComplaints	Presence of memory complaints, where 0 indicates No and 1 indicates Yes.
BehavioralProblems	Presence of behavioral problems, where 0 indicates No and 1 indicates Yes.
ADL	Activities of Daily Living score, ranging from 0 to 10. Lower scores indicate greater impairment.
Confusion	Presence of confusion, where 0 indicates No and 1 indicates Yes.
Disorientation	Presence of disorientation, where 0 indicates No and 1 indicates Yes.
PersonalityChanges	Presence of personality changes, where 0 indicates No and 1 indicates Yes.
DifficultyCompletingTasks	Presence of difficulty completing tasks, where 0 indicates No and 1 indicates Yes.
Forgetfulness	Presence of forgetfulness, where 0 indicates No and 1 indicates Yes.
Diagnosis	Diagnosis status for Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.
DoctorInCharge	This column contains confidential information about the doctor in charge, with "XXXConfid" as the value for all patients.

Methodology

The project will follow these steps:

Data Exploration and Preprocessing: Initial analysis of the dataset, including handling missing values, encoding categorical variables, and feature scaling.
Feature Selection: Identifying the most relevant features for predicting Alzheimer's Disease.
Model Selection: Evaluating multiple machine learning models (e.g., Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, etc.) and selecting the best-performing model based on key metrics such as accuracy, precision, recall, and F1-score.
Model Evaluation: Using cross-validation techniques to assess the model's performance and ensure it generalizes well to unseen data.
Conclusion: Summarizing the findings and discussing potential implications for early diagnosis of Alzheimer's Disease.

Through this project, we aim to contribute to the ongoing efforts in leveraging machine learning for healthcare applications, particularly in the early detection and management of Alzheimer's Disease.

Let's import the libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.naive_bayes import GaussianNB

from numpy import mean, std

Let's load the dataset and see.

alzheimer = pd.read_csv('alzheimers_disease_data.csv')

Let's look at the shape of the dataset.

alzheimer.shape

alzheimer.head()

Let's see if the null value exists.

alzheimer.isnull().sum()[alzheimer.isnull().sum() > 0]

As we can see, there are no null values. Now let's examine the types of attributes.

alzheimer.info()

The observations of some attributes are either 0 or 1. Their type is int64. I will make their type category.

categorical_columns = [
    'Gender', 'Ethnicity', 'EducationLevel', 'Smoking',
    'FamilyHistoryAlzheimers', 'CardiovascularDisease',
    'Diabetes', 'Depression', 'HeadInjury',
    'Hypertension', 'MemoryComplaints',
    'BehavioralProblems', 'Confusion',
    'Disorientation', 'PersonalityChanges',
    'DifficultyCompletingTasks', 'Forgetfulness'
]

alzheimer[categorical_columns] = alzheimer[categorical_columns].astype('category')

Let's see how many different values the DoctorInCharge attribute can take.