Skip to content

Alzheimer's Disease Diagnosis Prediction





Project Overview

This project aims to predict the diagnosis of Alzheimer's Disease using various machine learning methods. Alzheimer's Disease is a progressive neurological disorder that leads to memory loss, cognitive decline, and, ultimately, the inability to perform daily activities. Early and accurate diagnosis is critical for managing the disease and improving the quality of life for affected individuals.

In this project, we will utilize the Alzheimer's Disease Dataset from Kaggle. The dataset contains detailed information about patients, including demographic details, lifestyle factors, medical history, clinical measurements, cognitive and functional assessments, and symptoms.

Project Goals

The primary goal of this project is to develop a machine learning model that can accurately predict whether a patient is diagnosed with Alzheimer's Disease based on the available features in the dataset. We will explore various machine learning algorithms to determine the most effective model for this prediction task.





Dataset Overview

AttributesDescription
PatientIDA unique identifier assigned to each patient (4751 to 6900).
AgeThe age of the patients ranges from 60 to 90 years.
GenderGender of the patients, where 0 represents Male and 1 represents Female.
EthnicityThe ethnicity of the patients, coded as follows: 0: Caucasian, 1: African American, 2: Asian, 3: Other.
EducationLevelThe education level of the patients, coded as follows: 0: None, 1: High School, 2: Bachelor's, 3: Higher.
BMIBody Mass Index of the patients, ranging from 15 to 40.
SmokingSmoking status, where 0 indicates No and 1 indicates Yes.
AlcoholConsumptionWeekly alcohol consumption in units, ranging from 0 to 20.
PhysicalActivityWeekly physical activity in hours, ranging from 0 to 10.
DietQualityDiet quality score, ranging from 0 to 10.
SleepQualitySleep quality score, ranging from 4 to 10.
FamilyHistoryAlzheimersFamily history of Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.
CardiovascularDiseasePresence of cardiovascular disease, where 0 indicates No and 1 indicates Yes.
DiabetesPresence of diabetes, where 0 indicates No and 1 indicates Yes.
DepressionPresence of depression, where 0 indicates No and 1 indicates Yes.
HeadInjuryHistory of head injury, where 0 indicates No and 1 indicates Yes.
HypertensionPresence of hypertension, where 0 indicates No and 1 indicates Yes.
SystolicBPSystolic blood pressure, ranging from 90 to 180 mmHg.
DiastolicBPDiastolic blood pressure, ranging from 60 to 120 mmHg.
CholesterolTotalTotal cholesterol levels, ranging from 150 to 300 mg/dL.
CholesterolLDLLow-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL.
CholesterolHDLHigh-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL.
CholesterolTriglyceridesTriglycerides levels, ranging from 50 to 400 mg/dL.
MMSEMini-Mental State Examination score, ranging from 0 to 30. Lower scores indicate cognitive impairment.
FunctionalAssessmentFunctional assessment score, ranging from 0 to 10. Lower scores indicate greater impairment.
MemoryComplaintsPresence of memory complaints, where 0 indicates No and 1 indicates Yes.
BehavioralProblemsPresence of behavioral problems, where 0 indicates No and 1 indicates Yes.
ADLActivities of Daily Living score, ranging from 0 to 10. Lower scores indicate greater impairment.
ConfusionPresence of confusion, where 0 indicates No and 1 indicates Yes.
DisorientationPresence of disorientation, where 0 indicates No and 1 indicates Yes.
PersonalityChangesPresence of personality changes, where 0 indicates No and 1 indicates Yes.
DifficultyCompletingTasksPresence of difficulty completing tasks, where 0 indicates No and 1 indicates Yes.
ForgetfulnessPresence of forgetfulness, where 0 indicates No and 1 indicates Yes.
DiagnosisDiagnosis status for Alzheimer's Disease, where 0 indicates No and 1 indicates Yes.
DoctorInChargeThis column contains confidential information about the doctor in charge, with "XXXConfid" as the value for all patients.

Methodology

The project will follow these steps:

  1. Data Exploration and Preprocessing: Initial analysis of the dataset, including handling missing values, encoding categorical variables, and feature scaling.
  2. Feature Selection: Identifying the most relevant features for predicting Alzheimer's Disease.
  3. Model Selection: Evaluating multiple machine learning models (e.g., Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, etc.) and selecting the best-performing model based on key metrics such as accuracy, precision, recall, and F1-score.
  4. Model Evaluation: Using cross-validation techniques to assess the model's performance and ensure it generalizes well to unseen data.
  5. Conclusion: Summarizing the findings and discussing potential implications for early diagnosis of Alzheimer's Disease.

Through this project, we aim to contribute to the ongoing efforts in leveraging machine learning for healthcare applications, particularly in the early detection and management of Alzheimer's Disease.

Let's import the libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.naive_bayes import GaussianNB

from numpy import mean, std

Let's load the dataset and see.

alzheimer = pd.read_csv('alzheimers_disease_data.csv')

Let's look at the shape of the dataset.

alzheimer.shape
alzheimer.head()

Let's see if the null value exists.

alzheimer.isnull().sum()[alzheimer.isnull().sum() > 0]

As we can see, there are no null values. Now let's examine the types of attributes.

alzheimer.info()

The observations of some attributes are either 0 or 1. Their type is int64. I will make their type category.

categorical_columns = [
    'Gender', 'Ethnicity', 'EducationLevel', 'Smoking',
    'FamilyHistoryAlzheimers', 'CardiovascularDisease',
    'Diabetes', 'Depression', 'HeadInjury',
    'Hypertension', 'MemoryComplaints',
    'BehavioralProblems', 'Confusion',
    'Disorientation', 'PersonalityChanges',
    'DifficultyCompletingTasks', 'Forgetfulness'
]

alzheimer[categorical_columns] = alzheimer[categorical_columns].astype('category')

Let's see how many different values ​​the DoctorInCharge attribute can take.