Alzheimer's Disease Diagnosis Prediction
Project Overview
This project aims to predict the diagnosis of Alzheimer's Disease using various machine learning methods. Alzheimer's Disease is a progressive neurological disorder that leads to memory loss, cognitive decline, and, ultimately, the inability to perform daily activities. Early and accurate diagnosis is critical for managing the disease and improving the quality of life for affected individuals.
In this project, we will utilize the Alzheimer's Disease Dataset from Kaggle. The dataset contains detailed information about patients, including demographic details, lifestyle factors, medical history, clinical measurements, cognitive and functional assessments, and symptoms.
Project Goals
The primary goal of this project is to develop a machine learning model that can accurately predict whether a patient is diagnosed with Alzheimer's Disease based on the available features in the dataset. We will explore various machine learning algorithms to determine the most effective model for this prediction task.
Dataset Overview
| Attributes | Description |
|---|---|
| PatientID | A unique identifier assigned to each patient (4751 to 6900). |
| Age | The age of the patients ranges from 60 to 90 years. |
| Gender | Gender of the patients, where 0 represents Male and 1 represents Female. |
| Ethnicity | The ethnicity of the patients, coded as follows: 0: Caucasian, 1: African American, 2: Asian, 3: Other. |
| EducationLevel | The education level of the patients, coded as follows: 0: None, 1: High School, 2: Bachelor's, 3: Higher. |
| BMI | Body Mass Index of the patients, ranging from 15 to 40. |
| Smoking | Smoking status, where 0 indicates No and 1 indicates Yes. |
| AlcoholConsumption | Weekly alcohol consumption in units, ranging from 0 to 20. |
| PhysicalActivity | Weekly physical activity in hours, ranging from 0 to 10. |
| DietQuality | Diet quality score, ranging from 0 to 10. |
| SleepQuality | Sleep quality score, ranging from 4 to 10. |
| FamilyHistoryAlzheimers | Family history of Alzheimer's Disease, where 0 indicates No and 1 indicates Yes. |
| CardiovascularDisease | Presence of cardiovascular disease, where 0 indicates No and 1 indicates Yes. |
| Diabetes | Presence of diabetes, where 0 indicates No and 1 indicates Yes. |
| Depression | Presence of depression, where 0 indicates No and 1 indicates Yes. |
| HeadInjury | History of head injury, where 0 indicates No and 1 indicates Yes. |
| Hypertension | Presence of hypertension, where 0 indicates No and 1 indicates Yes. |
| SystolicBP | Systolic blood pressure, ranging from 90 to 180 mmHg. |
| DiastolicBP | Diastolic blood pressure, ranging from 60 to 120 mmHg. |
| CholesterolTotal | Total cholesterol levels, ranging from 150 to 300 mg/dL. |
| CholesterolLDL | Low-density lipoprotein cholesterol levels, ranging from 50 to 200 mg/dL. |
| CholesterolHDL | High-density lipoprotein cholesterol levels, ranging from 20 to 100 mg/dL. |
| CholesterolTriglycerides | Triglycerides levels, ranging from 50 to 400 mg/dL. |
| MMSE | Mini-Mental State Examination score, ranging from 0 to 30. Lower scores indicate cognitive impairment. |
| FunctionalAssessment | Functional assessment score, ranging from 0 to 10. Lower scores indicate greater impairment. |
| MemoryComplaints | Presence of memory complaints, where 0 indicates No and 1 indicates Yes. |
| BehavioralProblems | Presence of behavioral problems, where 0 indicates No and 1 indicates Yes. |
| ADL | Activities of Daily Living score, ranging from 0 to 10. Lower scores indicate greater impairment. |
| Confusion | Presence of confusion, where 0 indicates No and 1 indicates Yes. |
| Disorientation | Presence of disorientation, where 0 indicates No and 1 indicates Yes. |
| PersonalityChanges | Presence of personality changes, where 0 indicates No and 1 indicates Yes. |
| DifficultyCompletingTasks | Presence of difficulty completing tasks, where 0 indicates No and 1 indicates Yes. |
| Forgetfulness | Presence of forgetfulness, where 0 indicates No and 1 indicates Yes. |
| Diagnosis | Diagnosis status for Alzheimer's Disease, where 0 indicates No and 1 indicates Yes. |
| DoctorInCharge | This column contains confidential information about the doctor in charge, with "XXXConfid" as the value for all patients. |
Methodology
The project will follow these steps:
- Data Exploration and Preprocessing: Initial analysis of the dataset, including handling missing values, encoding categorical variables, and feature scaling.
- Feature Selection: Identifying the most relevant features for predicting Alzheimer's Disease.
- Model Selection: Evaluating multiple machine learning models (e.g., Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, etc.) and selecting the best-performing model based on key metrics such as accuracy, precision, recall, and F1-score.
- Model Evaluation: Using cross-validation techniques to assess the model's performance and ensure it generalizes well to unseen data.
- Conclusion: Summarizing the findings and discussing potential implications for early diagnosis of Alzheimer's Disease.
Through this project, we aim to contribute to the ongoing efforts in leveraging machine learning for healthcare applications, particularly in the early detection and management of Alzheimer's Disease.
Let's import the libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.naive_bayes import GaussianNB
from numpy import mean, stdLet's load the dataset and see.
alzheimer = pd.read_csv('alzheimers_disease_data.csv')Let's look at the shape of the dataset.
alzheimer.shapealzheimer.head()Let's see if the null value exists.
alzheimer.isnull().sum()[alzheimer.isnull().sum() > 0]As we can see, there are no null values. Now let's examine the types of attributes.
alzheimer.info()The observations of some attributes are either 0 or 1. Their type is int64. I will make their type category.
categorical_columns = [
'Gender', 'Ethnicity', 'EducationLevel', 'Smoking',
'FamilyHistoryAlzheimers', 'CardiovascularDisease',
'Diabetes', 'Depression', 'HeadInjury',
'Hypertension', 'MemoryComplaints',
'BehavioralProblems', 'Confusion',
'Disorientation', 'PersonalityChanges',
'DifficultyCompletingTasks', 'Forgetfulness'
]
alzheimer[categorical_columns] = alzheimer[categorical_columns].astype('category')Let's see how many different values the DoctorInCharge attribute can take.