Detecting Tuberculosis in X-Rays
📖 Background
Tuberculosis (TB) is one of the most common and deadly respiratory diseases in the world, causing about 1.25 million people in 2023. Doctors often use chest X-rays to help detect TB. However, looking at many X-rays by hand can be slow and difficult.
In this challenge, you will build a simple machine learning model that can help classify chest X-ray images into two groups:
- Healthy lungs
- Lungs affected by TB
This is not about building a “perfect” model. The focus should be on how you describe your process, decisions, and learnings.
🩻 The Data
You are given a small dataset from the Sakha-TB dataset:
- Training data: 150 healthy + 150 TB images (300 total)
- Test data: 50 healthy + 50 TB images (100 total)
These images are in the data.zip file at the root of the notebook. They will then be in the data/chestxray folder, which is further divided into test and train, both containing healthy and tb folders with the images inside.
💪 Challenge
You will train a model to classify chest X-rays. Your report should cover these questions:
-
Preprocessing
What steps did you take to make the images easier for a model to understand?
Some ideas to think about:- Did you resize the images to the same size?
- Did you convert them to grayscale or normalize the pixel values?
- Did you try any simple image transformations (e.g., small rotations)?
-
Modeling
Try at least two models and compare them.- One can be a simple model you build yourself (like a small CNN).
- Another can be a pre-trained model (like ResNet or MobileNet).
Explain what you tried and what differences you observed.
-
Evaluation
Evaluate your models on the test set. Report the following metrics in plain words:- Sensitivity (Recall for TB): How many TB cases your model correctly finds.
- Specificity: How many healthy cases your model correctly identifies.
- Positive Predictive Value (PPV): When your model says “TB”, how often it’s right.
- Negative Predictive Value (NPV): When your model says “Healthy”, how often it’s right.
👉 Tip: You don’t need to get the “best” numbers. Focus on explaining what the metrics mean and what you learned.
-
[Optional] ROC Curve
If you want, you can also draw a Receiver Operating Characteristic (ROC) curve to visualize how your model performance changes with different thresholds.
✅ Checklist before publishing
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the introduction to data science notebooks, so the workbook is focused on your story.
- Check that all the cells run without error.
# Load the necessary library
library(utils)
# Specify the path to the Zip file
zip_path <- "data.zip"
# Extract files from the Zip archive
sub_dir<-"data/chestxrays"
if (!file.exists(sub_dir)) {
unzip(zip_path)
}# =========================
# FIXED CODE BLOCK
# =========================
# 0. Ensure Python and Keras/TensorFlow are available
# ---------------------------------------------------
# This block checks for Python and installs TensorFlow/Keras if needed.
# Avoid using install_keras(method = "conda") due to conda environment creation errors.
# Instead, use method = "auto" or "virtualenv" (recommended for most systems).
if (!reticulate::py_available(initialize = TRUE)) {
cat("Python not found. Installing Miniconda and TensorFlow/Keras Python packages...\n")
# reticulate::install_miniconda() # Commented out to avoid conda errors
keras::install_keras(method = "auto") # Use "auto" or "virtualenv" instead of "conda"
}
# Load the necessary libraries
library(keras)
library(tensorflow)
## 1. Define Paths and Parameters
# Make sure 'sub_dir' is defined before using it
# sub_dir <- "path/to/your/data" # Uncomment and set this to your data directory
train_dir <- file.path(sub_dir, "train")
test_dir <- file.path(sub_dir, "test")
# Define image dimensions and batch size
IMG_WIDTH <- 224
IMG_HEIGHT <- 224
BATCH_SIZE <- 32
## 2. Create an Image Data Generator for the Training Set
train_datagen <- image_data_generator(
rescale = 1/255,
rotation_range = 20,
width_shift_range = 0.1,
height_shift_range = 0.1,
shear_range = 0.1,
zoom_range = 0.1,
horizontal_flip = TRUE
)
## 3. Create an Image Data Generator for the Test Set
test_datagen <- image_data_generator(
rescale = 1/255
)
## 4. Create the Data Flow from Directories
cat("Loading and preprocessing training images...\n")
train_generator <- flow_images_from_directory(
directory = train_dir,
generator = train_datagen,
target_size = c(IMG_WIDTH, IMG_HEIGHT),
color_mode = "grayscale",
batch_size = BATCH_SIZE,
class_mode = "binary"
)
cat("Loading and preprocessing test images...\n")
test_generator <- flow_images_from_directory(
directory = test_dir,
generator = test_datagen,
target_size = c(IMG_WIDTH, IMG_HEIGHT),
color_mode = "grayscale",
batch_size = BATCH_SIZE,
class_mode = "binary",
shuffle = FALSE
)
## 5. Verify the Setup
cat("\nClasses were automatically identified:\n")
print(train_generator$class_indices)