Skip to content
0

Detecting Tuberculosis in X-Rays

📖 Background

Tuberculosis (TB) is one of the most common and deadly respiratory diseases in the world, causing about 1.25 million people in 2023. Doctors often use chest X-rays to help detect TB. However, looking at many X-rays by hand can be slow and difficult.

In this challenge, you will build a simple machine learning model that can help classify chest X-ray images into two groups:

  • Healthy lungs
  • Lungs affected by TB

This is not about building a “perfect” model. The focus should be on how you describe your process, decisions, and learnings.

🩻 The Data

 

You are given a small dataset from the Sakha-TB dataset:

  • Training data: 150 healthy + 150 TB images (300 total)
  • Test data: 50 healthy + 50 TB images (100 total)

These images are in the data.zip file at the root of the notebook. They will then be in the data/chestxray folder, which is further divided into test and train, both containing healthy and tb folders with the images inside.

💪 Challenge

You will train a model to classify chest X-rays. Your report should cover these questions:

  1. Preprocessing
    What steps did you take to make the images easier for a model to understand?
    Some ideas to think about:

    • Did you resize the images to the same size?
    • Did you convert them to grayscale or normalize the pixel values?
    • Did you try any simple image transformations (e.g., small rotations)?
  2. Modeling
    Try at least two models and compare them.

    • One can be a simple model you build yourself (like a small CNN).
    • Another can be a pre-trained model (like ResNet or MobileNet).
      Explain what you tried and what differences you observed.
  3. Evaluation
    Evaluate your models on the test set. Report the following metrics in plain words:

    • Sensitivity (Recall for TB): How many TB cases your model correctly finds.
    • Specificity: How many healthy cases your model correctly identifies.
    • Positive Predictive Value (PPV): When your model says “TB”, how often it’s right.
    • Negative Predictive Value (NPV): When your model says “Healthy”, how often it’s right.

    👉 Tip: You don’t need to get the “best” numbers. Focus on explaining what the metrics mean and what you learned.

  4. [Optional] ROC Curve
    If you want, you can also draw a Receiver Operating Characteristic (ROC) curve to visualize how your model performance changes with different thresholds.

✅ Checklist before publishing

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the introduction to data science notebooks, so the workbook is focused on your story.
  • Check that all the cells run without error.
# Load the necessary library
library(utils)

# Specify the path to the Zip file
zip_path <- "data.zip"

# Extract files from the Zip archive
sub_dir<-"data/chestxrays"

if (!file.exists(sub_dir)) {
	unzip(zip_path)	
}
# =========================
# FIXED CODE BLOCK
# =========================

# 0. Ensure Python and Keras/TensorFlow are available
# ---------------------------------------------------
# This block checks for Python and installs TensorFlow/Keras if needed.
# Avoid using install_keras(method = "conda") due to conda environment creation errors.
# Instead, use method = "auto" or "virtualenv" (recommended for most systems).

if (!reticulate::py_available(initialize = TRUE)) {
  cat("Python not found. Installing Miniconda and TensorFlow/Keras Python packages...\n")
  # reticulate::install_miniconda() # Commented out to avoid conda errors
  keras::install_keras(method = "auto") # Use "auto" or "virtualenv" instead of "conda"
}

# Load the necessary libraries
library(keras)
library(tensorflow)

## 1. Define Paths and Parameters

# Make sure 'sub_dir' is defined before using it
# sub_dir <- "path/to/your/data" # Uncomment and set this to your data directory

train_dir <- file.path(sub_dir, "train")
test_dir  <- file.path(sub_dir, "test")

# Define image dimensions and batch size
IMG_WIDTH <- 224
IMG_HEIGHT <- 224
BATCH_SIZE <- 32

## 2. Create an Image Data Generator for the Training Set

train_datagen <- image_data_generator(
  rescale = 1/255, 
  rotation_range = 20,
  width_shift_range = 0.1,
  height_shift_range = 0.1,
  shear_range = 0.1,
  zoom_range = 0.1,
  horizontal_flip = TRUE
)

## 3. Create an Image Data Generator for the Test Set

test_datagen <- image_data_generator(
  rescale = 1/255
)

## 4. Create the Data Flow from Directories

cat("Loading and preprocessing training images...\n")
train_generator <- flow_images_from_directory(
  directory = train_dir,
  generator = train_datagen,
  target_size = c(IMG_WIDTH, IMG_HEIGHT), 
  color_mode = "grayscale", 
  batch_size = BATCH_SIZE,
  class_mode = "binary" 
)

cat("Loading and preprocessing test images...\n")
test_generator <- flow_images_from_directory(
  directory = test_dir,
  generator = test_datagen,
  target_size = c(IMG_WIDTH, IMG_HEIGHT),
  color_mode = "grayscale",
  batch_size = BATCH_SIZE,
  class_mode = "binary",
  shuffle = FALSE 
)

## 5. Verify the Setup

cat("\nClasses were automatically identified:\n")
print(train_generator$class_indices)