Fast-and-Frugal Decision Trees in R with FFTrees

An introductory tutorial to fast-and-frugal decision trees in R with the FFTrees package.
Dec 2017  · 12 min read

Imagine that one day an administrator named Heidi at a local hospital comes to you, a freelance data scientist, with a task. Every day, 30 people come to the hospital's emergency room complaining of chest pain. Some of these patients are having heart attacks, and should be immediately sent to a coronary care bed, while others are in fact not having heart attacks and should be sent to a regular hospital bed. Unfortunately, it is not immediately obvious which patients are having heart attacks and which are not. It would be prohibitive in terms of time and cost to give every patient a definitive test for a heart attack before assigning them to one of the two beds. Instead, doctors need to make fast decisions for each patent on a limited amount of uncertain information, from basic demographic information (such as age and sex), to various medical tests that can be completed in a relatively short period of time.

Heidi would like your help in designing an algorithm that the emergency room doctors can use to quickly decide where to send each patient with chest pain. Importantly, she tells you that the doctors will only use an algorithm that they can quickly understand and apply with minimal effort. Ideally, the algorithm shouldn't even require a calculator to use! For this reason, complex algorithms such as random forests, and even regression are not viable.

What algorithm suits this task and how can you create it?

The algorithm you will use to solve this problem is a Fast-and-Frugal Decision Tree (Martignon et al., 2003). A fast-and-frugal tree is an extremely simple decision tree that anyone can easily understand, learn, and use to make fast decisions with minimal effort.

In this tutorial, you will cover the basics of using the `FFTrees` R package (Phillips et al. 2017) to create fast-and-frugal trees. You will use data from from real emergency room patients to create several fast-and-frugal trees, and then visualise how well it makes decisions for both training and test data.

Fast-and-Frugal Decision Trees in R with `FFTrees`

Install the `FFTrees` Package

You can install the `FFTrees` package from CRAN using the `install.packages()` function

`````` # Install FFTrees from CRAN
install.packages("FFTrees")``````

Once you have installed the package, you can load it using the `library()` function:

`````` # Load the package
library("FFTrees")``````

Explore the Heart Disease Data

The `FFTrees` package contains two datasets that you'll use in this tutorial: one called `heart.train` that you will use to create (aka. train) fast-and-frugal trees, and another called `heart.test` that you will use to test their prediction performance.

Let's look at the first few rows of the `heart.train` dataframe:

`````` # Print the first few rows of the training dataframe
``````##     age sex cp trestbps chol fbs     restecg thalach exang oldpeak slope
## 94   44   0 np      108  141   0      normal     175     0     0.6  flat
## 78   51   0 np      140  308   0 hypertrophy     142     0     1.5    up
## 167  52   1 np      138  223   0      normal     169     0     0.0    up
## 17   48   1 aa      110  229   0      normal     168     0     1.0  down
## 141  59   1 aa      140  221   0      normal     164     1     0.0    up
## 145  58   1 np      105  240   0 hypertrophy     154     1     0.6  flat
##     ca   thal diagnosis
## 94   0 normal         0
## 78   1 normal         0
## 167  1 normal         0
## 17   0     rd         1
## 141  0 normal         0
## 145  0     rd         0``````

As you can see, this dataframe contains data from several patients, each categorised by demographic features such as age and sex, as well as the results of medical tests and measures such as their cholesterol level (`chol`) and their type of chest pain (`cp`). The key variable you want to predict is `diagnosis`, which is 1 for patients who are truly having heart attacks, and 0 for those who are not. The goal of your fast-and-frugal tree will be to find a few key variables in the data that can quickly and accurately predict diagnosis.

Create an `FFTrees` object

In order to create fast-and-frugal trees from the `heart.train` dataframe, you'll use the `FFTrees()` function. This will return an object of the `FFTrees` class that you'll assign to `heart_FFT`. The main arguments to `FFTrees()` are

• `data`: a dataframe used to createthe trees.
• `formula`: a formula which specifies which variables you want to predict with the trees.

In addition, you will specify three optional arguments

• `data.test`: data used to test the prediction accuracy of the trees
• `main`: a title.
• `decision.labels`: verbal labels for the final decisions. In this case, you'll call the tree "ER Decisions" and use 'Stable' and 'H Attack' for the two bed types.
`````` # Create an FFTrees object called `heart_FFT`
heart_FFT <- FFTrees(formula = diagnosis ~ .,               # The variable we are predicting
data = heart.train,                    # Training data
data.test = heart.test,                # Testing data
main = "ER Decisions",                # Main label
decision.labels = c("Stable", "H Attack")) # Label for decisions``````

Now that you've created the object `heart_FFT`, you can print it to obtain basic summary statistics:

`````` # Print basic information about the FFT
heart_FFT``````
``````## ER Decisions
## FFT #1 predicts diagnosis using 3 cues: {thal,cp,ca}
##
## [1] If thal = {rd,fd}, predict H Attack.
## [2] If cp != {a}, predict Stable.
## [3] If ca <= 0, predict Stable, otherwise, predict H Attack.
##
##                    train   test
## cases       :n    150.00 153.00
## speed       :mcu    1.74   1.73
## frugality   :pci    0.88   0.88
## accuracy    :acc    0.80   0.82
## weighted    :wacc   0.80   0.82
## sensitivity :sens   0.82   0.88
## specificity :spec   0.79   0.76
##
## pars: algorithm = 'ifan', goal = 'wacc', goal.chase = 'bacc', sens.w = 0.5, max.levels = 4``````

Here you see that the tree uses three cues or features `thal`, `cp`, and `ca` that can be summarised as follows:

[1] If `thal` is either `rd` or `fd`, decide Heart Attack.
[2] If `cp` is not equal to `a`, decide Stable.
[3] If `ca` <= 0, decide Regular, otherwise, decide Heart Attack.

Importantly, the tree uses these three cues sequentially. That is, as soon as a decision is made for a patient, then no additional information is considered.

When you see the tree visually, this will become very clear!

From this point out, you can also see many summary performance statistics for the tree in both training and test datasets, from the speed of the tree to its accuracy. For example, the value speed = 1.74 for the training data, means that the tree uses, on average 1.74 pieces of information to classify cases. Additionally, the value accuracy = 0.82 for the test data means that the tree has an accuracy of 82% for the test data.

Plotting Fast-And-Frugal Trees

One of the best features of a fast-and-frugal tree is that it's so easy to understand visually. To visualise a tree, along with summary statistics, you can use the generic plotting function `plot()`:

`````` # Visualise the tree applied to the test data heart.test
plot(heart_FFT,
data = "test")``````

This plot tells you lots of great information about the data and the fast-and-frugal tree.

• On the top row, you can see that there were 150 patients (cases) were in the training data, where 66 patients were truly having heart attacks (44%), and 84 patients were not (56%).
• In the middle row, you see exactly how the tree makes decisions for each of the patients using easy--to--understand icon arrays (Galesic 2009). For example, you see that 63 patients suspected of having heart attacks were (virtually) sent to the CCU after the first question, where 16 were not having heart attacks (false--alarms), and 47 were having heart attacks (hits).
• In the bottom row of the plot, you can see aggregate summary statistics for the tree. On the bottom row, you have a 2 x 2 confusion matrix (See this Wikipedia page), which shows you a summary of how well the tree was able to classify patients, levels indicating overall summary statistics, and an ROC curve which compares the accuracy of the tree to other algorithms such as logistic regression (LR) and random forests (RF). Here, where the fast-and-frugal tree is represented by the green circle "1", you can see that the fast-and-frugal tree had a higher sensitivity than logistic regression and random forests, but at a cost of a lower specificity.

Creating and Testing a Custom Tree

One of the best things about fast-and-frugal trees is that, because they are so simple, you can easily describe your own fast-and-frugal tree 'in words', and then apply that tree to data using `FFTrees()`. This can be very useful in comparing different fast-and-frugal trees. For example, after presenting the tree in Figure 1 to Heidi, she may come back to you with the following: "The tree you presented looks good, but a member of our staff thinks that a better rule would be to use the cues cholesterol, age, and slope. Can you apply this tree to the data and compare how well it works to the tree you presented?":

Heidi's tree

[1] If cholesterol > 300, decide Heart Attack
[2] If age < 50, decide Stable
[3] If slope is either up or flat, predict Attack, otherwise, predict Stable

Yes you easily can! The `my.tree` argument in the `FFTrees()` function allows us to easily create and test any tree we can type 'in words'. In the code chunk below, I will create Heidi's tree by using the `my.tree` argument.

`````` # Create Heidi's custom FFT
custom_FFT <- FFTrees(formula = diagnosis ~ .,
data = heart.train,
data.test = heart.test,
main = "Heidi's Tree",
decision.labels = c("Stable", "Attack"),
my.tree = "If chol > 300, predict Attack.
If age < 50, predict Stable.
If slope = {up, flat} predict Attack. Otherwise, predict Stable.")

# Plot Heidi's tree and accuracy statistics
plot(custom_FFT,
data = "test")``````

As you can see, Heidi's tree is much, much worse than the tree the internal algorithm came up with. While the tree generated by FFTrees in Figure 1 had an overall accuracy of 82%, Heidi's tree is only 54% accurate!

Moreover, you can see that very few patients (only 21) are classified as having a heart attack after the first node based on their cholesterol level, and of those, only 12 / 21 (57%) were really having heart attacks. In contrast, for the tree created by `FFTrees`, a full 72 patients are classified after the first node based on their value of thal, and of these, 75% were truly having heart attacks. Thus, you have strong evidence that the tree created by FFTrees is faster, and more accurate, than Heidi's tree.

There is much more you can do with this package! For example, you can use trees to predict classes (and their probabilities) for new datasets, and create trees that minimise different classification error costs (for example, when the cost of a miss is much higher than the cost of a false-alarm).

For more information on these features and more, check out the package vignette by running `FFTrees.guide()`.

Summary

Fast-and-frugal decision trees are great options when you need a simple, transparent decision algorithm that can easily be communicated and applied, either by a person or a computer. In this tutorial, you have covered the basic steps of creating and visualising a fast-and-frugal decision tree from medical data using the the `FFTrees` package.

Although fast-and-frugal trees are great for medical decisions (Green & Mehr, 1997), they can be created from any dataset with a binary criterion, from predicting whether or not a bank will fail (Neth et al., 2014), to predicting a judge's bailing decisions (Dhami & Ayton, 2001). Have fun!

Do you want to know more about Nathaniel, creator of the FFTrees package? Check out his website or send him an e-mail at [email protected]!

References

• Dhami, Mandeep K, and Peter Ayton. 2001. "Bailing and Jailing the Fast and Frugal Way." Journal of Behavioral Decision Making 14 (2). Wiley Online Library: 141 - 168.
• Galesic, Mirta, Rocio Garcia-Retamero, and Gerd Gigerenzer. 2009. "Using Icon Arrays to Communicate Medical Risks: Overcoming Low Numeracy." Health Psychology 28 (2). American Psychological Association: 210.
• Green, Lee, and David R Mehr. 1997. "What Alters Physicians' Decisions to Admit to the Coronary Care Unit". Journal of Family Practice 45 (3). [New York, Appleton-Century-Crofts]: 219 - 226.
• Martignon, Laura, Oliver Vitouch, Masanori Takezawa, and Malcolm R Forster. 2003. "Naive and yet Enlightened: From Natural Frequencies to Fast and Frugal Decision Trees." Thinking: Psychological Perspectives on Reasoning, Judgment and Decision Making. John Wiley & Sons, Ltd, 189 - 211.
• Neth, Hansjörg, Björn Meder, Amit Kothiyal, and Gerd Gigerenzer. 2014. "Homo Heuristicus in the Financial World: From Risk Management to Managing Uncertainty." Journal of Risk Management in Financial Institutions 7 (2). Henry Stewart Publications: 134 - 144.
• Phillips, Nathaniel D, Hansjörg Neth, Jan K Woike, and Wolfgang Gaissmaier. 2017. "FFTrees: A Toolbox to Create, Visualize, and Evaluate Fast-and-Frugal Decision Trees." Judgment and Decision Making 12 (4). Society for Judgment & Decision Making: 344.

Introduction to R

Beginner
4 hours
2,393,778
Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.
See Details

Intermediate R

Beginner
6 hours
531,572
Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

Introduction to the Tidyverse

Beginner
4 hours
261,455
Get started on the path to exploring and visualizing your own data with the tidyverse, a powerful and popular collection of data science tools within R.
See all courses
Related

How to Become a Data Scientist in 8 Steps

Find out everything you need to know about becoming a data scientist, and find out whether it’s the right career for you!

12 min

Predicting FIFA World Cup Qatar 2022 Winners

Learn to use Elo ratings to quantify national soccer team performance, and see how the model can be used to predict the winner of FIFA World Cup Qatar 2022.

Arne Warnke

How Data Science is Changing Soccer

With the Fifa 2022 World Cup upon us, learn about the most widely used data science use-cases in soccer.

Richie Cotton

Top Machine Learning Use-Cases and Algorithms

Machine learning is arguably responsible for data science and artificial intelligence’s most prominent and visible use cases. In this article, learn about machine learning, some of its prominent use cases and algorithms, and how you can get started.

Vidhi Chugh

15 min

An Introduction to Q-Learning: A Tutorial For Beginners

Learn about the most popular model-free reinforcement learning algorithm with a Python tutorial.

Abid Ali Awan

16 min

A Complete Guide to Data Augmentation

Learn about data augmentation techniques, applications, and tools with a TensorFlow and Keras tutorial.

Abid Ali Awan

15 min

See MoreSee More