Skip to content
Level Difficulty in Candy Crush Saga
  • AI Chat
  • Code
  • Report
  • 1. Candy Crush Saga

    Candy Crush Saga is a hit mobile game developed by King (part of Activision|Blizzard) that is played by millions of people all around the world. The game is structured as a series of levels where players need to match similar candy together to (hopefully) clear the level and keep progressing on the level map. If you are one of the few that haven't played Candy Crush, here's a short demo:

    Candy Crush has more than 3000 levels, and new ones are added every week. That is a lot of levels! And with that many levels, it's important to get level difficulty just right. Too easy and the game gets boring, too hard and players become frustrated and quit playing.

    In this project, we will see how we can use data collected from players to estimate level difficulty. Let's start by loading in the packages we're going to need.

    # This sets the size of plots to a good default.
    options(repr.plot.width = 5, repr.plot.height = 4)
    
    # Loading in packages
    library(readr)
    library(dplyr)
    library(ggplot2)

    2. The data set

    The dataset we will use contains one week of data from a sample of players who played Candy Crush back in 2014. The data is also from a single episode, that is, a set of 15 levels. It has the following columns:

    • player_id: a unique player id
    • dt: the date
    • level: the level number within the episode, from 1 to 15.
    • num_attempts: number of level attempts for the player on that level and date.
    • num_success: number of level attempts that resulted in a success/win for the player on that level and date.

    The granularity of the dataset is player, date, and level. That is, there is a row for every player, day, and level recording the total number of attempts and how many of those resulted in a win.

    Now, let's load in the dataset and take a look at the first couple of rows.

    # Reading in the data
    data <- read_csv("datasets/candy_crush.csv")
    
    # Printing out the first six rows
    head(data, 6)

    3. Checking the data set

    Now that we have loaded the dataset let's count how many players we have in the sample and how many days worth of data we have.

    # Count and display the number of unique players
    print("Number of players:")
    length(unique(data$player_id))
    
    # Display the date range of the data
    print("Period for which we have data:")
    range(data$dt)

    4. Computing level difficulty

    Within each Candy Crush episode, there is a mix of easier and tougher levels. Luck and individual skill make the number of attempts required to pass a level different from player to player. The assumption is that difficult levels require more attempts on average than easier ones. That is, the harder a level is, the lower the probability to pass that level in a single attempt is.

    A simple approach to model this probability is as a Bernoulli process; as a binary outcome (you either win or lose) characterized by a single parameter pwin: the probability of winning the level in a single attempt. This probability can be estimated for each level as:

    For example, let's say a level has been played 10 times and 2 of those attempts ended up in a victory. Then the probability of winning in a single attempt would be pwin = 2 / 10 = 20%.

    Now, let's compute the difficulty pwin separately for each of the 15 levels.

    # Calculating level difficulty
    difficulty <- data %>%
        group_by(level) %>%
        summarise(attempts = sum(num_attempts), wins = sum(num_success)) %>%
        mutate(p_win = wins / attempts)
    
    
    # Printing out the level difficulty
    difficulty

    5. Plotting difficulty profile

    Great! We now have the difficulty for all the 15 levels in the episode. Keep in mind that, as we measure difficulty as the probability to pass a level in a single attempt, a lower value (a smaller probability of winning the level) implies a higher level difficulty.

    Now that we have the difficulty of the episode we should plot it. Let's plot a line graph with the levels on the X-axis and the difficulty (pwin) on the Y-axis. We call this plot the difficulty profile of the episode.

    # Plotting the level difficulty profile
    difficulty %>%
      ggplot(aes(x = level, y = p_win)) + 
        geom_line() + 
        scale_x_continuous(breaks = 1:15) +
        scale_y_continuous(label = scales::percent)

    6. Spotting hard levels

    What constitutes a hard level is subjective. However, to keep things simple, we could define a threshold of difficulty, say 10%, and label levels with pwin < 10% as hard. It's relatively easy to spot these hard levels on the plot, but we can make the plot more friendly by explicitly highlighting the hard levels.

    # Adding points and a dashed line
    difficulty %>%
      ggplot(aes(x = level, y = p_win)) + 
        geom_line() + geom_point() +
        scale_x_continuous(breaks = 1:15) +
        scale_y_continuous(label = scales::percent) +
        geom_hline(yintercept = 0.1, linetype = 'dashed')

    7. Computing uncertainty

    As Data Scientists we should always report some measure of the uncertainty of any provided numbers. Maybe tomorrow, another sample will give us slightly different values for the difficulties? Here we will simply use the Standard error as a measure of uncertainty:

    \sigma_{error} \approx \frac{\sigma_{sample}}{\sqrt{n}}

    \sigma_{sample} = \sqrt{p_{win} (1 - p_{win})}

    \sigma_{error} \approx \sqrt{\frac{p_{win}(1 - p_{win})}{n}}

    # Computing the standard error of p_win for each level
    difficulty <- difficulty %>%
        mutate(error = sqrt(p_win * (1 - p_win) / attempts))
    Hidden output

    8. Showing uncertainty

    Now that we have a measure of uncertainty for each levels' difficulty estimate let's use error bars to show this uncertainty in the plot. We will set the length of the error bars to one standard error. The upper limit and the lower limit of each error bar should then be pwin + σerror and pwin - σerror, respectively.