Tech Thoughts
technical

A Brief History of Session Management

Learn more about how DataCamp provides sessions that are as smooth, fast and robust as if you were working on your own system.

From DataCamp's start, students could experiment with real code right from inside their browser. Except for a standard web browser, you don't have to install anything. In addition to that, the sessions are stateful: if you go to the learning interface and submit an R script that creates two vectors:

a <- 1:5
b <- 5:1

You can experiment with the data in the console afterwards:

> a + b
[1] 6 6 6 6 6

In state-less systems, every submission would be executed from a fresh start, and the variable a wouldn't be defined in your next interaction with the server:

> a + b
Error: object 'a' not found

Providing sessions that are as smooth, fast and robust as if you were working on your own system has been an ongoing challenge at DataCamp. This blog post explains the different steps we went through to arrive at where we are right now.

When DataCamp started, it was an R-only platform, and R sessions were established with RServe. Rserve is an open-source project that allows you to connect with websockets to a server that has R installed. It worked, but there were several issues.

First, it used websockets, something not every legacy browser supported, which is especially relevant as DataCamp is also used in the world of corporate training. Second, the websockets relied on a very stable Internet connection, which is still not a certainty today. Third, making sure that users didn't corrupt the server was a hassle, because the R process wasn't sandboxed. We needed AppArmor (and the R wrapper around it, RAppArmor to limit the access to system files and make sure that nobody used up too much CPU or RAM. Finally, maintenance was a pain: several RServe instances were running, and configuring RServe or updating an R package meant connecting to all RServe instances. There surely are tools to do this, but when you're experimenting with a new feature on one server, it can be hard to trace back the actual steps to replicate that behavior on other servers. In addition, you had to make a snapshot of the instance so that your latest changes would also be available whenever your cluster automatically scaled up.

All in all, after iterating and tweaking the entire stack, RServe was doing the job just fine throughout DataCamp's first years. But that was about to change.

Around August 2015, DataCamp's growing course offer comprised over 20 courses covering a wide range of topics, but it was still an R-only platform. We offered courses on statistics, data manipulation, data visualization, reporting, you name it, but still only in R. It was clear from the start, however, that after gaining foothold in the R world, DataCamp would have to expand its horizon to other programming languages. Python was the logical next step: it's an open-source general-purpose scripting language with a huge ecosystem for data science. Packages such as pandas, scikit-learn and beefed-up interactive shells such as IPython have made a Python a great tool for analytics.

DataCamp had to go Python, but how? There are RServe-like projects out there that allow you to connect to a Python server. Some of these projects will even support stateful code execution: a hard requirement, as we want to keep the experience highly interactive, mimicking your everyday job as a data scientist. However, maintenance would only increase. Now there are two clusters of servers that every browser should connect to in a stable, secure and maintainable way that also scales effortlessly. What if DataCamp were to move into the wonderful world of SQL, Spark, Hadoop, Julia or the likes? Build a new XServe solution for every one of these? That didn't seem like a good approach going forward. Luckily, we discovered the new kid on the block: Docker.

Docker is a fairly recent containerization framework that builds upon recent resource isolation features of the Linux kernel. Docker makes it easy to run lightweight virtual machines on any host machine, be it Linux, Mac or Windows. Basically, you can specify a Dockerfile - a simple text file with a complete specification of what you want your virtual environment to contain. Next, you can build an image from this Dockerfile. Finally, you can spin up this image, giving you a Docker container. Docker excels at being lightweight (You can have an Ubuntu container running from an image that is only 127 Mb), fast (you can start a container in less than 2 seconds) and secure (the Dockers are sandboxed, so it's hard corrupt the host machine). Most importantly though, because the Dockerfile is the single source of truth for the docker image and the containers resulting from it, Docker is a dream to maintain.

At DataCamp, this setup translates to spinning up a docker container for every user. This container runs either R or Python, depending on the programming language the user needs. This approach is clearly different from the "XServe approach". We get several things for free. Sandboxing provides a separate working environment for every user, and those environments do not interfere with each other. In addition, scalability and maintainability are way easier: updating the Dockerfile and distributing the updated image does the trick.

To orchestrate all of this goodness, we developed a NodeJS service that can handle requests coming from the learning interface on the client side. Whenever a user requests a new session, the service spins up a Docker container for the right language's image, and when the user wants to execute code, the service dumps the commands into the repl, listens to the output, and sends back the results.

In simplified form, this piece of Javascript handles creating a Docker image, executing a command and reading the output:

// create a repl
var repl = child.spawn('docker', ['run', '-ai', '-t', 'repl-r', 'R', '--no-save']);

// write command onto the repl
repl.stdin.write('x <- 5; print(x)');

// get output from repl
repl.stdout.on('data', (data) => {
  console.log(data.toString());
});

From an architectural point of view, this new approach is a great improvement compared to the previous setup. If you want to support a new language, you basically write a new Dockerfile, add some configs to the NodeJS service and you're up and running.

This new system worked smoothly for over a year as our user-base grew from 250.000 to 600.000, and our course library grew to over 50 interactive courses in two languages and on a wide range of topics. All R courses used the same R image, and all Python courses used the same Python image. Another problem started to emerge, though: package management. One course needed 7 packages related to data visualization, while another course needed 9 package on data modeling. However, the data visualization course didn't need any of those modeling packages. The images were getting bloated with all sorts of package installs, a fraction of which were actually required for a specific course. In addition, updating a package had its risks: there could be other content on the platform depending on a specific package version. Updating a package to support one course could end up breaking another one.

To solve this problem, we decided to take Dockerization to the next level, where every course now has its own Docker image. The Dockerfile for such a course image looks something like this:

FROM basic-r:123

RUN R -e "install.packages(c('tutorial', 'RDocumentation'))"

In the FROM tag, you can specify from which image you want to start. To provide a good basis, we created a minimal Docker image for each language, that contains R or Python on an Ubuntu image, and little more. That minimalist Dockerfile can then be extended with course-specific package installs. In the example above, we're installing the R packages tutorial and RDocumentation, two packages on CRAN that were developed by DataCamp. This way, we keep the course image as minimalist as possible: you start from the basic necessities and install exactly the packages the course needs, and nothing else.

Putting the course-specific system to work gave a much better separation of concerns. To put it in numbers: before, our flagship Introduction to R course had access to more than 4000 R packages:

> nrow(installed.packages())
[1] 4457

Now, with the more minimal course image, this number of images has been reduced by a factor 80+:

> nrow(installed.packages())
[1] 53

Let's go through our list of requirements for session management once more.

Secure? Yep, still using Docker!

Scalable? Yes! Whenever a new session server spins up, it has to pull all the necessary images and it's ready for action.

Maintainable? Definitely, you have full control over which versions of which packages are available for which courses. It's perfectly possible that one Python course runs Jinja version 2.5, while another Python course runs Jinja version 2.8.

Providing a highly interactive educational platform that looks and feels very similar to working on your local computer has been an ongoing challenge at DataCamp, but it's definitely worth it. Every solution we've implemented has given us significant improvements compared to the previous one, and no doubt new issues will arise with the current system. It's an inherent part of writing software in a fast-moving startup: we have to keep iterating, improving and innovating to make our environment scale and optimize both maintainability as well as user experience.