Version Control For Data Science

Discover how to overcome the steep learning curve of version control for data science, while also taking into account best practices and recommendations.

Dec 7, 2017 · 8 min read

Keeping track of changes that you or your collaborators make to data and software is a critical part of any project, whether it's research, data science or software engineering. Being able to reference or retrieve a specific version of the entire project aids in reproducibility for you leading up to publication, when responding to reviewer comments, and when providing supporting information for reviewers, editors, and readers.

The best tools for tracking changes are the version control systems that are used in software development, such as Git, Mercurial, and Subversion. They keep track of what was changed in a file when and by whom, and synchronize changes to a central server so that multiple contributors can manage changes to the same set of files.

While these tools make tracking changes easier, they can have a steep learning curve. To overcome this learning curve, there are two sets of recommendations: a systematic manual approach for managing changes and version control in its full glory, and you can use the first while working towards the second, or just jump in to version control.

Best Practises for Version Control

Whatever recommendation you will end up choosing, there are some general best practices or recommendations that you best take into account:

Back up (almost) everything created by a human being as soon as it is created. This includes scripts and programs of all kinds, software packages that your project depends on, and documentation. A few exceptions to this rule are discussed below.
Keep changes small. Each change should not be so large as to make the change tracking irrelevant. For example, a single change such as Revise script file that adds or changes several hundred lines is likely too large, as it will not allow changes to different components of an analysis to be investigated separately. Similarly, changes should not be broken up into pieces that are too small. As a rule of thumb, a good size for a single change is a group of edits that you could imagine wanting to undo in one step at some point in the future.
Share changes frequently. Everyone working on the project should share and incorporate changes from others on a regular basis. Do not allow individual investigator's versions of the project repository to drift apart, as the effort required to merge differences goes up faster than the size of the difference. This is particularly important for the manual versioning procedure describe below, which does not provide any assistance for merging simultaneous, possibly conflicting, changes.
Create, maintain, and use a checklist for saving and sharing changes to the project. The list should include writing log messages that clearly explain any changes, the size and content of individual changes, style guidelines for code, updating to-do lists, and bans on committing half-done work or broken code.
Store each project in a folder that is mirrored off the researcher's working machine using a system such as Dropbox or a remote repository such as GitHub. Synchronize that folder at least daily. It may take a few minutes, but that time is repaid the moment a laptop is stolen or its hard drive fails.

How to Approach Manual Versioning

The first suggested approach, in which everything is done by hand, has two additional parts.

First, add a file called CHANGELOG.txt to the project's docs subfolder, and make dated notes about changes to the project in this file in reverse chronological order (that is, most recent first). This file is the equivalent of a lab notebook, and should contain entries like those shown below.

## 2016-04-08

* Switched to cubic interpolation as default.
* Moved question about family's TB history to end of questionnaire.

## 2016-04-06

* Added option for cubic interpolation.
* Removed question about staph exposure (can be inferred from blood test results).

Second, copy the entire project whenever a significant change has been made (that is, one that materially affects the results), and store that copy in a sub-folder whose name reflects the date in the area that's being synchronized. This approach results in projects being organized as shown below:

.
|-- project_name
|   -- current
|       -- ...project content as described earlier...
|   -- 2016-03-01
|       -- ...content of 'current' on Mar 1, 2016
|   -- 2016-02-19
|       -- ...content of 'current' on Feb 19, 2016

Here, the project_name folder is mapped to external storage (such as Dropbox), current is where work is being done, and other folders within project_name are old versions.

Pros and Cons of Manual Versioning

You'll often hear "Data is Cheap, Time is Expensive". Copying everything like the above approach suggests may seem wasteful, since many files won't have changed, but consider: a terabyte hard drive costs about \$50 retail, which means that 50 GByte costs less than $5. Provided large data files are kept out of the backed-up area (this will be discussed in further detail below), this approach costs less than the time it would take to select files by hand for copying.

This manual procedure satisfies the requirements outlined above without needing any new tools. If multiple researchers are working on the same project, though, they will need to coordinate so that only a single person is working on specific files at any time. In particular, they may wish to create one changelog file per contributor, and to merge those files whenever a backup copy is made.

Version Control Systems

What the manual process described above requires most is self-discipline. The tools that underpin our second approach -the one we use in our own projects- don't just accelerate the manual process: they also automate some steps while enforcing others, and thereby require less self-discipline for more reliable results.

It's hard to know what tool is most widely used in research today, but the one that's most talked about is undoubtedly Git. This is largely because of GitHub, a popular hosting site that combines the technical infrastructure for collaboration via Git with a modern web interface. GitHub is free for public and open source projects and for users in academia and nonprofits. GitLab is a well-regarded alternative that some prefer, because the GitLab platform itself is free and open source. Bitbucket provides free hosting for both Git and Mercurial repositories, but does not have nearly as many scientific users.

What Not to Put Under Version Control

File Sizes and Formats

The benefits of version control systems don't apply equally to all file types. In particular, these systems can be more or less rewarding depending on file size and format.

First, file comparison in version control systems is optimized for plain text files, such as source code. Usually, the ability to see so-called "diffs" is one of the great joys of version control. Unfortunately, while Microsoft Office files (like the .docx files used by Word) or other binary files, for example, PDFs, can be stored in a version control system, it is not possible to pinpoint specific changes from one version to the next. Also tabular data, such as CSV files, can be put in such a system, but changing the order of the rows or columns will create a big change, even if the data itself has not changed.
Second, raw data should not change, and therefore should not require version tracking. Keeping intermediate data files and other results under version control is also not necessary if you can regenerate them from raw data and software. However, if data and results are small, it's recommended to version them for ease of access by collaborators and for comparison across versions.
Third, today's version control systems are not designed to handle megabyte-sized files, never mind gigabytes, so large data or results files should not be included. (As a benchmark for "large", the limit for an individual file on GitHub is 100MB.) Some emerging hybrid systems such as Git LFS put textual notes under version control, while storing the large data itself in a remote server, but these are not yet mature enough for us to recommend.

Another case in which the benefits of version control systems don't really play to your advantage is the case of "inadvertent sharing". Researchers dealing with data subject to legal restrictions that prohibit sharing (such as medical data) should be careful not to put data in public version control systems. Some institutions may provide access to private systems, so it is worth checking with your IT department.

Additionally, be sure not to unintentionally place security credentials, such as passwords and private keys, in a version control system where it may be accessed by others.

If you'd like to give all of this a try, please check out our free introduction to Git for Data Science.

Acknowledgments

This post is taken from "Good enough practices in scientific computing" by Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal, https://doi.org/10.1371/journal.pcbi.1005510.

Topics

Data Science

blog

What is Git? - The Complete Guide to Git

Learn about the most popular version control system and why it's a must-have collaboration tool for data scientists and programmers alike.

Summer Worsley

14 min

blog

Understanding GitHub: What is GitHub and How to Use It

Discover the uses of GitHub, a tool for version control and collaboration in data science. Learn to manage repositories, branches, and collaborate effectively.

Samuel Shaibu

9 min

Tutorial

The Complete Guide to Data Version Control With DVC

Learn the fundamentals of data version control in DVC and how to use it for large datasets alongside Git to manage data science and machine learning projects.