Skip to main content
HomeBlogData Science

Version Control For Data Science

Discover how to overcome the steep learning curve of version control for data science, while also taking into account best practices and recommendations.
Dec 2017  · 8 min read

Keeping track of changes that you or your collaborators make to data and software is a critical part of any project, whether it's research, data science or software engineering. Being able to reference or retrieve a specific version of the entire project aids in reproducibility for you leading up to publication, when responding to reviewer comments, and when providing supporting information for reviewers, editors, and readers.

The best tools for tracking changes are the version control systems that are used in software development, such as Git, Mercurial, and Subversion. They keep track of what was changed in a file when and by whom, and synchronize changes to a central server so that multiple contributors can manage changes to the same set of files.

While these tools make tracking changes easier, they can have a steep learning curve. To overcome this learning curve, there are two sets of recommendations: a systematic manual approach for managing changes and version control in its full glory, and you can use the first while working towards the second, or just jump in to version control.

Best Practises for Version Control

Whatever recommendation you will end up choosing, there are some general best practices or recommendations that you best take into account:

  1. Back up (almost) everything created by a human being as soon as it is created. This includes scripts and programs of all kinds, software packages that your project depends on, and documentation. A few exceptions to this rule are discussed below.

  2. Keep changes small. Each change should not be so large as to make the change tracking irrelevant. For example, a single change such as Revise script file that adds or changes several hundred lines is likely too large, as it will not allow changes to different components of an analysis to be investigated separately. Similarly, changes should not be broken up into pieces that are too small. As a rule of thumb, a good size for a single change is a group of edits that you could imagine wanting to undo in one step at some point in the future.

  3. Share changes frequently. Everyone working on the project should share and incorporate changes from others on a regular basis. Do not allow individual investigator's versions of the project repository to drift apart, as the effort required to merge differences goes up faster than the size of the difference. This is particularly important for the manual versioning procedure describe below, which does not provide any assistance for merging simultaneous, possibly conflicting, changes.

  4. Create, maintain, and use a checklist for saving and sharing changes to the project. The list should include writing log messages that clearly explain any changes, the size and content of individual changes, style guidelines for code, updating to-do lists, and bans on committing half-done work or broken code.

  5. Store each project in a folder that is mirrored off the researcher's working machine using a system such as Dropbox or a remote repository such as GitHub. Synchronize that folder at least daily. It may take a few minutes, but that time is repaid the moment a laptop is stolen or its hard drive fails.

How to Approach Manual Versioning

The first suggested approach, in which everything is done by hand, has two additional parts.

First, add a file called CHANGELOG.txt to the project's docs subfolder, and make dated notes about changes to the project in this file in reverse chronological order (that is, most recent first). This file is the equivalent of a lab notebook, and should contain entries like those shown below.

## 2016-04-08

* Switched to cubic interpolation as default.
* Moved question about family's TB history to end of questionnaire.

## 2016-04-06

* Added option for cubic interpolation.
* Removed question about staph exposure (can be inferred from blood test results).

Second, copy the entire project whenever a significant change has been made (that is, one that materially affects the results), and store that copy in a sub-folder whose name reflects the date in the area that's being synchronized. This approach results in projects being organized as shown below:

.
|-- project_name
|   -- current
|       -- ...project content as described earlier...
|   -- 2016-03-01
|       -- ...content of 'current' on Mar 1, 2016
|   -- 2016-02-19
|       -- ...content of 'current' on Feb 19, 2016

Here, the project_name folder is mapped to external storage (such as Dropbox), current is where work is being done, and other folders within project_name are old versions.

Pros and Cons of Manual Versioning

You'll often hear "Data is Cheap, Time is Expensive". Copying everything like the above approach suggests may seem wasteful, since many files won't have changed, but consider: a terabyte hard drive costs about \$50 retail, which means that 50 GByte costs less than $5. Provided large data files are kept out of the backed-up area (this will be discussed in further detail below), this approach costs less than the time it would take to select files by hand for copying.

This manual procedure satisfies the requirements outlined above without needing any new tools. If multiple researchers are working on the same project, though, they will need to coordinate so that only a single person is working on specific files at any time. In particular, they may wish to create one changelog file per contributor, and to merge those files whenever a backup copy is made.

Version Control Systems

What the manual process described above requires most is self-discipline. The tools that underpin our second approach -the one we use in our own projects- don't just accelerate the manual process: they also automate some steps while enforcing others, and thereby require less self-discipline for more reliable results.

It's hard to know what tool is most widely used in research today, but the one that's most talked about is undoubtedly Git. This is largely because of GitHub, a popular hosting site that combines the technical infrastructure for collaboration via Git with a modern web interface. GitHub is free for public and open source projects and for users in academia and nonprofits. GitLab is a well-regarded alternative that some prefer, because the GitLab platform itself is free and open source. Bitbucket provides free hosting for both Git and Mercurial repositories, but does not have nearly as many scientific users.

What Not to Put Under Version Control

File Sizes and Formats

The benefits of version control systems don't apply equally to all file types. In particular, these systems can be more or less rewarding depending on file size and format.

  • First, file comparison in version control systems is optimized for plain text files, such as source code. Usually, the ability to see so-called "diffs" is one of the great joys of version control. Unfortunately, while Microsoft Office files (like the .docx files used by Word) or other binary files, for example, PDFs, can be stored in a version control system, it is not possible to pinpoint specific changes from one version to the next. Also tabular data, such as CSV files, can be put in such a system, but changing the order of the rows or columns will create a big change, even if the data itself has not changed.

  • Second, raw data should not change, and therefore should not require version tracking. Keeping intermediate data files and other results under version control is also not necessary if you can regenerate them from raw data and software. However, if data and results are small, it's recommended to version them for ease of access by collaborators and for comparison across versions.

  • Third, today's version control systems are not designed to handle megabyte-sized files, never mind gigabytes, so large data or results files should not be included. (As a benchmark for "large", the limit for an individual file on GitHub is 100MB.) Some emerging hybrid systems such as Git LFS put textual notes under version control, while storing the large data itself in a remote server, but these are not yet mature enough for us to recommend.

Inadvertent Sharing

Another case in which the benefits of version control systems don't really play to your advantage is the case of "inadvertent sharing". Researchers dealing with data subject to legal restrictions that prohibit sharing (such as medical data) should be careful not to put data in public version control systems. Some institutions may provide access to private systems, so it is worth checking with your IT department.

Additionally, be sure not to unintentionally place security credentials, such as passwords and private keys, in a version control system where it may be accessed by others.

If you'd like to give all of this a try, please check out our free introduction to Git for Data Science.

Acknowledgments

This post is taken from "Good enough practices in scientific computing" by Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal, https://doi.org/10.1371/journal.pcbi.1005510.

Topics
Related

5 Common Data Science Challenges and Effective Solutions

Emerging technologies are changing the data science world, bringing new data science challenges to businesses. Here are 5 data science challenges and solutions.
DataCamp Team's photo

DataCamp Team

8 min

Top 32 AWS Interview Questions and Answers For 2024

A complete guide to exploring the basic, intermediate, and advanced AWS interview questions, along with questions based on real-world situations. It covers all the areas, ensuring a well-rounded preparation strategy.
Zoumana Keita 's photo

Zoumana Keita

15 min

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.
Mark Graus's photo

Mark Graus

10 min

Avoiding Burnout for Data Professionals with Jen Fisher, Human Sustainability Leader at Deloitte

Jen and Adel cover Jen’s own personal experience with burnout, the role of a Chief Wellbeing Officer, the impact of work on our overall well-being, the patterns that lead to burnout, the future of human sustainability in the workplace and much more.
Adel Nehme's photo

Adel Nehme

44 min

Becoming Remarkable with Guy Kawasaki, Author and Chief Evangelist at Canva

Richie and Guy explore the concept of being remarkable, growth, grit and grace, the importance of experiential learning, imposter syndrome, finding your passion, how to network and find remarkable people, measuring success through benevolent impact and much more. 
Richie Cotton's photo

Richie Cotton

55 min

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial

Learn to master DynamoDB with Node.js in this beginner's guide. Explore table creation, CRUD operations, and scalability in AWS's NoSQL database.
Gary Alway's photo

Gary Alway

11 min

See MoreSee More