Course
Git is a widely used tool for version control to help with managing software development projects, data science workflows, and even documentation repositories. However, traditional Git has limitations when it comes to handling large files. Large files will be duplicated with each commit, leading to repository bloat and slower workflows due to needing to copy multiple copies of large files when pulling changes. For that reason, managing large files with Git can become inefficient for the following reasons:
- Large repository size: Storing large files directly in a Git repository increases its overall size, making cloning and fetching operations slow, especially when working on a remote repository that may require uploading/downloading files.
- Inefficient storage and versioning: Every time a large file is modified, Git stores a new version in its history, leading to rapid repository bloat.
- Performance issues: With a large repository, Git operations (e.g., cloning, pulling, pushing) become significantly slower and require more storage.
In this guide, we will take an in-depth look at Git Large File Storage (Git LFS), which is a Git extension that assists with large files in your repository. It allows us to more efficiently store changes to large files without extra duplication or file storage. You should use it whenever you expect to have large binary files in your repository. I’ll start by explaining in more detail what it is, how it works, when to use it, and how to set it up effectively.
If you’re new to Git, check out our guide on How to Learn Git and Introduction to Git Course.
What is Git LFS?
Git Large File Storage (Git LFS) is an extension for Git that improves the handling of large files.
It changes how Git handles fetching and cloning by adding functionality to Git for lazily fetching data from a remote repository and adding some clever file management functionality to Git.
In every other way, the user experience is seamless and the same as normal Git.
How Git LFS differs from standard Git tracking
In traditional Git, we store the entirety of the repository’s history in the .git
directory. This includes the actual text files that have been changed over time. Additionally, entire repositories are downloaded when we call git fetch
or git clone
so that we have all files readily available. When we use Git LFS, it changes a few things.
First, Git LFS replaces large files with lightweight pointers to remote storage, reducing the need to store every file. Second, large files are downloaded only upon checkout to the branch, meaning we only download large files when we are ready to work on them.
Lastly, it manages your local repository storage to clean up old versions of files to maintain a clean working environment.
How Does Git LFS Work?
Git LFS follows a pointer-based storage mechanism. As mentioned before, when you specify a file for LFS tracking, Git LFS replaces it with a pointer file in the repository and also places a local copy in cache.
When you push commits, the local cache is updated to a remote store (e.g., GitHub, GitLab, Bitbucket LFS servers). If there are changes and you checkout, then these changes update your local cache to provide you with the most recent working copy.
If you want insight into one of the most popular remote repositories, GitHub, check out this Introduction to GitHub course.
There are many benefits to using Git LFS. It keeps repository sizes small and manageable. It also improves performance for teams working with large files. Also, by avoiding unnecessary duplication of large files, it keeps your local repository size manageable.
Setting Up Git LFS
Before using Git LFS there are a few steps to get setup. It is a relatively simple process that involves installing the extension. If you have existing repositories, you will need to migrate them over. If making a new repository, you can simply initiate Git LFS.
Installing Git LFS
To use Git LFS, you must install it on your system. The simplest way is to go to the git-lfs website and install the files. Once you have done so, you will then run git lfs install
one time in your command console to fully initialize and install Git LFS.
Utilizing Git LFS
Once installed, you will need to make sure Git LFS is configured for each repository. Different remote Git trackers (e.g. GitHub vs Bitbucket) will have slightly different steps. It is best to follow the recommended steps for your particular Git remote tracker. On a high level, if you initialize a new repository, you can run git lfs install
in that repository to initialize the hooks then add files to track. If you have an existing repository, you can use git lfs migrate
and then follow the .git directory cleanup procedures:
git reflog expire --expire-unreachable=now --all
git gc --prune=now
Using Git LFS in a Project
Let's go through some of the usage of Git LFS in a project. We’ll cover the steps of tracking, committing, cloning, and pulling large files from a Git LFS repository.
Tracking Large Files (git lfs track)
To start tracking specific file types, we can use the git lfs track <file type>
command. For instance, if I wanted to track CSV files, I would write the command as: git lfs track "*.csv"
. This also adds information to our .gitattributes
file to make sure we are using Git LFS for this particular file. We must now push our .gitattributes
file first to ensure we are tracking LFS files properly.
git add .gitattributes
git commit -m "Adding LFS .gitattributes"
git push origin main
Adding and Committing Large Files
After adding tracking, you can add and commit it as usual:
git add largefile.csv
git commit -m "Adding large file with Git LFS"
git push origin main
Behind the scenes, Git LFS stores the file separately and replaces it with a pointer in the repository. It should be a fairly seamless experience.
Cloning Repositories with Git LFS (git lfs clone)
If you have a newer version of Git (>= 2.3.0) then the command git clone should automatically work for both your LFS and non-LFS files. If you are using an older version of Git, then you will need to use the specific git lfs clone command. This works differently from your usual clone because it will refer to the pointer information for large files and only clone the working copy of each large file. For more information on cloning in Git, check the following tutorial on Git Clone Branch.
Pulling Large Files (git lfs pull)
If you would like to fetch and checkout your large LFS files from your repository, then you can use the function git lfs pull
. This is separate from a simple git pull
which will pull only the git files.
Here is an example workflow:
git checkout main # check out your main branch
git pull # pull latest git files from the remote, for this branch
git lfs pull # pull latest git lfs files from the remote, for this branch
Best Practices for Using Git LFS
While Git LFS is a great solution, it can still come with its own series of problems and is not designed to be a universal solution. Make sure to have the following best practices when using Git LFS to make the most of it:
- Use Git LFS only for large binary files, not for code or small text files.
- Regularly prune unnecessary files using:
git lfs prune
- Avoid tracking entire directories, which can lead to performance issues and bloat
- Ensure all collaborators have Git LFS installed to avoid missing files.
If you follow these general guidelines, then Git LFS will be a great tool for you!
Common Issues and How to Fix Them
As Git LFS is an extension of Git, it comes with its own host of problems. This is especially true since it relies on remote repositories which have their own issues. Here are some common problems you might run into when using Git LFS and their remedies.
Git LFS Authentication Problems
There are numerous ways in which you can run into authentication problems with Git LFS. To make sure you have the right credentials on your end, check the following:
- Use the correct SSH configuration with
git config lfs.url ssh://<git-url>
- Check that you have read/write access to the parent branch as Git LFS often tracks to the parent
- Try making a clean fetch/clone from the remote repository entirely
If you continue to run into issues and you are not the administrator, check with whoever owns the project to see if they can help you with permissions.
Large File Storage Limits and Quotas
Services can have quotes on Git LFS storage. For instance, GitHub has a 2GB limit for Free and Pro users with a 5GB limit per file for Enterprise cloud users. Make sure you are not uploading files larger than the limit of your service to avoid errors.
Migrating Existing Repositories to Use Git LFS
If a repository already contains large files, migrate them using:
git lfs migrate import --include="*.filetype"
This retroactively replaces large files with LFS pointers.
Conclusion
Git LFS is an essential tool for managing large files efficiently while keeping repositories performant and lightweight. By leveraging Git LFS, developers can properly manage their larger repository files. As long as you are following best practices to minimize bloat, LFS will be a great help to your file tracking. If you have more interest in learning about Git, try the following resources:
Git LFS FAQs
What happens if I clone a repository with Git LFS files?
Running git clone
on newer versions of Git will automatically fetch the repository and LFS pointers. If you are on an older version of Git, you need to run git lfs pull
to download the actual large files. Alternatively, using git lfs clone
does this automatically.
How can I check which files are tracked by Git LFS?
Run git lfs track
to see tracked file patterns, and git lfs ls-files
to list actual files stored using LFS.
What if I run out of Git LFS storage on GitHub or another hosting service?
Services like GitHub impose storage limits. You can manage storage with git lfs prune
, purchase more space, or use an alternative external storage solution.
Can I stop using Git LFS after enabling it?
Yes, but you need to migrate your files back to standard Git using git lfs migrate export --include="*.filetype"
to remove LFS tracking.