The number one issue I’ve seen when people start using Git is dealing with submodules in existing projects. Recently I’ve been considering moving everything to subtrees, but I don’t see that as a direct replacement. In this post I explain why.
Why use Submodules or Subtrees?
Every organisation has code that is shared between projects, and submodules and subtrees prevent us from duplicating code across those projects, avoiding the many problems that arise if we have multiple versions of the same code.
Subtrees vs Submodules
The simplest way to think of subtrees and submodules is that a subtree is a copy of a repository that is pulled into a parent repository while a submodule is a pointer to a specific commit in another repository.
This difference means that it is trivial to push updates back to a submodule, because we’re just pushing commits back to the original repository that is pointed to, but more complex to push updates back to a subtree, because the parent repository has no knowledge of the origin of the contents of the subtree.
It also means that subtrees are much easier for other people to come and pull, as they are just part of the parent repository.
So an ultra-dumbed-down ELI5 comparison of submodules to subtrees could be:
- Submodules are easier to push but harder to pull – This is because they are pointers to the original repository
- Subtrees are easier to pull but harder to push – This is because they are copies of the original repository
I will elaborate on this, so pardon the simplification.
A brief overview of git submodules
Adding a submodule
If I wanted to add a submodule to an existing git repository I’d run something like this:
$ git submodule add https://github.com/mowen/awesomelib lib/awesomelib Cloning into ‘lib/awesomelib’… remote: Counting objects: 11, done. remote: Compressing objects: 100% (10/10), done. remote: Total 11 (delta 0), reused 11 (delta 0) Unpacking objects: 100% (11/11), done. Checking connectivity... done.
If I then ran
git status I’d see this:
$ git status On branch master Your branch is up-to-date with 'origin/master'. Changes to be committed: (use "git reset HEAD <file>…" to unstage) new file: .gitmodules new file: lib/awesomelib
.gitmodules file has been created, and it’s contents will be:
[submodule “lib/awesomelib”] path = lib/awesomelib url = https://github.com/mowen/awesomelib
So the three key consequences of the submodule add are:
.gitmodulesfile has been added in the root of the repository, containing the path and URL for the added submodule.
lib/awesomelibfolder now contains a full clone of the https://github.com/mowen/awesomelib repository. With one key difference…
- The .git folder for the submodule repository has been added in the
lib/awesomelib/.git. The location
lib/awesomelib/.gitcontains a file with a single line
gitdir: ../../.git/modules/lib/awesomelibpointing to the real .git folder (the nested repository’s alternative to a full-blown .git folder).
Both the advantage and disadvantage of submodules is that they can and should be treated as a repository of their own. They will need to be committed to separately, and can be branched separately. The
lib/awesomelib directory in the example above should be treated as nothing more than a pointer to a particular SHA-1 in another repository.
You may already be able to see some of the issues that can occur if you ignore the fact that the submodule needs to be kept up to date:
- Changes to the parent could be committed and pushed without having committed and pushed the changes to the submodule.
- If a collaborator has modified and pushed changes to a submodule but you haven’t run
git submodule updateto update the submodule on your machine to their latest version, you may run
git add -Aand downgrade to your out of date version.
Pulling from a submodule
This is just a case of:
- Changing directory to the submodule repository
- Pulling from the remote
- Moving up again to the root of the parent repository
- Committing the pointer to the new HEAD commit of the submodule
Any changes from the last committed submodule commit will be listing as modified, and can be included in the next commit to the parent repository.
Pushing to a submodule
The only difference between making changes to code within a submodule directory and a regular directory is that we must commit and push to the submodule repository before then moving up a directory and committing the pointer to the new submodule commit and pushing that to the remote of the parent repository.
I think this needs a more detailed example, which I’ll start by adding a file to the submodule folder:
$ cd lib/awesomelib $ touch hello.txt $ git status HEAD detached at 2c81f4f Untracked files: (use "git add <file>..." to include in what will be committed) hello.txt nothing added to commit but untracked files present (use "git add" to track)
When the contents of a submodule folder have been modified they appear as a single line if we run
git status in the parent repository:
$ cd .. $ git status On branch master Your branch is up-to-date with 'origin/master'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) (commit or discard the untracked or modified content in submodules) modified: lib/awesomelib (untracked content) no changes added to commit (use "git add" and/or "git commit -a")
This output from
git status can be confusing, because it looks like only a single file has changed, when in fact there could be massive changes within the submodule directory.
If I see a modified submodule directory and I haven’t modified it myself, I tend to run
git submodule update to ensure that the checked out code for the submodule is the version it’s expected to be.
If you don’t do that, you are likely to end up committing the the incorrect version of the submodule that is present in your working copy.
As the changes in this example are deliberate, we should commit them, by changing directory to
lib/awesomelib to commit our changes, and then pushing them:
$ cd lib/awesomelib $ git add -A $ git status HEAD detached at 2c81f4f Changes to be committed: (use "git reset HEAD <file>..." to unstage) new file: hello.txt $ git commit -m "Test file." [detached HEAD 6498362] Test file. 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 hello.txt
Ignore the “detached HEAD”, it’s not perfect, but not relevant to this example.
So I’ve created a new commit in the submodule, but I haven’t yet pushed. If I move up a directory, I will then be back in the parent repository, and I will see that the submodule has a new commit:
$ cd .. $ git st On branch master Your branch is up-to-date with 'origin/master'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: lib/awesomelib (new commits) no changes added to commit (use "git add" and/or "git commit -a")
There’s nothing to stop me from committing this change in the parent, even though I haven’t pushed the submodule change to the remote. So I need to make sure that after a submodule commit I also push:
$ git push origin master Counting objects: 62, done. Delta compression using up to 8 threads. Compressing objects: 100% (40/40), done. Writing objects: 100% (62/62), 11.63 KiB | 0 bytes/s, done. Total 62 (delta 22), reused 58 (delta 21) To https://github.com/mowen/awesomelib
Now I’m safe to commit the submodule change in the parent repository:
$ cd .. $ git status On branch master Your branch is up-to-date with 'origin/master'. Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory) modified: lib/awesomelib (new commits) no changes added to commit (use "git add" and/or "git commit -a") $ git add -A $ git status On branch master Your branch is up-to-date with 'origin/master'. Changes to be committed: (use "git reset HEAD <file>..." to unstage) modified: lib/awesomelib $ git commit -m "Test file." [master 0297f84] Test file. 1 file changed, 1 insertion(+), 1 deletion(-)
And push it as normal:
$ git push origin master Counting objects: 3, done. Delta compression using up to 8 threads. Compressing objects: 100% (3/3), done. Writing objects: 100% (3/3), 310 bytes | 0 bytes/s, done. Total 3 (delta 2), reused 0 (delta 0) To https://github.com/mowen/parentrepo
That may seem quite convoluted, but we are dealing with two separate repositories, so there is always going to be twice as much work.
The order in which you commit and push changes when working with submodules is so important that I consider it the golden rule of modifying submodules…
The golden rule of modifying submodules
Always commit and push the submodule changes first, before then committing the submodule change in the parent repository.
As mentioned above, a submodule is nothing but a pointer to a specific commit in an external repository, so how can you possibly commit and push a reference to that pointer if it doesn’t exist on a server somewhere, accessible by everyone’s parent repositories?
Without following this rule you can get into a confusing state in which the parent repository is pointing to a submodule commit that only exists on your local machine. The tooling should warn about this and reject the push, but I haven’t seen it happen yet.
Issues with Submodules
Issues with submodules tend to arise due to the poor tooling. As mentioned, I’ve found that it is necessary to manually run a
git submodule update each time I pull updates and find that a submodule has been updated, and it’s also necessary when switching between branches. You can tell if it’s been updated because a clean checkout will say that the submodule has been modified.
If you don’t notice that you need to update the submodule, all it takes is a lazy
git add -A or
git commit -a and you’ve downgraded the submodule to the version you’ve had in your working copy all along. This stale submodule can cause the entire project to get into a mess.
If you define an alias which runs
git submodule update after every single
git pull then you will be safe, but a newbie is unlikely to do this.
A brief overview of git subtrees
Adding a subtree
The following call to git subtree will be roughly equivalent to the git submodule command above:
$ git subtree add --prefix lib/awesomelib https://github.com/mowen/awesomelib master --squash git fetch https://github.com/mowen/awesomelib master warning: no common commits remote: Counting objects: 11, done. remote: Compressing objects: 100% (10/10), done. remote: Total 11 (delta 0), reused 11 (delta 0) Unpacking objects: 100% (11/11), done. Resolving deltas: 100% (7/7), done. From hhttps://github.com/mowen/awesomelib * branch master -> FETCH_HEAD Added dir ‘lib/awesomelib’
This will clone the remote repository into the
lib/awesomelib folder, and create two commits for it.
The first is the squashing down of the entire history of the remote repository that we are cloning:
commit 70a0b8b8e2c76d9bcfd00f8f935d11941d2937d8 Author: Martin Owen <firstname.lastname@example.org> Date: Sat Apr 9 19:50:49 2016 +0100 Squashed ‘lib/awesomelib/‘ content from commit d3abff6 git-subtree-dir: lib/awesomelib git-subtree-split: d3abff6e5307227858d5323cf8aaf108c542ad2b
A merge commit for it, including the SHA-1 for it in the comment:
commit df09e101ac1bcb1e6d48cb4ab6b28c707b5b0402 Merge: cc78b8d 70a0b8b Author: Martin Owen <email@example.com> Date: Sat Apr 9 19:50:49 2016 +0100 Merge commit '70a0b8b8e2c76d9bcfd00f8f935d11941d2937d8' as ‘lib/awesomelib’
If I run
git status, I’ll see nothing, as
git subtree will have created the commits for me and left the working copy clean. Also there will be nothing in the
lib/awesomelib to indicate that the folder ever came from another git repository. And as with submodules, this is both an advantage and a disadvantage.
Pulling from a subtree
Pulling changes from the remote to the subtree isn’t difficult at all, and is very similar to the add:
$ git subtree pull --prefix lib/awesomelib https://github.com/mowen/awesomelib master --squash
You should be able to see that the parameters are exactly the same as the
add, we’ve just changed the command to
pull. The command will also create a similar set of commits to the earlier
So far so good.
Pushing to a subtree
Things get really tricky when we need to push commits back to the original repository. This is understandable because our repository has no knowledge of the original repository, and has to figure out how to prepare the changes so that they can be applied to the remote before it can push.
$ git subtree push --prefix lib/awesomelib https://github.com/mowen/awesomelib master git push using: https://github.com/mowen/awesomelib master Counting objects: 3, done. Delta compression using up to 8 threads. Compressing objects: 100% (3/3), done. Writing objects: 100% (3/3), 325 bytes | 0 bytes/s, done. Total 3 (delta 2), reused 0 (delta 0) To https://github.com/mowen/awesomelib 2c81f4f..f0a54ff f0a54ff7151a05ae9408a45daba88164bd4ab8cd -> master
In my experience how long this takes to run depends on the amount of history in the parent repository, your OS, and your machine. I’ve seen it take so long when running the command in a large repository on Windows that I had to give up and go back to using submodules, but I’ve found it to work more quickly on OS X.
The implementation is visible at: https://github.com/git/git/blob/master/contrib/subtree/git-subtree.sh and the
split command (run as part of a push) is what takes significant time, but I’ve not been able to determine exactly why.
Issues with Subtrees
After so many issues with submodules I had high hopes for subtrees, but was quite disappointed. For a start there is very little documentation. This text file is the best official documentation I’ve found, and everything else I know has come from either Stack Overflow or blog posts.
My other main issue is with the slow push speeds on Windows that I have mentioned, I’ve found it to be so bad that it has made subtrees unviable for me.
In my opinion subtrees are not a direct replacement for submodules. The way I believe you should split your shared code between subtrees and submodules is this:
- Is the external repository something you own yourself and are likely to push code back to? Then use a submodule. This gives you the quickest and easiest way for you to push your changes back.
- Is the external repository third party code that you are unlikely to push anything back to? Then use a subtree. This gives the advantage of not having to give people permissions to an extra repo when you are giving them access to the code base, and also reduces the chance that someone will forget to run a
git submodule update.
If you think I’m a complete idiot who has totally misunderstood and misrepresented submodules or subtrees, please let me know in the comments.