Candid’s brain

Making backups with Git

Posted on: November 19th, 2009

I am the maintainer of a website where I have very restricted access to the files on the webspace. The only way to access them is using sftp or scp, and the only commands I may execute when connecting to the server using ssh are some basic ones like ls or echo and rsync. So the only proper way to get a backup of the files is to copy them using rsync. I’ve been looking for a way to keep these regular backups in a revision control system for a long time, but with SVN, it just looked too complicated to maintain automatically, as you have to look which directories and files have been added or removed and then svn add or svn rm them.

So I was using the rsync hardlink mechanism until now. When creating the daily backup copy, I passed the directory of the previous backup to rsync using the --link-dest parameter. rsnyc then did not create a new copy of files that existed equally in the old backup, but instead hard linked to them so they did not require any disk space.

With Git, it becomes very easy to manage the backups. I just update my Git working space to the current state of the web space and then run git add .. This way, all newly created files will be added automatically. As Git doesn’t care about directories, there is no trouble with them. When I then run git commit -a, files that have been removed from the directory will be automatically removed from Git! I don’t have to look for them separately, as I would have to in SVN.

With my old rsync system, the problem was that some large files slightly changed very often, such as an SQLite database where an entry was added. As the file changed, a hardlink could not be created and the whole new file had to be saved, which took lots of disk space over time. I don’t know if Git’s mechanism works better here by saving only the diffs of these files.

With the old rsync system it was very easy to remove old backups, I could just remove the backup directory, as hardlinks were used instead of softlinks, it was no problem when the old files were removed. I don’t know what I’ll do when my Git repository gets too big, maybe there is a way to remove old commits. Suggestions are welcome.

Update: The files on the webspace totally take about 1,1 GiB. The Git repository (in .git) with one commit takes 965 MiB. The next day’s backup takes additional 3,7 MiB on the file system with the rsync method, in Git the new commit takes about 2 MiB. Looks good so far.

Update: I backuped another directory that contains lots of text files and some sqlite databases. The directory uses 50 MiB of disk space, the Git repository with one commit only 9,9 MiB. The second backup takes additional 29 MiB with the rsync method, in the Git repository 7,1 MiB. Looks great so far.

Just found this article that shows Git’s possibility to only commit parts of the changes you did in a file. Just great.

Git submodules

Posted on: November 18th, 2009

Next example why the current Git submodule support is completely shit. I have various PHP applications that use the same Java backend, calling it using exec(). Now each of these PHP applications has its own Git repository, the Java backend has one, too. The PHP repositories include the Java repository as submodules, so, if someone clones one of these PHP repositories, they have to run git submodule update --init after git clone.

If I make an update now in my Java library and commit that to the public repository, the new version won’t be used automatically in the PHP applications. Instead, I have to run these commands in all PHP applications:

cd java
git pull
cd ..
git commit -a
git push

After updating the working space of the Java submodule (using git pull), it appears like a modified file in the PHP application repository, so I have to commit the change.

Users of the PHP applications now cannot just run git pull to get the new version of the application (including the new version of the Java submodule), instead they have to run an additional git submodule update after that so that the working space in the submodule gets updated, too. So you have to tell your users that they can’t just git pull changes, but instead they have to run an additional command every time.

Now things get even funnier: The Java library requires an external library to work, so it includes a submodule itself. The thing is, when people download a PHP application and load the Java submodule using git submodule update --init, the submodules of the Java submodule won’t be initialised automatically. So users have to run the following commands to get a working copy of my PHP application after git clone:

git submodule update --init
cd java
git submodule update --init

Now imagine that the external library used by my Java library introduces a new feature that I begin to use in my Java library. I have to update the submodule of the external library in my Java library and commit that change. Then I have to update the Java library submodules in all my PHP applications and commit these. Imagine what a user of my PHP application has to run every time he wants to update his working space to a working new version:

git pull
git submodule update
cd java
git submodule update

My projects are rather small, imagine what you’d have to do to update a working copy of a huge project…

When I use a web application on my server (such as an webmail client or phpmyadmin), I usually check out the stable branch of their SVN repository and run svn up every now and then to get new updates. I don’t need to know anything about the repository structure of the projects to do that. With Git, I would have to know in which directories I would have to update the submodules manually, or, alternatively, there could be a shell script to update the working copy, which I would have to remember to call instead of git pull. This makes things unbelievably complicated. I hope Git will at least introduce a feature that automatically fetches the submodules on git clone or git pull.

Update: Man pages on the internet list a --recursive option for both git submodule as well as git clone that does exactly this. On none of my computers, these are supported by the installed Git version yet, so it must be a very new feature. I don’t know though if the option is available for git pull or git checkout as well. I hope that it will become or already be the default behaviour of Git. Yet I am missing an option to always include the newest commit of a repository as a submodule instead of a specific one.

Update: Oviously, the --recursive option was added in Git 1.6.5, which is still marked unstable in Gentoo Portage.

SVN pain

Posted on: November 18th, 2009

Ugliest SVN issue: When committing, the revision number of the working copy is not increased when it would only have to be increased by 1. I tend to never run svn update on my working copies of projects that I develop all alone. My working copy is always up to date, but the revision number is not…

Recently I’ve been removing SVN directories quite often, but everytime I want to commit the change, SVN tells me that the directories are out of date. And updating a directory that has been removed using svn rm does really strange things…

Git disadvantages

Posted on: November 16th, 2009

I’ve been considering merging from Subversion to Git lately, and have finally managed to understand how Git works. A good introduction is “Understanding Git Conceptually”. Not a good introduction is the Wikipedia article, as it mainly explains what Git is trying to be, and not what it actually is.

The biggest misunderstanding is that Git is called a “distributed” or “decentralised” revision control, as opposed to a “centralised revision control”. In fact, aside from the fact that you normally have a copy of the full repository in your working copy, Git isn’t that at all. When you hear about a “decentralised” revision control system, you suspect it to work a bit like p2p file sharing, commits would be exchanged directly between developers, a central server mediating only. This is not the fact with Git, you will always have a central repository that you commit to. If you try to develop without a central repository, you will end up in a mess.

The fact that Git is trying to be decentralised without being it leads to confusions that could have been avoided by designing it as a centralised system. For example, when you create a public branch in the central repository, you cannot use that branch directly but instead have to create another local branch that “tracks” the remote branch.

The main difference between Git and Subversion is often claimed to be that in Git you have the whole repository in your working copy, whereas in Subversion you only have the newest commit. This is a minor difference in my eyes, the main difference (and advantage) is that Git has native support for branches (whereas you have to emulate branch behaviour by duplicating directories in Subversion), and these branches can optionally only exist locally. It is very easy to create different branches for different minor changes, and you can develop on them even when you don’t have an internet connection.

The only disadvantage of Git compared to Subversion I have come across is a huge one, and you should really consider it before merging to Git. Git comes with very useful functionality to participate on the development of a project. However, one other very important use of Subversion is to get a copy of a project or a part of it (such as a library) and to easily keep up to date with changes without dropping your own local hacks (something like a “read-only working copy”, you never intend to commit any changes). Git is currently not made for this use at all, as you always have to download a lot more than you need. There is in fact an option to avoid downloading old commits that you don’t need (using --depth=1), but there is no way to only download a specified sub-directory. Common practice in Git is to create an own repository for every part of a project that one might want to check out on its own, and then to include these repositories using submodules (something similar to SVN externals). The problem about that is that it creates a lot of work in Git to make a change in one submodule, commit that and then load the changes into the other repositories including that. And for many projects, it is just impractical to split the code into multiple repositories. If I want to include a part of a large project (such as a part of a library) into my project, I have to include that whole project, which can take hours to download given today’s slow internet connections. There might be workarounds around this, but they certainly aren’t as simple as a single “svn up”.

So the major advantage of SVN over Git is that it is very easy and fast to get a complete and up-to-date “read-only” copy of a project by just using “svn co” or “svn up”. In Git, you have to clone the repository, then it might be incomplete because you have to additionally initialise the sub-modules. And downloading those might take hours.

As long as this disadvantage persists in Git, many projects will keep using SVN. And as long as these projects keep using SVN, it will be difficult for other projects to merge to Git, because they reference these projects using svn:externals.

I hope that these possibilities will be included in future versions of Git:

  1. Download a “read-only” copy of a Git repository or one of its sub-directories, with automatic initialisation of its sub-modules. (The copy should of course still be updatable using “git pull”.) Transmitting old commits and other useless bandwidth usage should be avoided.
  2. Something like svn:externals. Something that automatically pulls the newest commit of a repository or its sub-directory into a sub-directory of my working copy.
  3. SVN support for this svn:externals-like thing. It is already well possible to clone a subversion repository, so it shouldn’t be a problem to support importing them.