Candid’s brain

Making backups with Git

I am the maintainer of a website where I have very restricted access to the files on the webspace. The only way to access them is using sftp or scp, and the only commands I may execute when connecting to the server using ssh are some basic ones like ls or echo and rsync. So the only proper way to get a backup of the files is to copy them using rsync. I’ve been looking for a way to keep these regular backups in a revision control system for a long time, but with SVN, it just looked too complicated to maintain automatically, as you have to look which directories and files have been added or removed and then svn add or svn rm them.

So I was using the rsync hardlink mechanism until now. When creating the daily backup copy, I passed the directory of the previous backup to rsync using the --link-dest parameter. rsnyc then did not create a new copy of files that existed equally in the old backup, but instead hard linked to them so they did not require any disk space.

With Git, it becomes very easy to manage the backups. I just update my Git working space to the current state of the web space and then run git add .. This way, all newly created files will be added automatically. As Git doesn’t care about directories, there is no trouble with them. When I then run git commit -a, files that have been removed from the directory will be automatically removed from Git! I don’t have to look for them separately, as I would have to in SVN.

With my old rsync system, the problem was that some large files slightly changed very often, such as an SQLite database where an entry was added. As the file changed, a hardlink could not be created and the whole new file had to be saved, which took lots of disk space over time. I don’t know if Git’s mechanism works better here by saving only the diffs of these files.

With the old rsync system it was very easy to remove old backups, I could just remove the backup directory, as hardlinks were used instead of softlinks, it was no problem when the old files were removed. I don’t know what I’ll do when my Git repository gets too big, maybe there is a way to remove old commits. Suggestions are welcome.

Update: The files on the webspace totally take about 1,1 GiB. The Git repository (in .git) with one commit takes 965 MiB. The next day’s backup takes additional 3,7 MiB on the file system with the rsync method, in Git the new commit takes about 2 MiB. Looks good so far.

Update: I backuped another directory that contains lots of text files and some sqlite databases. The directory uses 50 MiB of disk space, the Git repository with one commit only 9,9 MiB. The second backup takes additional 29 MiB with the rsync method, in the Git repository 7,1 MiB. Looks great so far.

Filed under bugs

Comments are closed.