The Art of Backup

Published on Tue, 18 August 2009

Introduction

I’ve been meaning to write a post about backup for some time now. This morning after posting a comment on Stephen’s blog about it I decided it was time to finish it off.

My backup strategy has been evolving since around 2006 when I decided that my photos were not amply secure in case of disaster. My backup solution at the time was to periodically burn a copy to CD, which I kept in my house. The obvious limitation here is that if the house is destroyed, by fire for example, all my photos are lost. What got me thinking about this originally was there actually was a major fire in one of the apartments in the block that I lived in. I was away at the time so was fortunate it didn’t spread to the other apartments.

The iPod Solution
ExaVault
Amazon S3
Backblaze
Conclusion

The iPod Solution

I looked in to off-site backup solutions such a burning CDs and sending them to my parents and syncing to an external drive and storing it at work. Online, “cloud” solutions were too expensive at the time for the amount of data I wanted to store. My solution in the end was to buy the biggest iPod available at the time, a fifth generation, 60Gb. This had enough room to store all my music and photos as well as some other important data. It was also always on me when I went out. I figured that if a fire or something happened when I was at home I would grab it on the way out.

This worked reasonably well as iTunes synced the music and photos automatically. There was a manual step to sync the data though, which mean it was never as up to date. This was all fine until the iPod was stolen. Whilst I was more upset at the circumstances that led to it being taken I also didn’t like the idea of someone having all of my important data.

ExaVault

With the iPod gone I started using ExaVault, an online service that gives you access to a certain amount of storage via rsync for a monthly fee. This worked well for smaller datasets (ie. not photos and music) but it was too expensive to consider pushing tens of gigabytes to. ExaVault was great because I could schedule the backup to run via cron and it didn’t require any thought after that.

To mitigate the risk of losing photos I setup a Time Machine drive when Mac OS X 10.5 was released and also stored a copy of my Aperture library in a vault on an external drive. I still didn’t have a remote solution for them though.

Amazon S3

When I started a new job a colleague mentioned that he was using Amazon S3 for backup and it was really cheap. I looked into it and found that S3 offered pricing low enough that it was financially feasible to push a copy of all my photos to the service. I considered Jungle Disk but in the end didn’t go with it because I wanted a single solution that I could use on my Linux VPS as well as my Mac. I ended up using s3sync, which is basically rsync for S3. This had the added benefit of being a drop in replacement for my ExaVault setup.

The initial upload of over 40Gb of photos to S3 took a couple of weeks I think but after that it happily ran without any drama. I also moved VPS hosts and was able to pull down configs from S3 after the old server had become inaccessible.

Whilst I had the S3 solution in place I upgraded my Mac. I bought a second drive for the new Mac, which left me with the factory supplied drive. I use this drive as a Time Machine drive and it is big enough to keep a copy of almost everything on the main drive. The few things that aren’t backed up are my Downloads folder and my Movies folder (although it does copy movies I take on my camera).

S3 Problems

After a while I noticed that s3sync was coping a few files every time it ran, even though they hadn’t changed. These files had Unicode filenames. After some research I found out that HFS+ uses decomposed character sequences in filenames. However these did not play nice with s3sync. Presumably they were stored in pre-composed format on S3 and were then seen as different when the sync was run.

s3sync also uses a questionable method to store the attributes of the folders that it syncs. Since S3 doesn’t have a concept of folders s3sync creates a file with the name of the folder and stores its attributes (permissions, owner, etc.) in it. What this ends up meaning is that s3sync is the only tool that can perform a restore. Other tools create a file with the name of a folder then try to create files in that folder but can’t because the name is already taken.

Moving to Backblaze

To save money I moved all the services I had running on my Linux VPS to my Mac at home. Since I now only had one machine that needed backing up and I didn’t need Linux compatibility I had some more options. With the S3 issues I was seeing I decided to move to Backblaze. For US$50 per year I get unlimited, versioned, encrypted storage and decent Mac support. They also have a web interface that can be handy for grabbing files from my Mac at home when I’m at work.

The move to Backblaze took a lot longer than I had planned. First off uploading nearly 200Gb of data was always going to take a long time but it took far longer than it needed to due to a questionable default upload limit in the Backblaze application. It wasn’t until after the two month or so initial upload finished that I discovered the setting. So advice to anyone performing an initial upload, remove the upload limit!

Deleting All Data in an S3 Bucket

An additional unexpected complexity in the move away from S3 was deleting the data in S3 after the Backblaze upload was complete. It isn’t possible to just delete an S3 bucket and have all its data go with it. You have to delete all the data individually. This is where S3 as a backup service started to show its limitations. Its great if you’re storing just photos or music but when you start syncing arbitrary files on a computer system you end up putting lots of little files up, some of which are quite weird. The primary example being Mac OS X Icon files, which are actually named ’Icon\r’, yes that’s a carriage return character in the filename. I have no idea why they are named like this but attempting to delete these files gave me all sorts of grief.

I started deleting everything using s3cmd and its ‘deleteall’ action.

S3CONF=/path/to/conf.yml s3cmd -s -v deleteall bucket-name

-v : verbose - show files
-s : SSL

This ran for a couple of days before I noticed it had got stuck in an infinite loop. It was deleting the same files over and over again. It was here that I had to step in and start deleting stuff interactively. This is when I found the awkward Icon files. I tried several S3 clients for the Mac. Cyberduck ended up being the only one capable of removing them.

S3 Hub could see but not delete
S3 Browser couldn’t delete
s3cmd looked like it deleted but didn’t

Eventually I did get everything deleted but it seemed a lot harder than it should have been.

Conclusion

So that’s my backup story to date. Pretty much everything is stored on two or more hard disks plus remotely on Backblaze.

S3 is a great service but I don’t think its well suited to backup, particularly if you’re storing lots of small files. Using something like Jungle Disk would probably make it a lot more suitable though.

Backup is important and can be difficult to get right as well as keep it affordable. I believe my current setup keeps my data secure barring any worldwide catastrophes, which I think is good enough. The one one remaining thing that this strategy doesn’t cover is silent data corruption. Hopefully ZFS will become fully supported in Mac OS X soon and that will no longer be a problem.

Stay in touch!

Follow me on Twitter or Mastodon, subscribe to the feed, or send me an email.