Large databases can pose many problems, one of which is how to back up and (more importantly) restore them within the Recovery Time Objective required losing the least amount of data that is practical (the Recovery Point Objective). I’ve recently had the opportunity to experiment with a whole variety of high-end storage, including solid-state cards, solid-state SANs and conventional SANs. Here are my thoughts on backups and restores.
All successful solutions start with a thorough understanding of the requirement. Backups and restores are no exception, but being non-functional requirements they are often overlooked by the business users expressing the application requirement. In essence, the requirement is mostly about restore time (the Recovery Time Objective) and data-loss in the event of a restore (the Recovery Point Objective), but there is also the impact regular backups will have on the performance/functionality of the system (if any). Furthermore, there is an additional set of requirements that must be teased out, including backup retention period (for practical/legal/forensic reasons) and the need to retain copies of this data off-site for disaster recovery purposes. These requirements will shape the backup technique and technology employed as well as the operational processes required.
As with all requirements, the business users will start off by saying that they want the system restored as quickly as possible with no data loss – a view that will quickly be tempered when they are presented with the cost – so it makes sense to have an idea of the general budget available.
It may sound obvious, but you should really understand the backup capabilities (of lack thereof) of your data source. For example, many mature transactional RDBMSs like SQL Server and Oracle will have in-built backup features that enable on-line, transactionally consistent backups. However some sources will not have such a feature (an example being the MyISAM engine in MySQL), requiring a database using this storage engine either to be taken offline during a backup operation, or for a copy of the underlying data to be made (more on this later). Other technologies present their own challenges: Many of the NoSQL databases do not support a transactionally consistent backup process because they do not support transactional consistency. Other data sources (e.g. regular document or image files) do not require transactional consistency.
Yet another factor to take into account is whether the data you’re storing is already compressed (in which case it’s highly unlikely you’ll gain anything from backup compression). For example, most image formats (e.g. JPG, GIF, PNG) are already optimally compressed. Further compression tends to yield very minor reductions in size, but utilises considerable CPU and therefore extending backup time.
All backup media can be defined according to the following criteria:
Different technologies offer different patterns – full, differential, incremental, snapshot, etc. It’s also possible to mix-and-match technologies, for example providing one method and technology to provide a local backup that is fast to make, fast to restore (but may well prove too expensive to permit long-term retention), together with a long-term retention technique that backs up to a cheap, removable media that can be archived off-site.
The important thing to remember with your backup pattern/s is to check that your backup technique addresses the following challenges:
To meet your Recovery Time Objective, you first need to establish the write speed of the storage subsystem on which your data files reside. This is essential because the write speed of your storage subsystem will place an ultimate cap on the speed with which you can recover the system regardless of your backup subsystem. If your data files can’t be written back to their location within your RTO, you’ve work to do improving this subsystem before you look into the backup system.
Assuming the storage subsytem on which your files resides does permit writes at a speed facilitating your RTO, you’ve now got to consider your backup strategy:
If you’re backing up simple files, you may want to consider the following things:
Is the data already compressed? Most image files are already highly compressed. Highly compressed files cannot be effectively compressed but can waste a large amount of CPU activity in a futile attempt to compress them further. Also remember that image files do not require decompression to be useful – an important factor considering that decompression time can vastly affect your RTO.
A large number of files can adversely affect your backup speed, particularly so when it comes to incremental backups and even more so when the average file size is small. This is because the overhead of checking whether any file has changed since it was last backed up is constant per file, regardless of the file size. Depending upon file number and size, it can sometimes be quicker to backup and restore all the files. A useful utility from the Unix world is the TAR (Tape ARchive) utility which concatenates files together. A native Windows version of this utility can be downloaded from http://gnuwin32.sourceforge.net/packages/gtar.htm.
How much data do you have to restore in the event of a disaster? If the answer is a prohibitively large amount to meet your RTO, then you may have to revisit the requirement to determine some sort of prioritization so that the data required for your business to get up and running is restored as soon as possible.
How important is it that all the files are backed up at a single point in time? Is a snapshot truly required to ensure consistency? These questions are especially important when considering the new generation of NoSQL technologies, some of which don’t have their own backup technology built-in.
Some database technologies have their own in-built backup technology, for example SQL Server, Oracle, Sybase, DB2, etc. Where these do exist, they should be used because most of these offer recovery to a single-point in time, as well as recovery to a time continuum permitting the underlying database to be recovered to any point in time. While many SAN vendors offer snapshot technology, SAN snapshots do not offer recoverability to a point in time between snapshots.
If your database does not offer its own backup technology, then you should look and see if your storage supports snapshots. Snapshots enable a copy of the data to be taken at a consistent point in time. Contrary to what SAN vendors may tell you, a snapshot is not a backup – most copy-on-write snapshots are still reliant on some of the same underlying data blocks, so if you lose your underlying data blocks, you’ll lose your snapshot! Other types of snapshots provide a full second copy (snap-clone), but the underlying media may well be the same unless you specify otherwise. What snapshot technology does give you is consistency - you can copy/backup the files from the snapshot elsewhere and you’ll have a real backup on different media.
Incremental backups (backups of the differences since the last full or last incremental backup) and differential backups (backups of ALL the differences since the last full backup) are two techniques to permit recoverability to a single point in time without the overhead of a full backup during a system’s peak hours.
Assuming your read and write speeds of both your source and backup systems are equally matched, you may still find that compressing files is still a bottleneck. One highly effective way of increasing backup/restore performance in such a scenario is parallelization. This is because most compression algorithms are single-threaded – only one processor core can be brought to bear on the compression of a single file. In such scenarios where you want to leverage the benefits of parallelization, you first have to deduce a way to split up the data evenly, such that an Nth of the data is processed by each one of N processors.
Many RDBMSs will permit you to backup data to multiple files, effectively striping the data over the files. (Assuming sufficient random-write speed) you should therefore backup to as many files as threads you wish to bring to bear. Note that you may have to reduce the number of files and therefore threads from the optimal one-thread-per-server-core if you want to prevent hogging all the CPU on the machine at the time of the backup.
Because the network utilisation of regular backups and (less regular) restores is largely a known quantity, you should ensure that you have sufficient network bandwidth to accommodate it. Many large corporates utilise a separate backup network so that backup network utilisation does not impact nightly processes requiring network bandwidth. Another technique that can be used is to ensure that each server runs a backup agent – in this way the backup agent can make decisions on what data to compress and what data is already compressed. Most importantly, compression happens on the server before data traverses the network.
Yet another challenge is moving large amounts of data over high-latency WAN connections, such as to a remote data-centre. In this scenario, it’s important to cut down on the number of back-and-forth communications, so techniques such as utilising jumbo frames, parallel transfers of large concatenated files and the use of UDP protocol as in RSYNC are invaluable.