Since 2003, the majority of IT organisations have placed some disk in front of tape libraries. This is called “disk staging” and allows for faster and more reliable backups and restores as the latest backups are kept on disk. However, due to the cost of disk, most organisations only keep one to two weeks of retention on disk and then longer-term retention on tape.
How deduplication has accelerated the move to disk
Keeping multiple copies obviously increases the amount of backup data stored and thus creates a storage challenge. However, the data from one backup to another is highly redundant. If you have 40TB of data to be backed up, only about 800GB or 2% changes from backup to backup. Instead of backing up 40TB over and over, why not just back up the 40TB once and then the only changes from then on? This would drastically reduce the amount of disk required. Data deduplication solves the challenge by storing only the unique bytes and not storing bytes or blocks that have already been stored. This approach can reduce the amount of disk required to about 20:1. For example, if 40TB of data is kept for 20 weeks, 800TB of storage would be required. However, if the 40TB was compressed 2:1 to 20TB and then just the 2% change between backups is kept, then you would store 20TB plus 19 copies at 800GBTB each or about 35.2TB of data. In this very simplistic example, the amount of storage required would be reduced about 1/20th versus storing the data on disk without data deduplication. This is driven by the fact that in backup storage, you keep weeks, months, and years of retention. By only storing unique bytes and blocks, data deduplication uses far less disk and brings the cost of disk to about the cost of tape. Disk is faster and more reliable for both backups and restores. With the advent of data deduplication, many IT organisations have already moved to disk and eliminated tape backup.
Choosing the right approach to deduplication is vital
One of the downsides of data deduplication is that it is a compute-intensive process as all the data has to be split and compared. Depending on how deduplication is implemented, backup and restore speed can be greatly impacted. In some cases, data is only stored as bytes and blocks, and needs to be ‘rehydrated’ for every restore request.
There are three major areas of impact. The first is backup performance, which impacts the backup window or the time it takes to complete a backup. One implementation approach of data deduplication is inline deduplication where the data is deduplicated inline on the way to the disk. This approach can slow backups down as it is a very compute-intensive and cumbersome process. The alternative is to write direct to disk and perform deduplication after the data is committed to disk but still in parallel with the backups. This allows for the fastest backup performance.
The second area of impact is the storage architecture that the deduplication is deployed on. A traditional scale-up architecture has a front-end controller with disk shelves. As data grows, disk shelves are added. Using this approach, the backup window will invariably get longer as data grows, due to the fact that no additional compute resources are added. The more data there is, the longer it takes to deduplicate, as only disk capacity and not additional processor, memory, or bandwidth is added. The backup window will ultimately grow to a point where you have to replace the controller with a bigger and faster controller which increases cost. The alternative is a scale-out architecture where appliances are added into a grid. As data grows, the backup window stays fixed in length as each appliance comes with processor, memory, and bandwidth as well as disk. Therefore, both compute and capacity resources are added, allowing for additional deduplication resources, resulting in a fixed-length backup window.
The third area of impact revolves around restores, VM boots, and offsite tape copies. With an inline approach, the stored data is 100% deduplicated and for each restore request, the data needs to be rehydrated, which takes time. If the alternate approach is used – writing direct to disk and then deduplicate the data – the most recent backups are in their complete undeduplicated form and ready for fast restores, instant VM boots, and fast tape copies. The difference in restore time, boot time, and copy time can be measured in minutes versus hours between the two approaches. The best approach keeps the most recent backups in full form for fast restores, recoveries, tape copies, and VM boots while longer-term retention is deduplicated.
As can be seen above, with data deduplication, architecture matters. Understanding this before making your disk-based backup choice will help you to avoid costly mistakes. You cannot simply buy disk with deduplication or simply add deduplication to a backup application media server. You need to understand the different architectural approaches and the resulting impact on backup performance and backup window length, as data grows, as well as the impact on performance when doing restores, VM boots, and tape copies.
The key to making a good decision is to understand that data deduplication is a must to reduce the amount of data stored, resulting in the lowest cost disk backup but also in understanding that not all deduplication is created equal. How deduplication is implemented can make or break your backups and significantly impact the cost of backup storage.
This article was written and supplied by ExaGrid.