Deduplication – It’s not just about capacity
There is no debating that duplication is one of the hottest topics in IT. The question is if the hype has started to become bigger than the technology. Today, there are two primary use cases driving deduplication in the marketplace. The first is backup to disk and the second is virtual guest operating systems (VMware, Hyper-V, and Xen guests). (I will talk a bit about the disk to disk scenario in this article and the virtual guest topic in the next one.) These are both logical markets to adopt deduplication because they suffer from a common challenge. They both create a tremendous amount of redundant data on the disk array. The goal in both cases is to pack more data onto a disk drive and reduce the cost per GB. This is the first and most obvious use case for deduplication.
Disk drive capacity is growing exponentially, but disk performance is increasing at a much slower rate. In many cases, when helping customers size for their workload, performance drives the spindle count and not capacity. It is easy to meet the capacity needs with large drives, but will they meet the performance requirement? That is the problem. Often performance is what dictates the spindle count. It is no longer sufficient to size a storage device based solely on capacity requirements. This is a general challenge that must be taken into account when sizing a storage array.
So how is the growing disparity between size and performance effected by deduplication? Deduplication can make the performance issue worse by reducing the number of spindles even further. If the bottleneck in the storage device is the spindles, then using deduplication to pack more data onto those spindles is only going to exacerbate the situation.
Let’s take a closer look at sizing storage for a backup to disk workload. Delivering on the highly sequential read and write requirements of disk to disk backups is much easier than serving a more random workload. Disk drives do a great job with sequential reads and writes. This makes backup to disk all about sizing for capacity. When deduplication is added into the mix, the disk drives should still meet the performance requirement as long as the deduplication technology being used does not turn sequential IO into random IO. This is why it is important to understand how a specific deduplication implementation works.
The reality is that nearly every other IT workload is more random than backup to disk. If deduplication was used to pack more data onto the same number of spindles for a highly random workload the spindles would likely not meet the performance requirements. Does that mean deduplication is a point solution for highly sequential workloads? I do not believe so. I am working on an entry covering the potential performance benefits of deduplication in a more random environment.
Interesting thoughts. My response is specific to using dedupe for backup only. This is why ExaGrid really thought through “backup” and architected a product from the ground up that really “makes backup better”. It makes a whole lot of sense to land your backups on disk…at the speed of disk…thus using a post-processing approach. This shrinks the backup window and keeps peformance high. Inline dedupe slows things down. Then the most recent backup is kept in full, simply compressed so that it is ready for a rapid restore without “re-hydrating”. When the next backup lands, it is compared to the previous backup and only the bytes that have changed are kept. Combine this with a GRID architecture that brings not only additional capacity into a virtualized single pool of storage, but all the necessary processor, memory and bandwidth…so that as you data grows your performance stays the same. If you just add capacity as another “shelf”, as you point out, you are sending more data throught the same processor head and slowing things down further. This is a very “purpose-built” approach to disk backup with deduplication. I encourage you to learn more about ExaGrid. http://www.exagrid.com