Home > Storage > Deduplication – Sometimes it’s about performance

Deduplication – Sometimes it’s about performance

June 11th, 2009 Jesse St. Laurent

In a previous post I discussed the topic of deduplication for capacity optimization. Removing redundant data blocks on disk is the first, and most obvious, phase of deduplication in the marketplace. It helps to drive down the most obvious cost – the cost per GB of disk capacity. This market has grown quickly over the last few years. Both startups and established storage vendors have products that compete in the space. They are most commonly marketed as virtual tape library (VTL) or disk-to-disk backup solutions.

Does that mean that deduplication is a point solution for highly sequential workloads? No. There is another somewhat less obvious benefit of deduplication.

What storage administrator does not ask for more cache in the storage array? If I can afford 8GB, I want 16GB. If the system supports 16GB, I want 32GB. Whether it is for financial or technical reasons, cache is always limited. What about deduplicating the data in cache? When the workload is streaming sequential backup data from disk, this may not be very helpful. However, in a primary storage system with a more varied workload, this becomes very interesting.

The cost per GB of cache (DRAM) is several orders of magnitude higher than the cost of hard drives. If the goal is to reduce storage capital expenses by making the storage array more efficient, then let’s focus on the most expensive component. The cache is the most expensive component. If the physical data footprint on disk is reduced, then it is logical that the array cache should also benefit from that space savings. If the same deduplicated physical block is accessed multiple times through different logical blocks, then it should result in one read from disk and many reads from cache.

Storing VMware, Hyper-V, or Xen virtual machine images creates a tremendous amount of duplicate data. It is not uncommon to see storage arrays that are storing tens or even hundreds of these virtual images. If 10 or 20 or all of those virtual servers need to boot at the same time, then it places an extreme workload on the storage array. If all of these requests have to hit disk, the disk drives will be overwhelmed. If each duplicate logical block requires its own space in cache, then the cache will be blown out. If each duplicate block is read off disk once and stored in cache once, then the clients will boot quickly and the cache will be maintained. Preserving the current contents of the cache will ensure the performance of the rest of the applications in the environment is not impacted.

This virtual machine example does make a couple of assumptions. The most significant is that the storage controllers can keep up with the workload. Deduplication on disk allows for fewer disk drives. Deduplication in cache expands the logical cache size and makes the disk drives more efficient. Neither of these does anything to reduce the performance demand on the CPU, fibre channel, network, or bus infrastructure on the controllers. In fact, they likely place more demand on these controller resources. If the controller is spending less time waiting for disk drives, it needs the horsepower to deliver higher IOPS from cache with less latency.

This same deduplication theory applies to flash technology as well. Whether the flash devices are being used to expand cache or provide another storage tier, they should be deduplicated. Flash devices are more expensive than disk drives, so let’s get the capacity utilization up as high as possible.

As the mainstream storage vendors start to bring deduplication to market for primary storage, it will be interesting to see how they deal with these challenges. Deduplication for primary storage presents an entirely different a set of requirements from backup to disk.

Share this post:
  • email
  • Digg
  • del.icio.us
  • StumbleUpon
  • Technorati
  • Google Bookmarks
  • Reddit
  • Slashdot
  • Facebook

  1. Deduplication – It’s not just about capacity
  2. Deduplication – The NetApp Approach
  3. VMware boot storm on NetApp – Part 2
  4. Presentation: Demystifying Deduplication
  5. HSM without the headaches

Comments are closed.