Deduplication – The NetApp Approach
After writing a couple of articles (here and here) about deduplication and how I think it should be implemented, I figured I would try it on a NetApp system I have in the lab. The goal of the testing here is to compare storage performance of a data set before and after deduplication. Sometimes capacity is the only factor, but sometimes performance matters. The test is random 4KB reads against a 100GB file. The 100GB file represents significantly more data than the test system can fit into its’ 16GB read cache. I am using 4KB because that is the natural block size for NetApp.
To maximize the observability of the results in this deduplication test, the 100GB file is completely full of duplicate data. For those who are interested, the data was created by doing a dd from /dev/zero. It does not get any more redundant than that. I am not suggesting this is representative of a real world deduplication scenario. It is simply the easiest way to observe the effect deduplication has on other aspects of the system.
This is the output from sysstat -x during the first test. The data is being transferred over NFS and the client system has caching disabled, so all reads are going to the storage device. (The command output below is truncated to the right, but the important data is all there.)
Random 4KB reads from a 100GB file – pre-deduplication:
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s
in out read write read write age hit time ty util in out in out
19% 6572 0 0 6579 1423 27901 23104 11 0 0 7 16% 0% - 100% 0 7 0 0 0 0
19% 6542 0 0 6549 1367 27812 23265 726 0 0 7 17% 5% T 100% 0 7 0 0 0 0
19% 6550 0 0 6559 1305 27839 23146 11 0 0 7 15% 0% - 100% 0 9 0 0 0 0
19% 6569 0 0 6576 1362 27856 23247 442 0 0 7 16% 4% T 100% 0 7 0 0 0 0
19% 6484 0 0 6491 1357 27527 22870 6 0 0 7 16% 0% - 100% 0 7 0 0 0 0
19% 6500 0 0 6509 1300 27635 23102 442 0 0 7 17% 9% T 100% 0 9 0 0 0 0
The system is delivering an average of 6536 NFS operations per second. The cache hit rate hovers around 16-17%. As you can see, the working set does not fit in primary cache. This makes sense. The 3170 has 16GB of primary cache and we are randomly reading from a 100GB file. Ideally, we would like to get a 16% cache hit rate (16GB cache / 100GB working set) and we are very close. The disks are running at 100% utilization and are clearly the bottleneck in this scenario. The spindles are delivering as many operations as the are capable of. So what happens if we deduplication this data?