Archive

Archive for the ‘Storage’ Category

Jumbo Frames for NFS & iSCSI VMWare Datastores

June 1st, 2010 Jesse St. Laurent Comments off

We have been working on a comparison between VMware datastores running on NFS, iSCSI, and FC. (Stay tuned. We will publish those results shortly.) Along the way we were reminded of the performance boost that jumbo frames can provide. These tests were run using the same ‘boot storm’ test harness on the server side we have used before (details can be found at the end of this post). The question is, “How much faster will ESX be with jumbo frames enabled?”

Let’s jump right to the answer… Read more…

Categories: Storage, Systems Tags:

Oracle/Sun F20 Flash Card – How fast is it?

April 15th, 2010 Jesse St. Laurent Comments off

I received several questions about the performance of the Oracle/Sun F20 flash card I used in my previous post about block alignment, so I put together a quick overview of the card’s performance capabilities. The following results are from testing the card in a dual socket 2.93Ghz Nehalem (x5570) system running Solaris x64. This is similar to the server platform Oracle uses in the ExaData 2 platform.

The F20 card is a SAS controller with 4 x 24GB flash modules attached to it. You can find more info on the flash modules on Adam Leventhal’s blog and the official Oracle product page has the F20 details.

All of my tests used 100% random 4KB blocks. I focused on random operations, because in most cases it is not cost effective to use SSD for sequential operations. These tests were run with a variety of different thread counts to give an idea of how the card scales with multiple threads. The first test compared the performance of a single 24GB flash module to the performance of all 4 modules. Read more…

Block alignment is critical

March 26th, 2010 Jesse St. Laurent Comments off

Block alignment is an important topic that is often overlooked in storage. I read a blog entry by Robin Harris a couple months back about the importance of block alignment with the new 4KB  drives. I was curious to test the theory on one of the new 4KB drives, but I did not have one on hand. That got me thinking about Solid State Disk (SSD) devices. If filesystem misalignment hurts traditional spinning disk performance, how would it impact SSD performance. In short, it is ugly.

Here is a chart showing the difference between aligned and misaligned random read operations to a Sun F20 card. I guess it is officially an Oracle F20 card. Read more…

TechForum Presentation

March 12th, 2010 Jesse St. Laurent Comments off

I spoke at TechForum in New York earlier this week. Here is a copy of my presentation for anyone who is interested. The official title is “Rethinking Storage Strategies: How Virtualization is Transforming Storage.” At a high level, I spoke about the current trends in storage and how they play together with server virtualization. I do not think it will have the same impact without the running commentary, so feel free to comment here or drop me a line if you have any questions.

  Storage Trends and Server Virtualization (199.0 KiB)

Exadata V2 Surprises

February 22nd, 2010 Peter Galvin Comments off

When Oracle announced the Exadata V2 database appliance late last year, it created quite a stir. The performance numbers for the box are extremely high, and the feature set and capacity are quite large.

Last week we had an executive briefing for folks interested in Exadata V2. My colleagues Kurt Rosenfeld and John Laferrier presented information on business intelligence and the Exadata, as well as the business case and use cases for considering buying one. Joe LaFlamme from Oracle presented some reference customer examples.

I presented the Exadata V2 technical overview, traveling through the architecture details, migration strategies, and component details. Along the way there were a few points I made that seemed a bit surprising to the audience, and that led to a lively discussion. I summarize those points here, as they do not seem to be well known within the industry.

  • Existing Oracle licenses are transferable to Exadata (including Oracle DB, RAC, and Partitioning). That can greatly reduce the cost of an Exadata that is being used for database consolidation, for example.
  • The Exadata looks to be an excellent consolidation engine. Included with the Exadata software are resource management tools that can, for example, give some databases resource priority over others. These tools also allow the use of the flash storage to be fine tuned, pinning specific tables into flash or letting Oracle use the flash as an extended cache.
  • The Exadata V2 is designed to be able to perform OLTP and Data Warehouse transactions concurrently. If a single system can be used both ways, consider the implications compared to stand-alone, separate Data Warehouse solutions. Normally data must be extracted from the OLTP system, copied to the DW system, imported there, and then processed. The extraction and copying are overhead, on both the OLTP and DW systems. And, any reports or queries on the DW system are performed against “stale data” – data from the time the extraction started. Now consider being able to do DW operations against live, current OLTP data. And according to the performance numbers published by Oracle, those operations could run much faster than on most DW systems. That speed could result in completing more complex reports, the allowing of more ad hoc queries, and so on. Such a change could be a fundamental advantage to DW consumers (finance and senior management, for example).
  • Read more…

Categories: Storage, Systems Tags: , ,

VMware boot storm on NetApp – Part 2

December 28th, 2009 Jesse St. Laurent 2 comments

I have received a few questions relating to my previous post about NetApp VMware bootstorm results and want to answer them here.  I have also had a chance to look through the performance data gathered during the tests and have a few interesting data points to share. I also wanted to mention that I now have a pair of second generation Performance Accelerator Modules (PAM 2) in hand and will be publishing updated VMware boot storm results with the larger capacity cards.

What type of disk were the virtual machines stored on?

  • The virtual machines were stored on a SATA RAID-DP aggregate.

What was the rate of data reduction through deduplication?

  • The VMDK files were all fully provisioned at the time of creation. Each operating system type was placed on a different NFS datastore. This resulted in 50 virtual machines on each of 4 shares. The deduplication reduced the physical footprint of the data by 97%

A few interesting stats gathered during the testing. These numbers are not exact and due to the somewhat imprecise nature of starting and stopping statit in synchronization with the start and end of each test.

  • The CPU utilization moved inversely with the boot time. The shorter the boot time, the higher the CPU utilization. This is not surprising as during the faster boots, the CPUs were not waiting around for disk drives to respond. More data was served from cache the the CPU could stay more utilized.
  • The total NFS operations required for each test was 2.8 million.
  • The total GB read by the VMware physical servers from the NetApp was roughly 49GB.
  • The total GB read from disk trended down between cold and warm cache boots. This is what I expected and would be somewhat concerned if it was not true.
  • The total GB read from disk trended down with the addition of each PAM. Again, I would be somewhat concerned if this was not the case.
  • The total GB read from disk took a significant drop when the data was deduplicated. This helps to prove out the theory that NetApp is no longer going to disk for every read of a different logical block that points to the same physical block.

How much disk load was eliminated by the combination of dedup and PAM?

  • The cold boots with no dedup and no PAM read about 67GB of data from disk. The cold boot with dedup and no PAM dropped that down to around 16GB. Adding 2 PAM (or 32GB of extended dedup aware cache) dropped the amount of data read from disk to less that 4GB.

Building a ZFS Deduplication System

December 24th, 2009 Peter Galvin 3 comments

The news of Sun integrating an in-line deduplication feature into ZFS has created quite a buzz in storage circles. And our clients have been asking us about how to gain access to this new feature. This blog post describes the steps needed to build an OpenSolaris server, integrate the deduplication feature, and enable it.

For details about the ZFS deduplication feature, what it does, and how it does it, have a look at Jeff Bonwick’s blog post on the topic. He was the lead engineer on the project so you can take his word on it.

Deduplication was integrated into OpenSolaris build 128. That takes a little explanation. Solaris is Sun’s current commercial operating system. OpenSolaris has two flavors – the semiannual support-able release, and the frequently-updated developer release. The current supportable release is called 2009.06 and is available for download here. Also at that location is the “SXCE” latest build. That distribution is more like Solaris 10 – a big ol’ DVD including all the bits of all the packages. OpenSolaris is the acknowledged future of Solaris, including a new package manager (more like Linux) and a live-CD image that can be booted for exploration, and installed as the core release. To that core more packages can be added via the package manager.
Read more…

Categories: Storage, Systems Tags: , , ,

VMware boot storm on NetApp

November 1st, 2009 Jesse St. Laurent Comments off

UPDATE: I have posted an update to this article here: More boot storm details

Measuring the benefit of cache deduplication with a real world workload can be very difficult unless you try it in production. I have written about the theory in the past and I did a lab test here with highly duplicate synthetic data. The results were revealing about how the NetApp deduplication technology impacts both read cache and disk. Based on our findings, we decided to run another test. This time the plan was to test NetApp deduplication with a VMware guest boot storm. We also added the NetApp Performance Accelerator Module (PAM) to the testing.

The test infrastructure consists of 4 dual socket Intel Nehalem servers with 48GB of RAM each. Each server is connected to a 10GbE switch. A FAS3170 is connected to the same 10GbE switch. There are 200 virtual machines: 50 Microsoft Windows 2003, 50 Microsoft Vista, 50 Microsoft Windows 2008, and 50 linux. Each operating system type is installed in a separate NetApp FlexVol for a total of 4 volumes. This was not done to maximize the deduplication results. Instead we did it to allow the VMware systems to use 4 different NFS datastores. Each physical server mounts all 4 NFS datastores and the guests were split evenly across the 4 physical servers.

The test consisted of booting all 200 guests simultaneously. This test was run multiple times with the FAS 3170 cache warm and cold, with deduplication and without, and with PAM and without. Here is a table summarizing the boot timing results. This is the amount of time between starting the boot and the 200th system acquiring an IP address. Here are the results: Read more…

Categories: Storage, Systems Tags: , , , ,

ZFS Capacity Usage – Optimizing Compression and Record Size Settings

October 2nd, 2009 Jesse St. Laurent Comments off

I have migrated some data to ZFS filesystems recently and the capacity consumed has surprised me a couple times. In general, it has appeared that the data uses more capacity when stored on the ZFS filesystem. This prompted me to do a little investigating. Is ZFS using more capacity? Is it simply a reporting anomaly? Where is that space going? Does ZFS record size have a major impact? Does enabling compression have a significant impact?

In part, the extra space use is a result of ZFS reporting space utilization differently than other filesystems. When a ZFS filesystem is formatted, almost no capacity is used. A df command will show nearly the entire raw capacity. Many other filesystems take a portion of the raw capacity off the top and reserve it for metadata. This reserve will not show up in df. As data is added to the ZFS filesystem, blocks are allocated for both data and metadata. Both the data and metadata blocks will show up as used capacity. In many other filesystems, at least some of the metadata blocks will be taken from the reserve and only the data blocks will show as consumed capacity. For example, in Solaris, the du command will return the capacity used by the data blocks in a file. In ZFS, that du command returns the total space consumed by the file including metadata and compression. So the question at hand is, when storing a given set of files, does ZFS use more total space than other file systems? That one is difficult to test, given all the variables. But we can test various ZFS configuration options to determine the best settings for minimizing block use.

Read more…

Deduplication – The NetApp Approach

July 20th, 2009 Jesse St. Laurent 5 comments

After writing a couple of articles (here and here) about deduplication and how I think it should be implemented, I figured I would try it on a NetApp system I have in the lab. The goal of the testing here is to compare storage performance of a data set before and after deduplication. Sometimes capacity is the only factor, but sometimes performance matters. The test is random 4KB reads against a 100GB file. The 100GB file represents significantly more data than the test system can fit into its’ 16GB read cache. I am using 4KB because that is the natural block size for NetApp.

To maximize the observability of the results in this deduplication test, the 100GB file is completely full of duplicate data. For those who are interested, the data was created by doing a dd from /dev/zero. It does not get any more redundant than that. I am not suggesting this is representative of a real world deduplication scenario. It is simply the easiest way to observe the effect deduplication has on other aspects of the system.

This is the output from sysstat -x during the first test. The data is being transferred over NFS and the client system has caching disabled, so all reads are going to the storage device. (The command output below is truncated to the right, but the important data is all there.)

Random 4KB reads from a 100GB file – pre-deduplication:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 19%  6572     0     0    6579  1423 27901  23104     11     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6542     0     0    6549  1367 27812  23265    726     0     0     7   17%   5%  T  100%      0     7     0     0     0     0
 19%  6550     0     0    6559  1305 27839  23146     11     0     0     7   15%   0%  -  100%      0     9     0     0     0     0
 19%  6569     0     0    6576  1362 27856  23247    442     0     0     7   16%   4%  T  100%      0     7     0     0     0     0
 19%  6484     0     0    6491  1357 27527  22870      6     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6500     0     0    6509  1300 27635  23102    442     0     0     7   17%   9%  T  100%      0     9     0     0     0     0

The system is delivering an average of 6536 NFS operations per second. The cache hit rate hovers around 16-17%. As you can see, the working set does not fit in primary cache. This makes sense. The 3170 has 16GB of primary cache and we are randomly reading from a 100GB file. Ideally, we would like to get a 16% cache hit rate (16GB cache / 100GB working set) and we are very close. The disks are running at 100% utilization and are clearly the bottleneck in this scenario. The spindles are delivering as many operations as the are capable of. So what happens if we deduplication this data?

Read more…