When Oracle announced the Exadata V2 database appliance late last year, it created quite a stir. The performance numbers for the box are extremely high, and the feature set and capacity are quite large.
Last week we had an executive briefing for folks interested in Exadata V2. My colleagues Kurt Rosenfeld and John Laferrier presented information on business intelligence and the Exadata, as well as the business case and use cases for considering buying one. Joe LaFlamme from Oracle presented some reference customer examples.
I presented the Exadata V2 technical overview, traveling through the architecture details, migration strategies, and component details. Along the way there were a few points I made that seemed a bit surprising to the audience, and that led to a lively discussion. I summarize those points here, as they do not seem to be well known within the industry.
- Existing Oracle licenses are transferable to Exadata (including Oracle DB, RAC, and Partitioning). That can greatly reduce the cost of an Exadata that is being used for database consolidation, for example.
- The Exadata looks to be an excellent consolidation engine. Included with the Exadata software are resource management tools that can, for example, give some databases resource priority over others. These tools also allow the use of the flash storage to be fine tuned, pinning specific tables into flash or letting Oracle use the flash as an extended cache.
- The Exadata V2 is designed to be able to perform OLTP and Data Warehouse transactions concurrently. If a single system can be used both ways, consider the implications compared to stand-alone, separate Data Warehouse solutions. Normally data must be extracted from the OLTP system, copied to the DW system, imported there, and then processed. The extraction and copying are overhead, on both the OLTP and DW systems. And, any reports or queries on the DW system are performed against “stale data” – data from the time the extraction started. Now consider being able to do DW operations against live, current OLTP data. And according to the performance numbers published by Oracle, those operations could run much faster than on most DW systems. That speed could result in completing more complex reports, the allowing of more ad hoc queries, and so on. Such a change could be a fundamental advantage to DW consumers (finance and senior management, for example).
Read more…
I have received a few questions relating to my previous post about NetApp VMware bootstorm results and want to answer them here. I have also had a chance to look through the performance data gathered during the tests and have a few interesting data points to share. I also wanted to mention that I now have a pair of second generation Performance Accelerator Modules (PAM 2) in hand and will be publishing updated VMware boot storm results with the larger capacity cards.
What type of disk were the virtual machines stored on?
- The virtual machines were stored on a SATA RAID-DP aggregate.
What was the rate of data reduction through deduplication?
- The VMDK files were all fully provisioned at the time of creation. Each operating system type was placed on a different NFS datastore. This resulted in 50 virtual machines on each of 4 shares. The deduplication reduced the physical footprint of the data by 97%
A few interesting stats gathered during the testing. These numbers are not exact and due to the somewhat imprecise nature of starting and stopping statit in synchronization with the start and end of each test.
- The CPU utilization moved inversely with the boot time. The shorter the boot time, the higher the CPU utilization. This is not surprising as during the faster boots, the CPUs were not waiting around for disk drives to respond. More data was served from cache the the CPU could stay more utilized.
- The total NFS operations required for each test was 2.8 million.
- The total GB read by the VMware physical servers from the NetApp was roughly 49GB.
- The total GB read from disk trended down between cold and warm cache boots. This is what I expected and would be somewhat concerned if it was not true.
- The total GB read from disk trended down with the addition of each PAM. Again, I would be somewhat concerned if this was not the case.
- The total GB read from disk took a significant drop when the data was deduplicated. This helps to prove out the theory that NetApp is no longer going to disk for every read of a different logical block that points to the same physical block.
How much disk load was eliminated by the combination of dedup and PAM?
- The cold boots with no dedup and no PAM read about 67GB of data from disk. The cold boot with dedup and no PAM dropped that down to around 16GB. Adding 2 PAM (or 32GB of extended dedup aware cache) dropped the amount of data read from disk to less that 4GB.
The news of Sun integrating an in-line deduplication feature into ZFS has created quite a buzz in storage circles. And our clients have been asking us about how to gain access to this new feature. This blog post describes the steps needed to build an OpenSolaris server, integrate the deduplication feature, and enable it.
For details about the ZFS deduplication feature, what it does, and how it does it, have a look at Jeff Bonwick’s blog post on the topic. He was the lead engineer on the project so you can take his word on it.
Deduplication was integrated into OpenSolaris build 128. That takes a little explanation. Solaris is Sun’s current commercial operating system. OpenSolaris has two flavors – the semiannual support-able release, and the frequently-updated developer release. The current supportable release is called 2009.06 and is available for download here. Also at that location is the “SXCE” latest build. That distribution is more like Solaris 10 – a big ol’ DVD including all the bits of all the packages. OpenSolaris is the acknowledged future of Solaris, including a new package manager (more like Linux) and a live-CD image that can be booted for exploration, and installed as the core release. To that core more packages can be added via the package manager.
Read more…
UPDATE: I have posted an update to this article here: More boot storm details
Measuring the benefit of cache deduplication with a real world workload can be very difficult unless you try it in production. I have written about the theory in the past and I did a lab test here with highly duplicate synthetic data. The results were revealing about how the NetApp deduplication technology impacts both read cache and disk. Based on our findings, we decided to run another test. This time the plan was to test NetApp deduplication with a VMware guest boot storm. We also added the NetApp Performance Accelerator Module (PAM) to the testing.
The test infrastructure consists of 4 dual socket Intel Nehalem servers with 48GB of RAM each. Each server is connected to a 10GbE switch. A FAS3170 is connected to the same 10GbE switch. There are 200 virtual machines: 50 Microsoft Windows 2003, 50 Microsoft Vista, 50 Microsoft Windows 2008, and 50 linux. Each operating system type is installed in a separate NetApp FlexVol for a total of 4 volumes. This was not done to maximize the deduplication results. Instead we did it to allow the VMware systems to use 4 different NFS datastores. Each physical server mounts all 4 NFS datastores and the guests were split evenly across the 4 physical servers.
The test consisted of booting all 200 guests simultaneously. This test was run multiple times with the FAS 3170 cache warm and cold, with deduplication and without, and with PAM and without. Here is a table summarizing the boot timing results. This is the amount of time between starting the boot and the 200th system acquiring an IP address. Here are the results: Read more…
I have migrated some data to ZFS filesystems recently and the capacity consumed has surprised me a couple times. In general, it has appeared that the data uses more capacity when stored on the ZFS filesystem. This prompted me to do a little investigating. Is ZFS using more capacity? Is it simply a reporting anomaly? Where is that space going? Does ZFS record size have a major impact? Does enabling compression have a significant impact?
In part, the extra space use is a result of ZFS reporting space utilization differently than other filesystems. When a ZFS filesystem is formatted, almost no capacity is used. A df command will show nearly the entire raw capacity. Many other filesystems take a portion of the raw capacity off the top and reserve it for metadata. This reserve will not show up in df. As data is added to the ZFS filesystem, blocks are allocated for both data and metadata. Both the data and metadata blocks will show up as used capacity. In many other filesystems, at least some of the metadata blocks will be taken from the reserve and only the data blocks will show as consumed capacity. For example, in Solaris, the du command will return the capacity used by the data blocks in a file. In ZFS, that du command returns the total space consumed by the file including metadata and compression. So the question at hand is, when storing a given set of files, does ZFS use more total space than other file systems? That one is difficult to test, given all the variables. But we can test various ZFS configuration options to determine the best settings for minimizing block use.
Read more…
After writing a couple of articles (here and here) about deduplication and how I think it should be implemented, I figured I would try it on a NetApp system I have in the lab. The goal of the testing here is to compare storage performance of a data set before and after deduplication. Sometimes capacity is the only factor, but sometimes performance matters. The test is random 4KB reads against a 100GB file. The 100GB file represents significantly more data than the test system can fit into its’ 16GB read cache. I am using 4KB because that is the natural block size for NetApp.
To maximize the observability of the results in this deduplication test, the 100GB file is completely full of duplicate data. For those who are interested, the data was created by doing a dd from /dev/zero. It does not get any more redundant than that. I am not suggesting this is representative of a real world deduplication scenario. It is simply the easiest way to observe the effect deduplication has on other aspects of the system.
This is the output from sysstat -x during the first test. The data is being transferred over NFS and the client system has caching disabled, so all reads are going to the storage device. (The command output below is truncated to the right, but the important data is all there.)
Random 4KB reads from a 100GB file – pre-deduplication:
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s
in out read write read write age hit time ty util in out in out
19% 6572 0 0 6579 1423 27901 23104 11 0 0 7 16% 0% - 100% 0 7 0 0 0 0
19% 6542 0 0 6549 1367 27812 23265 726 0 0 7 17% 5% T 100% 0 7 0 0 0 0
19% 6550 0 0 6559 1305 27839 23146 11 0 0 7 15% 0% - 100% 0 9 0 0 0 0
19% 6569 0 0 6576 1362 27856 23247 442 0 0 7 16% 4% T 100% 0 7 0 0 0 0
19% 6484 0 0 6491 1357 27527 22870 6 0 0 7 16% 0% - 100% 0 7 0 0 0 0
19% 6500 0 0 6509 1300 27635 23102 442 0 0 7 17% 9% T 100% 0 9 0 0 0 0
The system is delivering an average of 6536 NFS operations per second. The cache hit rate hovers around 16-17%. As you can see, the working set does not fit in primary cache. This makes sense. The 3170 has 16GB of primary cache and we are randomly reading from a 100GB file. Ideally, we would like to get a 16% cache hit rate (16GB cache / 100GB working set) and we are very close. The disks are running at 100% utilization and are clearly the bottleneck in this scenario. The spindles are delivering as many operations as the are capable of. So what happens if we deduplication this data?
Read more…
Sun has published a usable capacity calculator available for the Sun 7000. It was originally written by Adam Leventhal and the latest update from Ryan Matthews is available here. The calculator connects to a 7000 series appliance (or simulator) to calculate the usable capacity. Unfortunately, not everyone has easy access to a system. This is an online version of the calculator so you do not need to have a system locally. It is nothing fancy, but it should get the job done.
The online calculator is here.
In a previous post I discussed the topic of deduplication for capacity optimization. Removing redundant data blocks on disk is the first, and most obvious, phase of deduplication in the marketplace. It helps to drive down the most obvious cost – the cost per GB of disk capacity. This market has grown quickly over the last few years. Both startups and established storage vendors have products that compete in the space. They are most commonly marketed as virtual tape library (VTL) or disk-to-disk backup solutions.
Does that mean that deduplication is a point solution for highly sequential workloads? No. There is another somewhat less obvious benefit of deduplication.
What storage administrator does not ask for more cache in the storage array? If I can afford 8GB, I want 16GB. If the system supports 16GB, I want 32GB. Whether it is for financial or technical reasons, cache is always limited. What about deduplicating the data in cache? When the workload is streaming sequential backup data from disk, this may not be very helpful. However, in a primary storage system with a more varied workload, this becomes very interesting.
Read more…
Why is there such a buzz among the analyst, press, and blogging community that Oracle is going to sell of the Sun hardware business? It makes no sense to me. I shared my thoughts on the acquisition in a previous post, but I am going to elaborate a bit here. Not only do I believe Oracle will continue selling Sun hardware, I think it is the primary reason they bought Sun.
Why would Oracle spend $7.4B to buy Sun? Is it for Solaris? I don’t think so. Solaris is open source and Sun would have welcomed Oracle’s help in tuning the operating system for Oracle’s software applications. Is it for Java? That is a little more plausible, but there was no need for Oracle to control Java. As far as I know, Sun was not doing anything to make it difficult for Oracle to use Java. Oracle is buying Sun for the hardware business. The hardware (and support) business is what generates the revenue at Sun.
I would like to share a few relevant quotes. The first comes from Larry Ellison in a recent interview. He did his best to shut down the rumor mill churning on what will happen to Sun’s hardware business.
Interviewer – “Are you going to exit the hardware business?”
Read more…
A few weeks ago, it looked like IBM was going to make a deal to purchase Sun. That fell through when the Sun board could not come to agreement. On April 20th, with very little rumor in the marketplace, Oracle announced they were buying Sun for $7.4B in cash. What does this mean for the new company?
- To quote a recent Oracle publication, “Oracle plans to engineer and deliver an integrated system – applications to disk – where all the pieces fit and work together, so customers do not have to do it themselves.” Sun is already shipping Infiniband switches and blades with InfiniBand on the motherboard. They have also mentioned IB is on the roadmap for the Sun 7000. Andy Bechtolsheim mentioned it at the Sun product announcement on April 14th. What about an integrated Oracle appliance running on Nehalem blades, Solaris x64, Sun 7000 storage, and using Infiniband switches. It should not be a major technology leap to put it all together. What would this mean for the Oracle/HP appliance?
- Sun SPARC processors are at an Oracle pricing disadvantage to IBM Power processors in the current Oracle pricing model. Oracle has never been afraid to use pricing to move the market in their direction. Watch for them to use their pricing model to encourage customer to buy Sun servers.
- Solaris x64 has been intentionally neglected by Oracle. Oracle delivers patches on Solaris SPARC and Linux immediately. Then, they have historically waited up to 6 months to release that same patch for Solaris x64. In the past, this has helped Oracle push their Linux agenda in the marketplace. Given the ease of porting the Oracle patches to Solaris x64, there is no logical technical reason for this lag. Watch for Solaris x64 to become a first class citizen in the Oracle OS support matrix now that growth of Solaris x64 means growth for Oracle.
Read more…