Home > Systems > The right and wrong places to use Sun’s “T” servers

The right and wrong places to use Sun’s “T” servers

May 27th, 2009

Sun uses three CPUs as the basis for its products: SPARC VI and VII, SPARC T1 and T2, and x86. Choosing the best CPU, in the best system, to solve a problem is more challenging the more choices there are. Frequently, I’ll be asked to recommend a best-fit solution. Sometimes, I’ll need to debug the performance of a system to determine where its bottlenecks are and if it is the best-fit for the workload. Frequently the “T” CPUs are used in the wrong environment, causing users and sysadmins to be unhappy with the provided performance.

In this blog entry I’ll talk about how to determine whether a given workload will run well on Sun’s T servers (the servers that use the T CPUs).

The T servers have one to four sockets. Each socket holds a CPU with up to eight cores. The CPUs currently range up to 1.4GHZ in clock rate. Each core can have eight “hot” threads, in that eight threads can be making progress on the CPU without the system performing a context switch. However, there are not 8 computation engines per core. Rather, each of the eight threads is round-robin scheduled on the core. For details of the architecture of the Niagara CPUs take a look at the Sun Niagara page. An architecture diagram of a single socket of Niagara II CPU is shown here for easy reference.

Sun Niagara II CPU Architecture

These T system CPUs are more than just integer units, adding to the expectations of stellar functionality. Each chip also includes eight cryptographic accelerators and eight floating point units, in some configurations the systems also have dual 10-Gb ethernet ports. Finally, Logical Domains, or LDOMS, are an included virtualization technology that allows at the maximum a virtual machine per thread. The T systems have won many benchmarking records, including world record single socket SPEC integer and floating point benchmarks. So what could go wrong?

In many instances, T servers are deployed into environments where they are doomed to have poor performance. For sites that understand the architecture of the T server and want to attempt to determine ahead of time whether a given workload will perform well there, the cooltst tool is the first step. This tool runs on x86 or SPARC hardware, on Solaris or Linux operating systems. It gathers more data if run as user “root” but can be used by non-root users. Obviously, for best results it should be run on the target system while under a usual or high load. It runs for five minutes by default and gathers data about floating point operations (important for T1-CPU based systems) and multi-threading. It then outputs a summary of the analysis it performs, including a basic “green, yellow, red” rating of the workload. Unfortunately, this simple rating system is, well, too simple. Å “green” rating will not ensure that the same applications, running under the same workload, will run well on a T server.

There are specific cases, which turn out to be quite common, in which a multi-threaded workload will run slower than expected on the T servers. Let’s have a look at each of these problem scenarios.

First, the cores used in the Niagara CPUs are rather basic. They do not have the advanced features of CPUs with fewer cores, such as multi-stage pipelining and branch prediction. These advanced features help those CPUs accomplish more in a given clock cycle. Conversely, the lack of those advanced features decreases the amount of work done by a CPU. Before the Niagara CPUs, we were used to a SPARC CPU being relatively the same as other SPARC CPUs. That is no longer the case. The Niagara CPUs do less work per clock cycle than other SPARC CPUs. Clock rate is no longer a good indicator of how fast a CPU is or how much it can perform compared to other CPUs. By combining data from a variety of sources, I’ve determined that the Niagara CPUs perform a task at about 70% of their core clock rate, on average. That is, a given thread running on a 1.2GHZ Niagara core will finish in about the same time as it would have on a single SPARC core (for example an UltraSPARC III) running at 800MHZ. The percentage difference varies depending on workload, so as always a real, well-run benchmark based on your actual workload is the best way to determine performance.

Second, overestimating how multi-threaded a workload is can be painful. If the workload isn’t highly-multithreaded, then a chosen system can end up with a lot of unused CPU resources. The highest-end T server has four sockets of Niagara T2 Plus CPUs, each with 64 hot threads. Thus, the system can reasonably run 256 concurrent threads. Of course the load would be less “reasonable” if each thread was performing high I/O – 256 threads performing high I/O would overtax many networks or SANs. Most developers and systems administrators consider dozens of threads to be highly-threaded, not hundreds. Cooltst helps some in determining the number of active threads, but some other system tools can be more useful. On Solaris, observe the “r” column generated by “vmstat 10 10″. The resulting number is the number threads that were in the run queue, on average per second for ten seconds. As with all the “*stat” commands, the first line of output is the average since system boot and is usually ignored. Note that the run queue contains all threads that are ready to run but are not yet running. So to determine the number of active threads, add the number of CPUs to the number in the first column. The result is a good indication of how many threads were active on the system during that time. Perhaps an easier way to determine the long-term number of active threads is to look at the output of “uptime”. The load averages are the one, three, and five minute average number of active threads. If these numbers are low, say less than 20, then this workload is not a good candidate for a T server.

You can also use prstat(1) to determine if your application processes are threaded,and how active the threads are. Just running “prstat” with no arguments provides a dynamically-updated list of all running processes, with the process name and number of threads in the process shown in the last column (PROCESS/NLWP). If the NLWP value is larger than 1, the process is threaded. Active threads per-process can be determined by selecting the PID of an application process, and running “prstat -Lmp PID”. This instructs “prstat” to look only at that process, and display a row of output per-thread. If the threads show some non-zero values in the USR or SYS columns, the threads are spending some time executing on CPU. If most or all of the threads are showing mostly SLP time, the threads are not that busy. Please be aware that there are many reasons a threaded application may show most threads sleeping, and the pattern of the threads behavior can change dramatically if the platform changes, the environment changes, or of course if user behavior changes. These are just high-level guidelines, and are not intended to produce hard conclusions about an applications concurrency.

Third, even a highly-threaded workload may not run well on a T server. Consider a job in which one thread calls another, and waits for it to complete its work before continuing. Even though this is a multi-threaded task, it is essentially “sequentially multithreaded”. The threads depend on each other and cannot independently make progress. Multiply this by dozens of instances and a seemingly highly-multi-threaded workload actually uses only a small amount of CPU resources.

Fourth, if response time is an important component of a computing task, the T servers may not be a good fit. If all threads that are responsible for response time are short-lived, then the job will likely run well on a T server. On the other hand, if many tasks are short but there are one or more longer tasks that are important in overall response time of the task, then the job does not fit well. For example, a MySQL database that executes read and write calls from an indexed database will likely perform well, but add a table scan to the mix and the user waiting for that scan to finish will likely be unhappy.

In essence, the T servers are trucks, not cars. They can move a lot of computing from start to finish, but any given compute job does not move quickly. Web servers tend to be a perfect fit on T servers, and the further a job moves from that many, short running thread scenario the less likely the T server will provide satisfactory performance.

Why is it worth fighting the battle of determining which workloads are right for T servers? Sun’s T servers have many aspects that separate them from
Sun’s other servers (and the industry’s servers as well). They use extremely little power per thread, and can run many threads concurrently. If a
workload needs a truck to move it from start to finish, then the T server may be the best truck going. Just be sure a truck is what is needed before
deploying a workload on the T servers.

The full column for ;login: Magazine that expands on this discussion is available for download here:

  August 2009 Usenix ;login: column (150.9 KiB)

  1. Column – T Servers – Why – and Why Not
  2. Details on Nehalem and NUMA
  3. New Intel Xeon 5500 “Nehalem” CPUs are starting to ship

Categories: Systems Tags: ,
Comments are closed.