Home > Systems > Details on Nehalem and NUMA

Details on Nehalem and NUMA

April 2nd, 2009

Some folks have asked about the sources of Nehalem’s performance improvements. Certainly a major contributor is its new on-board memory controller and the new QuickPath Interconnect (QPI).  Rather than having multiple sockets (each possibly containing multiple cores) of CPUs sharing a bus to get to memory in a Unified Memory Access (UMA) model, Intel with Nehalem has moved to a Non-Uniform Memory Access (NUMA) architecture. As the number of cores of compute power within a system increase, the more the need to have fast interconnects between the cores and their memory. Unfortunately at scales of greater than 4 or more cores, its unfeasible to have all components talking directly to all other components (uniformly). A single bus can be overwhelmed and become a bottleneck, and just cranking up CPU and bus speeds has failed to solve the problem because the amount that the crank can turn is limited.  Rather, components connect to other components, which then connect to other components. Each connection is very fast (especially when it is non-shared and there is no contention to have to mitigate), but a component take multiple hops across these fast connections  to reach some other components.  Some components are “closer” than others, so communication is faster. Thus the creation of NUMA architectures.

In moving to NUMA Intel has joined other CPU designs:

  • Sun’s UltraSPARC uses a shared parallel bus called the Sun Fireplane Interconnect Bus combined with a Crossbar switch architecture for even larger multi-tiered NUMA fabrics
  • AMD’s Opteron uses a non-shared serial transport called the HyperTransport Bus, which limits its scalability to create even larger multi-tiered NUMA fabrics based on the limited number of Hypertransport connections available
  • IBM’s P6 uses a combination of non-shared serial Inter-Chip transport (for connecting 4 CPU’s together into a fast NUMA node) as well as serial Inter-node transport (for connecting 4 CPU NUMA nodes together into an even larger multi-tiered NUMA fabrics)

NUMA as a trend should continue unabated. NUMA and virtualization complement each other extremely well, especially when you consider that a virtualized machine will typically run within its own NUMA node unaffected and unimpeded by virtual machines running within other NUMA nodes on the box.

NUMA and HPC is another area that will only see gains. Using larger 8+ CPU boxes in an HPC design will yield higher marks than many smaller 2 CPU boxes. Intel’s switch to NUMA will allow for Intel boxes that can contain more than 4 CPU’s without the need for special chipsets that separate the FSB and provide a cache coherency mechanism between the FSB’s.

NUMA architectures put more pressure on operating systems. An operating system needs to understand the performance “distance” of various components of the system to it can properly schedule CPUs and allocate memory to optimize performance. The higher the number of threads and interconnected components the more important this is.

Thanks to Ed Hamilton, Principal Solutions Architect at Corporate Technologies, Inc., for much of the content in this blog posting.

  1. New Intel Xeon 5500 “Nehalem” CPUs are starting to ship
  2. The right and wrong places to use Sun’s “T” servers

Categories: Systems Tags: ,
Comments are closed.