Details on Nehalem and NUMA
Some folks have asked about the sources of Nehalem’s performance improvements. Certainly a major contributor is its new on-board memory controller and the new QuickPath Interconnect (QPI). Rather than having multiple sockets (each possibly containing multiple cores) of CPUs sharing a bus to get to memory in a Unified Memory Access (UMA) model, Intel with Nehalem has moved to a Non-Uniform Memory Access (NUMA) architecture. As the number of cores of compute power within a system increase, the more the need to have fast interconnects between the cores and their memory. Unfortunately at scales of greater than 4 or more cores, its unfeasible to have all components talking directly to all other components (uniformly). A single bus can be overwhelmed and become a bottleneck, and just cranking up CPU and bus speeds has failed to solve the problem because the amount that the crank can turn is limited. Rather, components connect to other components, which then connect to other components. Each connection is very fast (especially when it is non-shared and there is no contention to have to mitigate), but a component take multiple hops across these fast connections to reach some other components. Some components are “closer” than others, so communication is faster. Thus the creation of NUMA architectures.