Article Preview
Top1. Introduction
As coming the Big-Data and In-Memory computing era, NUMA (Non Uniform Memory Access) has become the prevalent and important architecture of hardware platforms, which meet the increasing requirements for memory bandwidth of many-core architectures.
As the number of PC server’s core increasing, the memory contention of data intensive applications becomes more and more serious. Instead of using faster and bigger processor caches, NUMA have solved the memory bandwidth issues to some extent by using asymmetric hierarchical memory model, where the memory controllers are distributed while maintains a global address space for all the memory. Accessing the remote memory needs inter-processor connection technology, which means a remote accessing will have a higher cost than local accessing and give rise to the bandwidth contention along the routing path to remote node.
Because of the distributed and asymmetric nature of NUMA, the performance depends heavily on the software to deploy the characteristics of underlying hardware to obtain the optimized performance. As concerning the virtualized environments of cloud computing, authors in (Rao et al., 2010) proposed a method called vNUMA-mgr to optimize the VMs (Virtual Machines) deployment on NUMA architecture and their experimental results showed that High-Performance Computing (a memory intensive application) have achieved 30%~50% performance gains. Research in (Ali, Q, 2012) also showed that the ESXi Server of VMware have improved performance up to 167% after adopting virtual NUMA topology technology.
The previous works pay much attention to the data locality and were well documented (Zhang et al., 1991; LaRowe et al., 1992; Brecht, 1993; Holliday et al., 1994; Bircsak et al., 2000) and others are focused on how to map the thread and memory into particular NUMA architecture and to maximize the locality (Osiakwan et al., 1990; Castro et al., 2009; da Cruz et al., 2012; Tudor et al., 2011) using OS-provided APIs or other tools (Drepper, 2007; Kleen, 2005; Ribeiro et al., 2009; Lameter, 2006; Hursey et al., 2011). But in recent two years, the studies (Awasthi et al., 2010; Majo et al., 2011; Luo et al., 2013; Dashti et al., 2013) have shown that microarchitecture have a great effect on optimizing the memory performance on NUMA platforms, even under some circumstance decreasing data locality may procure better performance.
All these works needs to know the topology of NUMA nodes’ connection and assume that the topology is provided. But it is not always true for all circumstances. And what is more, the software should not be optimized for a fixed topology of a certain architecture or hardware platform. It means we should obtain the ability to detect the topology and construct the routing table automatically, when software is transplanted to a new platform. As we know there is no article addressing this problem, and there is no existed method to solve this problem.