An Extensive Power and Performance Analysis for High Dimensional Mesh and Torus Interconnection Networks

The next generation parallel computers are the keys to achieve exascale performance, whereas sequential computers have already been saturated. In order to achieve this mighty target of exascale computing, one of the main challenges is the reduction of power consumption along with achieving suitable performance. Energy efficiency is a key feature to ensure the trade-off between the performances over the required power usages. Hence, to focus on those issues, the target of this article is to analyze the performance versus power usage trade-off for the conventional networks like-Mesh and Torus. High degree networks show much better performance than the low degree of networks. However, high degree networks require higher power usage for their high degree of interconnected links. This article showed that with zero load latency, the 3D Torus could show about 57.07% better performance than a 2D Torus. On the other hand, a 2D Mesh network requires about 24.22% less router power usage than the 3D Mesh, & 5D Torus requires about 66.8% higher router power usage than a 3D Torus network.

(used for Cray T3D) require 23.07% higher router power in comparison to 3D Mesh (used in MIT M-Machine) at the lowest level of a network with 64 nodes (Al Faisal et al., 2017). The power usage of MPC systems can be divided into two modules. One will be the on-chip module and another will be the off-chip module. Considering the on-chip module, an 80-tile teraFLOPS processor arranged as (8 x 10) 2D Mesh network with 65nm CMOS process requires 97W of electrical power (operating frequency was 4.27GHz) (Liu & Svensson, 1994) (Vangal, S. et al., 2008). In addition, 16-tile on-chip network requires about 36% of the total chip power (Wang & Peh, 2003). However, off-chip links also have a high impact on total power usage. For example, Infiniband QDR 40Gbps switch requires 1W of electrical power per link (NVIDIA, n.d.). On-chip network for supercomputers consists of a CPU core, on-chip level shared memory, and finally the router to connect neighboring cores through their router. Quriosity supercomputer ranked in Top500 (Supercomputer, n.d.), requires about 15km of wiring and also needs 600 kilowatts of electric power (BASF Supercomputer, n.d.). Finally, this research paper is immensely important for the field of interconnection networks due to the design consideration of on-chip as well as off-chip networks based on for conventional networks like-Mesh and Torus, especially for the hierarchical networks where on-chip levels are often considered as high degree networks and off-chip networks are considered as low degree networks in order to reduce the power consumptions. For example, 3D-TTN network (Al Faisal et al., 2021) considers the 3D Torus network for its on-chip module and the 2D Torus network for its off-chip connectivity. Similarly, a latest hierarchical interconnection network specially designed for exascale supercomputers named as Hierarchical Flattened Butterfly Network (HFBN) considers similar to 2D flattened butterfly architecture at the NoC level and the upper level is designed with 2D Torus network (Faisal et al., 2022). Hence, investigating the network performance with respect to latency and power usage are highly important for the field of interconnection networks.

ARCHITECTURE oF MESH AND ToRUS NETwoRKS
The modern interconnection networks of the MPC systems mainly focus on the fixed router radix because it is very important to maintain fixed router cost with the increased scalability. Increased router radix has the performance efficiency but also increases the router cost and also the power usages for the increased link connectivity and router activity. On the other hand, a constant router radix network helps to construct very large networks from the lowest network module. This regularity and modularity ensures that direct networks are often considered for the MPC systems. Table 2 shows the several topologies that are considered for various MPC systems. Mesh and Torus networks are the most common networks that have been adopted by supercomputers. Even one of the modern supercomputers like-Sunway TaihuLight can achieve 93PFLOPS with a power efficiency of 6.051 GFLOPS/watt used 2D Mesh interconnects (Fu, 2016). The main problem of those networks is the ill-conceived numbers of off-chip links. Those conventional networks are vastly used due to their network simplicity and reduced networks wiring complexity. Mesh (Miura et al., 2013) is one of the k-ary n-cube networks, is one of the well-known networks in the field of interconnection networks, which is easy to layout because of its regular and equal-length links. Mesh network has been used in Tilera 100-core CMP. Moreover, hierarchical networks (Rahman & Horiguchi, 2005) like-TESH (Jain et al., 1997;Rahman et al., 2008) and 3D-TESH (Al Faisal et al., 2017) both used the 2D Mesh networks for their on-chip module. Fig. 1(a) shows the architectural interconnect for 2D Mesh. In addition, it has high path diversity, i.e., there are many ways to reach from one node to another. Along with the Mesh interconnection network for supercomputers, this network is also very popular for Wi-Fi systems, where multiple Wi-Fi devices act as a single Wi-Fi network.
Torus is also a k-ary n-cube network. However, the main difference between Mesh and Torus networks is the extra wrap-around links used by the Torus network. Mesh is not as symmetric on edges as the Torus network. Torus network doesn't have this problem (Miura et al., 2013). It has a high bisection bandwidth than that of a Mesh network, and hence, it has high fault tolerance and path diversity. In contrast, it requires high cost, high power usage, and harder to layout on-chip and also requires unequal link lengths. Fig. 1(b) displays the interconnectivity of their nodes for a 3D Torus network. Similar to the TESH network, hierarchical networks like-TTN (Hafizur Rahman et al., 2013) and 3D-TTN (Al Faisal et al., 2016) used the 2D Torus and 3D Torus networks respectively at the on-chip module. Fig. 2 shows system cost with respect to chip-chip links (level-2 links) and

Supercomputers Interconnection Networks
Connection Machine CM-5 (Leiserson et al., 1996) Fat-Tree Intel iPSC-2 (Arlanskas, 1988;Nugent, 1988) Hypercube Intel Paragon (Intel Corp, 1991) 2DMesh MIT J-Machine (Noakes & Dally, 1990;Noakes et al., 1993)  intra-rack links (level-3 links) ( Fig. 2(a) is the cost for the 4096 cores and Fig. 2(b) is the scaled cost for 1M cores). This research considered the electrical links at the inter-chip level and optical links at the intra-rack level. The consideration for number of links has been shown at Table 3. This analysis explains that 2D Mesh network will require about 90.39% less amount of cost for designing level-2 and level-3 off-chip links than the 4D Torus network. On the other hand, fig. 3 shows the power usages for various network topologies with 4096 cores and 1M cores based on the static power usages. Here, 0.0135w is considered for each electrical interconnect at inter-chip level and 0.0101 w for each optical connections with 1.2w for each GBIC Module at the Intra-rack level connectivity.

RoUTING ALGoRITHM
In this paper, the authors have considered deterministic dimension-order routing algorithm. In dimension-order routing each packet routes to each dimension until the distance of the dimension becomes zero, then it forwards packets to the next dimension. A packet starts its' routing from the source node to the destination, though checking each dimension one by one. The authors consider the BookSim simulator (Jiang & Dally, 2013) for the evaluation of dynamic performance. Hence, this routing algorithm has been designed with the consideration of a simulator environment. This simulator requires the outgoing port number for each node routing path. Hence, when the routing algorithm is called, it returns the outgoing channel number to send the packet to the next node along the destination path. However, Mesh networks don't have any wrap-around connection, hence no partition logic is being adopted for Mesh interconnection networks.  This partition logic has been adopted for obtaining a better saturation rate for Torus networks. In the case of Mesh routing, the authors considered the default routing algorithm and Torus routing has been developed by our-self. However, this article has also shown some result analysis of default Torus routing. In the case of considering the routing algorithm for a Mesh network (Mesh_Routing()), where gN is defined by network dimension, gK is the total number of nodes in each dimension, cur is the current routing node number, gNodes is the total node number and dest is the destination node number. Dynamic routing algorithm (Torus_Routing()) is considered for the Torus routing. Our Torus routing algorithm considers less use of the wrap-around connection, which is the only difference between the default routing algorithms.

STATIC NETwoRK PERFoRMANCE ANALySIS
The network topology holds the key to ensure the overall performance of the MPC systems. Interconnection networks are supposed to have low cost, low degree or radix, high connectivity, and low packet congestion. In this section, this paper would like to show the diameter performance and the average distance for the Mesh and Torus networks.
The diameter of a network should be small, which is the maximum inter-node distance between any distinct pair of nodes along its shortest path. A Short distance will reduce the total message passing time. Hence, networks with small diameters are extremely desirable. The diameter of the 2D Mesh network can be calculated using equation 1 (where N is the number of nodes), whereas 2D Torus uses equation 2, and for k-ary n-cube networks require equation 3. Fig. 4 illustrates this diameter analysis of various networks, where the 5D Torus (5DT) network outperforms every other: Small diameter is preferable for networks however, considering only the diameter performance is not a good choice. This is due to the reason of a single node requires every other node rather than a single node. To mitigate this issue, a preferable choice is to consider the average distance, where the mean distance is calculated for all distinct pairs of nodes in a network. To calculate the average distance of the 2D Mesh equation 4 is been used (where N is the number of nodes), equation 5 is used for 2D Torus and equation 6 is for k-ary n-cube networks. In short, fig. 5 shows the complete result analysis, where 5D Torus outperformed every other due to its high node degree (10): Av Dist of D Torus N . .
Av Dist of k ary n cube nk . .

DyNAMIC CoMMUNICATIoN PERFoRMANCE ANALySIS
The performance of supercomputers heavily depends on the network latency and its throughput (Nakao et al., 2020). Low network latency and throughput are not desirable at all. However, the performance of various networks also depends on the virtual channels and the number of router buffers. In this section, the authors like to evaluate the dynamic communication performance of Mesh and Torus networks through the cycle-accurate BookSim network simulator (Jiang & Dally, 2013) using only two virtual channels. Latency is defined as the time period that is required from the instance of a packet to the last flit of the message is received at the destination node. However, the network throughput is defined as the packet transmission rate for a specific traffic pattern. Table 4 depicts the simulation environments for the dynamic performance analysis.

Uniform Traffic Pattern
Uniform traffic pattern is defined as the randomly selected source and destination node for each generated message. Hence, each node sends a message to another node with an equal probability. which ensures that our algorithm shows a little better performance in case of zero load latency. However, the default algorithm shows better saturation load latency than algorithm 2. The important finding from this figure is the zero load difference between the various networks; 3D, 4D, and 5D Mesh/Torus networks show the very margin-able differences between them. In contrast, zero load latency of 2D Mesh/Torus is completely undesirable for MPC systems.

Transpose Traffic Pattern
Transpose traffic pattern transmits the packets to fixed source-destination pair. For example, node number a 0 will transmit the packet to a n/2 , where n is the total node number. Fig. 7 shows the transpose traffic analysis for Mesh and Torus networks. This figure also ensures that Torus networks show much better performance than the Mesh network in case of zero load latency. In contrast to uniform traffic, this pattern shows better saturation load performance for Torus networks over the Mesh networks even with low virtual channels. This figure also confirms the worst performance for 2D Mesh/Torus networks. However, interestingly, 2D Torus network shows a better saturation rate than the 3D Mesh & 4D Mesh/Torus networks.

Perfect Shuffle Traffic Pattern
Perfect shuffle traffic pattern transmits a packet to the fixed selected source and destination node for each generated message. The choice of source and the destination node is based on the rotation of the left 1 bit. Fig. 8 shows the perfect shuffle traffic performance analysis for conventional networks. This traffic pattern shows the exact opposite trend from the other two traffic patterns, where this figure ensures the superiority of Mesh networks over the Torus networks in case of zero load latency and saturation load latency. However, the performance of 2D networks are usual as before and the others show a very narrow difference between them. Hence, the supercomputer, which is designed with a Torus network, must not use a perfect shuffle for running various applications.

Bit-Reverse Traffic Pattern
Bit-reverse traffic pattern sends the packets with a source address of a 0 to destination node a {n-1} (fixed pair). Fig. 9 shows the traffic analysis for the bit-reverse traffic patterns. This traffic pattern is very useful for a low number of nodes. It ensures better performance with a low number of nodes, especially, for the saturation load. For example-5D network with low nodes can tolerate heavy traffic loads finally before getting saturated. However, other than 2D Torus network, the rest of the 3D, 4D, and even 5D Torus networks show worst performance than the Mesh networks in case of saturation load.

ESTIMATIoN oF PowER CoNSUMPTIoN
With modern advancements, the biggest concern for supercomputers is power dissipation. Modern supercomputer like-Sunway System has achieved about 93 PFLOPS performance. To achieve this performance Sunway system requires about 15.3MW electrical power installed with 2D Mesh network (Fu, 2016). This section will explain the effect of on-chip power usages on the high degree Mesh and Torus networks.

Assumptions for Power Model
Power consumption at the on-chip network level is up to 50% of total chip power usage (Liu & Svensson, 1994) due to a large number of on-chip links in comparing less numbered of off-chip links. Hence, we have considered only the on-chip power estimation for various networks. The power consumption for this paper considers on-chip network simulation along with the tabular based default routing of GARNET network simulator (Agarwal et al., 2009) and the authors have estimated the static and dynamic power for the links as well as for the routers. Therefore, at the on-chip level, the authors had to take account of every interconnected link.

on-Chip Power Model
Orion energy model (Kahng et al., 2011) has been considered as the on-chip power model in this paper using 32nm fabrication process for various networks. This consideration leads to the evaluation of the static and dynamic power usage for the routers and the inter-router links. However, for analyzing various networks, this paper has used the GARNET network simulator (Agarwal et al., 2009) along with the Orion energy model. To evaluate the power dissipation for various networks, dynamic power and leakage power are the main sources of power consumption. Hence, the authors have analyzed both dynamic and leakage power for the routers and interconnected links. Router total energy depends on energy consumption by the total activity at the local & global arbiters, the read (E br ) and write (E bw ) to router buffers, and finally for the total number of crossbar traversals (E xb ). Equation 7 shows the router's total energy consumption (Kahng et al., 2011). The dynamic energy is defined in the Orion power model as equation 8, where α is the switching activity, C is the capacitance and V is the supply voltage (Kahng et al., 2011):

Power Consumption for Mesh and Torus Networks
In January 2011, Samsung developed a 30nm ~ 39nm fabricated processor module, which can obtain a data transfer rate is about 2.133 Gbit/s with the supply voltage of 1.2V (Intel, n.d.). Hence, our consideration for the clock frequency with 2.133 GHz, supply voltage 1.2V, and uniform traffic pattern with 2.5mm per link length. Table 5 shows the simulation condition of the on-chip networks. Those simulations show the various cases of power usages with traffic loads from 0.0015 to 0.0065. Though the increasing traffic loads don't have any effect on the leakage power usage, it highly affects the dynamic power usage. Fig. 10 shows the link power dissipation of various networks in considering the 64 numbers of nodes for 2D and 3D Mesh/Torus networks, 256 nodes for 4D Torus, and 512 nodes for 5D Torus. Fig. 11 shows the router power dissipation including the router leakage power, router dynamic power, and the clock power considering 64 routers for every network. Fig. 10   13 (same number of routers like Fig. 11) show the traffic load as 0.0065 flits/second. Fig. 10 and Fig. 12 show that the 3D Mesh/Torus networks require higher link power usage than the 2D Mesh/ Torus networks. In Fig. 12, the static power remained the same, but dynamic power was changed in comparison with Fig. 10 and 3DT requires less dynamic power than the 2D networks. On the other hand, in the case of router power dissipation obtained from Fig. 11 and Fig. 12, show that Torus networks require higher power usage than the Mesh networks and clock power requires a high portion of amounts in considering the total power usage. Even though there is a change in traffic loads in Fig. 13, static power and clock power remain the same as Fig. 11. Router power analysis shows that high degree networks consume much larger power than low degree networks.

EFFECTS oF VIRTUAL CHANNELS
Virtual channels are the key to solve the deadlock avoidance problem in any interconnection networks. Moreover, virtual channels helps to increase the network throughput as well as the network latency. Suppose, a single router received two different packets (P 0 , P 1 ) from two different source nodes, and expected to forward those packets through the same physical link. In such cases, one of them could block the whole physical link until his transmission is over. However, using the virtual channels both packets could move forward with half speed on a flit-by-flit approach. Hence, the MPC systems must consider the number of virtual channels to make their network deadlock-free. For instance, the BlueGene/L supercomputer uses 4 virtual channels for its deadlock-free routing in the network (IBM Blue Gene 100, n.d.). Out of these four, two of them are used for the deterministic routing and the rest two virtual channels are used for the adaptive routing. In this section, the authors have investigated the effects of using high virtual channels in 3D Mesh/Torus networks with respect to packet latency and router level power usage. The simulation environments and parameters for Table 6, is similar to Table 4 with 4096 nodes and uniform traffic pattern. However, here the paper considered only zero load latency (0.0015 flits/cycle) as the injection rate. Table 6 shows the average packet latency for zero traffic loads with 3D Mesh/Torus networks using 2 and 4 virtual channels. This result illustrates that the Mesh network yields the worst performance in both cases. On the other hand, Table 7 shows the power estimation for two and four virtual channels based on the simulation considerations of Table 5 and 64 nodes only. This analysis shows that with changes in virtual channels, router static power is highly affected. However, router clock power and dynamic power have no impact or very little impact. Hence, the consideration of a suitable number of virtual channels is also important for reducing the total power usage of the MPC systems.

DISCUSSIoN
The main contribution for this research was to investigate the power and performance analysis of high dimensional Mesh and Torus Networks. Moreover, the authors had also investigated the effects of virtual channels on Mesh and Torus Networks with respect to three-dimensional Mesh and Torus Networks. It is obvious that high degree networks like three-dimensional or four-dimensional Mesh/ Torus networks will show high saturation rate, better network latency than two-dimensional Mesh/ Torus networks. However, power usage for two-dimensional Mesh/Torus networks is less than other high dimensional networks. For example, 2D Mesh network requires about 24.22% less router power usage than the 3D Mesh network. On the other hand, considering only 2 virtual channels, the saturation rate of the Torus is faster than the Mesh network. However, increasing the number of virtual channels to four, the Torus network shows much better saturation rate than the Mesh networks. Even the Mesh network shows the slightly poor zero load latency with four virtual channels than the two virtual channels, which is about a 0.04% decrease in clock cycles. In the case of two virtual channels, the difference of clock cycles at zero load latency between 3D Mesh and 3D Torus is about 7.03% less.

CoNCLUSIoN
In this research plan, the authors have analyzed the network performance for Mesh and Torus networks with the parameters like-diameter and average distance for investigating the static network performance, and also the network latency for investigating the dynamic network performance. Regarding the performance concern, this paper was able to show that 3D Torus can achieve about 57.07% better dynamic communication performance than 2D Torus network and 4D Torus can show about 25.95% better performance than 3D Torus network considering the zero load latency in the uniform traffic pattern. Moreover, the saturation rate for low degree networks is much faster than the high degree networks. In the case of power analysis on Mesh and Torus considering the link's and router's static as well as dynamic power usage, the authors could able to show that 2D Mesh network requires about 24.22% less router power usage than the 3D Mesh network and 5D Torus requires about 66.8% higher router power usage than the 3D Torus network with 0.0015flits/cycle traffic load. However, in considering the 0.0065flits/cycle traffic load 5D Torus network requires about 67.31% higher router power usage than the 3D Torus network. This analysis ensures that high degree networks will require huge power usage even at the on-chip level. Furthermore, in the case analyzing the effects of virtual channels on Mesh and Torus networks, the difference of clock cycles at zero load latency between 3D Mesh and 3D Torus is about is about 8.17% lower with four virtual channels. On the other hand, in case of power usage in router static power has been significantly increases with the increase of virtual channels. For example, in case of 3D Torus network, 2 VCs require 71.98% less static power than 4 VCs. Hence, the proper choice of required number of virtual channels is very important for network performance as well as for power usages.