On Allocation Algorithms for Manycore Systems With Network on Chip

Single-chip multicore processors and their network on chip interconnection mechanisms have received extensive interest since the early 2000s. The mesh topology is popular in networks on chip. A common issue in mesh is that it can result in high energy consumption and chip temperatures. It has been recently shown that mapping communicating tasks to neighboring cores can reduce communication delays and the associated power consumption and improve throughput. This paper evaluates the contiguous allocation strategy first fit and non-contiguous allocation strategies that attempt to achieve a degree of contiguity among the cores allocated to a job. One of the non-contiguous strategies is a new strategy, referred to as neighbor allocation strategy, which decomposes the job request so that it can be accommodated by free core submeshes and individual cores that have degree of contiguity. The results show that the relative merits of the policies depend on the job’s communication pattern.


INTROdUCTION
Single-chip multicore processors and their network on chip (NoC) interconnection mechanisms have received extensive interest since the early 2000s.They have become attractive as the number of transistors that can be placed on a single die has continued to increase, as per Moore's law.A multicore processor consists of multiple processing cores that can execute instructions in parallel.Multicore processors that consist of a large number (tens to thousands) of relatively simple cores are referred to as manycore processors.Manycore processors have for goal achieving a high level of explicit parallelism.
The mesh interconnection topology is popular in NoCs.This includes both 2D and 3D meshes (Matsutani et al., 2007;Wentzlaff et al., 2007).An example is the 2D 8×10 mesh interconnection network used in an Intel manycore research chip that integrates 80 cores.Each of the cores contains a processing element (PE) and a 5-port router for communication (Vangal et al., 2007).Four ports are used for communicating with the east, west, north, and south neighbors of internal PEs or cores, as in Figure 1.Edge and corner cores have fewer neighbors.The fifth port is for communication with the PE.Another example is the TRIPS On-Chip Network (OCN) that uses a 4x10 wormhole-routed 2D mesh interconnection network (Gratz et al., 2007).A 1024-node manycore system that has a mesh topology, and is partitioned into 32 clusters, where each cluster is 4×8 mesh, has been proposed recently.To decrease the system's overall diameter, each cluster has in its middle a Radio Hub (RH) that attaches to the four routers of the middle cluster node.Communications between clusters takes place through the Radio Frequency (RF) waveguides of the RHs.Within clusters, communication takes place over a mesh interconnection network (Lahdhiri et al., 2020).
A 2D mesh interconnection network (see Figure 1) is an example of direct networks, in which each processer or core is connected directly to its neighbors.In addition to being used over the past two decades in multicore systems, the 2D mesh interconnection network had been used in earlier large-scale multicomputer systems.This is due to its simplicity, regularity, scalability (i.e., it can be scaled up as the system needs), as well as its ease of implementation, and ability to benefit from the locality property in communication for many parallel applications.The 3D mesh is an extension of the 2D system, where several 2D mesh chips are stacked vertically on top of each other with additional top and down interconnection links, as appropriate.
A common issue in mesh NoCs is that they can suffer from high energy consumption and temperatures.Mapping communicating tasks to neighboring cores can reduce communication delays and associated power consumption and improve throughput and job execution times (Mosayyebzadeh et al., 2016;Agyeman et al., 2018;Dahir et al., 2021).

Core Allocation
We assume in this paper that a subclass of manycore systems implemented as a system on chip (SOC) will be used as general purpose multicomputers.In such systems, performance of an application depends on the job scheduling strategy, the core allocation strategy, and the characteristics of the SOC, including the NoC used.The job scheduler determines the order in which jobs are executed.The core allocation strategy determines the set of cores or PEs on which a job that is selected for execution will run.In addition, we assume in that the NoC used has the 2D mesh topology and jobs request the allocation of submeshes of size a×b upon arrival.As in previous works, we assume static space sharing allocation, where a job is allocated a set of cores exclusively until it terminates execution (Lo et al., 1997;Bani-Mohammad & Ababneh, 2018).
In general, two types of static space sharing allocation have been under investigation by researchers, the contiguous and non-contiguous allocation (Chuang & Tzeng, 1994;Lo et al., 1997;Seo & Kim, 2003;Bani-Mohammad & Ababneh, 2018).In contiguous allocation, the physical contiguity among allocated cores is necessary, and the shape of the allocated submesh should be the same as that requested.This submesh restriction imposes the fragmentation problem, which has direct influence on the ability to utilize the PEs of the system.Fragmentation has two types: internal fragmentation and external fragmentation.Internal fragmentation occurs when the allocation strategy assigns to the job more cores than what it has requested.External fragmentation occurs when there are enough free cores, but the allocation algorithm is unable to allocate them because of the contiguity condition or lack of recognition by the algorithm.
Figure 2-A illustrates internal fragmentation, where a job that requests two cores is allocated four cores, causing internal fragmentation of 50%.The cores allocated are within the black frame.Figure 2-B illustrates external fragmentation, assuming that contiguous allocation is used, and a job requests eight cores as a 2×4 submesh.Eight cores are available in the mesh system, but they are not allocated because they are not contiguous.
However, in non-contiguous allocation, the physical contiguity among the allocated cores, as well as having the same shape as the request, are not required; wherever cores are free, they can be allocated to the job.
Researchers have proposed non-contiguous allocation schemes that aim to improve the performance of the system in terms of average response time, which is the total time that the job spends in the computer system from arrival until departure, and in terms of system utilization, which is the percentage of cores that are utilized over time.Non-contiguous allocation can suffer from communication overhead as distances between communicating entities can be high, and interference from the messages of other co-running applications can occur.These effects can be reduced by preserving a good degree of contiguity among the cores allocated to the same job to decrease the network distances among communicating entities (Seo & Kim, 2003;Bani-Mohammad et al., 2007;Bani-Mohammad & Ababneh, 2018).In addition to improving communication delays and execution times, assigning communicating tasks to neighboring cores can reduce power consumption and chip temperatures (Mosayyebzadeh et al., 2016;Agyeman et al., 2018;Dahir et al., 2021).
Motivated by the above observations, a new allocation algorithm for 2D mesh systems is proposed in this paper.The algorithm aims to improve system performance by trying to allocate submeshes that have some contiguity degree among their cores.It starts by allocating large available rectangular submeshes, and then allocating neighbor cores.The proposed algorithm mainly aims to decrease the communication overhead in the network by preserving contiguity among the allocated cores.
The performance of the proposed algorithm is compared with that of the well-known contiguous First Fit (FF) algorithm (Zhu, 1992), and the non-contiguous L-Shaped Submesh Allocation algorithm (LSSA) (Seo & Kim, 2003) using simulations.The influence of different communication patterns on the performance of these allocation algorithms is studied.The communication patterns considered are one-to-all, all-to-all and near-neighbor.
The simulation results show that the performance of the proposed algorithm is better than that of the previous contiguous and non-contiguous allocation strategies considered in this work in terms of system utilization.This is because of its ability to alleviate external fragmentation.In terms of average response times, the simulation results show that the proposed algorithm has better performance than all other schemes when the one-to-all communication pattern is used.However, for all-to-all and near-neighbor communication patterns, the results show that the performance of the proposed strategy is better than that of LSSA, but FF is the best.

BACKGROUNd ANd ReLATed WORKS
In this section, a brief description of some of the well-known 2D mesh contiguous and non-contiguous allocation strategies that have been proposed and investigated is presented.

2d Buddy System
The Two-Dimensional-Buddy-System, known as 2DBS (Li & Cheng,1991), is applicable only to powerof-two square mesh systems and square submesh requests.A job that requests a submesh of size a a × is allocated a square submesh of side length 2 k , where k a =       log .For instance, assume a job requests 9 PEs as 3×3 submesh, then k = 2 and a 4×4 submesh is allocated to that request, which results in an internal fragmentation of 44%.This scheme suffers from both internal and external fragmentation, in addition to requiring that the system and requests be square (Chuang & Tzeng, 1994).

First-Fit (FF) and Best-Fit (BF) Allocation Strategies
First-Fit (FF) and Best-Fit (BF) allocation strategies (Zhu, 1992) are applicable to any mesh system regardless of its size.In these strategies, internal fragmentations are eliminated by allocating precisely the requested number of cores or processors.The nodes that can be used as base nodes for submeshes that can hold the jobs are stored in an array of the size of the mesh.When using FF, the first free suitable base node is selected as the base node for allocation.On the other hand, BF chooses the node that has the largest number of busy neighbors and smallest number of free areas to be the base node for the allocated submesh.The simulation results (Zhu, 1992) show that the performance of FF and BF are close in both average response times and system utilization.FF and BF suffer from external fragmentation as they are contiguous policies and do not support changing the orientation of the job request, which degrades system performance (Seo & Kim, 2003).Switching request orientation had been proposed in (Ding & Bhuyan, 1993), where allocation for b a × submesh is attempted if it fails for the requested a b × submsh.

Flexfold Strategy
In this strategy (Gupta & Jayendran, 1996), switching the orientation of the request and changing its shape are utilized.For a job request for a submesh S a b ( , ) , the Flex Fold strategy attempts allocation as follows: 2 2 using a first fit strategy.Despite improving system utilization, this strategy suffers from external fragmentation and has a constraint on the job side lengths, where both sides must be even (Seo & Kim, 2003).

ALL Shapes First-Fit Sub-Mesh Allocation Strategy (ASFF)
The All-Shapes First-Fit contiguous submesh allocation strategy (ASFF) (Ababneh et al., 2010), attempts allocation to an incoming job request by permitting all possible 2D shapes.In ASFF, given a job request for n cores, all valid request shapes are constructed, and then these shapes are considered for allocation in a specific order.Priority is given to shapes that have smaller differences between width and height.The allocation process stops upon the first successful allocation.For example, if 12 processors are requested and the mesh system size is 6×6, then ASFF generates the following request sequence: (4,3), (3,4), (6,2), and (2,6), while if a job requests 25 processors, then only one shape (5,5) is generated.ASFF uses FF for allocation.

L-Shape Sub-Mesh Allocation Strategy (LSSA)
The L-Shape Sub-mesh Allocation strategy (LSSA) (Seo & Kim, 2003) uses a scheme that changes a rectangular submesh request S(a,b) into an L-shaped request.This is done by virtually removing a smaller rectangular submesh R(c,d) from S(a,b) and attaching it to a side of the remaining block in S. The lengths of attachment sides in R and the remaining block of S are chosen to be equal, to produce an L-shaped allocation request for a b × processors.Note that there are normally multiple choices for R, resulting in various L-shaped requests.
In LSSA, if the allocation of a submesh S a b ( , ) is requested, LSSA first checks if the requested number of processors a b × is in the free state in the system.If not, allocation fails.Then, LSSA attempts allocation in the following order until allocation succeeds, or all choices are exhausted: LSSA attempts allocation for L-shaped submeshes.

Minimum Interference Paging (MIP)
The Minimum Interference Paging strategy (MIP) (Bani-Mohammad & Ababneh, 2018), attempts to reduce the distances among cores allocated using a paging variant that chooses a set of cores with the lowest distance between the first and last allocated cores, where the distance is the number of cores between the first allocated core and last on allocated.In this strategy, the valid request shapes are generated for the number of cores requested.Then, the successive shapes are first considered for contiguous allocation using the FF policy (Zhu, 1992).Contiguous allocation is given priority because it eliminates interference among the messages of different jobs, which reduces contention and communication latency.If contiguous allocation fails, the MIP algorithm (Bani-Mohammad & Ababneh, 2018) is applied, where the cores are scanned, as in Paging[0] (Lo et al., 1997), starting with the first core in the mesh system to determine all the base and end cores for all available sets of cores that can accommodate the job request.Free cores with the lowest distance between the base and end cores are allocated to the job request.The distance of a set is the number of cores visited between the base and end cores of the set during the scan.

efficient Maximal Free Submesh detection Scheme for Space-Sharing Allocation in Manycore Systems with 2d NoCs (PMFL)
In this strategy (Ababaneh & Bani-Mohammad, 2022), an efficient recognition-complete maximal free submesh detection scheme for 2D mesh-connected manycore systems is proposed.An advantage of this scheme over the previous recognition-complete scheme (KYMFL), proposed in (Kim & Yoon, 1998), is that its time complexity is quadratic in the number of free submeshes, while the time complexity of the previous scheme is cubic in this number.Using detailed simulations, the two schemes have been evaluated and compared.In these simulations, various previous promising policies for deciding where allocation takes place are considered.The simulation results show that when allocation and de-allocation times are considered, the proposed submesh detection scheme (PMFL) substantially outperforms the previous maximal free submesh detection scheme (KYMFL).It has achieved up to seventy percent improvement in the measured combination of these times.

PROPOSed NeIGHBOR ALLOCATION STRATeGy (NAS)
The main aim of the proposed Neighbor Allocation Strategy (NAS) is to decrease job communication costs and achieve good system utilization by trying to execute jobs on submeshes that have contiguity among their cores; contiguity within this context is having the network routers of cores connected via the router of at least one intermediate core.

The Neighbor Allocation Strategy (NAS)
A 2D mesh is represented by M h w , ( ) , where h is the height of the mesh and w . is its width, each node can be identified by two coordinates x y , ( ) , where 0 ≤ < x w .and 0 ≤ < y h .A node P and a sub-mesh are said to be neighbors when one of the border nodes of the submesh S a b , ( ) For example, the P node (4,1) in Figure 3 is neighbor to the submesh border node (3,1), where x P and y P y , and also the P node (1,4) is neighbor to the submesh border node (1, 3), where y P and x P x . In general, Figure 3 shows an example that clarifies the definition of neighbor relation between a node and a submesh; here the submesh S( , ) 3 3 that has nodes represented as black circles, is neighbor to the set of nodes {(4, 1), (4, 2), (4, 3), (1, 4), (2, 4), (3, 4)}.Also, if (4, 1) is allocated to the same job as S( , ) 3 3 , then (5, 1) is considered a neighbor of the sub-mesh S .Likewise, (4, 4) is considered a neighbor of the submesh S if (4, 3) or (3, 4) are allocated to the same job.

Figure 3. Neighbor relation between a node and a sub-mesh
An incoming job request is represented by , S a b ( ) , where the number of PEs requested is a b × .The NAS scheme initially checks if the requested number of processors is available in the system.If not, allocation fails.Otherwise, it searches for free submeshes in the following sequence until the requested free submeshes are allocated or allocation fails.Initially NAS tries allocation to the original submesh , S a b ( ) using FF, and if this fails then ASFF is attempted for , S a b ( ) .If this fails, LSSA is applied by constructing submeshes from , S a b ( ) and its rotation , S b a ( ) , and attempting FF allocation for the constructed submeshes.If this fails, the following NAS scheme is applied.
The NAS largest submesh assigned to the job is called the nucleus submesh (NS).For any job that requests a submesh , S a b ( ) , the NS is constructed as follows: for even a , where a b uses −1 as k , and increments k by 1 while it is less than 4 as in (Seo & Kim, 2003), otherwise k is incremented by b / 4       .This is how the L-Shape (LS) submesh is constructed in LSSA.Next NAS searches the system to allocate the remaining requested processors as neighbors to the allocated processors.The goal is to maintain a degree of contiguity, while attempting to increase system utilization.
This allocation process is implemented by the algorithm in Figure 4.Note that allocation succeeds if the number of free processors is greater than or equal to a b × .

Figure 4. Outline of the NAS allocation algorithm
Figure 5 shows the initial state of an example 2D mesh connected multicore system, where all free cores (not allocated) are shown as white circles.
We give some examples selected carefully to clarify how NAS works.In the first example, the job requests a submesh of size , 5 3 × as shown in Figure 6.The NAS strategy tries FF, ASFF and LSSA.When this fails, NAS rebuilds the job request by constructing its nucleus submesh of size 2 6 × and allocates it in the mesh system, as shown in Figure 7. Next, NAS searches for free neighbor nodes.As shown in Figure 7, the neighbor cores (1, 2 and 3) are allocated to the job.Assuming the system state shown in Figure 8, the second example shows an incoming job that requests a submesh of size 6 3 × .Here, NAS rebuilds the request as a 6 2 × nucleus submesh and allocates it in the system, then 6 cores are allocated to the job as shown in Figure 9.
In the third example, a job requests a 5 1 × submesh, as shown in Figure 10.NAS tries to form a nucleus submesh, but this fails although the required number of cores is free in the mesh system.In this case, NAS considers the nucleus submesh to be 1 1 × and searches the system to allocate the remaining number of cores by keeping contiguity, as shown in Figure 11.In Figure 12, the request is for a submesh of size 5 2 × .In this example, NAS tries allocation for the nucleus submesh 4 3 × , but this is not available.Therefore, NAS tries again to rebuild the request by forming another nucleus submesh as 5 2 × .Again, this shape is not available.NAS forms another nucleus submesh as 6 1 × and allocation succeeds for this submesh, and then the 4 remaining processors are allocated as shown in Figure 13.

SIMULATION ReSULTS
In our simulation experiments carried out using the ProcSimity v4.3 simulator (ProcSimity v4.3, 1996), we assume a two-dimensional mesh-connected manycore system where cores or PEs are interconnected using a direct 2D NoC, and we use three common communication patterns for comparing performance of FF, LSSA and the proposed strategy.The communication patterns are the Near-Neighbor, One-to-All and All-to-All patterns.In Near-Neighbor communication, every core allocated to the job sends a message to each of its four neighbor cores: east, west, north, and south neighbors.This pattern has been added to ProcSimity v4.3 (Bani-Mohammad & Ababneh, 2013).In One-to-All communication, a core, randomly selected among those allocated to the same job, sends a message to all cores that are allocated to the same job (Lo et al., 1997).In All-to-All communication, every core assigned to a job sends a message to all other cores executing the same job (Lo et al., 1997).A job runs until all messages it sends are received, and each job does exactly one iteration of the given communication pattern ProcSimity is a software simulation tool for processor allocation and job scheduling schemes in meshes and k-ary-n-cube multicomputers.It is an open source code simulator that was developed at the University of Oregon and written in the C programing language (ProcSimity v4.3, 1996).The ProcSimity simulator has been updated in (Bani-Mohammad & Ababneh, 2013;Ababneh & Bani-Mohammad, 2022;Al Abass et al., 2022) by adding new communication patterns and new allocation and scheduling strategies.ProcSimity is suitable for evaluating processor allocation and job scheduling strategies for mesh connected multicomputers and manycore systems with direct 2D NoCs.
We conduct enough runs to keep the confidence level above 95% that the relative errors are at most 5%.In each run, the values of measured metrics that include job response time, system utilization, service time, and finish time are calculated.Overall average performance values are computed at the end of runs.The performance values considered are the job average response time and system utilization.The number of jobs per run is 1000.
The mesh system in the simulations is a 2D square mesh with side lengths of L. Job inter-arrival times are exponentially distributed.The job scheduling strategy used in this research work is First Come First Served (FCFS).FCFS has been used because it is fair, and the goal is to compare allocation schemes, not scheduling policies.
The job arrival rate is the inverse of the mean inter-arrival time of jobs.Two distributions are used to generate widths and heights of job requests.The first is the uniform distribution over [1, L ], where the width and length of a request are generated independently.The second distribution is the uniform-decreasing distribution.It is governed by four probabilities p 1 , p 2 , p 3 , and p 4 , and three integers l 1 , l 2 , and l 3 , where the probabilities that the width and height of a request fall in the ranges and [l 3 +1, L ] are p 1 , p 2 , p 3 , and p 4 , respectively.Again, the width and height of a request are generated independently.The probability of the side lengths within the ranges are equally likely.For the simulation experiments in this research work, L =16, p1 =0.4,p2 =0.2, p3 =0.2, p4 =0.2, l1 = L /8, l2 = L /4, l 3 = L /2, and l 4 = L .These distributions have often been used in the literature (Lo et al., 1997;Bani-Mohammad & Ababneh, 2018).
Wormhole switching is used as a flow control mechanism in the interconnection network.Flits are assumed to take one time unit to move between two adjacent nodes, and t s time units to be routed through a node.Message sizes are represented by P len .The performance figures in this paper are based on using t s = 3 time units, and P len = 8.

System Utilization Results
Figures 14 to 19 show the mean system utilization for One-to-All, All-to-All and Near Neighbor communication patterns, and both the uniform and uniform-decreasing job size distributions.
The results show that for heavy system loads (i.e., high job arrival rates) NAS achieves higher utilization than the remaining policies for the two job size distributions and all communication patterns considered.This means that allocation is more likely to succeed in NAS than in FF and LSSA.However, because higher system utilization may be accompanied with longer execution times due to longer distances and congestion that can occur in non-contiguous allocation, response times are a better metric.Moreover, they are of primary direct interest to users.
In Figure 14, for example, the mean system utilization for all considered allocation algorithms is almost the same for job arrival rates that are below the 0.0005 jobs/time units.However, for job arrival rates above 0.001 jobs/time units, the performance of NAS is better than that of LSSA and FF.This is because NAS has better ability to avoid external fragmentation by utilizing the free processors.In Figure 15, it can be seen that the mean system utilization is improved for all allocation strategies when the uniform-decreasing distribution is used.This is because of the increased probability of generating small jobs compared to the size of the mesh system; successful allocation is more probable for smaller jobs.NAS also performs much better than LSSA and FF for high job arrival rates.
In Figure 16, the mean system utilization is almost the same for both of NAS and LSSA for the job arrival rates below 0.00005 jobs/time units, and it is better than that of the FF.This is again because of the ability of NAS and LSSA to better avoid external fragmentation.For high job arrival rates (0.00006667 jobs/time units and above), NAS performs better than LSSA and FF.This is because of its ability to better avoid external fragmentation as compared to LSSA.
In Figure 17, the mean system utilization is improved for all allocation strategies when the uniform-decreasing distribution is applied.This is because of the increased probability of generating small jobs compared to the size of the mesh system, which makes successful allocation more probable.NAS performs much better than LSSA and FF.
In Figure 18, the results show that the mean system utilization for LSSA is better than that of NAS and FF when the job arrival rates are below 0.00333333 jobs/time units.However, NAS performs better than LSSA and FF when the job arrival rates are above 0.004 jobs/time units.This is because of the ability of the NAS to better avoid external fragmentation when the load is heavy.
In Figure 19, the results show that the mean system utilization for LSSA is better than that of the NAS and FF for the job arrival rates below 0.0125 jobs/time units.However, NAS performs better than LSSA and FF for the job arrival rates that are above 0.025 jobs/time units.This is because of the ability of the NAS to better avoid external fragmentation under heavy loads.

Average Response Time
Figures 20-25 show the average response times results using the One-to-All, All-to-All and Near Neighbor communication patterns when the FCFS scheduling scheme is used, and for both job size distributions considered, uniform and uniform-decreasing.
Figures 20 and 21 show that the NAS allocation strategy has superior performance over all other allocation strategies for the One-to-All communication pattern under the uniform and uniformdecreasing job size distributions considered.This is because of NAS ability to remove external fragmentation as compared to FF and LSSA, which increases the probability of successful allocation, and hence enhances the system performance in terms of job response time.
For All-to-All and Near Neighbor communication patterns results shown in Figures 22-25, FF has the best performance among the allocation strategies considered for both the uniform and uniform-decreasing job size distributions.This is because contiguous allocation strategies allocate a rectangular submesh for job requests, and this can reduce inter-job interference and communication distances, which reduce the communication overhead in the system.However, NAS is better than LSSA, presumably because of its ability to achieve superior system utilization.

CONCLUSION ANd FUTURe dIReCTIONS
Contiguous allocation schemes allocate a single submesh to each job, and they suffer from processor fragmentations.In non-contiguous allocation, the job can be executed on several small submeshes instead of waiting for one submesh of the requested size and shape to be available.The goal is to improve system performance by minimizing processor fragmentation, however non-contiguous allocation can result in high communication overhead.Generally, the aim of any allocation strategy is to improve system performance by increasing system utilization and decreasing jobs' communication overhead and response times.
Motivated by the previous observations, NAS has been proposed.Its goal is to reduce external processor fragmentation, while preserving a good degree of contiguity among the allocated processors to alleviate the communication overhead.In NAS, the job request can be allocated to several small submeshes, while having a large degree of contiguity among them.In the first step, the nucleus submesh is constructed, and then the remaining required number of processors are allocated as neighbors to the nucleus submesh or other processors allocated to the job.
Simulation experiments have been conducted for comparing FF, LSSA and NAS.The results show that the performance of NAS, when system utilization is considered, is overall better than the Overall, the results also show that NAS, when considering average response times, is better than that of all other schemes.On the other hand, when All-to-All and Near Neighbor communication patterns are used, the contiguous FF allocation scheme has superior performance over the noncontiguous allocation schemes considered.This is because contiguous allocation assigns a single submesh to job requests, and this decreases inter-job interference and communication distances.However, NAS is better than LSSA because of its ability to maintain more contiguity.
As a continuation of this research in the future, it would be interesting to study the performance of the proposed NAS allocation strategy with different scheduling schemes, such as window-based (Ababneh & Bani-Mohammad, 2011).It would be also interesting to assess the suggested NAS allocation strategy using other performance metrics, such as power consumption.

Figure 13 .
Figure 13.The allocation of job 5x2 in the system

Figure 14 .
Figure 14.System utilization vs. arrival-rate using one-to-all communication pattern and uniform distribution for job size in a 16×16 mesh

Figure 15 .
Figure 15.System utilization vs. arrival rate using the one-to-all communication pattern and uniform-decreasing distribution for job size in a 16×16 mesh

Figure 17 .
Figure 17.System utilization vs. arrival rate using the All-to-All communication pattern and uniform-decreasing distribution for job size in a 16×16 mesh

Figure 19 .
Figure 19.Mean system utilization vs. arrival rate using the near-neighbor communication pattern and uniform-decreasing distribution for job size in a 16×16 mesh

Figure 20 .
Figure 20.Average response time vs. arrival rate using one-to-all communication pattern and uniform distribution for job size in a 16×16 mesh

Figure 22 .
Figure 22.Average response time vs. arrival rate using the All-to-All communication pattern and uniform distribution for job size in a 16×16 mesh

Figure 24 .
Figure 24.Average response time vs. arrival rate using the near-neighbor communication pattern and uniform distribution for job size in a 16×16 mesh