Article Preview
Top1. Introduction
The average of the shortest paths between any two nodes of a network is a global metric of high relevance. Popularly called as the average path length (APL), it provides useful insights on the level of interconnectivity in a network and the time it would take for information/goods to flow between any two randomly selected points on the network. APL has been shown to be an important metric for tasks such as the designing of real-life transportation networks (Balmer et al., 2004; Klunder and Post, 2006; Ziliaskopoulos et al., 1997), design of routing networks (Costa et al., 2007; Dabek et al., 2004), design of web-based networks (Backstrom et al., 2012; Fu et al., 2008; Kleinberg, 2000; Newman, 2000), studying propagation of diseases (Dekker, 2013), diffusion of information and opinion dynamics (Yildiz et al., 2011). Researchers have also shown that search and navigation are easier when APL is small (Zhang et al., 2008). However, estimating APL takes a lot of time and is sometimes infeasible on account of lack of computational resources (Wang, 2006).
Researchers have shown that he computational time required to estimate APL scale as where is the number of vertices and a function of the number of edges in the network (Madduri et al., 2007). scales as where . This quadratic to cubic scaling of the computational time with network size makes the estimation of APL impractical as the network sizes increase (Wang, 2006). As an example, it takes approximately 9.6 hours to estimate the APL of a synthetic network created using the preferential attachment algorithm with about 100,000 nodes on a 16GB RAM PC and the time requirement increases to more than five days for a network with 1 Million nodes. Such a long wait time is generally impractical, especially in simulation and emulation studies that require generation of 100s of synthetic networks and the estimation of APL for each scenario.
In this paper, we develop a random node pair sampling based strategy to estimate the APL for mid-sized networks when both computational time and capacity are limited. Though sampling introduces some uncertainty in the reliability of the parameter estimates, the precision and confidence can be controlled using a combination of the right sampling strategy and sample size. We therefore propose and demonstrate the efficacy (regarding computational time, confidence level, and precision) of the proposed node pair sampling algorithm. We compare the proposed algorithm with random node sampling algorithm and algorithms wherein the node sampling is non-uniform. The random node pair sampling algorithm yields a speed up factor of more than 411 when compared to the algorithm that uses random node sampling and speed up of 750 when compared to the algorithm that measures APL using the population information. The proposed algorithm uses the central limit theorem approximation to determine the sample size for a given precision and confidence level.