A Sensor Data Stream Collection Scheme Considering Phase Differences for Load Balancing

In the internet of things (IoT), various devices (things) including sensors generate data and publish them via the internet. The authors define continuous sensor data with difference cycles as a sensor data stream and have proposed methods to collect distributed sensor data streams. In this paper, the authors describe a skip graph-based collection scheme for sensor data streams considering phase differences. In the proposed scheme considering phase differences, the collection time is balanced within each collection cycle by the phase differences, and the probability of load concentration to the specific time or node is decreased. The simulation results show that the proposed scheme can equalize the loads of nodes even if the distribution of collection cycles is not uniform.


INTRoDUCTIoN
The Internet of Things (IoT) (Hodges et al., 2013) has attracted greater interest and attention with the spread of network-connected small devices such as sensors, smartphones, and wearable devices. In the data science field, stream data generated from IoT devices are analyzed to get various information. A larger amount of data can lead to higher-quality information such a faster stream data collection is one of the main techniques in the data science field and various schemes have been proposed. To enable IoT applications for data collection, pub/sub messaging (Eugster et al., 2003) is considered to be a promising event delivery method that can achieve the asynchronous dissemination and collection of information in real-time in a loosely-coupled environment. For example, the sensor devices correspond to publishers, and the IoT application corresponds to a subscriber. Topic-Based Pub/Sub (TBPS) protocols are already widely utilized by many IoT applications Teranishi et al., 2017). These systems have a broker server for managing topics. The broker gathers all the published messages and forwards them to the corresponding subscribers.

Assumed environment
The purpose of this study is to disperse the communication load in the sensor stream collections that have different collection cycles. The source nodes have sensors to gain sensor data periodically. The source nodes and collection node (sink node) of those sensor data construct P2P networks. The sink node searches source nodes and requires a sensor data stream with those collection cycles in the P2P network. Upon reception of the query from the sink node, the source node starts to deliver the sensor data stream via other nodes in the P2P network. The intermediate nodes relay the sensor data stream to the sink node based on their routing tables.

Input Setting
The source nodes are denoted as N i (i = 1, ..., n), and the sink node of sensor data is denoted as S. In addition, the collection cycle of N i is denoted as C i .
In Figure 1, each node indicates source nodes or sink node, and the branches indicate collection paths for the sensor data streams. Concretely, they indicate communication links in an application layer. The branches are indicated by dotted lines because there is a possibility that the branches may not collect a sensor data stream depending on the collection method. The sink node S is at the top and the four source nodes N 1 , ..., N 4 (n = 4) are at the bottom. The figure in the vicinity of each source node indicates the collection cycle, and C 1 = 1, C 2 = 2, C 3 = 2, and C 4 = 3. This corresponds to the case where a live camera acquires an image once every second, and N 1 records the image once every second, N 2 and N 3 record the image once every two seconds, and N 4 records the image once every three seconds, for example. Table 1 shows the collection cycle of each source node and the sensor data to be received in the example in Figure 1. The purpose of this study is to disperse the communication load in the sensor stream collections that have different collection cycles. The source nodes have sensors to gain sensor data periodically. The source nodes and collection node (sink node) of those sensor data construct P2P networks. The sink node searches source nodes and requires a sensor data stream with those collection cycles in the P2P network. Upon reception of the query from the sink node, the source node starts to deliver the sensor data stream via other nodes in the P2P network. The intermediate nodes relay the sensor data stream to the sink node based on their routing tables.

Definition of a Load
The communication load of the source nodes and sink node is given as the total of the load due to the reception of the sensor data stream and the load due to the transmission. The communication load due to the reception is referred to as the reception load, the reception load of N i is I i and the reception load of S is I 0 . The communication load due to the transmission is referred to as the transmission load, the transmission load of N i is O i and the transmission load of S is O 0 .
In many cases, the reception load and the transmission load are proportional to the number of sensor data pieces per unit hour of the sensor data stream to be sent and received. The number of pieces of sensor data per unit hour of the sensor data stream that is to be delivered by N p to N q (q ≠p; p, q = 1, ..., n) is R(p, q), and the number delivered by S to N q is R(0, q).

Skip Graphs
In this paper, we assume the overlay network for the skip graph-based TBPS such as  Skip graphs are overlay networks that skip list are applied in the P2P model (Aspnes et al., 2007). Figure 2 shows the structure of a skip graph. In Figure 2, squares show entries of routing tables on peers (nodes), and the number inside each square shows a key of the peer. The peers are sorted in ascending order by those keys, and bidirectional links are created among the peers. The numbers below entries are called "membership vector." The membership vector is an integral value and assigned to each peer when the peer joins. Each peer creates links to other peers on the multiple levels based on the membership vector. In skip graphs, queries are forwarded by the higher-level links to other peers when a single key and its assigned peer is searched. This is because of the higher-level links can efficiently reach the searched key with fewer hops than the lower level links. In the case of range queries that specify the beginning and end of keys to be searched, the queries are forwarded to the peer whose key is within the range, or less than the end of the range. The number of hops to key search is represented to O(log n) when n is denoted as the number of peers. In addition, the average number of links on each peer is represented to log n.

Phase Differences
Currently, we have proposed a large-scale data collection schema for distributed TBPS . In Teranishi et al., 2017, we employ "Collective Store and Forwarding," which stores and merges multiple small size messages into one large message along with a multi-hop tree structure on the structured overlay for TBPS, taking into account the delivery time constraints. This makes it possible to reduce the overhead of network processes even when a large number of sensor data is published asynchronously. In addition, we have proposed a collection system considering phase differences Kawakami et al., 2018;Kawakami et al., 2019). In the proposed method, the phase difference of the source node N i is denoted as d i (0≦d i < C i ). In this case, the collection time is represented to C i p + d i (p = 0, 1, 2, ...). Table 2 shows the time to collect data in the case of Figure 1 where the collection cycle of each source node is 1, 2, or 3. By considering phase differences like Table 2, the collection time is balanced within each collection cycle, and the probability of load concentration to the specific time or node is decreased. Each node sends sensor data at the time base on his collection cycle and phase difference, and other nodes relay the sensor data to the sink node. In this paper, we call considering phase differences "phase shifting (PS)." Figures 3 and 4 show an example of the data forwarding paths on skip graphs without phase shifting (PS) and with PS, respectively.

eVALUATIoN
In this section, we describe the evaluation of the proposed skip graph-based method with phase shifting (PS) by simulation. Table 3 shows the simulation environments. We evaluate our proposed system in two environments by the combination of collection cycles. The number of the sink node is one, and the collection cycle  C i and phase difference d i of each node is determined at random. In the simulations, we measure the number of nodes targeted to collect data from time 0 to 99 and compare the results with the case of not considering phase differences. Figures 5 and 6 show the results of the number of nodes targeted to collect data from time 0 to 99. The horizontal axis shows the time, and the vertical axis shows the number of the targeted nodes at each time. In the simulation environment 1 shown by Figure 5, the case of not considering phase differences collects data from all 1000 nodes at time 0, 6, 12, ..., 96. This is because the collection cycle is 1, 2, or 3, and the lowest common multiple is 6. At other time in the case of not considering phase differences, the number of the nodes is extremely and constantly increase/decrease. On the other hand, the collection time is shifted by the phase differences in our proposed system, and the number of the nodes is probabilistically equalized for each time if the phase difference of each node  is determined at random. Therefore, the probability of load concentration is decreased. Also in the simulation environment 2 shown by Figure 6, our proposed system achieves high balancing similar to the results in the simulation environment 1 while the case of not considering phase differences changes the number of the nodes complexly by the combination of cycles from 1 to 10.

Communication Loads and Hops
In the simulation environments of our previous work, the collection cycle of each source node denoted as C i is determined at random between 1 and 10. The selectable cycles are assumed limited by the practical systems, however, the distribution of the selected cycles is not uniform in the real world. In this paper, therefore, we evaluate our proposed method in the different distributions of collection cycles. The employed distributions are the normal (Gaussian) distribution and exponential distribution.
To determine the integer cycle between 1 and 10, the normal distribution has 5.5 as an average and 1.5 as a variance. In addition, the exponential distribution determines the integer cycle for each node based on its cumulative distribution function (CDF), 10 (1e -x ), while x is determined between 0.0 and 5.0 at random. For other parameters, the simulation time denoted as t is from 0 to 2519, which length is the least common multiple of the selectable cycles. In addition, this simulation has no communication delays among nodes although there are various communication delays in the real world. As comparison methods, we compare the proposed method with the skip graph-based method without PS shown in Figure 4, the method in which all source nodes send data to the destination node directly (Source Direct, SD), and the method in which all source nodes send data to the next node for the destination node (Daisy Chain, DC). Figures 7 and 8 show an example of SD and DC with PS, respectively.
Figures 9 and 10 show the results for the maximum instantaneous load and total loads of nodes by the number of nodes, respectively. The lateral axis shows the number of nodes, and the allowable number of stream aggregation is under 11. In all the distributions from Figure 9, the proposed method, skip graphs (SG) with PS, has a lower instantaneous load compared to SD-based methods where the destination node receives data directly from the source nodes. Although the larger the allowable number of stream aggregation in DC-based methods, the smaller the number of transmission and reception. In this simulation environment, however, the proposed method has a lower instantaneous load than the results of DC-based methods. In addition, the proposed method has a lower instantaneous load compared to SG without PS because each node has different timing of transmission and reception by its phase difference even if another node is configured the same collection cycle. In Figure 10, on the other hand, SD-based methods have the lowest total loads. However, the proposed method has lower total loads compared to DC-based methods in this simulation environment. In addition, the total loads are the lowest in the exponential distribution because longer cycles have higher probabilities to be selected.
Similar to the results for the maximum instantaneous load and total loads of nodes, Figures 11  and 12 show the results for the average number and the maximum number of hops by the number of nodes under 11 streams aggregation, respectively. In Figures 11 and 12, SD-based methods have only one hop as the average number and maximum number although those instantaneous loads described in Figure 9 are high. The proposed method has log n as the average number of hops while n is denoted as the number of nodes and DC-based methods are affected linearly by n. Figures 13 and 14 show the results for the maximum instantaneous load and total loads of nodes by the allowable number of stream aggregation, respectively. The allowable number of stream aggregation is the value on the lateral axis, and the number of nodes is 200. SD-based methods have a constant value as the maximum instantaneous load which is not affected by the allowable number of stream aggregation because the source nodes send data to the destination node directly. In Figures 13  and 14, most of the results decrease by the increase of the allowable number of stream aggregation. The proposed method, SG with PS, has lower results for both of the maximum instantaneous load and total loads even in the realistic situation, 4 1 streams aggregation, compared to DC-based methods that require many streams aggregation to reduce those loads. In addition, the average number and  Figures 11 and 12 because they are not affected by the allowable number of stream aggregation.

DISCUSSIoN
We described the data collection scheme with the approach of phase shifting (PS) in the third section. Our experiment results show that our proposed method can reduce loads of nodes and realize highly scalable systems to periodically collect distributed sensor data.
As the limitations of our current study, we assume the pieces of data are not so different from each other. In the real world, however, various types of data are published at the same time such as texts, images, and audio. Those pieces of data have different sizes and loads to be processed. We can clear this limitation by considering not only the number of data pieces (or transmission/reception) but also the types of them. Similar to the inconsideration of the data types, the inconsideration of nodes' performances is another limitation of our current study. We can clear this limitation by considering the nodes' performance such as processing power, memory size, and network environment. In addition, our current study has a limitation in the viewpoint of security or privacy. For example, private data are preferred to be sent to the subscriber via fewer nodes. Encryption of the data or communication is one of the common approaches, and the arrangement of the data forwarding paths considering security/privacy is another solution to clear this limitation, e.g., the forwarding paths are directly connected to those valid subscribers for private data.

ReLATeD woRK
Related to the distributed stream data collection, various techniques have been proposed to disperse the communication loads for stream delivery (Shen et al., 2011;Win et al., 2018).
P2P stream delivery techniques have been proposed to use a P2P architecture and disperse the communication loads among the processing computers (nodes) (Zhang et al., 2005;Liao et al., 2006;Magharei et al., 2009;Yu et al., 2011;Sakashita et al., 2005). The P2P stream delivery techniques are divided into a pull type and a push type. In the pull type technique such as PPLive, DONet (Zhang et al., 2005), and SopCast, the reception nodes request data to other nodes and receive them. The reception nodes find the nodes which have not yet been received the requested data, and redundant communications do not occur. In the push type techniques such as AnySee, data are sent from the transmission nodes to other nodes (Liao et al., 2006). The transmission nodes find the nodes which have not yet received the requested data, and redundant communications do not occur. The techniques combining a pull type and a push type also have been proposed such as PRIME (Magharei et al., 2009).
Data delivery path construction techniques have been proposed as a multicast tree to prevent the concentration of communication loads to the specific node (Tran et al., 2003;Jin et al., 2007;Silawarawet et al., 2011;Le et al., 2012). In the ZIGZAG method, nodes construct clusters, and the multicast tree is constructed by the clusters (Tran et al., 2003). The number of clusters included in each depth of a multicast tree is made the same, and thus, the loads are dispersed. Multicast trees are constructed only of information gained in the application layer, and it is not necessary to understand the physical network structure.
In the MSMT/MBST method, the concentration of the communication loads is more prevented than the ZIGZAG method by consideration of the physical network structure (Jin et al., 2007). However, the MSMT/MBST method is not easy to be implemented because it is necessary to understand all the network structures between the nodes. In LAC (locality aware clustering), the loads are more dispersed than the ZIGZAG method by considering a part of nodes, though the physical network structure cannot be understood (Silawarawet et al., 2011).
In the above-described P2P stream delivery techniques, the same data streams are assumed to be sent to many reception nodes. In the delivery of the sensor data streams, however, the same sensor data stream is assumed to have different delivery cycles to be delivered. In this case, those sensor data streams are delivered as different data streams for each delivery cycle. Thus, the communication loads cannot be efficiently dispersed. On the other hand, our proposed methods consider the different frequencies or cycles of each data stream and construct delivery paths to efficiently collect them.
As the distributed stream data collection systems, an existing method to reduce the number of messages to receive data from large-scale nodes is to execute a range query on key order-preserving overlays (Alaei et al., 2010;Gu et al., 2011;Legtchenko et al., 2012;Ohnishi et al. 2015;Shinomiya et al., 2011;Takeuchi et al.. 2010). For example, "split forward broadcasting(SFB)" (Banno & Fujino et al., 2015) is an efficient way to construct tree structures for range queries. The data collection from publishers on a subscriber in TBPS corresponds to the execution of range query for nodes that have keys with a topic. It can reduce the number of messages by merging responses from nodes along the reverse path of the query delivery tree structure. However, this method loses the asynchronous realtime feature of TBPS. The latest sensor data is not delivered until the subscriber executes a range query. Once the tree structure is constructed, it can be reused, but the periodic execution of range queries is needed to catch up with the joins/leaves of publishers and subscribers. Some existing works address the "aggregation problem" on structured overlays. DAT (Distributed Aggregation Tree) (Cai et al., 2007) constructs a tree to aggregate data from distributed nodes using the Chord (Stoica et al., 2003) overlay structure. DAT computes the aggregated value of all the local values applying a given aggregation function on the distributed nodes. DAT can be used for data collections if the nodes execute the message merging function as an aggregation function. However, to merge and collect published messages, the publishers need to publish messages at the same time, which is not a realistic assumption. Moreover, the aforementioned methods cause path concentration of data being forwarded on the nodes that are located close to the subscriber nodes on the overlay. In addition, once the tree structure is decided, it cannot be changed dynamically. As a result, a network process overload tends to occur on these nodes. On the other hand, our proposed ADCT method  can construct a flexible collection tree and adaptively adjust the maximum overhead for the nodes to merge and forward messages.

CoNCLUSIoN
In this paper, we proposed a skip graph-based collection scheme for sensor data streams considering phase differences. Our method uses phase shifting to avoid the load concentration to a specific time. Our simulation results show that the proposed scheme can equalize the number of the nodes even if the distribution of collection cycles is not uniform.
As future work, we will try to clear the current limitations described in the fifth section entitled Discussion. More specifically, we will consider other information to determine the data forwarding paths such as data types, node performances, and security/privacy.

ACKNowLeDGMeNT
This research was supported by JSPS KAKENHI Grant Numbers 18K11316, G-7 Scholarship Foundation, and Research Grants from the University of Fukui.