Adaptive Modularized Recurrent Neural Networks for Electric Load Forecasting

In order to provide more efficient and reliable power services than the traditional grid, it is necessary for the smart grid to accurately predict the electric load. Recently, recurrent neural networks (RNNs) have attracted increasing attention in this task because it can discover the temporal correlation between current load data and those long-ago through the self-connection of the hidden layer. Unfortunately, the traditional RNN is prone to the vanishing or exploding gradient problem with the increase of memory depth, which leads to the degradation of predictive accuracy. Many RNN architectures address this problem at the expense of complex internal structures and increased network parameters. Motivated by this, this article proposes two adaptive modularized RNNs to tackle the challenge, which can not only solve the gradient problem effectively with a simple architecture, but also achieve better performance with fewer parameters than other popular RNNs.

As load data are usually recorded sequentially at a certain time interval, electric load forecasting can be regarded as the time series prediction in the field of data mining . Therefore, Recurrent Neural Network (RNN) that can well capture the time correlation of the sequence has been considered as a good choice for this task recently (Bianchi et al., 2017). As the simplest architecture, a Vanilla RNN is generally composed of three parts: the input layer, the hidden layer, and the output layer. Instead of the traditional Multi-Layer Perceptron containing only input and output connections, the Vanilla RNN introduces the recurrent connection from the previous to the current moment in the hidden layer (Elman, 1990). This means that the current input x t and the hidden state of the previous moment h t-1 together affect the hidden state h t . Because of the self-connection of the hidden layer, RNN designed for sequence modeling can be regarded as a deep network when it is unrolled along the time axis. Such a deep network structure can easily lead to the vanishing or exploding gradient problem for the long sequence in the process of parameter training using Back Propagation Through Time (BPTT) (Fernando et al., 2018). To address this challenge, different architectures were proposed to improving the trainability of RNN, but at the cost of significant computation overhead, such as Long Short-Term Memory (LSTM) (Hochreiter et al., 1997) and its variants . They pose a challenge in the training of many more parameters with the introduction of gates. Recently, Clock-Work RNN (CW-RNN) was proposed to utilize another simple but effective architecture to alleviate the gradient problem (Koutnik et al., 2014). It first divides the hidden layer into several modules with different updated frequencies, and then makes the slow-updating modules have recurrent connections with longer time delays. This allows data dependencies to be passed in fewer time steps to avoid excessive multiplications of the gradient. Besides, CW-RNN uses the predefined rule instead of training to determine module updates, greatly reducing the number of network parameters. But this inevitably weakens the generalization ability of the network. The motivation of this article is to refine and extend this architecture based on multi-timescale connections, aiming to resolve the contradiction between the performance and the number of parameters. The research content mainly includes the adaptive updating strategy of modules in the hidden layer and the pruning problem of recurrent connections caused by this strategy. The contributions of this article are: • A new modularized RNN (M-RNN) is proposed to generalize the existing CW-RNN. M-RNN is a framework with the skip length of the module as the key component, which can realize the update of the hidden state by taking the module as the minimum unit. • Two adaptive strategies for updating the hidden modules are proposed by designing two new activation functions to calculate the priority of each module. On this basis, the unordered and ordered adaptive M-RNNs (AM-RNNs) are defined respectively to achieve dynamic multitimescale connections. • Since the existing pruning strategy is only applicable to ordered AM-RNN, a two-way pruning strategy is designed for the unordered AM-RNN to realize the sparsification of recurrent connections. • Both versions of AM-RNN are compared with other popular RNNs for electric load forecasting.
The experimental results show that AM-RNNs can achieve better predictive accuracy with fewer network parameters than the current RNNs widely used in this field.

RELATED woRK
Considering the economic and environmental implications of even a slight improvement in accuracy, there is still a lot of research on electric load forecasting recently (Bendaoud et al., 2020). Methods to accomplish this task can be generally divided into two categories: statistics-based methods and data-driven methods (Hong et al., 2016). Although the statistics-based methods have been developed very maturely, their abilities are limited due to the nonlinearity of sequences, the randomness of user behaviors, the diversity of external factors, etc. (Hafeez et al., 2020). This makes data-driven methods based on artificial intelligence gradually become a research hotspot, including dynamic neural network (Mordjaoui et al., 2017), extreme learning machine (Chen et al., 2018), deep belief network (Ouyang et al., 2019), convolutional neural network , etc. However, the above models do not take the temporal relationship of data as a definite feature of the load, which may lead to performance degradation when complex dependencies between the current load and those long-ago need to be considered. From this point, RNNs that can continuously transmit early input information through the hidden state have been widely applied to electric load forecasting, especially in the absence of feature engineering (Shi et al., 2018). Because the gradient problem mentioned above makes the Vanilla RNN perform poorly in long sequence learning, some measures including algorithm substitution, weight constraint, gate mechanism, and multi-timescale connections have been proposed for its improvement. The first measure is to replace BPTT with other optimization algorithms, such as Hessian-Free (HF) method (Martens et al., 2012), Reservoir Computing (RC) (Gallicchio et al., 2017), etc. However, they are also criticized for problems like difficulty in implementation or limited learning capacity. The second measure is to constrain the recurrent weight to guarantee that the continuous multiplication of multiple matrices does not cause the gradient to approach zero or very large (Arjovsky et al., 2016). Note that the strict implementation of weight constraints may hinder training speed and generalization ability (Vorontsov et al., 2017). Gate mechanism represented by LSTM and its variants is the third measure and has been applied in numerous tasks of time series, including short-term or midterm electric load forecasting (Li et al., 2021) (Dudek et al., 2021). The hidden state of LSTM needs to be obtained through a cell state and three gate vectors, all of which require additional connection weights with the input and the previous hidden state. So a typical LSTM requires about four times as many training parameters as Vanilla RNN (Greff et al., 2016). Reducing the number of the gate is a common method that leads to faster convergence and improved generalization. As a simplified variant of LSTM, Gated Recurrent Unit (GRU) without memory cell performed better on some tasks than LSTM, even though it has only the update and reset gates (Cho et al., 2014). Furthermore, Minimal Gated Unit (MGU) which contains only one gate can be regarded as the simplest design among all gated architecture by sharing the gate vector (Zhou et al., 2016). To handle the problem of gate undertraining, a new gate mechanism was designed to perform the element-wise refining operation on the input and the output of each gate (Cheng et al., 2020). While it can be equipped with any kind of gated RNNs to improve performance, it does not reduce the number of parameters. The final common measure to improve the performance of RNN on long-term dependencies is to introduce multi-timescale connections with increasing recurrent skip lengths. Similar to the skip-connection of residual neural network (He et al., 2016), the main idea of multi-timescale connections is that the recurrent connections should exist not only in adjacent time steps but also in larger time steps. For example, Skip RNN allows the network to determine whether to fully replicate the previous state or to update it at the current moment, based on the calculated updated likelihood (Campos et al., 2018). In addition to neurons, multi-timescale connections can also be formed in modules, such as the CW-RNN described earlier (Koutnik et al., 2014). Dilated RNN stacks multiple hidden layers that work on different skip lengths to focus on different temporal dependencies (Chang et al., 2017). More recently, the latest advances have focused on integrating some of the aforementioned measures (Jing et al., 2019) (Moirangthem et al., 2021).
In summary, from the perspective of trade-off optimization between accuracy and efficiency, the authors consider the modularized RNN (M-RNN) modified on the Vanilla RNN to be a very competitive multi-timescale architecture because it does not require additional hidden layers or connection weights. However, little work has been done on this architecture, prompting them to study it in this article.

THE NEw FRAMEwoRK M-RNN
The core idea of M-RNN is to obtain the short-term and long-term dependencies of the data by introducing different skip lengths of the hidden modules at different moments. According to the framework shown in Figure 1, the hidden layer is first divided into k modules. It is assumed that each module has m neurons, which means the number of the hidden neurons n h =k*m. Secondly, skip length S t i (i=1, 2, …, k) is introduced to determine the time delay of the module H t i , to clearly distinguish its responsibility for short-term or long-term dependency at time t. The module with a smaller skip length is more conducive to short-term dependency. Conversely, the longer the skip length, the fewer steps needed to transform the information, and the more favorable the long-term dependency. The equations of M-RNN are defined as (1)-(3).
Here, x t , h t , h t ', and y t are the input, the hidden state, the candidate state, and the output at time step t;  and * are denoted as the element-wise product between two vectors and two matrices, respectively; f h (.) and f o (.) are the activation functions of the hidden and the output layer, such as tanh or ReLu. W ih , W hh , and W ho represent the input, the recurrent, and the output weight matrices. They and the bias vectors b h , b o are parameters to be learned. W is used to represent all the above parameters, which are required to be the same at each time step. Note that the equations for M-RNN are very similar to those for GRU. There are two main differences between them. One is that the updated vector u t of GRU is learned by the update gate, while the updated vector of M-RNN is easier to obtain by the updated strategy described later. The other is that GRU adopts the reset vector r t learned by GRU Besides, the elements of u t and r t for GRU range from 0 to 1 due to activation functions (usually sigmoid), while the elements of u t and M for M-RNN are either 1 or 0. In other words, M-RNN can be thought of as containing a binary gate that controls the flow of information. Furthermore, M-RNN requires the same values of updated elements for all neurons in the same module to ensure that they are updated or retained at the same time. Finally, as a general framework, note that the skip length S t i can be either a constant or a variable, depending on the updated strategy of the module described below.

THE UPDATED STRATEGy oF HIDDEN MoDULES
Different from RNN with gate mechanism, the update of M-RNN is not based on the neuron, but on the module, which aims to greatly reduce the network parameters and training time. As the simplest model of M-RNN, the computational efficiency of CW-RNN is not only much higher than gated RNNs but even higher than Vanilla RNN, because not all modules are updated at every time, but must be determined by the updated vector. Therefore, designing an appropriate strategy to obtain the updated vector is crucial to the performance of M-RNN. In this article, based on the fixed strategy adopted by CW-RNN, several other effective updating strategies are discussed to meet the challenge of finding suitable time scales.

The Fixed Strategy
As a typical representative, CW-RNN utilizes a fixed strategy to determine whether each module is updated at time step t through a predefined rule. More specifically, after assigning a period T i to each module, CW-RNN stipulates that the state of all neurons in the module can be updated only if the current time t is divisible by the period of the module; otherwise, the previous state should always be maintained. According to the above rule, the updated vector of CW-RNN in any time step can be obtained without training. If the updated vector u t is divided into k sub-vectors according to module size, then the element values of each sub-vector are either all zeros or all ones. One means that the corresponding neuron is involved in the update, while zero means that it retains the value of the previous time. Figure 2 shows the locations of the activated modules about CW-RNN with 4 modules As can be seen from Figure 2, this CW-RNN can provide a set of recurrent connections with skip lengths of 1, 2, 4, and 8. Modules with longer skip lengths update slower than other modules. By taking advantage of these slow-updating modules, information that takes a lot of time steps to transmit has other paths with fewer time steps. This effectively alleviates the gradient problem caused by too long time steps. The benefit of the fixed strategy is that its simplicity reduces the number of network parameters and saves a lot of training time. However, it is undeniable that the generalization ability of fixed strategies is unsatisfactory.

The Random Strategy
Inspired by Zoneout which regularizes LSTM by randomly preserving hidden states (Krueger et al., 2017), the authors designed an updated strategy for the modules based on random probability in their early work (Zhuang et al., 2020). Different from traditional Zoneout based on neurons, this strategy randomly assigns some modules to participate in the update at each time step. Firstly, the module H t i is randomly assigned an updated probability P t i at each time step. Secondly, given an updated threshold ε (0<ε<1), the elements in the updated vector corresponding to the neurons in H t i are set to 1 only if P t i > ε, otherwise are set to 0. In this way, the updated vector for any time step can be obtained. Figure 3 shows an example of Zoneout M-RNN (ZM-RNN) based on the random strategy. It can be concluded from Figure 3 that the skip length of each module is no longer a constant but a variable, which means that each module is sometimes updated faster and sometimes slower. This means a strict division between faster and slower modules can be avoided. Each module can accommodate both short-term and long-term dependencies. Although the random strategy is simple to implement, it inevitably introduces an additional hyper-parameter, namely the updated threshold, which can be adjusted by the validation set. Besides, the performance fluctuation caused by random probability is also one of the problems to be considered.

The Adaptive Strategy without order
Compared with the existing strategies, the adaptive strategy can be regarded as a more active and effective strategy. The key is to determine which modules are worth updating and which ones are worth retaining at each time step. In this article, a new concept of "module priority" is proposed to measure the importance of each module participating in the update. Firstly, the element-wise vector based on (5) is implemented for the candidate state h t ' to obtain the priority of the hidden neurons. Secondly, as shown in (6), the priority of each neuron in the same module is accumulated to obtain the priority of each module. ∈ can be easily obtained by (7).
The updated threshold is set to 1/(k+1) to ensure that all modules have the opportunity to participate in the update at some particular moment. Finally, a fully updated vector u t can be obtained by splicing all the updated sub-vectors up and down. For the convenience of expression, the adaptive M-RNN based on this strategy is called AM-RNN-I. Figure 4 shows several examples of AM-RNN-I with 4 modules obtaining the updated vectors based on (5)-(7). The skip length obtained by AM-RNN-I is similar to that obtained by ZM-RNN, which is also variable and does not strictly distinguish long/short dependencies. Therefore, Figure 3 can also be used to illustrate the locations of the activated modules about AM-RNN-I. The difference between them is that ZM-RNN is based on random probability to determine whether a module is updated, while AM-RNN-I is based on the module priority. By comparing Figure  2 and Figure 3, it can be observed that ZM-RNN and AM-RNN-I do not have an obvious hierarchy like CW-RNN because the update of the module is independent of its index.

The Adaptive Strategy with order
The latest research has an explicit preference for the hierarchical hidden layer   (Schoene et al., 2021). This leads the authors to further consider how to improve the above adaptive strategy to ensure that the updated frequency of modules can be decreased in order. In other words, when the module H t i is updated, all modules with smaller indexes ( H t j , 1 d j i ) are updated at the same time. To achieve this goal, the learning process of the updated vector needs to perform the following two steps after gaining the priority of each module: Step 1   Step 2: Set an updated threshold ε (0<ε<1) to determine the number of modules to be updated per time step. The updated vector can be obtained by (9) More specifically, the updated vector is split into two segments: the 1-segment and the 0-segment. The length of the segment is variable, depending on the number and size of the updated modules. The AM-RNN based on the above-updated rule is called AM-RNN-II. Figure 5 shows several examples of AM-RNN-II with 4 modules obtaining the updated vectors where the updated threshold ε=0.5.
It can be concluded from Figure 5 that module H t 1 is updated at each time step, which allows it to be treated as a Vanilla RNN. The updated frequency of modules keeps decreasing, making the module with a larger index more conducive to long-term dependencies. Figure 6 shows more details about the location of the activated modules about an AM-RNN-II with 4 modules. For example, the updated vectors at time t=2, 6, 10, 13 can be considered similar to the four examples in Figure 5, respectively. Similar to CW-RNN, AM-RNN-II has strict differences in updated frequency between modules. However, except that the skip length of module H t 1 is a constant equal to 1, those of all other modules are a variable learned in a data-driven way.
In summary, M-RNN can contain multiple models based on the above-updated strategies, as shown in Figure 7. The characteristic of CW-RNN based on the fixed strategy is that the skip length of each module is a preset constant. So the Vanilla RNN can be regarded as a special case of CW-RNN retaining all recurrent connections when the hidden layer is a module with skip length 1. Different from CW-RNN, the module of ZM-RNN based on the random strategy is sometimes updated fast and sometimes updated slowly. The probability change of skip lengths can achieve dynamic updating of modules. Finally, AM-RNN-I and AM-RNN-II based on the adaptive strategy can be regarded as the improvement of ZM-RNN and CW-RNN respectively. Their common feature is that the skip lengths can be adjusted according to the input information, rather than determined in advance or randomly.

THE PRUNING STRATEGy oF RECURRENT CoNNECTIoNS
In addition to the updated strategy of hidden modules, another key issue for M-RNN is the pruning strategy of recurrent connections. According to different pruning strategies, different mask matrices M in (1) can be obtained to determine which information of h t-1 is used to calculate the candidate state h t '. It should be pointed out that the purpose of M-RNN adopting a pruning strategy instead of the reset gate in GRU is to further reduce network parameters and avoid over-fitting.

THE EXISTING STRATEGy FoR oNE-wAy PRUNING AND ITS LIMITATIoNS
CW-RNN emphasizes that a given hidden neuron at time t-1 is only allowed to establish connections with those hidden neurons running at the same or faster-updated frequency at time t. Since a larger module index means a slower-updated frequency, the neurons of the module H t j −1 can be fully connected to the neurons of the module H t i only if j i ≥ . Figure 8(a) shows an illustration of this strategy. According to this slow-to-fast connection, if the mask matrix M are partitioned into k k × blocks as shown in (10) Since M is a block-upper triangular matrix, the weights below the block-diagonal of the Hadamard product W hh *M in (1) are all 0, to realize the pruning of the recurrent connections. This means that only m k k 2 1 2 u u / non-zero parameters in W hh need to be trained, which is much smaller than m k u 2 parameters of Vanilla RNN. It can be concluded from (1) that this strategy makes the slowupdating modules retain less information about h t-1 and rely more on the input x t when calculating the candidate state h t '. However, this strategy needs to determine which module updates faster or slower to achieve one-way pruning of the recurrent connections. Therefore, it is only suitable for ordered M-RNNs (CW-RNN or AM-RNN-II). The unordered M-RNNs (ZM-RNN or AM-RNN-I) cannot meet the above requirement due to the variable skip length of each module at different moments. A two-way strategy is designed to solve the problem that the updating speed of modules cannot be strictly distinguished.

THE PRoPoSED STRATEGy FoR Two-wAy PRUNING
Previous work by the authors has shown that CW-RNN can also perform well when there is not only the slow-to-fast connection but also the fast-to-slow connection (Huang et al., 2019). The fast-to-slow connection means a module at time t-1 can be connected to a module with the slower-updated frequency at time t. On this basis, a two-way strategy is designed for the unordered M-RNNs, whose principle is to decide which sub-matrices in M have all 1 or all 0 elements according to the pruning threshold p. This strategy is similar to the sparse connections of the reservoir in ESN (Hu et al., 2020), except that it is to prune the recurrent connections between modules directly rather than between neurons. To ensure the basic performance of the model, the two-way strategy first retains the recurrent connections of the same module at adjacent moments (i.e., all elements of M ii are 1). The goal is to form a simple fixed, non-random topology in the network to ensure that each module is fully connected at different times (Rodan et al., 2010). Secondly, the remaining sub-matrices (M ij , i¹j) are determined according to their random probabilities to control the sparsification of recurrent connections. The goal is to make use of random connections between different modules to achieve undifferentiated pruning. This poses a new question: is it better to generate a mask matrix M for the entire sequence or a new M for each step? Considering that the candidate state affected by M will later be used for learning the module priority, it is recommended to generate a new M for each step to ensure sufficient dynamic changes in the model (Qiao et al., 2016). Figure 8(b) shows an illustration of this strategy. Different from Figure 8(a), the updated speed of each module in 8(b) is variable at different moments, so there are both slow-to-fast and fast-to-slow connections in the hidden layer of adjacent moments. Besides, the one-way strategy forces the same pruning of the recurrent connections, while the two-way strategy allows different pruning at different times.
To sum up, as a general framework with modules as the minimum updated unit, M-RNN needs to pay attention to the following two parts. One is the updated strategy of hidden modules, including fixed strategy, random strategy, and adaptive strategy without order or with the order. The other is the pruning strategy of recurrent connections, including one-way strategy and two-way strategy. According to different combinations of the above two parts, four models can be obtained as shown in Table 1.

EXPERIMENTS AND RESULTS
In this section, all the above M-RNNs are compared with some popular gated RNNs for electric load forecasting. All models are implemented with Tensorflow. The experiments are performed according to the following principles: 1) all models contain only one hidden layer and have the same number of neurons; 2) the output layer is added only to the last moment of the sequence; 3) no tricks, such as recurrent dropout (Semeniuta et al., 2016), batch normalization (Laurent et al., 2016), gradient clipping (Pascanu et al., 2013), etc., are adopted.

Experimental Settings
A real-world dataset provided by the Australian Energy Market Operator is used for the one-day-ahead prediction of the load. The daily maximum load values from 2014 to 2017 in Queensland are selected for the experiment, among which the data from 2014 to 2016 are taken as the training set, the first six months of 2017 as the validation set, and the last six months as the test set. Z-Score standardization is adopted for data preprocessing to make its distribution more regular. To investigate the memory capacity of each model for long-term information, the time step is set to 365(the length of a year). The number of hidden neurons is set to 210. For M-RNN, the hidden layer is divided into 7 modules, with 30 neurons in each module. The initial weights are drawn from a truncated normal distribution with zero mean and a standard deviation of 0.01. All models are trained for 20000 epochs using RMSProp optimizer, where the learning rate could be optimized in {10 -3 , 10 -4 , 10 -5 } and the decay rate is set to 0.9. Other settings are as follows: the batch size=128; the updated threshold ε=0.5; and the pruning threshold p=0.5. Finally, the number of parameters (NP), mean absolute percentage error (MAPE), and normalized root means square error (NRMSE) are used to comprehensively evaluate the performance of each model. The equations of MAPE and NRMSE are as follows: where • is the operation of the mean; y t and y t * represent the predicted value and the ground-truth value respectively. The smaller the value of the above indicators is, the higher the predictive accuracy will be.  Table 2 shows the performance of each model over the test set. In terms of predictive accuracy, Vanilla RNN is the worst performing model due to the gradient problem of the long sequence. Among the three models based on gate mechanism, GRU has the best performance, followed by MGU, and LSTM has the worst performance. This indicates that more network parameters do not mean better performance. Finally, among the four models based on multi-timescale connections, CW-RNN is inferior to LSTM in MAPE but superior to it in NRMSE. ZM-RNN performs better than LSTM but worse than MGU and GRU. This fully illustrates the inadequacy of module updates using fixed or random strategies. Both versions of the proposed AM-RNN are superior to the aforementioned models. This indicates that the modeling capability of AM-RNN is significantly enhanced by introducing adaptive updating of modules in a data-driven way. In terms of training complexity, the difference in network architecture determines the difference in the number of training parameters. According to the structure of each model, the number of parameters can be deduced by (13) While the number of parameters of M-RNN is not only much less than gated RNN but even less than Vanilla RNN. This is thanks to the partial update of modules and pruning of recurrent connections taken by M-RNN. Note that since ZM-RNN and AM-RNN-I prune the recurrent connection according to probability, the number of parameters is an indeterminate value. When they happen to have the same amount of weight pruning as CW-RNN, the number of parameters can be considered equal. Finally, AM-RNN-II has the same number of parameters as CW-RNN because both adopt a slow-to-fast strategy for pruning the recurrent connection. As can be seen from Table 2, AM-RNN-II requires only 58% of Vanilla RNN's parameters to reduce MAPE by 37%. Compared to the best-performing GRU in the gated RNN, it takes only 19% of GRU's parameters to reduce MAPE by 11%. Figure 9 shows more detail of the residuals between the ground-truth value and the predicted value of all models. The larger the black area in the figure indicates the larger the predictive error. In particular, it can be observed that the areas with large residuals tend to be relative to the same intervals. These intervals are characterized by large abnormal fluctuations that make load changes particularly difficult to predict. Especially in the last segment, which corresponds to December 2017, due to the summer in Australia, the load shows a significant upward trend. Nevertheless, both versions of the proposed AM-RNN can still be very close to the target time series and obtain the best predictive accuracy. Note that their performance is very similar, with only minor differences. The consistent high performance of AM-RNN across different versions can be explained by its adaptive multi-timescale connectivity.

DISCUSSIoNS
For M-RNN, the most important parameter is considered to be the number of modules in the hidden layer, because it determines the diversity of multi-timescale connections. To better determine the applicability of the proposed AM-RNNs, comparative experiments with different numbers of modules were carried out on the Queensland load dataset with other experimental settings unchanged. Since this parameter does not exist for Vanilla RNN and gated RNNs, Table 3 shows only the performance of multiple models belonging to M-RNN. The experimental results show that both versions of AM-RNN To further evaluate the performance of AM-RNNs, additional experiments are carried out on a baseline dataset for chaotic time series prediction. Firstly, a time series containing 1700 sampling points is generated using the following Mackey-Glass equation: where the initial condition x(0)=0.5 and the chaotic attractor τ =17. The first 1500 points are used for training, while the last 200 points are divided equally for validation and testing. The number of hidden neurons is set to 12 and can be equally divided into 6 modules for M-RNN. The initial weights are drawn from a truncated normal distribution with zero mean and a standard deviation of 0.1. All models are trained using Adam optimizer with the learning rate 10 -4 . All other settings remain the same as the previous experiment except that the batch size=100. To assess the dependencies of different lengths, time steps are set to 50, 100, and 150 respectively, with the results shown in Table  4. Consistent with the previous experimental results, both versions of AM-RNN show good performance. The performance of AM-RNN-II is particularly outstanding. It almost always has the best predictive accuracy regardless of the time-step size.

CoNCLUSIoN
RNNs have gained widespread attention in many time-related tasks, such as electric load forecasting studied in this article. However, capturing complex time dependencies in sequence data, especially long-term dependencies, remains an open challenge for RNNs. By introducing the skip length of the module, this article mainly studies the general framework called the modularized RNN (M-RNN).
The adaptive M-RNN (AM-RNN) is designed under this framework, which can capture long-term dependencies adaptively thanks to the multi-timescale recurrent connections. The main feature of AM-RNN is to dynamically adjust the updated frequency of each module by calculating its priority. By reducing the number of the module being updated to obtain longer skip lengths, AM-RNN provides shortcuts for gradient propagation. Finally, two versions of AM-RNN are obtained by combining the updated strategy of hidden modules and the pruning strategy of recurrent connections, and their superiority is proved by experiments. The experimental results demonstrate that AM-RNN-II is superior not only to the Vanilla RNN and the popular gated RNNs but also to the existing multitimescale RNNs. In the future, the authors will further study the application of AM-RNN in other fields of smart cities, such as traffic flow prediction and air quality prediction.