In the near future, several radio access technologies will coexist in Beyond 3G mobile networks (B3G) and they will be eventually transformed into one seamless global communication infrastructure. Selfmanaging systems (i.e. those that self-configure, self-protect, self-heal and self-optimize) are the solution to tackle the high complexity inherent to these networks. In this context, this chapter proposes a system for automated fault management in the Radio Access Network (RAN) of wireless systems. The chapter presents some basic definitions and describes how fault management is performed in current mobile communication networks. Some methods proposed for auto-diagnosis, which is the most complex task in fault management, are also discussed in this chapter. The presented systems incorporate Key Performance Indicators (KPIs) to identify the cause of the network malfunction.
There is no doubt that during the last decade mobile communications have played an increasingly important role in the telecommunication business, and it will continue to do so in the years to come. In the last years, 3G networks, called Universal Mobile Telecommunications Service (UMTS) networks in Europe, have started to be deployed throughout the world. In the near future, thanks to 3G, mobile internet-services are expected to be available “anywhere and anytime’’. Users will surf the Web, check the email, download files or have real time videoconference, in a shopping mall, the airport, the city center or their homes. Beyond 3G mobile networks (B3G) (Jamalipour, 2005) will comprise a set of interrelated and rapidly growing wireless networks, applications which will require increasing bandwidth, and users who will demand high quality of service at low cost, all within a limited spectrum allocation. In these networks, the highly complex and heterogeneous Radio Access Network1 (RAN) will be composed of different technologies, such as GSM, UMTS and WLAN.
Until now, most operational tasks have been manually performed, requiring dedicated staff, with subsequent, inflexibility and delay of response. However, network operators are currently showing a growing level of interest in automating most network management activities. This has stimulated intense research activities in the field of self-managing networks (Pras, 2007; Kephart, 2003; Strassner, 2004). In this context, the self-managing property refers to the capability of the network to self-configure, self-protect, self-heal and self-optimize. All these issues have been the main driver behind recent studies dealing with automation and optimization of cellular networks (Halonen, 2003; Johnson, 2004; Lempiäinen, 2001; Laiho, 2002a).
In a mature cellular network that has undergone most of its site roll-out, the major cost is associated to the operation of the network. As the network consists of a high number of pieces of equipment that are distributed across the entire country, maintaining and operating this large and technically complicated system is a difficult task that requires operator personnel around the clock in several regional offices. For example, a GSM network in a typical European country may consist of about 10.000 sites. Due to the large size of the networks, it is common that some of the deployed pieces of equipment do not work as planned. The consequence of such problem is poor end-user service. As in most countries several operators are competing for subscribers, it is imperative to rectify such occurrences because otherwise users will be dissatisfied with the service and thus will likely switch to competing network operators. Hence, fault management, also called troubleshooting (TS), is a key aspect of operating a cellular system in a competitive environment. As the RAN of cellular systems is by far the biggest part of the network, most of the TS activities are focused on this area.
Key Terms in this Chapter
Fault Detection: Identification of those cells in a cellular network that have some problems.
Diagnosis Model: It is a representation of how the identification of the fault cause should be performed. It comprises a qualitative part (causes and symptoms) and a quantitative part (parameters that quantify the relations among causes and symptoms).
Auto-Diagnosis: Automated identification of the fault cause that is causing problems, based on the analysis of the symptom values.
Decision-Theoretic Troubleshooting: Method to solve problems that tries to find a sequence of tests and actions that maximizes the efficiency of the troubleshooting process.
Trouble Ticket: Procedure used to reflect a problem status, when a fault is investigated. It consists of the fault description and the steps performed so far to solve it out or the identification of the faulty equipment in case the problem is deemed as a hardware fault.
Diagnosis Accuracy: Percentage of cases in a data set correctly classified, that is, the percentage of cases where the diagnosed cause is equal to the actual cause.
Key Performance Indicators: Quantifiable measurements that reflect the performance of the wireless network.
Alarm Correlation: Filtering of alarms to meaningful high level alarms in order to avoid overloading the operators. It involves different tasks: reduction of multiple occurrences of an alarm into a single alarm, inhibition of low priority alarms in the presence of higher priority alarms, substitution of a specific set of correlated alarms by a new one, etc.
Bayesian Network: A directed graph and a set of conditional probability functions that allow an efficient representation of a joint distribution over a set of random variables.
Troubleshooting: Also referred as fault management. Procedures carried out to solve problems. It comprises three phases: fault detection, diagnosis and solution deployment.