The growth of the internet and network-based services bring to us a lot of new opportunities but also pose many new security threats. The intrusion detection system (IDS) has been studied and developed over the years to cope with external attacks from the internet. The task of an IDS is to classify and stop the malicious traffic from outside to enter the computer system. In recent years, machine learning-based IDS has attracted a lot of attention from the industry and academia. The IDS based on state-of-the-art machine learning algorithms usually achieves a very high predictive performance than traditional approaches. On the other hand, several open datasets have been introduced for the researchers to evaluate and compare their algorithms. This chapter reviews the classification techniques used in IDS, mostly the machine learning algorithms and the published datasets. The authors discuss the achievements and some open problems and suggest a few research directions in the future.
TopIntroduction
Network security plays a crucial role in our modern world. It helps to secure our communication and information, reduce financial loss and prevent network disruption due to attacks. Particularly, due to the growth of the Internet and its importance, almost every computer in the world is connected hence they are vulnerable to attacks. We can claim that for any computer, attacks will happen sooner or later. Intrusion detection systems (IDSs) are designed to help computer systems deal with external attacks by classifying and stopping any incoming malicious traffic (Tsai et al., 2009). It is worthy to note that the IDS does not deal with internal attacks, i.e. the attacks that started from a computer within the local network. The role of the IDS is visualized in Figure 1.
Figure 1. The IDS in a computer system
(Tsai et al., 2009) divided the IDSs into two main categories: one relies on anomaly detection and one relies on signature detection. However, regarding the recent development in the research on IDS, we propose to divide the IDSs into three main categories: signature-based approach, anomaly detection approach, and classification approach.
The signature-based IDS relies on the signatures, or rules, defined by the security experts. For instance, the experts might define that any packet with the source IP address that is the same as the destination IP address is malicious. The signature-based IDS is very popular in the early days of the Internet. It is still being used widely in the industry today (Dang, 2020a).
The anomaly detection-based IDS relies on the assumption that the benign data themselves form a common group of traffic, while the malicious traffic is an outlier and does not share their properties. The anomaly detection is visualized in Figure 2. We can see that the majority part of the data is assumed to be normal data (or benign data), and the anomalies are defined as something different from the normal data.
The classification-based IDS rely on one or multiple machine learning classification algorithms (Umadevi and Marseline, 2017). To perform the classification, the algorithms require a labeled dataset (Ring et al., 2019). The required dataset is usually huge and needs to be updated frequently. The difference between anomaly detection and classification is that in the classification setting we know exactly what are attacks, while in the anomaly detection setting we just have a vague definition that the attacks are different from the benign traffic. Hence, the classification algorithms are supervised learning techniques, while the anomaly detection algorithms are unsupervised. The classification algorithms are the dominant algorithms used in the literature recently (Alqahtani et al., 2021).
In recent years, machine learning techniques have achieved a lot of success in empowering IDS (Ahmad et al., 2021). In the rest of the chapter, we will discuss the machine learning algorithms that are studied in the literature and the published datasets for training and testing the models. We then discuss the results and the limitations of the presented approaches. We draw some further research direction and conclude our paper.
TopDatasets
In this section, we review some of the most popular published intrusion datasets. These datasets can be used by researchers to train and test intrusion detection algorithms.
DARPA
The dataset DARPA 98/99 (McHugh, 2000) is one of the first intrusion datasets that has been introduced (Ring et al., 2019). The dataset carries two parts: one is for offline evaluation and one is for online evaluation. The dataset is created using an emulated network system. Version 1998 contains seven weeks of data while version 1999 contains the five weeks of data. The data is criticized as containing too much redundancy (McHugh, 2000).