Big Data Analytics for Intrusion Detection: An Overview

Big Data Analytics for Intrusion Detection: An Overview

Luis Filipe Dias, Miguel Correia
DOI: 10.4018/978-1-5225-9611-0.ch014
(Individual Chapters)
No Current Special Offers


Intrusion detection has become a problem of big data, with a semantic gap between vast security data sources and real knowledge about threats. The use of machine learning (ML) algorithms on big data has already been successfully applied in other domains. Hence, this approach is promising for dealing with cyber security's big data problem. Rather than relying on human analysts to create signatures or classify huge volumes of data, ML can be used. ML allows the implementation of advanced algorithms to extract information from data using behavioral analysis or to find hidden correlations. However, the adversarial setting and the dynamism of the cyber threat landscape stand as difficult challenges when applying ML. The next generation security information and event management (SIEM) systems should provide security monitoring with the means for automation, orchestration and real-time contextual threat awareness. However, recent research shows that further work is needed to fulfill these requirements. This chapter presents a survey on recent work on big data analytics for intrusion detection.
Chapter Preview


Over the past two decades network intrusion detection systems (NIDSs) have been intensively investigated in academia and deployed by industry (Debar et al., 1999). More recently, intrusion detection has become a big data problem because of the growing volume and complexity of data necessary to unveil increasingly sophisticated cyberattacks. The security information and event management (SIEM) systems adopted during the last decade show limitations when processing with big data, even more in relation to extracting the information it can provide. Therefore, new techniques to handle high volumes of security-relevant data, along with machine learning (ML) approaches, are receiving much attention from researchers. This chapter presents an overview of the state-of-the-art regarding this subject.

The Cloud Security Alliance (CSA) suggested that intrusion detection systems (IDSs) have been going through three stages of evolution corresponding to three types of security tools (Cárdenas et al., 2013):

  • IDS: able to detect well-known attacks efficiently using signatures (misuse detection) and to unveil unknown attacks at the cost of high false alarm rates (anomaly detection);

  • SIEM: collect and manage security-relevant data from different devices in a network (e.g., firewalls, IDSs, and authentication servers), providing increased network security visibility by aggregating and filtering alarms, while providing actionable information to security analysts;

  • 2nd generation SIEM: the next generation, that should be able to handle and take the best from big data, reducing the time for correlating, consolidating, and contextualizing even more diverse and unstructured security data (e.g., global threat intelligence, blogs, and forums); they should be able to provide long-term storage for correlating historical data as well as for forensic purposes.

The European Agency for Network and Information Security (ENISA) stated that the next generation SIEMs are the most promising domains of application for big data (ENISA, 2015). According to recent surveys from the SANS Institute, most organizations are just starting to evolve from traditional SIEMs to more advanced forms of security analytics and big data processing (Shackleford, 2015, 2016). In fact, the industry developments in the area led Gartner to start publishing market guides for user and entity behavior analytics (UEBA) (Litan, 2015). While recent UEBA technologies target specific security use cases (e.g., insider threats), typical SIEM technologies provide comprehensive rosters of all security events which are also important for compliance requirements. Recent guides from Gartner (Bussa et al., 2016; Kavanagh et al., 2018) state that “Vendors with more mature SIEM technologies are moving swiftly to incorporate big data technology and analytics to better support detection and response”, revealing the tendency of moving towards 2nd generation SIEMs.

The focus of this survey is on state-of-the-art techniques that can contribute for such next generation SIEMs, and on the challenges of big data and ML applied to cybersecurity analytics. There are a few related surveys available in the literature, none with this focus. Buczak & Guven (2016) analyze papers that use different ML techniques in the cybersecurity domain; although interesting, most of the experiments of those studies were done with datasets that date back to 1999 and that do not represent the actual cybersecurity landscape. Bhuyan et al. (2014) provide a comprehensive overview of network anomaly detection methods. Zuech et al. (2015) review works considering the problem of big heterogeneous data associated with intrusion detection. While this last work touches topics similar to this chapter, it lacks a comprehensive study on recent techniques that tackle the problem of extracting useful information from such volumes of data.

Key Terms in this Chapter

SIEM: A software solution that allows to collect, manage and correlate security events generated by several network devices.

Clustering: Aims to group data automatically according to their degree of similarity.

Big Data: Traditional technologies of data processing, have difficulties to handle in tolerable time a large dataset.

Anomaly detection: Finds deviations from normal. Rare events or observations raise suspicions when differ significantly from most of the data.

Stream Processing: Given a data set (a stream), an operation is applied to all elements of the stream.

Security Analytics: Is an approach to cybersecurity focused on the analysis of data.

Log Analysis: It seeks to extract knowledge about threats or malfunctions, from records generated by a device.

Complete Chapter List

Search this Book: