Malware Detection in Network Flows With Self-Supervised Deep Learning

Malware Detection in Network Flows With Self-Supervised Deep Learning

Thomas Alan Woolman, Philip Lunsford
Copyright: © 2023 |Pages: 18
DOI: 10.4018/978-1-7998-9220-5.ch139
(Individual Chapters)
No Current Special Offers


This article explores the application of anomaly detection models from network flow data using multi-layer perceptron autoencoding neural networks, for the purpose of self-supervised detection of novel network intrusion events and malware classes over unrestrained internet connections. The authors utilized network flows rather than more detailed (and larger) packet capture logs in order to create a more cost-effective and potentially faster anomaly detection tool that could more easily scale enterprise class network traffic analysis. Unsupervised/self-supervised deep learning anomaly detection was used against this less-granular dataset to maximize the likelihood of detecting novel network activities within the less-detailed dataset without relying on pre-defined rules and training data. The authors conclude with a test of statistical significance against known threat classes (unknown to the anomaly detection model) that the proposed methodology results were statistically significant for detecting threat classes in unrestrained internet networks using network flow data.
Chapter Preview


With ever-increasing network complexities and threat actor sophistication, the vulnerabilities of critical network infrastructure and host systems are potentially greater than ever before. The ability of targeted organizations and government entities to defend their network perimeters utilizing traditional threat detection systems provides only a limited set of tools that are traditionally based on simple statistical tests of network activities and known threat signatures. These threat signatures generally rely on predefined malware detection rules based on known, previously encountered network intrusion attack types. As a result, sensitive information and critical resource applications can potentially be highly vulnerable to novel sophisticated and evolving network intrusion types, potentially putting commercial and public sector resources and information in mounting jeopardy.

The ability to detect legacy cyber threats through a multilayered defense approach is based on research pioneered by Chess and White (1987), initially based on permutations of signature detection methods proposed by Cohen (1987). While signature-based network malware and intrusion detection are still among the most heavily used techniques, heuristic approaches that are able to discern multiple, related threats from a single definition source have been increasingly common, as defined by Kaspersky Lab ZAO (2013). However, novel threats as well as more advanced cyber malware and intrusion events that are explicitly designed to avoid detection by the more commonly used available tools and techniques are becoming increasingly common. By being able to bypass the network security perimeter, intrusions and malware can quickly propagate throughout the network and operate undetected for substantial lengths of time. In many cases, these network intrusions can access restricted information while remaining undetected, masquerading their traffic signatures as legitimate, benign activities.

As the capability to resist successful classification is increasing with the latest generation of network intrusion technologies, continuous improvement in the multilayered network defense approach first proposed in 1987 becomes increasingly necessary. One example of this emerging malware threat class is a sophisticated modular malware known as Flame, first discovered in 2012 on networked devices running the Microsoft Windows operating system (ICIRT, 2012). Flame, also known as Skywiper, is believed to likely have been developed by a state actor as a cyber-weapon that was deployed for espionage purposes for one or more targets in the Middle East (Kaspersky Lab ZAO, 2013).

First detected inadvertently in 2012, Flame is now generally regarded as an unusually robust backdoor attack toolkit, with worm-like features and Trojan capabilities, with the ability to replicate both within a targeted network as well as on removable media upon receipt of commands to do so by a remote threat actor’s command and control server. Although the exact method of entry into a network has not yet been determined, Flame’s ability to take on different roles through a wide range of add-on functional libraries allows it to be both extraordinarily adaptable and difficult to analyze by traditional mitigation and detection methods, utilizing the novel technique of concealment through an unusually large and variable codebase compared to most other network malware threats.

Flame is capable of harvesting sensitive data in a variety of ways, including robust SQL database query insertions, compressed digital audio microphone recording, Bluetooth wireless connectivity attacks from inside the network, as well as file and network traffic ingestion and analysis. Furthermore, Flame can also take recurring screenshot images from infected devices. Flame is capable of reporting back to an external command and control server from within the targeted network via a covert SSL data channel, as well as turning other host devices within the network into beacons that are discoverable via Bluetooth connections, according to Kaspersky Lab ZAO (2013).

Key Terms in this Chapter

Deep Learning: Deep learning refers to a subset of the field of machine learning that utilizes a type of neural network algorithm that utilizes successive layers of neurons called perceptrons for the purpose of representation learning. The learning conducted by these various forms of neural networks can be either supervised, semi-supervised or unsupervised machine learning.

Network Intrusion Detection System (NIDS): A hardware device and/or software application designed to monitor a digital network for malicious activity, violations of network policies and recording network activities for analysis. NIDS systems typically record incoming network traffic, from the perspective of an enterprise host system. NIDS systems are traditionally sub-divided into classes, such as signature based (relying on specific, pre-defined patterns of network behaviors to identify specific malware events) and anomaly based systems. Anomaly based NIDS systems are intended to be more adaptable to previously unknown malware attacks because they are not limited to being pre-programmed for specific malware signatures, but are more challenging to develop.

Multilayer Perceptron: Also known as an MLP neural network, they are a type of deep learning neural network algorithm that is composed of multiple layers of perceptrons. They contain at least one “hidden” layer of perceptrons within their network, and all MLP models contain an input layer, at least one hidden layer and an output layer. The MLP models generally contain a non-linear activation function and thus have sensitivities to non-linear relationships present in the dataset. MLP models are typically referred to as “vanilla” neural networks, as opposed to recurrent neural networks, convolutional neural networks and other more mission-specific forms of neural networks.

Network Flow: Network flow data typically refers to metadata (higher level digital records) which characterize connections made across a digital network, without recording the exact content of each network activity. Network flows typically contain the internet protocol addresses and port numbers utilized in each recorded network connection event, along with protocols, time stamps and connection durations, source and destination hosts, and network interfaces used. Network flows do not include the actual content of each network connection, thus making it a much more “lightweight” information monitoring tool when compared to a packet capture log which does include the connection content and related content attributes.

Anomaly detection: Anomaly detection refers to finding unusual events or outliers within a given dataset. Many techniques for detecting anomalies for univariate statistical data exist in statistics, and multivariate techniques exist as well but are more complex and less scalable with standard statistical methods, and are also potentially less accurate due to a traditional model being limited to homogenized methods across all variables and generally being limited to linear relationships between independent variables. Anomaly detection in this paper utilizes a multilayer perceptron neural network that is potentially more sensitive to complex features of the multivariate data including both linear and nonlinear attributes.

Autoencoding Neural Network: A form of unsupervised machine learning that utilizes multiple layers within a neural network to first encode and then later decode the attributes of information about a dataset, for the purpose of learning which attributes are significant features. This feature extraction process is often referred to as a self-supervised process. The autoencoding neural network is then able to produce an anomaly score for each observation of data using a reconstructed mean square of the error (RMSE) score, with higher RMSE scores denoting increasing difficulty in reproducing the observation in the decoding layer of the neural network. Thus, higher RMSE scores generally denote a greater potential anomaly in the multivariate dataset observation.

Unsupervised Learning: Unsupervised machine learning is designed to discover patterns or groupings within datasets where no dependent or target variable is present. It is typically used to discover relationships where no labeled outcome variable is known to the algorithm and is frequently used in the area of clustering, association and dimensionality reduction. For the purposes of this paper, unsupervised learning is used for anomaly detection within multivariate data without the use of a labeled outcome dependent variable.

Complete Chapter List

Search this Book: