An Opcode-Based Malware Detection Model Using Supervised Learning Algorithms

An Opcode-Based Malware Detection Model Using Supervised Learning Algorithms

Om Prakash Samantray, Satya Narayan Tripathy
Copyright: © 2021 |Pages: 13
DOI: 10.4018/IJISP.2021100102
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

There are several malware detection techniques available that are based on a signature-based approach. This approach can detect known malware very effectively but sometimes may fail to detect unknown or zero-day attacks. In this article, the authors have proposed a malware detection model that uses operation codes of malicious and benign executables as the feature. The proposed model uses opcode extract and count (OPEC) algorithm to prepare the opcode feature vector for the experiment. Most relevant features are selected using extra tree classifier feature selection technique and then passed through several supervised learning algorithms like support vector machine, naive bayes, decision tree, random forest, logistic regression, and k-nearest neighbour to build classification models for malware detection. The proposed model has achieved a detection accuracy of 98.7%, which makes this model better than many of the similar works discussed in the literature.
Article Preview
Top

Introduction

Malware is known as malicious program designed with an intent to damage sensitive information stored in computers or mobile devices. Malware is a generic name used to represent different malicious software like virus, ransomware, rootkit, botnet, trojan, worm, adware and so on. These are the computer pollutants which enter into a system and find vulnerabilities of the operating system to execute unintended codes in the system (Behera & Bhaskari, 2017). Malware writers are designing malware not only to get some fame but also to have some financial benefits. Many anti–malware or malware detection systems have been developed so far but still the need for even more efficient detection strategy motivates the researchers to this domain. The signature based detection systems may not detect unknown and zero-day attacks because these detection systems use known signature databases to detect malware. Though signature based detection is good in identifying well-known malicious code but it may not be fit for detecting obfuscated code and previously unknown malware. The situation is even critical in case of metamorphic and polymorphic malware. Metamorphic and polymorphic malware are the malicious software which are created to evade detection engines. Malware creators usually create these malware by inserting different codes with similar functionalities. The purspose of metamorphic and polymorphic malware is same but they use different obfuscation and propagation techniques. Packing, encryption and compression are the most common obfuscation techniques used to change the appearance of malware so as to evade the detection engines. Therefore, there is a need of a detection model which can use the code (significantly opcode) of the executable as a feature to classify it as malware or benignware.

Malware analysis is an important step in malware research. This process is used to understand the structure and behavior of malware and benign samples. Analysis can be done either statically or dynamically. If an executable is analysed without execution, it is called as static analysis. Abimannan & Kumaravelu (2019) have presented a detailed mathematical description of heuristic based static malware analysis. Dynamic analysis is performed by executing the file in a safe and controlled environment to understand behavior of the file. These two analysis methods can also be combined to extract best features from the samples for classification(Vidyarthi et al., 2017). In this article, the operation codes of the samples are extracted using static analysis.

For solving a classification problem, machine learning methods are used on a feature set to train a model and then test the model. These algorithms can learn the patterns present in the training set and find these patterns in test dataset to classify the inputs either as malware or benign (Shabtai et al., 2012) . This article proposes an opcode based malware detection model using machine learning classification algorithms.

The features (opcodes) for this experiment are extracted by disassembling the sample programs using IDAPro tool. Upon disassembling the programs, the assembly language format of the executable instructions are obtained which comprise of many operation codes (Opcode). Opcode is the part of an instruction which states the action to be accomplished. Executable files (malware or benign) usually contain opcodes such as; SUB, ADD, AND, OR, XOR, INC, DEC, MOV, MOVZX, CALL, TEST, SBB, IMUL, CMP, RETN, PUSH, PUSHF, POP, NOP, JZ, JNZ, JMP, LEA, JB, FDIVP etc. They disclose important differences between malicious and legitimate programs (Bilar, 2007). Therefore, operation codes can be used as a feature in malware classification.

The contributions of this work are,

  • 1.

    Performing static analysis of malware and benign samples using IDAPro disassembler to generate an output file containing assembly language format of the input file.

  • 2.

    Applying the opcode extract and count (OPEC) algorithm on the output file(.asm) to create opcode count feature vector.

  • 3.

    Applying Extra Tree Classifier method on the feature vector to select relevant features for the experiment.

  • 4.

    Implementing machine learning algorithms on the dataset and comparing their results.

Complete Article List

Search this Journal:
Reset
Volume 18: 1 Issue (2024)
Volume 17: 1 Issue (2023)
Volume 16: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing