Article Preview
TopIntroduction
Malware is known as malicious program designed with an intent to damage sensitive information stored in computers or mobile devices. Malware is a generic name used to represent different malicious software like virus, ransomware, rootkit, botnet, trojan, worm, adware and so on. These are the computer pollutants which enter into a system and find vulnerabilities of the operating system to execute unintended codes in the system (Behera & Bhaskari, 2017). Malware writers are designing malware not only to get some fame but also to have some financial benefits. Many anti–malware or malware detection systems have been developed so far but still the need for even more efficient detection strategy motivates the researchers to this domain. The signature based detection systems may not detect unknown and zero-day attacks because these detection systems use known signature databases to detect malware. Though signature based detection is good in identifying well-known malicious code but it may not be fit for detecting obfuscated code and previously unknown malware. The situation is even critical in case of metamorphic and polymorphic malware. Metamorphic and polymorphic malware are the malicious software which are created to evade detection engines. Malware creators usually create these malware by inserting different codes with similar functionalities. The purspose of metamorphic and polymorphic malware is same but they use different obfuscation and propagation techniques. Packing, encryption and compression are the most common obfuscation techniques used to change the appearance of malware so as to evade the detection engines. Therefore, there is a need of a detection model which can use the code (significantly opcode) of the executable as a feature to classify it as malware or benignware.
Malware analysis is an important step in malware research. This process is used to understand the structure and behavior of malware and benign samples. Analysis can be done either statically or dynamically. If an executable is analysed without execution, it is called as static analysis. Abimannan & Kumaravelu (2019) have presented a detailed mathematical description of heuristic based static malware analysis. Dynamic analysis is performed by executing the file in a safe and controlled environment to understand behavior of the file. These two analysis methods can also be combined to extract best features from the samples for classification(Vidyarthi et al., 2017). In this article, the operation codes of the samples are extracted using static analysis.
For solving a classification problem, machine learning methods are used on a feature set to train a model and then test the model. These algorithms can learn the patterns present in the training set and find these patterns in test dataset to classify the inputs either as malware or benign (Shabtai et al., 2012) . This article proposes an opcode based malware detection model using machine learning classification algorithms.
The features (opcodes) for this experiment are extracted by disassembling the sample programs using IDAPro tool. Upon disassembling the programs, the assembly language format of the executable instructions are obtained which comprise of many operation codes (Opcode). Opcode is the part of an instruction which states the action to be accomplished. Executable files (malware or benign) usually contain opcodes such as; SUB, ADD, AND, OR, XOR, INC, DEC, MOV, MOVZX, CALL, TEST, SBB, IMUL, CMP, RETN, PUSH, PUSHF, POP, NOP, JZ, JNZ, JMP, LEA, JB, FDIVP etc. They disclose important differences between malicious and legitimate programs (Bilar, 2007). Therefore, operation codes can be used as a feature in malware classification.
The contributions of this work are,
- 1.
Performing static analysis of malware and benign samples using IDAPro disassembler to generate an output file containing assembly language format of the input file.
- 2.
Applying the opcode extract and count (OPEC) algorithm on the output file(.asm) to create opcode count feature vector.
- 3.
Applying Extra Tree Classifier method on the feature vector to select relevant features for the experiment.
- 4.
Implementing machine learning algorithms on the dataset and comparing their results.