Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process

Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process

Georgia Frantzeskou, Stephen G. MacDonell, Efstathios Stamatatos
DOI: 10.4018/978-1-60566-836-9.ch020
(Individual Chapters)
No Current Special Offers


Nowadays, in a wide variety of situations, source code authorship identification has become an issue of major concern. Such situations include authorship disputes, proof of authorship in court, cyber attacks in the form of viruses, trojan horses, logic bombs, fraud, and credit card cloning. Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-grams in order to represent a source code author’s style. Experiments on data sets of different programming-language (Java,C++ and Common Lisp) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.
Chapter Preview

Section 1. Introduction

Statement of the Problem

With the increasingly pervasive nature of software systems, cases arise in which it is important to identify the author of a usually limited piece of programming code. Such situations include cyber attacks in the form of viruses, Trojan horses and logic bombs, fraud and credit card cloning, code authorship disputes, and intellectual property infringement.

Why do we believe it is possible to identify the author of a computer program? Humans are creatures of habit and habits tend to persist. That is why, for example, we have a handwriting style that is consistent during periods of our life, although the style may vary, as we grow older. Does the same apply to programming? Could we identify programming constructs that a programmer uses all the time? Spafford and Weber (1993) suggested that a field they called software forensics could be used to examine and analyze software in any form, be it source code for any language or executable programs, to identify the author. Spafford and Weber wrote the following of software forensics:

“It would be similar to the use of handwriting analysis by law enforcement officials to identify the authors of documents involved in crimes or to provide confirmation of the role of a suspect”

The closest parallel is found in computational linguistics. Authorship analysis in natural language texts, including literary works has been widely debated for many years, and a large body of knowledge has been developed. Authorship analysis on computer software, however, is different and more difficult than in natural language texts.

Several reasons make this problem difficult. Programmers reuse code, programs are developed by teams of programmers, and programs can be altered by code formatters and pretty printers.

Identifying the authorship of malicious or stolen source code in a reliable way has become a common goal for digital investigators. Spafford and Weber (1993) have suggested that it might be feasible to analyze the remnants of software after a computer attack, through means such as viruses, worms or Trojan horses, and identify its author through characteristics of executable code and source code. Zheng et al. (2003) proposed the adoption of an authorship analysis framework in the context of cybercrime investigation to help law enforcement agencies deal with the identity tracing problem.

Researchers (Krsul and Spafford, 1995; MacDonell et al. 2001; Ding and Samadzadeh, 2004) addressing the issue of code authorship have tended to adopt a methodology comprising two main steps (Frantzeskou. et al 2004). The first step is the extraction of apparently relevant software metrics and the second step is using these metrics to develop models that are capable of discriminating between several authors, using a statistical or machine learning algorithm. In general, the software metrics used are programming language-dependent. Moreover, the metrics selection process is a non trivial task.

With this in mind, our objective in this chapter is to provide a language independent methodology to source code authorship attribution which is called the SCAP (Source Code Author Profile) approach (Frantzeskou. et al 2008, Frantzeskou et al 2007). The effectiveness of the SCAP method is also demonstrated through a number of experiments (Frantzeskou. et al 2006a, Frantzeskou et al 2006b, Frantzeskou. et al 2005a, Frantzeskou et al 2005b)

Key Terms in this Chapter

Programming Language: An artificial language that can be used to control the behavior of a machine, particularly a computer. Programming languages, like human languages, are defined through the use of syntactic and semantic rules, to determine structure and meaning respectively. Programming languages are used to facilitate communication about the task of organizing and manipulating information, and to express algorithms precisely (Abelson and Sussman

Program: A collection of instructions that describes a task, or set of tasks, to be carried out by a computer. More formally, it can be described as an expression of a computational method written in a programming language language (Knuth 1997)

Authorship: Defined as, “the state of being an author”. As in literature, a particular work can have multiple authors. Furthermore, some of these authors can take an existing work and add things to it, evolving the original creation

Author: Defined by Webster (Merriam-Webster 1992) as one that writes or composes a literary work,” or as one who originates or creates.” In the context of software development the author or programmer is someone that originates or creates a piece of software.”

RD (Relative Distance): The dissimilarity measure used by Keselj et al. (2003) in text authorship attribution - (1) where f1(n) and f2(n) are either the normalized frequencies of an n-gram n in the two compared texts or 0 if the n-gram does not exist in the text(s).

N-Gram: An n-contiguous sequence and can be defined at the byte, character, or word level

Authorship Analysis: The application of the study of linguistic style, usually to written language often used to attribute authorship to anonymous or disputed documents

Simple Profile Intersection (SPI): Letting SPA and SPT be the simplified profiles of one known author and the test or disputed program, respectively, then the distance measure SPI is given by the size of the intersection of the two profiles:

Byte Level N-Grams: Raw character n-grams, without any pre-processing. For example the word sequence “In the” would be composed of the following byte-level N-grams (the character “_” stands for space). Bi-grams, In, n_, _t, th, he - tri-grams, In_, n_t, _th, the - 4-grams, In_t, n_th, _the - 5-grams, In_th, n_the - 6-grams, In_the. N-grams have been successfully used for a long time in a wide variety of problems and domains, including information retrieval (Heer, 1974), detection of typographical errors (Morris and Cherry, 1975), automatic text categorization (Cavnar and Trenkle, 1994), music representation (Downie, 1999), computational immunology (Marceau, 2000), analysis of whole genome protein sequences (Ganapathiraju et al., 2002) protein classification (Cheng et al., 2005) etc

Simplified Profile(SP) of Size L: In SCAP methodology, SP is defined as the L most frequent n-grams that were found for a specific programmer

Source Code Authorship Analysis: The process of examining the characteristics of a piece of code in order to draw conclusions on its authorship (Abbasi and Chen 2005).

Language Independent Methodology: An approach which is not based on metrics specific to a particular language. The low – level information used for classification could be applied to any language.

Complete Chapter List

Search this Book: