Search the World's Largest Database of Information Science & Technology Terms & Definitions
InfInfoScipedia LogoScipedia
A Free Service of IGI Global Publishing House
Below please find a list of definitions for the term that
you selected from multiple scholarly research resources.

What is Byte Level N-Grams

Handbook of Research on Computational Forensics, Digital Crime, and Investigation: Methods and Solutions
Raw character n-grams, without any pre-processing. For example the word sequence “In the” would be composed of the following byte-level N-grams (the character “_” stands for space). Bi-grams, In, n_, _t, th, he - tri-grams, In_, n_t, _th, the - 4-grams, In_t, n_th, _the - 5-grams, In_th, n_the - 6-grams, In_the. N-grams have been successfully used for a long time in a wide variety of problems and domains, including information retrieval (Heer, 1974), detection of typographical errors (Morris and Cherry, 1975), automatic text categorization (Cavnar and Trenkle, 1994), music representation (Downie, 1999), computational immunology (Marceau, 2000), analysis of whole genome protein sequences (Ganapathiraju et al., 2002) protein classification (Cheng et al., 2005) etc
Published in Chapter:
Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process
Georgia Frantzeskou (University of the Aegean, Greece), Stephen G. MacDonell (Auckland University of Technology, New Zealand), and Efstathios Stamatatos (University of the Aegean, Greece)
DOI: 10.4018/978-1-60566-836-9.ch020
Abstract
Nowadays, in a wide variety of situations, source code authorship identification has become an issue of major concern. Such situations include authorship disputes, proof of authorship in court, cyber attacks in the form of viruses, trojan horses, logic bombs, fraud, and credit card cloning. Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-grams in order to represent a source code author’s style. Experiments on data sets of different programming-language (Java,C++ and Common Lisp) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.
Full Text Chapter Download: US $37.50 Add to Cart
eContent Pro Discount Banner
InfoSci OnDemandECP Editorial ServicesAGOSR