Authorship Attribution for Online Social Media

Authorship Attribution for Online Social Media

Ritu Banga (Jaypee Institute of Information Technology, India), Akanksha Bhardwaj (Jaypee Institute of Information Technology, India), Sheng-Lung Peng (National Dong Hwa University, Taiwan) and Gulshan Shrivastava (National Institute of Technology Patna, India)
Copyright: © 2018 |Pages: 25
DOI: 10.4018/978-1-5225-5097-6.ch008

Abstract

This chapter gives a comprehensive knowledge of various machine learning classifiers to achieve authorship attribution (AA) on short texts, specifically tweets. The need for authorship identification is due to the increasing crime on the internet, which breach cyber ethics by raising the level of anonymity. AA of online messages has witnessed interest from many research communities. Many methods such as statistical and computational have been proposed by linguistics and researchers to identify an author from their writing style. Various ways of extracting and selecting features on the basis of dataset have been reviewed. The authors focused on n-grams features as they proved to be very effective in identifying the true author from a given list of known authors. The study has demonstrated that AA is achievable on the basis of selection criteria of features and methods in small texts and also proved the accuracy of analysis changes according to combination of features. The authors found character grams are good features for identifying the author but are not yet able to identify the author independently.
Chapter Preview
Top

Introduction

Authorship Attribution is the way to identify the author of a given text or document by inferring their distinctiveness from the characteristic of a document written by that author (Juola, 2008). In this, documents are examined to determine stylistic details concealed in the text hence, concluding about the features of the author of a text. It is useful in case of repudiation where no person is willing to state that whether a given text is written by him/her. This field has been taking advantages from research already pursued in areas like machine learning, natural language processing and information retrieval. The development of these areas influence authorship attribution technology in many ways as described below:

  • Powerful machine learning techniques are becoming convenient to deal with higher dimensional and sparse data that permit more expressive representations.

  • For representing and classifying a large amount of data or messages, efficient technologies like Information retrieval have been developed.

  • To analyze the data or documents systematically, Natural Language Processing (NLP) tools are used that provides a new pattern of measures for illustrating the style.

Authorship attribution (AA) as a scholarly discipline has maintained a special connection to erudition in the digital humanities, where the authentication of disputed scholarly articles is of interest. For that reason, plenty of work in this field remains unexplored to researchers and professionals in digital forensics at large, in spite of accomplished efforts at outreach.

Authorship Analysis has four principal aspects:

  • 1.

    Authorship Attribution (or Identification): Determine the particular document is written by which author among a list of known authors.

  • 2.

    Authorship Verification: Determine the target document is written by that particular author or not.

  • 3.

    Authorship Profiling (or Characterization): Determining the demographic characteristics/features of author (like age, gender, personality) from an anonymous document.

  • 4.

    Similarity Detection: Compare multiple piece of writing and determines whether they were produced by a single author. Most research in this aspect is related to plagiarism detection. Plagiarism includes the full or partial replication of a piece of writing without owner’s permission. To detect the plagiarism activity, plagiarism detection aim to investigate the similarity between two pieces of writing. Because writing styles of two authors differ in various aspects.

The problem of identifying the most appropriate author from the list of potential suspects, who are considered as classes, is a typical authorship identification and classification problem. A digital document can be used as an evidence to prove the guilt of a suspect involved in cybercrime or cyber-bullying. If the suspect or target authors are unknown, then we deal with an authorship identification problem.

Authorship analyses of online messages perform in various steps (Figure 1):

  • Data Gathering: i.e. Messages are collected from prolific users in order to make relevant dataset to obtain better results. As we have more effective dataset thus, we have more information that can be extracted.

  • Preprocessing the Data: In other words molding the raw data into an appropriate format. Data is cleaned using natural language processing techniques to make it more attractive or useful. Stylistic features are depicted from the individual writing styles and generate feature vectors which act as novel part in tracing the identity of an author.

  • Training the Model: The methods are trained by applying appropriate methods on feature sets. Different methods such as statistical, computational and machine learning methods can be applied to a dataset for authorship identification.

  • Authorship Identification: Identify the anonymous author from the list of the known authors i.e. most likely author of a given anonymous tweet by extracting his/her features and validate these features on applied model.

Complete Chapter List

Search this Book:
Reset