Predicting the Writer's Gender Based on Electronic Discourse

Predicting the Writer's Gender Based on Electronic Discourse

Szde Yu (Wichita State University, Wichita, USA)
Copyright: © 2020 |Pages: 15
DOI: 10.4018/IJCRE.2020010102

Abstract

The present study compared three methods aimed at predicting the writer's gender based on writing features manifested in electronic discourse. The compared methods included qualitative content analysis, statistical analysis, and machine learning. These methods were further combined to create a mixed methods model. The findings showed that the machine learning model combined with qualitative content analysis produced the best prediction accuracy. Including qualitative content analysis was able to improve accuracy rates even when the training set for machine learning was relatively small. Thus, this study presented a concise model that can be fairly reliable in predicting gender based on electronic discourse with high accuracy rates and such accuracy was consistently found when the model was tested by two separate samples.
Article Preview
Top

Introduction

As digital evidence is increasingly involved in all types of crime, digital investigation is no longer a topic related only to cybercrime. Digital investigation is warranted when the subject’s presence and behavior in a digital environment may potentially reveal crucial clues. Such presence and behavior are normally referred to as digital footprints. Digital footprints can contain a wide variety of digital files, such as photos, computer logs, and videos. Nonetheless, text is still the most commonly encountered form of digital footprints as it appears in many forms of electronic discourse, including emails, text messages, social media comments, online discussions, blogs, online advertisements, and so on (Doane et al., 2016; Hill et al., 2014; Koivunen et al., 2014). As such, many research endeavors have been devoted to the analysis of electronic discourse (Miner et al., 2012; Dunne et al., 2012). The field of text analytics is growing and its importance in digital investigation is being valued more and more (Anwar and Abulaish, 2014; Al-Zaidy et al., 2012; Yu, 2015). When criminal investigators need to analyze textual evidence without knowing who the writer is, it is basically a text-based profiling process as investigators try to predict the characteristics of a person based solely on his or her electronic discourse. This concept is not new and many people are already using what they see on the Internet to make assumptions or inferences about a person they have never personally met. For instance, some people have tried to predict mental illness based on a person’s online text, such as tweets (Preotiuc-Pietro, 2015). While Internet users do not usually need to be held responsible for inaccurate predictions, in criminal investigation the accuracy of prediction matters and often bears critical consequences.

Ideally, it would be highly helpful if we can predict an unknown writer’s identity by analyzing his or her writing alone, but this is not yet a reliable technique. However, some research has found that it is not impossible to predict a person’s general characteristics, such as gender, age and education based on nothing but digital footprints (Steel, 2014; Yu, 2013). Notwithstanding, without other types of digital footprints, whether electronic discourse alone is sufficient for this purpose is not yet confirmed (Nguyen et al., 2014; Merler et al., 2015). The current trend is mainly focused on big data analysis, while in criminal investigations the text files available for analysis are generally limited in quantity as well as in content. A technique suitable for criminal investigators to apply when dealing with text-based evidence is severely lacking.

Accordingly, this study was aimed to test the ability to predict gender using nothing but the subject’s writing on an electronic platform (i.e., electronic discourse). Gender is being used for prediction in this exploratory attempt for the reason that it is a personal trait easier to verify. In criminal investigations, the ability to predict gender correctly is a huge step toward narrowing down suspects. In this study, the goal was to test different methods for this purpose. Two research questions were asked. First, can we predict the writer’s gender based solely on writing features in electronic discourse? Second, which method produces better accuracy? Built on the findings as to these questions, an additional inquiry was conducted to see if a mixed methods approach could further improve accuracy. The methods being compared here included a qualitative content analysis, a logistic regression model, and machine learning. It is important to stress that this study was not intended to create a new artificial intelligence technique that handles big data. Rather, the focus here is on identifying a technique most applicable to criminal investigations where investigators do not normally need to handle big data.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 2: 2 Issues (2020): 1 Released, 1 Forthcoming
Volume 1: 2 Issues (2019)
View Complete Journal Contents Listing