Domain-Specific Search Engines for Investigating Human Trafficking and Other Illicit Activities

Domain-Specific Search Engines for Investigating Human Trafficking and Other Illicit Activities

Mayank Kejriwal (University of Southern California, USA)
Copyright: © 2020 |Pages: 19
DOI: 10.4018/978-1-5225-9715-5.ch033

Abstract

Web advertising related to human trafficking (HT) activity has been on the rise in recent years. Question answering over crawled sex advertisements to assist investigators in the real world is an important social problem involving many technical challenges. This article will describe the problem of domain-specific search (DSS), a specific set of technologies that can address these challenges, drawing on cutting-edge techniques developed over three years of DARPA-funded research conducted in collaboratively academic (e.g., the University of Southern California's Information Sciences Institute), government (e.g., NASA's Jet Propulsion Laboratory), and industrial (e.g., Uncharted) settings to assist analysts and investigative experts in the HT domain. Specifically, the article will describe the background, main principles and challenges of building a DSS in illicit domains and the general architecture of a DSS designed for illicit sex advertisements on the open web. The article will conclude with the scope of future research in this area.
Chapter Preview
Top

Introduction

Web advertising related to Human Trafficking (HT) activity has been on the rise in recent years (Szekely et al., 2015). Question answering over crawled sex advertisements to assist investigators in the real world is an important social problem. This problem involves many technical challenges (Kejriwal & Szekely, 2017c). This article will describe the problem of domain-specific search (DSS), a specific set of technologies that can address these challenges. Modern DSS systems for investigative activities draw on cutting-edge techniques developed over three years of DARPA-funded research conducted in collaboratively academic (e.g., the University of Southern California’s Information Sciences Institute), government (e.g., NASA’s Jet Propulsion Laboratory) and industrial (e.g., Uncharted) settings. Evidence from the HT domain shows that the systems can be used to provide valuable utility to analysts and investigative experts.

In illicit domains such as HT but also others like securities fraud and narcotics, domain-specific search involves a form of Information Retrieval (IR) that takes as input a large domain-specific corpus of pages crawled from the Web. The system allows investigators to satisfy their information needs by posing sophisticated queries to a special-purpose engine. A workflow of this process is shown in Figure 1. Since investigators are largely non-technical, they must be able to issue such queries to (and receive responses from) intuitive, graphical interfaces. A fully functional DSS engine must have some notion of semantics, since sophisticated queries go beyond just keyword specification. This is because investigative queries are more like real-world questions requiring complex operations like aggregations (e.g., find me all email addresses linked to the phone 123-456).

Figure 1.

A procedural workflow of domain-specific search from the point of view of an investigative user, using the domain-specific insight graph (DIG) DSS for example interfaces

978-1-5225-9715-5.ch033.f01

A viable solution to the problem has to allow the user to pose queries both intuitively and interactively.

For such a DSS to operate semi-automatically and be useful in the real world, several challenges and desiderata must be fulfilled. Possibly the most important of these is handling the unusual nature of an illicit domain, since investigators who have to use the system have special needs. To understand why this can be challenging, consider the recent advent of technologies like neural networks and deep learning. Pre-trained tools such as word embeddings and Named Entity Recognizers in the natural language processing community have also been released for public use (Pennington, Socher & Manning, 2014). However, many of these tools have been trained on datasets and corpora that are fairly ‘regular’ i.e. comprise of relatively well-structured text (like news corpora and Wikipedia articles). Consequently, they are not necessarily suitable for language or data acquired in illicit domains. Table 1 illustrates some examples of real text scraped from sex advertisement webpages (but with identifying phone numbers appropriately modified). Acquiring and labeling data from such domains is both expensive and sensitive, not easily amenable to crowdsourcing (Kejriwal & Szekely, 2017a). A purely machine learning-based approach is simply not feasible.

Table 1.
Example fragments of text extracted from real-world illicit sex advertisements. Note that identifying information has been replaced. Information that is potentially useful to investigators and/or to a semantics-aware domain-specific search engine is highlighted in bold.
Italian 19 hello guys…My name is charlotte, New to town from kansas
[ GORGOUS BLONDE beauty] ? FROM Florida ? (Petite) ? [ CURVy ]?
NO DISAPPOINTMENTS. 34C..Brazilian,ITALIAN beauty…
Hey gentleman im Newyork and i’m looking for generous
Hi guy’s this is sexy newyork .& ready to party.
AVAILABLE NOW! ?? – (1 two 1) six 5 six – 0 9 one 2-21

Key Terms in this Chapter

Query Reformulation: Query reformulation refers to a set of techniques wherein a query (in some domain-specific language like SPARQL or SQL) that is originally posed against a DSS engine is reformulated into a set of queries (in the same or different language) to increase query retrieval performance. Query reformulation is a useful technique both when the underlying KG is noisy and when the original query does not fully express (or over-conditions) user intent.

Knowledge Graph (KG): A knowledge graph (KG) is a directed, labeled multi-relational graph that is used to model and represent semi-structured data to make it more amenable to machine reasoning (‘knowledge’).

Entity Resolution: Entity resolution (ER) is the problem of algorithmically determining when two entities in a KG refer to the same underlying entity. For example, the same entity ‘Barack Obama’ may have been independently extracted from two webpages under names such as ‘President Obama’ and ‘Obama’.

Ontology: An ontology may be practically defined as a controlled set of terms and constraints for expressing the domain of interest. An ontology can range from a simple set of terms (e.g., {PERSON, LOCATION, ORGANIZATION}) to a taxonomy (with concepts and sub-concepts e.g., ACTOR and ENTREPRENEUR would be sub-concepts of PERSON) to a general graph with equational constraints e.g., that the domain and range of the relation starred-in is ACTOR and MOVIE respectively. Any KG that is ontologized thus should obey such constraints at the instance level.

Investigative Schema: An investigative schema is an ontology expressing an investigative domain of interest, usually involving an illicit activity like sex advertising. The investigative schema is usually simple and shallow, hence the term ‘schema’ and not ‘ontology’.

Information Extraction (IE): Information extraction (IE) is an algorithmic technique that generally accepts as input either text or raw HTML as input, and outputs a set of ontologically typed instances. For illicit domains, the investigative schema serves as the ontology. Supervised machine learning and deep learning IE methods have emerged as state-of-the-art in recent times.

Domain-Specific Search (DSS): The problem of building a (rudimentary or advanced) search engine over a domain-specific corpus. Domains of special interest in this article were illicit domains such as human trafficking, over which building such an engine is an especially challenging problem.

Complete Chapter List

Search this Book:
Reset