Text Preprocessing: A Tool of Information Visualization and Digital Humanities

Text Preprocessing: A Tool of Information Visualization and Digital Humanities

Piotr Malak (University of Wroclaw, Poland)
DOI: 10.4018/978-1-5225-4990-1.ch006

Abstract

Digital humanities and information visualization rely on huge sets of digital data. Those data are mostly delivered in the text form. Although computational linguistics provides a lot of valuable tools for text processing, the initial phase (text preprocessing) is very involved and time-consuming. The problems arise due to a human factor – they are not always errors; there is also inconsistency in forms, affecting data quality. In this chapter, the author describes and discusses the main issues that arise during the preprocessing phase of textual data gathering for InfoVis. Chosen examples of InfoVis applications are presented. Except for problems with raw, original data, solutions are also referred. Canonical approaches used in text preprocessing and common issues affecting the process and ways to prevent them are also presented. The quality of data from different sources is also discussed. The content of this chapter is a result of a few years of practical experience in natural language processing gained during realization of different projects and evaluation campaigns.
Chapter Preview
Top

Background

Big Data components and sources, such as data bases, Web- and Web2.0 sites, social media etc. are ever-growing sources of digital data. Despite of improving data transferring speed over the Internet, which make it possible to present in real time sounds, pictures and video, textual data are still the most popular in scale of document types. And text documents themselves are still the richest source of data to analyze and visualize.

From the human comprehension perspective the reception of text is a complex system, that employs not only reading abilities but also other cognitive predispositions. We, the humans, do always comprehend text in the context, which is defined by our education, knowledge and experience. Thus we can clearly detect and understand metaphors, and social-, political- and any other context and relations included non-explicitly in the text. Thanks to the ability of reading the latent information, so called “reading between the lines”, a raw text itself can be very informative for us, while, in contrast, not for computer systems.

Complete Chapter List

Search this Book:
Reset