One of the goals of data scientists and curators is to get information (contained in text) organized and integrated in a way that can be easily consumed by people and machines. A starting point for such a goal is to get a model to represent the information. This model should ease to obtain knowledge semantically (e.g., using reasoners and inferencing rules). In this sense, the Semantic Web is focused on representing the information through the Resource Description Framework (RDF) model, in which the triple (subject, predicate, object) is the basic unit of information. In this context, the natural language processing (NLP) field has been a cornerstone in the identification of elements that can be represented by triples of the Semantic Web. However, existing approaches for the representation of RDF triples from texts use diverse techniques and tasks for such purpose, which complicate the understanding of the process by non-expert users. This chapter aims to discuss the main concepts involved in the representation of the information through the Semantic Web and the NLP fields.
TopIntroduction
The Web provides a wealth data source that can be consumed by humans and applications in diverse areas such as Data Analytics, Information Retrieval, Information Extraction, Machine Learning, and so forth, which can benefit Smart Environments, Smart Business, Educational and Learning, and, also, the Internet of Things (IoT) as a whole (Liu, Fang, & Ansari, 2016). Diverse elements of information can be obtained from the Web (e.g., product descriptions, relevant information from organizations, profiles of persons, contextual information, etc.), which can be connected in a way to provide new insights, later represented as knowledge that can be useful for organizations for decision making. Several actors from diverse domains (education, healthcare, logistics, tourism, energy, etc.) can be benefited from gathering such kind of information for providing digital services and products with added value for consumers. Typically, the data published in the Web is mainly represented as plain text for human consumption, which has no structure or descriptions that facilitate its comprehension (even in a human visual understanding format as HTML). Thus, computers cannot easily process the text to obtain its underlying information (and meaning). In this way, to get information organized and represented for further consumption by humans or applications, the following aspects should be taken into account:
- a)
A representation model. This model should be useful for representing and querying facts about real-world objects and their connections. Moreover, it should allow computers to infer new information according to rules and already represented facts, which is a step for obtaining knowledge.
- b)
Feed representation model. Once a model is defined, the next task is to represent data following the specifications and rules of such a model. However, this task is often unfeasible for humans when the data source is huge and constantly increasing (as the Web).
According to the previous subjects, there exist two fields directly involved in the modeling and representation of data: the Semantic Web (Berners-Lee, Hendler, & Ora, 2001) and the Natural Language Processing (NLP) (Hirschberg & Manning, 2015). The first refers to an extension of the Web aimed at the representation, integration, sharing, reuse, and connection of information through a format readable by humans and applications. The Semantic Web relies on the Resource Description Framework (RDF), which provides a model for the formal representation of information based on a basic unit of information, the RDF triple. A set of RDF triples can represent information in the form of a graph (Knowledge Graph –KG– (Ehrlinger & Wöß, 2016)), where nodes represent real-world elements (resources) and the edges define the relationship or description between the nodes. On the other hand, NLP is aimed at preparing and processing text, so that computers can handle it. Thus, by the nature of its tasks, NLP has become a cornerstone for the representation of data on the Semantic Web, providing methods and techniques (e.g., segmenting text into sentences and words, grammatical analysis and parsing) for the identification and extraction of two main elements from text: named entities and semantic relations. Named entities refer to real-world objects that can be classified in a specific class (e.g., person, location, place), while semantic relations refer to the existing relationship between the identified named entities. In this way, named entities and semantic relations can be represented in the RDF model as nodes and edges (properties), respectively. For example, a KG from the sentence “LeBron James plays basketball for Los Angeles Lakers. He was born in Ohio” contains named entities (LeBron James, Los Angeles Lakers, Ohio), which are linked by predicates (isA, play, birthplace). From such relations, additional information (not explicitly stated in the sentence) can be inferred (basketball team, country, marital status, age, hobbies, etc.). Explicit and implicit types of information are useful to feed and enrich, respectively, the KG.