Towards an Objective Assessment Framework for Linked Data Quality: Enriching Dataset Profiles With Quality Indicators

Towards an Objective Assessment Framework for Linked Data Quality: Enriching Dataset Profiles With Quality Indicators

Ahmad Assaf, Aline Senart, Raphaël Troncy
DOI: 10.4018/978-1-5225-5191-1.ch021
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Ensuring data quality in Linked Open Data is a complex process as it consists of structured information supported by models, ontologies and vocabularies and contains queryable endpoints and links. In this paper, the authors first propose an objective assessment framework for Linked Data quality. The authors build upon previous efforts that have identified potential quality issues but focus only on objective quality indicators that can measured regardless on the underlying use case. Secondly, the authors present an extensible quality measurement tool that helps on one hand data owners to rate the quality of their datasets, and on the other hand data consumers to choose their data sources from a ranked set. The authors evaluate this tool by measuring the quality of the LOD cloud. The results demonstrate that the general state of the datasets needs attention as they mostly have low completeness, provenance, licensing and comprehensibility quality scores.
Chapter Preview
Top

1. Introduction

In the last few years the Semantic Web gained a momentum supported by the introduction of many related initiatives like the Linked Open Data (LOD)1. From 12 datasets cataloged in 2007, the Linked Open Data cloud has grown to nearly 1000 datasets containing more than 82 billion triples. Data is being published by both public and private sectors and covers a diverse set of domains from life sciences to military. This success lies in the cooperation between data publishers and consumers where users are empowered to find, share and combine information in their applications easily.

We are entering an era where open is the new default. Governments, universities, organizations and even individuals are publicly publishing huge amounts of open data. This openness should be accompanied with a certain level of trust or guarantees about the quality of data. The Linked Open Data is a gold mine for those trying to leverage external data sources in order to produce more informed business decisions (Crawford, 2011).

However, the heterogeneous nature of sources reflects directly on the data quality as these sources often contain inconsistent as well as misinterpreted and incomplete information.

Traditional data quality is a thoroughly researched field with several benchmarks and frameworks to grasp its dimensions (Kahn, 2002; Stvilia, 2007; Wang, 1996). Data quality principles typically rely on many subjective indicators that are complex to measure automatically. The quality of data in indeed realized when it is used (Godfrey, 1999), thus directly relating to the ability of satisfying users’ continuous needs.

Web documents that are by nature unstructured and interlinked require different quality metrics and assessment techniques than traditional datasets. For example, the importance and quality of Web documents can be subjectively calculated via algorithms like Page Rank (Page, 1999). Ensuring data quality in Linked Open Data is a complex process as it consists of structured information supported by models, ontologies and vocabularies and contains queryable endpoints and links. This makes data quality assurance a challenge. Despite the fact that Linked Open Data quality is a trending and highly demanded topic, very few efforts are currently trying to standardize, track and formalize frameworks to issue scores or certificates that will help data consumers in their integration tasks.

Data quality assessment is the process of evaluating if a piece of data meets the consumers need in a specific use case (Bizer, 2009a). The dimensionality of data quality makes it dependent on the task and users requirements. For example, DBpedia (Bizer, 2009b) and YAGO (Suchanek, 2007) are knowledge bases containing data extracted from structured and semi-structured sources. They are used in a variety of applications e.g., annotation systems (Mendes, 2011), exploratory search (Marie, 2013) and recommendation engines (Di Noia, 2012). However, their data is not integrated into critical systems e.g., life critical (medical applications) or safety critical (aviation applications) as its data quality is found to be insufficient. In this paper, we first propose a comprehensive objective framework to evaluate the quality of Linked Data sources. Secondly, we present an extensible quality measurement tool that helps on one hand data owners to rate the quality of their dataset and get some hints on possible improvements, and on the other hand data consumers to choose their data sources from a ranked set. The aim of this paper is to provide researchers and practitioners with a comprehensive understanding of the objective issues surrounding Linked Data quality.

Complete Chapter List

Search this Book:
Reset