User Assisted Creation of Open-Linked Data for Training Web Information Extraction in a Social Network

User Assisted Creation of Open-Linked Data for Training Web Information Extraction in a Social Network

Martin Necasky (Charles University, Czech Republic), Ivo Lasek (Charles University, Czech Republic), Dominik Fiser (Charles University, Czech Republic), Ladislav Peska (Charles University, Czech Republic) and Peter Vojtas (Charles University, Czech Republic)
Copyright: © 2013 |Pages: 11
DOI: 10.4018/978-1-4666-2827-4.ch002

Abstract

For the first problem we propose several procedures on how to create Open-Linked data, including assisted creation of annotations (serving as base line or training set for Web Information Extraction tools), employing the social network, and also specific approaches to creating Open-linked data from governmental data resources. We describe some cases where such data can be used (e.g., in e-commerce, recommending systems, and in governmental and public policy projects).
Chapter Preview
Top

Human Assisted Creation Of Semantic Content

As we already mentioned in the introduction, the main problem of Semantic Web (web of data) is a sociological problem (and only afterwards it is a managerial problem and then also a technological problem). The problem is: Who (and also why, how, when, where, etc.) will create semantic content? This is the main goal of our project of Web Semantization (Dědek, Eckhardt, & Vojtáš, 2009). Here, web semantization is understood as a process of gradual enrichment of the web by semantic content by a third party annotation. This work is based on four main ideas: (1) To have a web repository of indexes for a part of web (e.g., Czech web (.cz) in size TB's). This was based on the system Egothor, which offered full text indexing and some additional features. (2) To use web information extraction to automate the enrichment (annotation process). (3) A semantic repository, which is a crucial component for our approach. Having a third party annotation, we cannot change original pages and we have to store annotations elsewhere. (4) A “user software agent,” supporting users (customers) in specific applications and making use of semantic content.

Community Efforts

Big effort and results in creating semantic content is achieved by different community efforts. If publishers themselves annotate their pages (e.g., in a process of SEO using schema.org vocabulary), it would substantially enlarge semantic content on the web (we do not go into problems of this approach–is not clear how far these annotations will be interlinked). Another impulse which can push publishers to annotate their resources acts in smaller, well-organized community forces--typically with licensed work (e.g., in medical domain). Last, but not least, are various laws pushing publisher to machine readable publishing of (mostly legal) documents.

Another big human effort is Wikipedia. From a human point of view, it contains lot of semantics, especially about named entities and their disambiguation. DBPedia is a community effort to extract structured and interlinked information from Wikipedia and make it a RDF database on the web. The main source of information for DBpedia are so called “infoboxes” (i.e., small tables on the right side of some Wikipedia articles containing basic information about a described entity). Based on hand-written mapping rules maintained by the DBpedia community, the information is extracted from the source code of particular articles and mapped to a common DBpedia ontology.

Complete Chapter List

Search this Book:
Reset