Semantic Search on Unstructured Data: Explicit Knowledge through Data Recycling

Semantic Search on Unstructured Data: Explicit Knowledge through Data Recycling

Alex Kohn (Roche Diagnostics GmbH, Germany), François Bry (University of Munich, Germany) and Alexander Manta (Roche Diagnostics GmbH, Germany)
Copyright: © 2010 |Pages: 19
DOI: 10.4018/jswis.2010040102

Abstract

Studies agree that searchers are often not satisfied with the performance of current enterprise search engines. As a consequence, more scientists worldwide are actively investigating new avenues for searching to improve retrieval performance. This paper contributes to YASA (Your Adaptive Search Agent), a fully implemented and thoroughly evaluated ontology-based information retrieval system for the enterprise. A salient particularity of YASA is that large parts of the ontology are automatically filled with facts by recycling and transforming existing data. YASA offers context-based personalization, faceted navigation, as well as semantic search capabilities. YASA has been deployed and evaluated in the pharmaceutical research department of Roche, Penzberg, and results show that already semantically simple ontologies suffice to considerably improve search performance.
Article Preview

Introduction

Nowadays most data produced in business is captured electronically and stored in computer systems. Search engines are of key importance in making this “hidden” information visible to the employees. Spoiled by the improvements in Web search, experts expect now a similar search performance in their intranet environment. However, current state-of-the-art enterprise search engines underperform (Feldman & Sherman, 2004). In effect, search for information becomes a central problem in companies.

A particularity of enterprise search is the lack of scientific publications. In case of commercial products, the information provided in booklets or white papers give only a vague picture of the applied algorithms. An aggravating factor is that the methods’ effectiveness in improving information retrieval in enterprise search is barely empirically investigated. Indeed, published methods often restrict to synthetic evaluations. Further, scientific publications often describe methods which are optimized for the Web but not for intranet environments. Lastly, papers addressing intranet search are often focused on the intranet web, ignoring the fact that file shares, e-mails, databases, applications, etc. are also part of an intranet which needs to be searched.

We conclude that search for information in intranet environments is theoretically and practically disappointing. The rising question is: why is search for information in the enterprise such a challenge?

Many reasons can be given (Fagin et al., 2003; Hawking, 2004): Heterogeneous data sources and formats, complex security permissions, less user observations, few or missing metadata, growing amounts of data, etc.

The World Wide Web is dominated by the hypertext protocol. This is in contrast to intranets, where only a small portion of the data is in a Web accessible format. This heterogeneity makes data integration a difficult task, as large portions of the intranet are not search engine friendly. Further, ranking of search results is made more difficult due to a different or missing linkage structure (Xue et al., 2003).

The complex security permissions present in companies are a mixed blessing. On the one hand side the information landscape is fragmented into many silos, i.e. any employee can only see a small subset of all data. On the other hand, ranking of search results is eased as only a subset of all data needs to be sorted by relevance. The degree of fragmentation depends of course on the company’s philosophy of information sharing across departments.

Observing a user’s search behavior enables search engines to detect the context of a user, which ultimately leads to personalization services (Micarelli et al., 2007). Such services are already part of the leading Web search engines. Offering personalization services in the enterprise however, is a difficult task due to the lack of feedback data: a few users are facing a lot of data.

Being confronted with barely explicit metadata at hand and mostly unstructured free-text documents represents another challenge. Therefore, it is difficult to offer semantic search capabilities – a problem, well known from the Internet.

Considering the mentioned challenges, the problem is how to improve search for information in the enterprise. Could integration, i.e. federated search, make the information landscape accessible? Could the ranking of search results be improved by applying facetted navigation or personalized search? Could high-quality metadata be obtained by applying automatic information extraction? Could domain knowledge (e.g., organizational charts, project databases, etc.) be used to set the searcher as well as the results in context?

Technically, we contribute by compiling and developing several approaches for facing the listed challenges, namely role-based adaptation, guided navigation, and incorporation of domain knowledge. The approaches are implemented into YASA (Your Adaptive Search Agent). YASA is deployed in the pharmaceutical research department of Roche in Penzberg.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 15: 4 Issues (2019): 1 Released, 3 Forthcoming
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing