User Profiles for Personalizing Digital Libraries
Giovanni Semeraro (University of Bari, Italy), Pierpaolo Basile (University of Bari, Italy), Marco de Gemmis (University of Bari, Italy) and Pasquale Lops (University of Bari, Italy)
Copyright: © 2009
Exploring digital collections to find information relevant to a user’s interests is a challenging task. Information preferences vary greatly across users; therefore, filtering systems must be highly personalized to serve the individual interests of the user. Algorithms designed to solve this problem base their relevance computations on user profiles in which representations of the users’ interests are maintained. The main focus of this chapter is the adoption of machine learning to build user profiles that capture user interests from documents. Profiles are used for intelligent document filtering in digital libraries. This work suggests the exploiting of knowledge stored in machine-readable dictionaries to obtain accurate user profiles that describe user interests by referring to concepts in those dictionaries. The main aim of the proposed approach is to show a real-world scenario in which the combination of machine learning techniques and linguistic knowledge is helpful to achieve intelligent document filtering.
Our research was mainly inspired by the following works.
Syskill & Webert (Pazzani & Billsus, 1997) is an agent that learns a user profile to identify interesting Web pages. The learning process is performed by first converting a hypertext markup language (HTML) source into positive and negative examples, represented as keyword vectors, and then using learning algorithms like Bayesian classifiers, a nearest neighbor algorithm, and a decision tree learner.
Personal WebWatcher (Mladenic, 1999) is a Web browsing recommendation service that generates a user profile based on the content analysis of the requested pages. Learning is done by a naïve Bayes classifier where documents are represented as weighted keyword vectors, and classes are “interesting” and “not interesting.”
Mooney and Roy (2000) adopt a text categorization method in their Libra system that performs content-based book recommendations by exploiting product descriptions obtained from the Web pages of the Amazon online digital store. Also in this case, documents are represented by using keywords, and a naïve Bayes text classifier is adopted.
The main limitation of these approaches is that they represent items by using keywords. The objective of our research is to create accurate semantic user profiles. Among the state-of-the-art systems that produce semantic user profiles, SiteIF (Magnini & Strapparava, 2001) is a personal agent for a multilingual news Web site that exploits a sense-based representation to build a user profile as a semantic network, whose nodes represent senses of the words in documents requested by the user.
The role of linguistic ontologies in knowledge-retrieval systems is explored in OntoSeek (Guarino, Masolo, & Vetere, 1999), a system designed for content-based information retrieval from online yellow pages and product catalogs. OntoSeek combines an ontology-driven content-matching mechanism based on WordNet with a moderately expressive representation formalism. The approach has shown that structured content representations coupled with linguistic ontologies can increase both recall and precision of content-based retrieval.
We adopted a content-based method able to learn user profiles from documents represented by using senses of words obtained by a word sense disambiguation strategy that exploits the WordNet IS-A hierarchy.
Key Terms in this Chapter
Synset: A group of data elements that are considered semantically equivalent for the purposes of information retrieval.
Personalization: The process of tailoring products or services to users based on their user profiles.
Word Sense Disambiguation: The problem of determining in which sense a word having a number of distinct senses is used in a given sentence.
User Profile: A structured representation of interests (and disinterests) of a user or group of users.
NLP (Natural Language Processing): A subfield of artificial intelligence and linguistics that studies the problems of automated generation and understanding of natural human languages. It converts samples of human language into more formal representations that are easier for computer programs to manipulate.
WordNet: A semantic lexicon for the English language. It groups English words into sets of synonyms called synsets. It provides short, general definitions, and records the various semantic relations between these synonym sets.
Recommender system: A system that guides users in a personalized way to interesting or useful objects in a large space of possible options.