This chapter presents a new approach of mining the Web to identify people of similar background. To find similar people from the Web for a given person, two major research issues are person representation and matching persons. In this chapter, a person representation method which uses a person’s personal Web site to represent this person’s background is proposed. Based on this person representation method, the main proposed algorithm integrates textual content and hyperlink information of all the Web pages belonging to a personal Web site to represent a person and match persons. Other algorithms are also explored and compared to the main proposed algorithm. The evaluation methods and experimental results are presented.
In this section, we introduce previous studies on person search and people search.
Previous studies of person search mainly focus on how to find Web pages related to a person, given this person’s name as the query. In their systems, the query, which is usually a person’s name, is sent to regular search engines, and the search results from the regular search engines are then refined to find Web pages related to this person.
Key Terms in this Chapter
People Search: People search is to search other people that have similar interests or background with a given person. It is called “people search” because its purpose is to find a list of people that are similar to the given one, in terms of interests and background.
Link Similarity: The degree of similarity between two Web sites (or Web pages), based on the link information (inlinks and outlinks) of the two Web sites (or Web pages).
Content Similarity: The degree of similarity between two Web sites (or Web pages), based on the textual content (terms appearing in them) of the two Web sites.
Web Page and Web Web Site: In this study, a Web page is a single Web document in a Web site. A Web site holds one or more Web pages.
Person Search: Person search is a type of search which finds pages related to a specific person given this person’s name as the query. It aims at searching pages authored by a specific person or containing information about this person, and the query is the name of this person.
Inlink and Outlink: To Web page W, an inlink is a URL of another Web page which contains a link pointing to W. To Web page W, an outlink is a link (URL) appearing in W which points to another Web page.
Term Weight: different terms have different importance in a textual unit, e.g., a document, a document collection, or a Web site. A term’s weight is a value representing the degree of importance a term is in a textual unit. Usually a term’s frequency of appearance in a document or its TF.IDF value is used as its weight.
Word Stemming: A process which strips off the word endings, reducing them to a root form or a common stem. For example, after applying word stemming to words “designed,” “designs,” and “designing,” they have the same root form, “design.”