This chapter emphasizes topic analysis and identification of search engine user queries. Topic analysis and identification of queries is an important task related to the discipline of information retrieval which is a key element for the development of successful personalized search engines. Topic identification of text is also no simple task, and a problem yet unsolved. The problem is even harder for search engine user queries due to real-time requirements and the limited number of terms in the user queries. The chapter includes a detailed literature review on topic analysis and identification, with an emphasis on search engine user queries, a survey of the analytical methods that have been and can be used, and the challenges and research opportunities related to topic analysis and identification.
Literature Review For Topic Analysis And Identification Of User Queries
In this chapter, we include a literature review of the state-of-the-art on topic analysis and identification. Topic analysis and identification studies ranging from basic analysis of topics and terms of search engine queries, to query clustering, session identification, automatic new topic identification and query topic identification models, and then to the more general context of text classification, categorization and mining, will be included in the literature review.
Key Terms in this Chapter
Top ic Anal ysis: Analysis aiming to identify the topic of search engine queries.
To pic Ident ification: Automatically identifying or estimating the topic of search engine queries without human intervention.
Hidden Markov Models: The Hidden Markov model is a stochastic process, where the underlying process or parameters are not observable, but can only be monitored through another stochastic process with observable parameters (Rabiner, 1989).
Session Identif ication: Session identification is discovering the group of sequential log entries that are related to a common user or topic; new topic identification.
Query Clustering: Grouping the sequential log entries into different clusters in terms of topics or users.
Conditional Random Fields: Conditional random fields are a probabilistic framework for labeling and segmenting sequential data, based on conditional probabilities (Wallach, 2004).
Maximum Entropy Modeling: Maximum entropy modeling is a methodology aiming to model random and stochastic events, that is motivated by the principle of generating probability distributions from a training dataset, and calculating the conditional probability that event y occurs given that event x has occurred.
New Topic Identification: New topic identification is discovering when the user has switched from one topic to another during a single search session to group sequential log entries that are related to a common topic (He, Goker and Harper, 2002), session identification.