Processing and Analysis of Search Query Logs in Chinese
Michael Chau (The University of Hong Kong, Hong Kong), Yan Lu (The University of Hong Kong, Hong Kong), Xiao Fang (The University of Toledo, USA) and Christopher C. Yang (Drexel University, USA)
Copyright: © 2009
More non-English contents are now available on the World Wide Web and the number of non-English users on the Web is increasing. While it is important to understand the Web searching behavior of these non-English users, many previous studies on Web query logs have focused on analyzing English search logs and their results may not be directly applied to other languages. In this Chapter we discuss some methods and techniques that can be used to analyze search queries in Chinese. We also show an example of applying our methods on a Chinese Web search engine. Some interesting findings are reported.
Search engines have been widely used for finding useful information on the World Wide Web. Many users start their Web activities using popular search engines such as Google (http://www.ayna.com) for Arabic. The information needs and search behavior of non-English users are different from those of native English users because of different languages and different cultures (Chau et al., 2007). More importantly, some languages, such as Asian languages, have different characters, grammars, and structures that are significantly different from those of English. Consequently, the methods and techniques for processing search logs in these languages can be quite different from those for processing English search logs. In this chapter, we discuss methods and issues involved in processing search logs in Chinese. As one of the most widely used non-English languages, Chinese has its unique characteristics. On the other hand, it shares similar characteristics with some other Asian languages such as Japanese and Korean. We believe that we can extend methods in this chapter across these languages.
The chapter is structured as follows. In the next section, we give some background knowledge about the characteristics of the Chinese language. Then we discuss the methods and techniques used to analyze Chinese search queries. The section that follows presents the application of our methods on a Chinese Web search engine called Timway. The last section provides a summary of this chapter.
Key Terms in this Chapter
Bigram Analysis: The analysis of all sequences of two adjacent words in each query.
N-Gram Analysis: The analysis of all sequences of n adjacent words in each query.
Zipf Distribution: A distribution in which the frequency of any object is inversely proportional to its frequency rank. It has been observed in text corpora, database contents, and other natural phenomena.
Chinese Search Logs: Contain the Chinese queries that are often received in different character encodings. GB-2312, GBK, and BIG 5 are the three most popular Chinese language encoding schemes. They are used in different Chinese speaking regions with different popularity. For example, Traditional Chinese, usually encoded in BIG 5, is widely used in Hong Kong and Taiwan, while Simplified Chinese, usually encoded in GB-2312, is more commonly used in mainland China and Singapore.