With the rapid growth of the World Wide Web, the use of automated Web-mining techniques to discover useful and relevant information has become increasingly important. One challenging direction is Web usage mining, wherein one attempts to discover user navigation patterns of Web usage from Web access logs. Properly exploited, the information obtained from Web usage log can assist us to improve the design of a Web site, refine queries for effective Web search, and build personalized search engines. However, Web log data are usually large in size and extremely detailed, because they are likely to record every aspect of a user request to a Web server. It is thus of great importance to process the raw Web log data in an appropriate way, and identify the target information intelligently. In this chapter, we first briefly review the concept of Web Usage Mining and discuss its difference from classic Knowledge Discovery techniques, and then focus on exploiting Web log sessions, defined as a group of requests made by a single user for a single navigation purpose, in Web usage mining. We also compare some of the state-of-the-art techniques in identifying log sessions from Web servers, and present some popular Web mining techniques, including Association Rule Mining, Clustering, Classification, Collaborative Filtering, and Sequential Pattern Learning, that can be exploited on the Web log data for different research and application purposes.
Web Usage Mining (WUM), defined as the discovery and analysis of useful information from the World Wide Web, has been an active area of research and commercialization in the recent years (Cooley, Srivastava, & Mobasher, 1997). In general, as shown in Figure 1, the WUM process can be considered as a three-phase process, which consists of data preparation, pattern discovery, and pattern analysis (Srivastava, Cooley, Deshpande, & Tan, 2000).
This process implicitly covers the standard process of Knowledge Discovery in the Databases (KDD), and WUM therefore can be regarded as an application of KDD to the Web domain. Nevertheless, it is distinct from standard KDD methods by facing the unique challenge to dealing with the overwhelming resources on the Internet. To assist Web users in browsing the Internet more efficiently, it is widely accepted that the easiest way to find knowledge about user navigations is to explore the Web server logs. Generally, Web logs record all user requests to a Web server. A request is recorded in a log file entry, which contains different types of information, including the IP address of the computer making the request, the user access timestamp, the document or image requested, etc. The following is an example extracted from the Livelink Web server log (Huang, An, Cercone & Promhouse, 2002).1 (Figure 2)
Livelink is a database driven web-based knowledge management system developed by Open Text Corporation (http://www.opentext.com). It provides a web-based environment (such as an intranet or extranet) to facilitate collaborations between cross-functional employees within an organization. In this example, a user using the computer with the IP 188.8.131.52 has requested a query with object ID 12856199 on April 10th, at 7:22pm.