Studies that rely on Web usage mining can be experimental or observational in nature. The focus of such studies is quite varied and may involve such topics as predicting online purchase intentions (Hooker & Finkelman, 2004; Moe, 2003; Montgomery, Li, Srinivsan, & Liechty, 2004), designing recommender systems for e-commerce products and sites (Cho & Kim, 2004; Kim & Cho, 2003), understanding navigation and search behavior (Chiang, Dholakia, & Westin, 2004; Gery & Haddad, 2003; Johnson, Moe, Fader, Bellman, & Lohse, 2004; Li & Zaiane, 2004), or a myriad of other subjects. Regardless of the issue being studied, data collection for Web usage mining studies often proves to be a vexing problem, and ideal research designs are frequently sacrificed in the interest of finding a reasonable data capture or collection mechanism. Despite the difficulties involved, the research community has recognized the value of Web-based experimental research (Saeed, Hwang, & Yi, 2003; Zinkhan, 2005), and has, in fact, called on investigators to exploit “non-intrusive means of collecting usage and exploration data” (Gao, 2003, p. 31) in future Web studies. In this article we discuss some of the methodological complexities that arise when conducting studies that involve Web usage mining. We then describe an innovative, software-based methodology that addresses many of these problems. The methods described here are most applicable to experimental studies, but they can be applied in ex-post observational research settings, as well.
Approaches to Web usage mining can be server-centric or client-centric. In the former case the data are harvested from a server machine. In some instances this approach requires no special software mechanisms since server logs are maintained routinely by server software. Client-centric approaches always require special data collection mechanisms because standard browsers do not document user actions. An example of this is the PCMeter usage mining software application. This software runs in the background on the client machine recording click-stream data as the research subject interacts with a Web browser (see Johnson et al., 2004, and Montgomery et al., 2004, for usage examples).
Server logs provide the most frequent data source for usage mining studies. This is because the data are readily available in a standard, machine-readable format, and preexisting Web sites can be used as long as the server log data can be procured for analysis. However, the literature is rife with criticism and complaints about the shortcomings of this data source (e.g., Bracke, 2004; Fenstermacher & Ginsburg, 2003; Huysmans, Baesens, & Vanthienen, 2004; Montgomery et al., 2004; Spiliopoulou, Mobasher, Berendt, & Nakagawa, 2003). The problems arise from such confounding elements as multiple server types (e.g., proxy servers, image servers, and application servers), server farms and load balancing procedures, caching activities, stateless nature of sessions, and so forth. In the words of Shahabi, Banaei-Kashani, and Faruque (2001, p. 1) “... usage data acquisition via server logs is neither reliable, nor efficient. It is unreliable due to the side effects of the network ... [it is] inefficient because of usage data requiring extensive preprocessing before it can be utilized.”
Key Terms in this Chapter
ASP (Active Server Page) Scripting: A simple server-side scripting approach where script code (usually VBScript or Jscript) is mixed with HTML code on a Web page. The script code is processed by a script engine before the page is rendered by the server. This can be used to create dynamic Web pages and to share data within or between Web sessions. This is a predecessor of ASP.NET technology and is sometimes called Classic ASP . ASP pages are identified by an “.asp” file extension. (See ASP.NET .)
ASP.NET: The new generation of ASP provided by the Microsoft .NET environment. ASP.NET supports a number of advanced features including server-side controls, dynamic data binding, Web services, and Web forms. All .NET programming languages (e.g., C++, C#, VB) are fully supported, so the developer is no longer restricted to using simple scripting languages. ASP.NET components are compiled thereby providing major security and performance enhancements. ASP.NET pages are identified by an “.aspx” file extension. (See ASP Scripting .)
Web Usage Mining: Harvesting and processing data for the purpose of uncovering usage and navigation patterns of Web users. This is usually recognized as a sub-category of Web Mining . The additional elements of this broader term are Web Content Mining and Web Structure Mining . These focus, respectively, on the information content of Web documents, and on the hyperlink structure among Web documents.
Click-Stream: In Web research, the click-stream is the sequence of Web pages that is visited by the experimental subject. A click-stream data record can be as simple as URL and sequence number, or a timestamp can also be added. This latter approach allows for analysis of page viewing time.
Method: In Object Oriented Programming, methods are the actions or behaviors that an object can perform. At the coding level, a method is created by including a procedure (function or sub) within the class.