Statistical Web Object Extraction
Jun Zhu (Tsinghua University, China), Zaiqing Nie (Web Search and Mining Group Microsoft Research Asia, China) and Bo Zhang (Tsinghua University, China)
Copyright: © 2009
The World Wide Web is a vast and rapidly growing repository of information. There are various kinds of objects, such as products, people, conferences, and so on, embedded in both statically and dynamically generated Web pages. Extracting the information about real-world objects is a key technique for Web mining systems. For example, the object-level search engines, such as Libra (http://libra.msra.cn) and Rexa (http://rexa.info), which help researchers find academic information like papers, conferences and researcher’s personal information, completely rely on structured Web object information. However, how to extract the object information from diverse Web pages is a challenging problem. Traditional methods are mainly template-dependent and thus not scalable to the huge number of Web pages. Furthermore, many methods are based on heuristic rules. So they are not robust enough. Recent developments in statistical machine learning make it possible to develop advanced statistical Web object extraction models. One key difference of Web object extraction from traditional information extraction from natural language text documents is that Web pages have plenty of structure information, such as two-dimensional spatial layouts and hierarchical vision tree representation. Statistical Web object extraction models can effectively leverage this information with properly designed statistical models. Another challenge of Web object extraction is that many text contents on Web pages are not regular natural language sentences. They have some structures but are lack of natural language grammars. Thus, existing natural language processing (NLP) techniques are not directly applicable. Fortunately, statistical Web object extraction models can easily merge with statistical NLP methods which have been the theme in the field of natural language processing during the last decades. Thus, the structure information on Web pages can be leveraged to help process text contents, and traditional NLP methods can be used to extract more features. Finally, the Web object extraction from diverse and large-scale Web pages provides a valuable and challenging problem for machine learning researchers. To nicely solve the problem, new learning methodology and new models (Zhu et al., 2007b) have to be developed.
Web object extraction is a task of identifying interested object information from Web pages. A lot of methods have been proposed in the literature. The wrapper learning approaches like (Muslea et al., 2001; Kushmerick, 2000) take in some manually labeled Web pages and learn some extraction rules (wrappers). Since the learned wrappers can only be used to extract data from similar pages, maintaining the wrappers as Web sites change will require substantial efforts. Furthermore, in wrapper learning a user must provide explicit information about each template. So it will be expensive to train a system that extracts data from many Web sites. The methods (Zhao et al., 2005; Embley et al., 1999; Buttler et al., 2001; Chang et al., 2001; Crescenzi et al., 2001; Arasu, & Garcia-Molina, 2003) do not need labeled training samples and they automatically produce wrappers from a collection of similar Web pages.
Two general extraction methods are proposed in (Zhai & Liu, 2005; Lerman et al., 2004) and they do not explicitly rely on the templates of Web sites. The method in (Lerman et al., 2004) segments data on list pages using the information contained in their detail pages, and the method in (Zhai & Liu, 2005) mines data records by string matching and also incorporates some visual features to achieve better performance. However, the data extracted by (Zhai & Liu, 2005; Lerman et al., 2004) have no semantic labels.