Abstract
The scale and scope of information on the Internet has been extended enormously over the past decade. The growth of more and more intelligent Web-based services and applications has resulted in an enormous growth of potentially useful data of both commercial and non-commercial interest. While this rise has brought a great amount of positive impact on global economic, social, and political development, it also implies an enormous flood of information into an increasingly complex information space. This is to be found on a vast variety of topics originating from a vast variety of sources, which range from private Web sites containing different kinds of information, to business-to-business B2B platforms. In most cases, these data are of an unsorted and unstructured kind, making efficient and target-oriented information retrieval very hard, if not nearly impossible. Coping with the challenge of a lack of transparency can be remedied by intelligent software agents, also referred to as softbots, which guide users through finding, sorting, and filtering this accruing data on the Internet like commonly used search engines.
Key Terms in this Chapter
Web Usage Mining: Based on the interaction of Internet-users with Web sites, Web usage mining copes with the identification of commercially valuable information in order to create personalized Web-pages or provide enhanced search engines.
Authorities and Hubs: Based on an information- centric view that the Internet in general can be sub-structured into two main kinds of Web-pages: authorities who represent useful information about the topic searched on, and hubs that enclose pointers to high quality information sources
Data Mining: (Semi-)Automatic and systematic exploration and extraction of unknown information which accrues within large data-pools.
Web Content Mining: The extraction of certain information from the unstructured raw data text of unknown structures is referred to as Web content mining. A set of information extraction tools is brought forward in order to identify and collect content items, such as Text Extraction and Wrapper Induction.
Web Structure Mining: Web structure mining presents several integral subtopics, such as graph structures and searching, as well as content categorization and classification techniques to set a sound foundation in order to explore, extract, and analyze Web information data.
Training Classifier: In order to cope with the high degree of freedom immanent in Internet data, the use of sensible categories to present a clear basis for machine learning systems, such as special sets of words and phrases (training classifiers) is needed. Also, training classifiers are substantial within the task of automated spam filtering
Wrapper: Based on highly standardized, regular, validated, and tag-consistent source pages, wrappers can map content provided in the pages’ source code to certain content attributes, and therefore extract items from structured HTML or XML content.
Web Intelligence: Web intelligence represents extraction, exploration, and utilization of unstructured data accruing on the Internet by using techniques known from Web Mining, which is subdivided into Web structure mining, Web content mining, and Web usage mining.