Visual Analytics and Interactive Technologies: Data, Text and Web Mining Applications

Visual Analytics and Interactive Technologies: Data, Text and Web Mining Applications

Qingyu Zhang (Arkansas State University, USA), Richard S. Segall (Arkansas State University, USA) and Mei Cao (University of Wisconsin-Superior, USA)
Indexed In: SCOPUS
Release Date: October, 2010|Copyright: © 2011 |Pages: 362
ISBN13: 9781609601027|ISBN10: 1609601025|EISBN13: 9781609601041|DOI: 10.4018/978-1-60960-102-7

Description

Large volumes of data and complex problems inspire research in computing and data, text, and Web mining. However, analyzing data is not sufficient, as it has to be presented visually with analytical capabilities.

Visual Analytics and Interactive Technologies: Data, Text and Web Mining Applications is a comprehensive reference on concepts, algorithms, theories, applications, software, and visualization of data mining, text mining, Web mining and computing/supercomputing. This publication provides a coherent set of related works on the state-of-the-art of the theory and applications of mining, making it a useful resource for researchers, practitioners, professionals and intellectuals in technical and non-technical fields.

Topics Covered

The many academic areas covered in this publication include, but are not limited to:

  • Data mining techniques for outlier detection
  • Database analysis with ANNs
  • Design of specialized biological databases
  • Effective Web personalization
  • Feature selection methods for knowledge discovery
  • Interactive visual clustering
  • Ontology-based framework to extract external Web data
  • Visual analytic system for frequent set mining
  • Visual survey analysis
  • Web mining and social network analysis

Reviews and Testimonials

This is a comprehensive book on concepts, algorithms, theories, applications, software, and visualization of data mining and computing. It provides a volume of coherent set of related works on the state-of-the-art of the theory and applications of mining and its relations to computing, visualization and others with an audience to include both researchers, practitioners, professionals and intellectuals in technical and non-technical fields, appealing to a multi-disciplinary audience. Because each chapter is designed to be stand-alone, readers can focus on the topics that most interest them.

– Qingyu Zhang, Arkansas State University, USA; Richard S. Segall, Arkansas State University, USA; and Mei Cao, University of Wisconsin-Superior, USA

Table of Contents and List of Contributors

Search this Book:
Reset

Preface

Large volumes of data and complex problems inspire research in computing and data, text, and web mining. However, analyzing data is not sufficient, as it has to be presented visually with analytical capabilities, i.e., a chart/diagram/image illustration that enables humans to perceive, relate, and conclude in the knowledge discovery process.  In addition, how to use computing or supercomputing techniques (e.g., distributed, parallel, and clustered computing) in improving the effectiveness of data, text, and web mining is an important aspect of the visual analytics and interactive technology. This book extends the visual analytics by using tools of data, web, text mining and computing, and their associated software and technologies available today.

This is a comprehensive book on concepts, algorithms, theories, applications, software, and visualization of data mining and computing.  It provides a volume of coherent set of related works on the state-of-the-art of the theory and applications of mining and its relations to computing, visualization and others with an audience to include both researchers, practitioners, professionals and intellectuals in technical and non-technical fields, appealing to a multi-disciplinary audience. Because each chapter is designed to be stand-alone, readers can focus on the topics that most interest them.

With a unique collection of recent developments, novel applications, and techniques for visual analytics and interactive technologies, the sections of the book are Concepts, Algorithms, and Theory; Applications of Mining and Visualization; and Visual Systems, Software and Supercomputing, pertaining to Data mining, Web mining, Data Visualization, Mining for Intelligence, Supercomputing, Database, Ontology, Web Clustering, Classification, Pattern Recognition, Visualization Approaches, Data and Knowledge Representation, and Web Intelligence.

Section I consists of seven chapters on concepts, algorithms, and theory of mining and visualizations. 

Chapter I, Towards the Notion of Typical Documents in Large Collections of Documents, by Mieczyslaw A. Klopotek, Slawomir T. Wierzchom, Krzysztof Ciesielski, Michal Draminski, and Dariusz Czerski, focuses on how to best represent a typical document in a large collection of objects (i.e., documents). They propose a new measure of document similarity – GNGrank that was inspired by the popular idea that links between documents reflect similar content. The idea was to create a rank measure based on the well known PageRank algorithm which exploits the document similarity to insert links between the documents. Various link-based similarity measures (e.g., PageRank) and GNGrank are compared in the context of identification of a typical document of a collection. The experimental results suggest that each algorithm measures something different, a different aspect of document space, and hence the respective degrees of typicality do not correlate.

Chapter II, Data Mining Techniques for Outlier Detection, by N. Ranga Suri, M Narasimha Murty,  and G Athithan, highlights some of the important research issues that determine the nature of the outlier detection algorithm required for a typical data mining application. Detecting the objects in a data set with unusual properties is important; as such outlier objects often contain useful information on abnormal behavior of the system or its components described by the data set. They discussed issues including methods of outlier detection, size and dimensionality of the data set, and nature of the target application. They attempt to cover the challenges due to the large volume of high dimensional data and possible research directions with a survey of various data mining techniques dealing with the outlier detection problem.

Chapter III, Using an Ontology-based Framework to Extract External Web Data for the Data Warehouse, by Charles Greenidge and Hadrian Peter, proposes a meta-data engine for extracting external data in the Web for data warehouses that forms a bridge between the data warehouse and search engine environments. This chapter also presents a framework named the semantic web application that facilitates semi-automatic matching of instance data from opaque web databases using ontology terms. The framework combines information retrieval, information extraction, natural language processing, and ontology techniques to produce a viable building block for semantic web applications.  The application uses a query modifying filter to maximize efficiency in the search process. The ontology-based model consists of a pre-processing stage aimed at filtering, a basic and then more advanced matching phases, a combination of thresholds and a weighting that produces a matrix that is further normalized, and a labeling process that matches data items to ontology terms.

Chapter IV, Dimensionality Reduction for Interactive Visual Clustering: A Comparative Analysis, by P. Alagambigai and K. Thangavel, discusses VISTA as a Visual Clustering Rendering System that can include algorithmic clustering results and serve as an effective validation and refinement tool for irregularly shaped clusters. Interactive visual clustering methods allow a user to partition a data set into clusters that are appropriate for their tasks and interests through an efficient visualization model and it requires an effective human-computer interaction. This chapter entails the reliable human-computer interaction through dimensionality reduction by comparing three different kinds of dimensionality reduction methods: (1) Entropy Weighting Feature Selection (EWFS), (2) Outlier Score Based Feature Selection (OSFS), and (3) Contribution to the Entropy based Feature Selection (CEFS). The performance of the three feature selection methods were compared with clustering of dataset using the whole set of features. The performance was measured with popular validity measure Rand Index.

Chapter V, Database Analysis with ANNs by Means of Graph Evolution, by Daniel Rivero, Julián Dorado, Juan R. Rabuñal, and Alejandro Pazos, proposes a new technique of graph evolution based ANN and compares it with other systems such as Connectivity Matrix, Pruning, Finding network parameters, and Graph-rewriting grammar. Traditionally the development of Artificial Neural Networks (ANNs) is a slow process guided by the expert knowledge. This chapter describes a new method for the development of Artificial Neural Networks, so it becomes completely automated. Several tests were performed with some of the most used test databases in data mining. The performance of the proposed system is better or in par with other systems.

Chapter VI, An Optimal Categorization of Feature Selection Methods for Knowledge Discovery, by Harleen Kaur, Ritu Chauhan, and M. A. Alam, focuses on several feature selection methods as to their effectiveness in preprocessing input medical data. Feature selection is an active research area in pattern recognition and data mining communities. They evaluate several feature selection algorithms such as Mutual Information Feature Selection (MIFS), Fast Correlation-Based Filter (FCBF) and Stepwise Discriminant Analysis (STEPDISC) with machine learning algorithm naive Bayesian and Linear Discriminant analysis techniques. The experimental analysis of feature selection technique in medical databases shows that a small number of informative features can be extracted leading to improvement in medical diagnosis by reducing the size of data set, eliminating irrelevant features, and decreasing the processing time.

Chapter VII, From Data to Knowledge: Data Mining, by Tri Kurniawan Wijaya, conceptually discusses the techniques to mine hidden information or knowledge which lies in data. In addition to the elaboration of the concept and theory, they also discuss about the application and implementation of data mining. They start with differences among data, information, and knowledge, and then proceed to describe the process of gaining the hidden knowledge, and compare data mining with other closely related terminologies such as data warehouse and OLAP. 

Section II consists of five chapters on applications of mining and visualizations.

Chapter VIII, Patent Infringement Risk Analysis Using Rough Set Theory, by Chun-Che Huang, Tzu-Liang (Bill) Tseng, and Hao-Syuan Lin, applies rough set theory (RST), which is suitable for processing qualitative information, to induce rules to derive significant attributes for categorization of the patent infringement risk. Patent infringement risk is an important issue for firms due to the increased appreciation of intellectual property rights. If a firm gives insufficient protection to its patents, it may loss both profits and industry competitiveness.  Rather than focusing on measuring the patent trend indicators and the patent monetary value, they integrate RST with the use of the concept hierarchy and the credibility index, to enhance application of the final decision rules.

Chapter IX, Visual Survey Analysis in Marketing, by Marko Robnik-Šikonja and Koen Vanhoof, makes use of the ordinal evaluation (OrdEval) algorithm as a visualization technique to study questionnaire data of customer satisfaction in marketing. The OrdEval algorithm has many favorable features, including context sensitivity, ability to exploit meaning of ordered features and ordered response, robustness to noise and missing values in the data, and visualization capability. They choose customer satisfaction analysis as a case study and present visual analysis on two applications of business-to-business and costumer-to-business. They demonstrate some interesting advantages offered by the new methodology and visualization and show how to extract and interpret new insights not available with classical analytical toolbox. 

Chapter X, Assessing Data Mining Approaches for Analyzing Actuarial Student Success Rate, by Alan Olinsky, Phyllis Schumacher, and John Quinn, entails the use of several types of predictive models to perform data mining to evaluate the student retention rate and enrollment management for those selecting a major in the Actuarial Science at a medium size university. The predictive models utilized in this research include stepwise logistic regression, neural networks and decision trees for performing the data mining. This chapter uses data mining to investigate the percentages of students who begin in a certain major and will graduate in the same major. This information is important for individual academic departments in determining how to allocate limited resources in making decisions as to the appropriate number of classes and sections to be offered and the number of faculty lines needed to staff the department. This chapter details a study that utilizes data mining techniques to analyze the characteristics of students who enroll as actuarial mathematics students and then either drop out of the major or graduate as actuarial students.

Chapter XI, A Robust Biclustering Approach for Effective Web Personalization, by H. Hannah Inbarani and K. Thangavel, proposes a robust Biclustering algorithm to disclose the correlation between users and pages based on constant values for integrating user clustering and page clustering techniques, which is followed by a recommendation system that can respond to the users’ individual interests. The proposed method is compared with Simple Biclustering (SB) method. To evaluate the effectiveness and efficiency of the recommendation, experiments are conducted in terms of the recommendation accuracy metric. The experimental results demonstrated that the proposed RB method is very simple and is able to efficiently extract needed usage knowledge and to accurately make web recommendations.

Chapter XII, Web Mining and Social Network Analysis, by Roberto Marmo, reviews and discusses the use of web mining techniques and social networks analysis to possibly process and analyze large amount of social data such as blogtagging, online game playing, instant messenger, etc. Social network analysis views social relationships in terms of network and graph theory about nodes (individual actors within the network) and ties (relationships between the actors). In this way, social network mining can help understand the social structure, social relationships and social behaviours. These algorithms differ from established set of data mining algorithms developed to analyze individual records since social network datasets are relational with the centrality of relations among entities. 

Section III consists of five chapters on visual systems, software and supercomputing.

Chapter XIII, iVAS: An Interactive Visual Analytic System for Frequent Set Mining, by Carson Kai-Sang Leung and Christopher L. Carmichael, proposes an interactive visual analytic system called iVAS for providing visual analytic solutions to the frequent set mining problem. The system enables the visualization and advanced analysis of the original transaction databases as well as the frequent sets mined from these databases. Numerous algorithms have been proposed for finding frequent sets of items, which are usually presented in a lengthy textual list. However, the use of visual representations can enhance user understanding of the inherent relations among the frequent sets.

Chapter XIV, Mammogram Mining Using Genetic Ant-Miner, by Thangavel. K. and Roselin. R, applies classification algorithm to image processing (e.g., mammogram processing) using genetic Ant-Miner.  Image mining deals with the extraction of implicit knowledge, image data relationship, or other patterns not explicitly stored in the images. It is an extension of data mining to image domain and an interdisciplinary endeavor. C4.5 and Ant-Miner algorithms are compared and the experimental results show that Ant-Miner performs better in the domain of biomedical image analysis.  

Chapter XV, Use of SciDBMaker as Tool for the Design of Specialized Biological Databases, by Riadh Hammami and Ismail Fliss, develops SciDBMaker to provide a tool for easy building of new specialized protein knowledge bases. The exponential growth of molecular biology research in recent decades has brought growth in the number and size of genomic and proteomic databases to enhance the understanding of biological processes.  This chapter also suggests best practices for specialized biological databases design, and provides examples for the implementation of these practices.

Chapter XVI, Interactive Visualization Tool for Analysis of Large Image Databases, by Anca Doloc-Mihu, discusses an Adaptive Image Retrieval System (AIRS) that is used as a tool for actively searching for information in large image databases. This chapter identifies two types of users for an AIRS: an end-user who seeks images and a research-user who designs and researches the collection and retrieval systems. This chapter focuses in visualization techniques used by Web-based AIRS to allow different users to efficiently navigate, search and analyze large image databases. Recent advances in Internet technology require the development of advanced Web-based tools for efficiently accessing images from tremendously large, and continuously growing, image collections. One such tool for actively searching for information is an Image Retrieval System. The interface discussed in this chapter illustrates different relationships between images by using visual attributes (colors, shape, and proximities), and supports retrieval and learning, as well as browsing which makes it suitable for an Adaptive Image Retrieval Systems.

Chapter XVII, Supercomputers and Supercomputing, by Jeffrey S. Cook, describes supercomputer as the fastest type of computer used for specialized applications that require a massive number of mathematical calculations. The term “supercomputer” was coined in 1929 by the New York World, referring to tabulators manufactured by IBM. These tabulators represent the cutting edge of technology, which harness immense processing power so that they are incredibly fast, sophisticated, and powerful. The use of supercomputing in data mining has also been discussed in the chapter. 

All chapters went through a blind refereeing process before final acceptance. We hope these chapters are informative, stimulating, and helpful to the readers.

Author(s)/Editor(s) Biography

Qingyu Zhang received his Ph.D. in Manufacturing Management and Engineering from the College of Business Administration of the University of Toledo. He is a Certified Fellow in Production and Inventory Management (CFPIM) by APICS. He is also certified MCSD, MCSE, and MCDBA by Microsoft. He is an associate professor at Arkansas State University. He has published in European Journal of Operational Research, International Journal of Production Research, Journal of Operations Management, International Journal of Production Economics, Kybernetes: International Journal of Systems and Cybernetics, Industrial Management & Data Systems, International Journal of Operations and Production Management, International Journal of Logistics Management, Journal of Systems Science and Systems Engineering, International Journal of Product Development, International Journal of Quality and Reliability Management, European Journal of Innovation Management, and International Journal of Information Technology and Decision Making. Dr. Zhang’s research interests are supply chain management, value chain flexibility, e-commerce, product development, and data mining. He serves on the Editorial Boards of Journal of Computer Information Systems, Information Resource Management Journal, International Journal of Data Analysis Techniques and Strategy, and International Journal of Information Technology Project Management.

Dr. Richard S. Segall is a Professor of Computer & Information Technology in the College of Business at Arkansas State University in Jonesboro, AR and also teaches in the Master of Engineering Management (MEM) Program in the College of College of Agriculture, Engineering & Technology. He is also Affiliated Faculty at the University of Arkansas at Little Rock (UALR) where he serves on thesis committees. He holds a Bachelor of Science and Master of Science in Mathematics as well as a Master of Science in Operations Research and Statistics from Rensselaer Polytechnic Institute in Troy, New York. He also holds a PhD in Operations Research form University of Massachusetts at Amherst, He has served on the faculty of Texas Tech University, University of Louisville, University of New Hampshire, University of Massachusetts-Lowell, and West Virginia University. His research interests include data mining, text mining, web mining, database management, Big Data, and mathematical modeling.

Dr. Segall‘s publications have appeared in numerous journals including International Journal of Information Technology and Decision Making (IJITDM), International Journal of Information and Decision Sciences (IJIDS), Applied Mathematical Modelling (AMM), Kybernetes: The International Journal of Cybernetics, Systems and Management Sciences, Journal of the Operational Research Society (JORS) and Journal of Systemics, Cybernetics and Informatics (JSCI). He has published book chapters in Encyclopedia of Data Warehousing and Mining, Handbook of Computational Intelligence in Manufacturing and Production Management, Handbook of Research on Text and Web Mining Technologies, Encyclopedia of Information Science & Technology, and Encyclopedia of Business Analytics & Optimization.

Dr. Segall is a member of the Arkansas Center for Plant-Powered-Production (P3), and on the Editorial Board of the International Journal of Data Mining, Modelling and Management (IJDMMM) and International Journal of Data Science (IJDS), and served as Local Arrangements Chair of the MidSouth Computational Biology & Bioinformatics Society (MCBIOS) Conference that was hosted at Arkansas State University.

His research has been funded by National Research Council (NRC), U.S. Air Force (USAF), National Aeronautical and Space Administration (NASA), Arkansas Biosciences Institute (ABI), and Arkansas Science & Technology Authority (ASTA). He is recipient of several Session Best Paper awards at World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI) conferences. He is co-editor of two other books published by IGI Global: Visual Analytics and Interactive Technologies: Data, Text and Web Mining Applications in 2011 and Research and Applications in Global Supercomputing in 2015. Dr. Segall is recipient of Arkansas State University, College of Business Faculty Award for Excellence in Research in 2015.

Mei Cao is an Assistant Professor at the University of Wisconsin-Superior. She received her Ph.D. in Manufacturing Management and Engineering from the College of Business Administration of the University of Toledo. She has publications in various academic journals such as European Journal of Operational Research, International Journal of Production Research, International Journal of Operations and Production Management, Journal of Systems Science and Systems Engineering, Information & Management, Industrial Management & Data Systems, International Journal of Product Development, International Journal of Services Technology and Management. She serves on the editorial boards of International Journal of Operations Research and Information Systems, and International Journal of Information Technology Project Management. She received Distinguished Paper Award by McGraw-Hill/Irwin at the Midwest Business Administration Association Annual Meeting in Chicago in 2004. Her research interests include Supply Chain Management, Transportation and Logistics, Flexibility, and Inter-Organizational Information Systems. Her research has been funded by the National Center for Freight and Infrastructure Research and Education (CFIRE) and Midwest Regional University Transportation Center (MRUTC) under the sponsorship of the Department of Transportation.

Indices