Computing Dense Cubes Embedded in Sparse Data

Computing Dense Cubes Embedded in Sparse Data

Lixin Fu (University of North Carolina at Greensboro, USA)
Copyright: © 2007 |Pages: 24
DOI: 10.4018/978-1-59904-271-8.ch002
OnDemand PDF Download:
$37.50

Abstract

In high-dimensional data sets, both the number of dimensions and the cardinalities of the dimensions are large and data is often very sparse, that is, most cubes are empty. For such large data sets, it is a well-known challenging problem to compute the aggregation of a measure over arbitrary combinations of dimensions efficiently. However, in real-world applications, users are usually not interested in all the sparse cubes, most of which are empty or contain only one or few tuples. Instead, they focus more on the “big picture” information the highly aggregated data, where the “where clauses” of the SQL queries involve only few dimensions. Although the input data set is sparse, this aggregate data is dense. The existing multi-pass, full-cube computation algorithms are prohibitively slow for this type of application involving very large input data sets. We propose a new dynamic data structure called Restricted Sparse Statistics Tree (RSST) and a novel cube evaluation algorithm, which are especially well suited for efficiently computing dense sub-cubes imbedded in high-dimensional sparse data sets. RSST only computes the aggregations of non-empty cube cells where the number of non-star coordinates (i.e., the number of group by attributes) is restricted to be no more than a user-specified threshold. Our innovative algorithms are scalable and I/O efficient. RSST is incrementally maintainable, which makes it suitable for data warehousing and the analysis of streaming data. We have compared our algorithms with top, state-of-the-art cube computation algorithms such as Dwarf and QCT in construction times, query response times, and data compression. Experiments demonstrate the excellent performance and good scalability of our approach.

Complete Chapter List

Search this Book:
Reset
Table of Contents
Preface
David Taniar
Chapter 1
Torben Pedersen, Jesper Thorhauge, Søren Jespersen
Enormous amounts of information about Web site user behavior are collected in Web server logs. However, this information is only useful if it can be... Sample PDF
Combining Data Warehousing and Data Mining Techniques for Web Log Analysis
$37.50
Chapter 2
Lixin Fu
In high-dimensional data sets, both the number of dimensions and the cardinalities of the dimensions are large and data is often very sparse, that... Sample PDF
Computing Dense Cubes Embedded in Sparse Data
$37.50
Chapter 3
Karlton Sequeira, Mohammed J. Zaki
Very often, related data may be collected by a number of sources, which may be unable to share their entire datasets for reasons like... Sample PDF
Exploring Similarities Across High-Dimensional Datasets
$37.50
Chapter 4
Irene Ntoutsi, Nikos Pelekis, Yannis Theodoridis
Many patterns are available nowadays due to the widespread use of knowledge discovery in databases (KDD), as a result of the overwhelming amount of... Sample PDF
Pattern Comparison in Data Mining: A Survey
$37.50
Chapter 5
Fedja Hadzic, Tharam Dillon, Henry Tan, Ling. Feng, Elizabeth Chang
Association rule mining is one of the most popular pattern discovery methods used in data mining. Frequent pattern extraction is an essential step... Sample PDF
Mining Frequent Patterns Using Self-Organizing Map
$37.50
Chapter 6
Mafruz Ashrafi, David Taniar, Kate Smith
Association rule mining is one of the most widely used data mining techniques. To achieve a better performance, many efficient algorithms have been... Sample PDF
An Efficient Compression Technique for Vertical Mining Methods
$37.50
Chapter 7
Alex Freitas, André Carvalho
In machine learning and data mining, most of the works in classification problems deal with flat classification, where each instance is classified... Sample PDF
A Tutorial on Hierarchical Classification with Applications in Bioinformatics
$37.50
Chapter 8
Daniel Wu, Xiaohua Hu
In this chapter, we report a comprehensive evaluation of the topological structure of protein-protein interaction (PPI) networks, by mining and... Sample PDF
Topological Analysis and Sub-Network Mining of Protein-Protein Interactions
$37.50
Chapter 9
Yong Shi, Yi Peng, Gang Kou, Zhengxin Chen
This chapter provides an overview of a series of multiple criteria optimization-based data mining methods, which utilize multiple criteria... Sample PDF
Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications
$37.50
Chapter 10
Xiuju Fu, Lipo Wang, GihGuang Hung, Liping Goh
Classification decisions from linguistic rules are more desirable compared to complex mathematical formulas from support vector machine (SVM)... Sample PDF
Linguistic Rule Extraction from Support Vector Machine Classifiers
$37.50
Chapter 11
Graph-Based Data Mining  (pages 291-307)
Wenyuan Li, Wee-Keong Ng, Kok-Leong Ong
With the most expressive representation that is able to characterize the complex data, graph mining is an emerging and promising domain in data... Sample PDF
Graph-Based Data Mining
$37.50
Chapter 12
Richi Nayak
Web services have recently received much attention in businesses. However, a number of challenges such as lack of experience in estimating the... Sample PDF
Facilitating and Improving the Use of Web Services with Data Mining
$37.50
About the Authors