Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Parallel, Distributed, and Grid-Based Data Mining: Algorithms, Systems, and Applications

Moez Ben HajHmida, Antonio Congiusta

Source Title: Grid and Cloud Computing: Concepts, Methodologies, Tools and Applications

DOI: 10.4018/978-1-4666-0879-5.ch110

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Knowledge discovery has become a necessary task in scientific, life sciences, and business fields, both for the growing amount of data being collected and for the complexity of the analysis that need to be performed on it. Classic data mining techniques, developed for centralized sites, often reveal themselves inadequate, due to some unique characteristics of today’s data sources. In such cases, sequential approaches to data mining cannot provide for scalability, in terms of the data dimensionality, size, and runtime performance. Moreover, the increasing trend towards decentralized business organizations, distribution of users, software, and hardware systems magnifies the need for more advanced and flexible approaches and solutions. Life science is one of the application areas that best resemble such scenario. This chapter presents the state of the art about the major data mining techniques, systems and approaches. A detailed taxonomy is drawn by analyzing and comparing parallel, distributed and Grid-based data mining methods, with a particular focus on the exploitation of large and remotely dispersed datasets and/or high-performance computers.

Chapter Preview

Top

Introduction

Data mining aims at extracting hidden information from large data repositories (e.g., databases, file archives, digital libraries) for building valuable knowledge patterns and predictive models. The main data mining tasks are association rules discovery, classification and clustering.

Data mining is a massive computing task that deals with memory resident data. With the huge amount of stored data in centralized or distributed systems, traditional data mining techniques encounter limitations and shortcomings that often lead to inefficiencies. The need for parallel and distributed computing becomes inevitable to deal with large-scale data mining (Freitas, 1998; Kargupta, 2000), and for addressing complex needs and scenarios encountered in business as well as research organizations.

Parallel Data Mining (PDM) is targeted to tightly-coupled systems, like shared or distributed memory machines, and clusters based on fast networks. Distributed Data Mining (DDM) deals with loosely-coupled systems: clusters with average-fast or slow networks and geographically distributed computing nodes. The main differences between PDM and DDM are the number of involved computing nodes, the communication costs, and the degree of data distribution.

The advances in network technologies have produced huge amount of data stored on geographically distributed databases and repositories. When these amounts of data are owned by different organizations and hosted by non-dedicated computing resources, parallel and distributed data mining techniques start showing their limits. The Grid is the computing architecture that provides means for utilizing geographically distributed resources as a single meta-system. The emergence of such new infrastructure is highly beneficial to large-scale and compute-intensive data mining, as it offers new opportunities to optimize and speed-up mining processes.

Unlike previous parallel and distributed data mining surveys (Kargupta, 2000; Zaki, 2000), this chapter differentiates between:

•
PDM, where learning methods are often platform dependent, databases are centralized (for example in a cluster or a supercomputer), and network connections are reliable and fast;
•
DDM, in which learning methods are based on sharing nothing machines with a much slower network and databases are naturally distributed;
•
and, in addition, the Grid Data Mining (GDM) category of algorithms and systems, which is deeply explored.

Although GDM shares many commonalities with PDM and DDM, there are platform peculiarities and requirements implying that efforts and obtained results in such area cannot be compared (in a homogeneous way) with those achieved by PDM and DDM.

The remainder of this chapter is organized as follows: Section 2 contains a background on classical data mining techniques; Section 3 presents parallel systems and the related programming paradigms, it details the techniques used in parallel data mining and classify them on the basis of the employed method; Section 4 presents the main differences between parallel and distributed data mining techniques, then it discusses the major distribution methods and their transition from parallel to distributed systems; Section 5 contains a description of the knowledge discovery process in Grid environments, it focuses on Grid infrastructures and frameworks designed for such purpose; finally Section 6 draws conclusions and highlights some future trends.

Top

Background

Association rules discovery aims at finding all the itemsets (set of attributes) in a database that frequently occur together, the so called frequent itemsets, and the derived association rules. The main algorithm for this data mining task is Apriori (Agrawal, 1996), which is an iterative algorithm that needs multiple scans of the database. If n is the number of items (attributes), the complexity of Apriori algorithm is exponential, i.e. O(2ⁿ). To speed up the algorithm, the amount of generated candidates and the number of database scans have to be optimized.

Classification aims at assigning data items to one of n predefined categorical classes. Since the category being predicted is pre-labeled, classification is also known as supervised induction. There are several classification techniques like decision trees (Quinlan, 1993), rule induction (Cohen, 1995), neural networks (Lippmann, 1987), bayesian networks (Guan, 1991), support vector machines (Boser, 1992), and evolutionary algorithms (Freitas, 2002).

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Parallel, Distributed, and Grid-Based Data Mining: Algorithms, Systems, and Applications

Abstract

Introduction

Background

Complete Chapter List