Clustering Algorithms for Data Streams

Christos Makris; Nikos Tsirakis

doi:10.4018/978-1-60566-026-4.ch092

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Clustering Algorithms for Data Streams

Christos Makris, Nikos Tsirakis

Source Title: Encyclopedia of Information Science and Technology, Second Edition

DOI: 10.4018/978-1-60566-026-4.ch092

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The World Wide Web has rapidly become the dominant Internet tool which has overwhelmed us with a combination of rich hypertext information, multimedia data and various resources of dynamic information. This evolution in conjunction with the immense amount of available information imposes the need of new computational methods and techniques in order to provide, in a systematical way, useful information among billions of Web pages. In other words, this situation poses great challenges for providing knowledge from Web-based information. The area of data mining has arisen over the last decade to address this type of issues. There are many methods, techniques and algorithms that accomplish different tasks in this area. All these efforts examine the data and try to find a model that fits to their characteristics in order to examine them. Data can be either typical information from files, databases and so forth, or with the form of a stream. Streams constitute a data model where information is an undifferentiated, byte-by-byte flow that passes over the time. The area of algorithms for processing data streams and associated applications has become an emerging area of interest, especially when all this is done over the Web. Generally, there are many data mining functions (Tan, Steinbach, & Kumar, 2006) that can be applied in data streams. Among them one can discriminate clustering, which belongs to the descriptive data mining models. Clustering is a useful and ubiquitous tool in data analysis.

Chapter Preview

Top

Background

Data Mining and Knowledge Discovery

Classic algorithms handle small amounts of data and face up performance problems when data are huge in capacity. For example, a sorting algorithm runs efficiently with some megabytes of data but could have difficulties in running for some gigabytes of data. Many methods such as clustering and classification have been widely studied in the data mining community. However, a majority of such methods may not be working effectively on data streams. This happens because data streams provide huge volumes of data and at the same time require online mining, in which we wish to mine the data in a continuous fashion. Generally, there are many specific problems with traditional algorithms. Data mining is a technology that blends traditional data analysis methods with sophisticated algorithms for processing large volumes of data. In addition, it gives new opportunities for exploring and analyzing new types of data and for analyzing old types of data with new ways. Data mining is an integral part of knowledge discovery in databases (KDD). These two terms are often used interchangeably (Dunham, 2003). Over the last few years, KDD has been used to refer to a process consisting of many phases, while data mining is only one of these phases. Below are some definitions of knowledge discovery in databases and data mining (Fayyad, Piatetsky-Shapiro, &Smyth, 1996a, 1996b).

•
Knowledge discovery in databases (KDD): Is the process for finding useful information and patterns in data.

Knowledge discovery in databases is a process that involves five different phases which are listed bellow (Dunham, 2003):

1.
Data selection
2.
Data preprocessing
3.
Data transformation
4.
Data mining
5.
Data interpretation/evaluation

Data mining attempts to autonomously extract useful information or knowledge from large data stores or sets. It involves many different algorithms to accomplish different tasks. All these algorithms attempt to fit a model to the data. The algorithms examine the data and determine a model that is closest to the characteristics of the data being examined. These algorithms consist of three parts:

•
Model: The purpose of the algorithm is to fit to the data.
•
Preference: Some criteria must be used to fit one model over another.
•
Search: All algorithms require some technique to search the data.

There are many different methods used to perform data mining tasks. These techniques not only require specific types of data structures, but also imply certain types of algorithmic approaches. Data mining tasks are generally divided into two different categories.

•
Predictive tasks: These tasks predict the value of a particular attribute based on the values of other attributes. Predictive tasks include classification, regression, time series analysis and prediction.
•
Descriptive tasks: Here, the objective is to derive patterns or relationships in data. Descriptive tasks include clustering, summarization, association rules and sequence discovery.

Key Terms in this Chapter

Web Mining: Is the application of data mining techniques to discover patterns from the Web. According to analysis targets, Web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.

Data Bases: A database is a collection of information stored in a computer in a systematic way, such that a computer program can consult it to answer questions. The software used to manage and query a database is known as a database management system (DBMS). The properties of database systems are studied in information science.

Knowledge Discovery: Is the process of finding novel, interesting, and useful patterns in data.

Clustering: Clustering is an algorithmic concept where data points occur in bunches, rather than evenly spaced over their range. A data set which tends to bunch only in the middle is said to possess centrality. Data sets which bunch in several places do not possess centrality. What they do possess has not been very much studied, and there are no infallible methods for locating the describing more than one cluster in a data set (the problem is much worse when some of the clusters overlap).

Synopsis Data Structures: Are data structures that use very little space, can be any data structures that are substantively smaller than their base data sets. The design and analysis of effective synopsis data structures offer many algorithmic challenges.

Data Streams: An undifferentiated, byte-by-byte flow of data. A data stream can be distinguished in practice from a block transfer, although the moving of blocks could itself be considered a “stream” (of coarser granularity).

Data Mining: Is the process of autonomously extracting useful information or knowledge from large data stores or sets. Data mining can be performed on a variety of data stores, including the World Wide Web, relational databases, transactional databases, internal legacy systems, pdf documents, and data warehouses.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Clustering Algorithms for Data Streams

Abstract

Background

Data Mining and Knowledge Discovery

Key Terms in this Chapter

Complete Chapter List