Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Document Clustering: A Summarized Survey

Harsha Patil, R. S. Thakur

Source Title: Information Retrieval and Management: Concepts, Methodologies, Tools, and Applications

DOI: 10.4018/978-1-5225-5191-1.ch003

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

As we know use of Internet flourishes with its full velocity and in all dimensions. Enormous availability of Text documents in digital form (email, web pages, blog post, news articles, ebooks and other text files) on internet challenges technology to appropriate retrieval of document as a response for any search query. As a result there has been an eruption of interest in people to mine these vast resources and classify them properly. It invigorates researchers and developers to work on numerous approaches of document clustering. Researchers got keen interest in this problem of text mining. The aim of this chapter is to summarised different document clustering algorithms used by researchers.

Chapter Preview

Top

Introduction

Clustering is the process of subseting data objects. Subsets are based on characteristics of the objects. Objects with similar characteristics are come together and make a set. So, objects from one set have high similarity while objects from other set (C. C. Aggarwalet al, 2012). Example: Suppose we have collection of 10 objects. Here we consider shape characteristic of objects. So objects of same shape are come together in one set. This set is known as cluster and process of dividing similar objects in setwise is known as clustering. After partitioning we get three clusters of objects.

Figure 1.

Clustering based on OBJECTS shape

In clustering problem, defining concepts of Optimization criteria is very important. In above example partitioning objects on the basis of their shapes is very easy task, but this is not the always case. In reality, finding the properties of object by which we can make cluster is very difficult, basically it’s very important that objects which belongs to different cluster somehow must be more dissimilar to each other. Suppose, we have collection of n objects lets o1,o2,03..........on. Lets s= {o1,o2,o3......on} and suppose number of clusters are k.so we need to partition n objects into k clusters. That means we need to make k partitions. So suppose those partitions are: P1, P2,P3......Pk. So properties that are shared by these partitions are:

(i) Pi≠⏀∀i=1,2,3...k
(ii) Pi∩Pj=⏀∀i≠j
k
(iii) ∪Pi=S
i=1

We need to find some similarity or some disimilarity among objects by which we can partition objects into different sets. Some of widely used similarity and dissimilarity measures are Euclidean distance, Cosine similarity and so on.

Euclidean Distance

The purpose of a measure of similarity is to compare two lists of numbers (i.e. vectors), and compute a single number which evaluates their similarity. The basis of many measures of similarity and dissimilarity is euclidean distance. The distance between vectors X and Y is defined as follows:

In other words, euclidean distance is the square root of the sum of squared differences between corresponding elements of the two vectors. Note that the formula treats the values of X and Y seriously: no adjustment is made for differences in scale. Euclidean distance is only appropriate for data measured on the same scale. As you will see in the section on correlation, the correlation coefficient is (inversely) related to the euclidean distance between standardized versions of the data.

Euclidean distance is most often used to compare profiles of respondents across variables. For example, suppose our data consist of demographic information on a sample of individuals, arranged as a respondent-by-variable matrix. Each row of the matrix is a vector of m numbers, where m is the number of variables. We can evaluate the similarity (or, in this case, the distance) between any pair of rows. Notice that for this kind of data, the variables are the columns. A variable records the results of a measurement. For our purposes, in fact, it is useful to think of the variable as the measuring device itself. This means that it has its own scale, which determines the size and type of numbers it can have. For instance, the income measurer might yield numbers between 0 and 79 million, while another variable, the education measurer, might yield numbers from 0 to 30. The fact that the income numbers are larger in general than the education numbers is not meaningful because the variables are measured on different scales. In order to compare columns we must adjust for or take account of differences in scale. But the row vectors are different. If one case has larger numbers in general then another case, this is because that case has more income, more education, etc., than the other case; it is not an artifact of differences in scale, because rows do not have scales: they are not even variables. In order to compute similarities or dissimilarities among rows, we do not need to (in fact, must not) try to adjust for differences in scale. Hence, Euclidean distance is usually the right measure for comparing cases.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Document Clustering: A Summarized Survey

Abstract

Introduction

Euclidean Distance

Complete Chapter List