Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Clustering Genes Using Heterogeneous Data Sources

Erliang Zeng, Chengyong Yang, Tao Li, Giri Narasimhan

Source Title: Computational Knowledge Discovery for Bioinformatics Research

DOI: 10.4018/978-1-4666-1785-8.ch005

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

Clustering of gene expression data is a standard exploratory technique used to identify closely related genes. Many other sources of data are also likely to be of great assistance in the analysis of gene expression data. This data provides a mean to begin elucidating the large-scale modular organization of the cell. The authors consider the challenging task of developing exploratory analytical techniques to deal with multiple complete and incomplete information sources. The Multi-Source Clustering (MSC) algorithm developed performs clustering with multiple, but complete, sources of data. To deal with incomplete data sources, the authors adopted the MPCK-means clustering algorithms to perform exploratory analysis on one complete source and other potentially incomplete sources provided in the form of constraints. This paper presents a new clustering algorithm MSC to perform exploratory analysis using two or more diverse but complete data sources, studies the effectiveness of constraints sets and robustness of the constrained clustering algorithm using multiple sources of incomplete biological data, and incorporates such incomplete data into constrained clustering algorithm in form of constraints sets.

Chapter Preview

Top

1. Introduction

Large scale microarray experiments that have been performed under a variety of conditions or at various stages during a biological process have resulted in huge amounts of gene expression data, and have presented big challenges for the field of data mining (de Souto et al., 2008; Kerr et al., 2008). Challenges include rapidly analyzing and interpreting data on thousands of genes measured with hundreds of different conditions, and assessing the biological significance of the results. Clustering is the exploratory, unsupervised process of partitioning the expression data into groups (or clusters) of genes sharing similar expression patterns (Yeung et al., 2003; Kerr et al., 2008). However, the quality of clusters can vary greatly, as can their ability to lead to biologically meaningful conclusions.

On a different note, the biological and medical literature databases are information warehouses with a vast store of useful knowledge. In fact, text analysis has been successfully applied in bioinformatics for various purposes such as identifying relevant literature for genes and proteins, connecting genes with diseases, and reconstructing gene networks (Yandell & Majoros, 2002). Hence, including the literature in the analysis of gene expression data offers an opportunity to incorporate additional functional information about the genes when defining expression clusters. In more general terms, with the availability of multiple information sources, it is a challenging problem to conduct integrated exploratory analyses with the aim of extracting more information than what is possible from only a single source.

The basic problem of learning from multiple information sources has been extensively studied by the machine learning community. In computer vision this problem is referred to as multi-modal learning. In general, there are two approaches to multi-modal learning: feature level integration and semantic integration (Wu et al., 1999). Methods that use feature level integration combine the information at the feature level and then perform the analysis in the joint feature space (Glenisson et al., 2003). On the other hand, the semantic level integration methods first build individual models based on separate information sources and then combine these models via techniques such as mutual information maximization (Becker, 1996).

Microarray experiments usually provide gene expression data on all the genes in a genome. Hence they are inherently “complete”. A major challenge using other sources of data to assist the analysis of gene expression data is that they may not always be complete, i.e., do not provide information on all the genes in the genome.

Recent work from the machine learning community has focused on the use of background information in the form of instance-level constraints. Two types of pair-wise constraints have been proposed: positive constraints that specify that two instances must remain in the same cluster, and negative constraints that specify that two instances must not be placed in the same cluster. Recent examples of work include methods that ensured that constraints were satisfied at each iteration (Wagsta et al., 2001), algorithms that used constraints as initial conditions (Basu et al., 2002), algorithms that learned a distance metric trained by a shortest-path algorithm (Klein et al., 2002), a convex optimization method using Mahalanobis distances (Xing et al., 2002), and semi-supervised clustering that incorporated both metric learning and the use of pair-wise constraints in a principled manner (Bilenko et al., 2004).

While great efforts have been made to develop efficient constrained clustering algorithm variants, the role of constraint sets in constrained clustering algorithm has not been fully studied yet. Recently, Wagstaff et al. (2006) and Davidson et al. (2006) attempted to link the quality of constraint sets with clustering algorithm performance (Davidson et al., 2006; Wagsta et al., 2006). Two properties of constraint set – inconsistency and incoherence – were shown to be strongly negative correlated with clustering algorithm performance.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Clustering Genes Using Heterogeneous Data Sources

Abstract

1. Introduction

Complete Chapter List