Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Introduction to Feature and Gene Selection

Source Title: Feature Selection and Ensemble Methods for Bioinformatics: Algorithmic Classification and Implementations

DOI: 10.4018/978-1-60960-557-5.ch008

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Chapter Preview

Top

Problem Of Feature Selection

One can claim that times when features were scarce in many branches of science and technology are nowadays in the past. It could be said that feature abundance is a blessing but on the contrary and to surprise of many, it is not. It could be thought that having more features brings more discriminating power, thus facilitating classification. However, in practice, this causes problems: redundant and irrelevant features lead to the increased complexity of classification and degrade classification accuracy.

Hence, some features have to be removed from the original feature set in order to mitigate these negative effects before a classifier is utilized. The task of redundant/irrelevant feature removal is termed feature selection in machine learning and data mining literature. It is a data dimensionality reduction¹ when the original set of features is reduced to another set , where the symbol means ‘subset of or equal to’, implying that it is not impossible to have an irreducible set of features in certain cases². However, for microarray data this is surely not the case, since there are thousands and tens of thousands of features (gene expression levels) in each dataset. In analyzing high dimensional microarray data, one is interested in retaining just a few genes out of many thousands. Thus gene selection becomes synonymous to feature selection, which means that many existing feature selection methods can be readily applied to gene selection.

There is huge interest to gene selection among researches working in bioinformatics (see, e.g., links to collected journal articles at http://www.nslij-genetics.org/microarray/). This is because gene selection represents a challenging and important task for both biology and machine learning.

By selecting a small fraction of genes from a microarray, one aims at finding the genes that can be used as indicators of a certain disease or even early predictors of that disease. Since different types of cancer threaten humankind with great persistence, the overwhelming majority of articles about gene selection apply theoretical ideas and methods to a very practical problem related to cancer.

From the machine learning point of view, feature selection removes meaningless, i.e. not related to a studied disease, genes, thus mitigating overfitting of a classifier on high dimensional microarray data. Overfitting is plague when there are a lot of features and only few samples or instances characterized by these features. Overfitting leads to very good and often perfect classification performance (zero or close to zero error rate) on the training data, but this seemingly wonderful result does automatically translate to new, out-of-sample data. Put it differently, a researcher neglecting the harmful effect of overfitting in the case of microarray data would find a small set of genes which he claims to predict a certain type of cancer. However, when biologists and/or doctors try to pay attention to expression levels of these genes when observing test volunteers and/or real patients, they see no value of those genes, because during machine learning stage, healthy and diseased patients were separated purely based on the noise present in microarray measurements rather than on the disease presence or absence. This happened because without prior removal irrelevant genes, the classification problem is known as the small sample size problem (the number of features far exceeds the number of samples in a dataset) in statistics and machine learning. For such problems, the lack of classifier generalization to new data is a norm rather than an exception, unless a data dimensionality is dramatically reduced.

Thus, the goal in microarray data classification is to identify the differentially expressed genes that can be used to predict class membership of new, unseen samples. The classification of gene expression data involves feature selection and classifier design. Feature selection identifies the subset of differentially-expressed genes that are good (useful, relevant) for distinguishing different classes of samples.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Introduction to Feature and Gene Selection

Chapter Preview

Problem Of Feature Selection

Complete Chapter List