Data Classification: Its Techniques and Big Data

Data Classification: Its Techniques and Big Data

A. Sheik Abdullah (Thiagarajar College of Engineering, India), R. Suganya (Thiagarajar College of Engineering, India), S. Selvakumar (G. K. M. College of Engineering and Technology, India) and S. Rajaram (Thiagarajar College of Engineering, India)
DOI: 10.4018/978-1-5225-2031-3.ch003
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Classification is considered to be the one of the data analysis technique which can be used over many applications. Classification model predicts categorical continuous class labels. Clustering mainly deals with grouping of variables based upon similar characteristics. Classification models are experienced by comparing the predicted values to that of the known target values in a set of test data. Data classification has many applications in business modeling, marketing analysis, credit risk analysis; biomedical engineering and drug retort modeling. The extension of data analysis and classification makes the insight into big data with an exploration to processing and managing large data sets. This chapter deals with various techniques, methodologies that correspond to the classification problem in data analysis process and its methodological impacts to big data.
Chapter Preview
Top

Introduction

Data is an abstract concept from which information and knowledge are derived. Raw unprocessed data often moves and crosses stage by stage for its exact representation and processed form of representation. Data is a collection of facts which is the representation of values and measurements.

Meanwhile information is referred to as processed data. It reveals the content or message through direct or indirect form of representation. Hence it is in a meaningful form of representation which can be easily conveyed and understood by the users. It resolves uncertainty and ambiguity.

Qualities of Data

The quality signifies the characteristics of data which are specifically suited for the data analysis process. The following characteristics represent the quality of a good data:

  • 1.

    Accurate

  • 2.

    Represented numerically

  • 3.

    Relationship

  • 4.

    Signified for definite purpose

  • 5.

    Completeness

  • 6.

    Clearly Understandable

Types of Data Elements

At the start of every data analysis it is necessary to identify the type of data which can then be considered for analysis. The following represents the types of data elements which can be used up for the determination of the type of data.

Continuous Data

These are the type of data elements which are defined upon an interval scale. Examples include income of employees in an organization, sales of an enterprise and so on.

Categorical Data

These kinds of data elements are of three types:

  • 1.

    Ordinal Data: The type of data elements which takes restricted set of values with meaningful ordering. Example includes the classification of age into young, middle age and old group.

  • 2.

    Nominal Data: These are the type of data elements which takes restricted set of values with no any such meaningful ordering between them. Example includes profession of employees, marital status and so on.

  • 3.

    Binary Data: These are the types of data elements that can take only two values. Examples include gender and employment status of an employee.

Data Standardization

Data standardization is the mechanism of normalizing the data to a defined specified range. It provides the mechanism of coding the data to a smaller specified range. The following are the data normalization procedures used up for scaling the given variable.

Min-Max Normalization

This type of normalization scales up the data into a smaller specified range most probably [0-1]. It is represented as:

(1) where the newmin and newmax are the newly specified minimum and maximum values respectively.

Z-Score Normalization

In this type of normalization, the mean value of the altered data values will be reduced to zero. It is represented by:

(2) where, μ is the mean value and σ is the standard deviation.

Normalization by Decimal Scaling

This type of normalization transforms the data range into {-1 1}. It is represented by the formula,

Xnew=Xold÷10N(3)

Complete Chapter List

Search this Book:
Reset