Data

Data

Copyright: © 2023 |Pages: 15
DOI: 10.4018/978-1-6684-4730-7.ch002
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The initial step for a data scientist when addressing a business question is to identify the data type, as not all types can be employed in data mining analyses. Accordingly, the data scientist must select a suitable data type that corresponds to the data mining technique and classify the data into categorical and continuous types, regardless of the source of the data. Quality control is a significant factor for the data scientist, particularly if data collection was poorly administered or designed, leading to issues like missing values. Once the data scientist has acquired a relevant dataset, they should inspect the outliers associated with each feature to make sure the data is suitable for analysis. Observing outliers through data visualizations, such as scatter plots, is a common practice among data scientists, highlighting the crucial role of data type determination.
Chapter Preview
Top

Attribute

The attributes of the data are used to define the scope of the Data Object. For instance, the data attribute is gender, and data object is male, female, or genderless. Data attributes are also used in different contexts (Inmon and Lindstedt, 2015). When data scientists write programs, data attributes are viewed as variables and considered as features. Data scientist collects data in a dataset consisting of a feature that stores multiple records. In data science, these records are called “Instance”. The structure to store Feature, Data Objects, and Instances, is as follows (Gru¨tter, 2019: Angiulli & Fassetti, 2021).

Table 1.
The example of feature, data object, and instance
First NameLast Name
JiraponSunkpho
SarawutRamjan
KomCampiranon

In Table 1, there are totally 2 features consisting of first name and last name. The first name features are Jirapon, Sarawut and Com while the last name features are Sunkpho, Ramjan and Campiranon, respectively. When data objects from multiple features are combined, they become the Instances. From Table 1, there are 3 Instances consisting of Jirapon Sunkpho, Sarawut Ramjan and Kom Campiranon, respectively.

Data Dimensionality: Is the number of features within a dataset (Li, Horiguchi and Sawaragi, 2020). In data science, it focuses on analyzing the data in various fields without focusing on storing data from the beginning. The number of dimensions is an issue that the data scientist must consider in order to select only the features that can support the analysis. Therefore, the data scientists need to reduce the number of dimensions to have the features necessary for data analysis. This allows data scientists to reduce the time and digital resources required to process massive instances of datasets.

Complete Chapter List

Search this Book:
Reset