Receive a 20% Discount on All Purchases Directly Through IGI Global's Online Bookstore.

Additionally, libraries can receive an extra 5% discount. Learn More

Additionally, libraries can receive an extra 5% discount. Learn More

Amitava Mitra (Auburn University, USA)

Copyright: © 2009
|Pages: 6

DOI: 10.4018/978-1-60566-010-3.ch089

Chapter Preview

TopAs the abundance of collected data on products, processes and service-related operations continues to grow with technology that facilitates the ease of data collection, it becomes important to use the data adequately for decision making. The ultimate value of the data is realized once it can be used to derive information on product and process parameters and make appropriate inferences.

**Inferential statistics**, where information contained in a sample is used to make inferences on unknown but appropriate population parameters, has existed for quite some time (Mendenhall, Reinmuth, & Beaver, 1993; Kutner, Nachtsheim, & Neter, 2004). Applications of inferential statistics to a wide variety of fields exist (Dupont, 2002; Mitra, 2006; Riffenburgh, 2006).

In data mining, a judicious choice has to be made to extract observations from large databases and derive meaningful conclusions. Often, decision making using statistical analyses requires the assumption of normality. This chapter focuses on methods to transform variables, which may not necessarily be normal, to conform to normality.

TopWith the normality assumption being used in many statistical inferential applications, it is appropriate to define the normal distribution, situations under which non-normality may arise, and concepts of data stratification that may lead to a better understanding and inference-making. Consequently, statistical procedures to test for normality are stated.

A continuous random variable, Y, is said to have a normal distribution, if its probability density function is given by the equationf(y) = ], *(1)* where μ and σ denote the mean and standard deviation, respectively, of the normal distribution. When plotted, equation (1) resembles a bell-shaped curve that is symmetric about the mean (μ). A cumulative distribution function (cdf), F(y), represents the probability P[Y ≤ y], and is found by integrating the density function given by equation (1) over the range (-∞, y). So, we have the cdf for a normal random variable as

In general, P[a ≤ Y ≤ b] = F(b) – F(a).

A standard normal random variable, Z, is obtained through a transformation of the original normal random variable, Y, as follows:

Z = (Y - μ)/σ .The standard normal variable has a mean of 0 and a standard deviation of 1 with its cumulative distribution function given by F(z).

Prior to analysis of data, careful consideration of the manner in which the data is collected is necessary. The following are some considerations that data analysts should explore as they deal with the challenge of whether the data satisfies the normality assumption.

Depending on the manner in which data is collected and recorded, data entry errors may highly distort the distribution. For instance, a misplaced decimal point on an observation may lead that observation to become an outlier, on the low or the high side. **Outliers** are observations that are “very large” or “very small” compared to the majority of the data points and have a significant impact on the **skewness** of the distribution. Extremely large observations will create a distribution that is right-skewed, whereas outliers on the lower side will create a negatively-skewed distribution. Both of these distributions, obviously, will deviate from normality. If outliers can be justified to be data entry errors, they can be deleted prior to subsequent analysis, which may lead the distribution of the remaining observations to conform to normality.

Search this Book:

Reset

Copyright © 1988-2018, IGI Global - All Rights Reserved