Ranjit Biswas (Jamia Hamdard University, India)

Source Title: Handbook of Research on Generalized and Hybrid Set Structures and Applications for Soft Computing

Copyright: © 2016
|Pages: 46
DOI: 10.4018/978-1-4666-9798-0.ch023

Chapter Preview

TopStatistics is one of the most important subjects in Science and Engineering, part of our everyday life at every moment. By statistics we mean a vast subject of techniques and procedures dealing with the collection, organization, analysis, interpretation, and presentation of data/information. Without the use of statistical methods it would be very difficult to make any good decisions about the raw data, precise or imprecise. From statistical point of view, the term “Universe” refers to the totality of the items or units (or data elements) in any field of enquiry/survey, whereas the term “Population” refers to the total of items about which information is desired by a statistician (or by the statisticians) at some moment of time. Thus, in statistics, by ‘population’ we mean a large collection of objects of a *similar nature* which is of interest as a whole - e.g. human beings, households, readings from a measurement device, etc. Whenever we talk about a population P, we can also think of a relevant universe U. Thus all the members (being a multiset/bag) of a population P are also the members of a relevant super set U. But the notion of ‘Sample’ in statistics is little different. A sample is a sub-collection of objects drawn from a population. A sample is chosen to make inferences about the population just by examining or measuring the elements in the sample.

An important direction in Statistics is unearthed by Biswas in his work in (Biswas, 2014b, 2014c) by introducing a new subject NR-Statistics and then proposing an updated shape of the subject Statistics by proposing R-Statistics and NR-Statistics as two parts of the subject STATISTICS. In STATISTICS in its newly proposed shape as subject, populations can be divided into two categories: *R-Population* and *NR-Population*. If a population consists of real number data (n-dimensional) only then it is of category ‘R-Population’, and if a population does not fall into the category of ‘R-Population’ then it is of the category ‘NR-Population’. An *NR-Population* may contain R-population too. Thus a ‘NR-Population’ could be a collection of the type of data viz. a collection of 30 sounds (blows) from a bus horn, a collection of a large number of handwritten characters of the English character “A”, a collection of 150 paints of beautiful ‘Tajmahal’ by 150 number of under-12 aged children, a collection of 5 ECG reports of a patient, a collection of three X-ray images of a fracture bone of a patient, etc.

The justification for categorizing statistical populations into two disjoint mutually exclusive and exhaustive categories: R-population and NR-population, needs to be cleared first of all. Without going for endless amount of justification, we present first of all one simple instance only:-

Consider a population P of real data. Clearly P is a multiset in general. For computing the population mean (arithmetic mean), two operations must be valid in the corresponding multiset data which are “Addition” and “Division by an integer”. Suppose that the population P consists of the data which are the ages of 2000 males of height 5ft and above. For computing the mean of this population of size 2000, there is no issue as the two operations “Addition” and “Division by a scalar 2000” are well valid here. For computing the population variance, another operation ‘multiplication’ must be valid in the multiset P, which is too not an issue in this case. But if we consider a population P consisting of the data which are the collection of 2000 number of handwritten characters of the English character “A” by 2000 children, then the problem is how to compute the population mean, population variance, etc? For computing the mean of this population of size 2000, there must exist two operations “Addition” and “Division by a scalar 2000” valid in the multiset population data! Otherwise we can not compute the mean in this case, as the existing literature of Statistics provide neither any theoretical method nor any experimental technique to find it out. But the daily life of human society, nature, universe, etc. can not ignore such kind of populations which are infinite in numbers. This is a major draw back of the classical Statistics, which we however rename here by a new nomenclature: “R-Statistics”.

Dimension of a Nucleus: To compute a nucleus, if exists, of a given population (R or NR), a statistician pre-chooses the values : radius = r and large = N which are non-negative real numbers. Then the pair (r, N) is called a dimension of the nucleus.

Diameter of a Population: Let (P,d) be a population space. The diameter of the population P is the non-negative real number D given by D = max { d(x,y) }, where x, y?P.

m-Mapping: A mapping of a multiset (bag) to a multiset (bag).

m1-Mapping: If the classical mapping f: S ? T is 1-to-1, then the m-mapping f: P ? T e is called to be a m 1 -mapping (or, m 1 -function).

Coefficient of Homogeneity: It signifies the amount of homogeneously spread of data in the population.

Metric Mean (MM): An object m of the universe U which is closest to all the objects of the population P (R-population or NR-population) compared to all other objects of U; i.e. which is at the centre of P.

R-Population: If a population consists of real number data (n-dimensional) then it is of category ‘R-Population’.

Metric Standard Deviation (MSD): Standard Deviation of a NR-population (or R-population) with respect to the mean MM.

Multiset Space: Consider a population P in the universe U, or rather say a multiset P in U. Suppose that U forms a metric space with respect to the metric d. Then the multiset or bag P is said to form a “multiset space” in U with respect to the metric d, and is denoted by (P, d).

Linear Standard Deviation (LSD): Standard Deviation of a NR-population (or R-population) with respect to the mean LM.

Bag: It is a hybrid set structure. Let X be a finite set of elements. A bag B drawn from the set X is characterized by a function given by C: X ? N where N is the set of all non-negative integers. The function C is called the ‘count function’ of the bag B. For any x? X, the value C(x) indicates the number of occurrence of the object x in the bag B.

Density: Let P be a population of cardinality N and diameter D (?0). The density ? of the population P is defined by the real number given by ? = N/D.

Nucleus: Nucleus of a population (R or NR) means a member of the population centered around which there exist a large number of members of the population. Thus nucleus is not a hard/rigid measure, but a kind of soft measure.

Linear Variance (LV): Variance of a NR-population (or R-population) with respect to the mean LM.

Rigid Measure: A statistical measure which can be computed using crisp mathematics only, without using any Soft-computing tools.

r-Density: Consider the population space (P, d). For any positive real number r, the r-density d r of the population P at a point x of it is denoted by d r (x) and is defined by d r (x) = n/2r, where n is the cardinality of the closed ball B[x,r]. Thus r-density is a measure of density of a closed ball of radius r.

Coefficient of Heterogeneity: It signifies the amount of heterogeneously spread of data in the population.

Atrain: It is the heterogeneous and dynamic data structure exclusively to deal with big data in any 4Vs.

Linear Mean (LM): Consider a finite population P (R or NR) of which the universe is the set U. Suppose that U forms a linear space (i.e., vector space) over the field R of real numbers. Then the corresponding mean of the population P is called the linear mean of P.

Distance Multiset: The collection of all distances d(x i , x j ) of the elements of the population P of the population space (P,d) forms a multiset D P which is called the ‘Distance Multiset’ of the population P.

Congestion: The data congestion of a population P (R or NR) is understood by the density ? of the population P. If density is high, we say that P is congested.

“Atrain Distributed System” (ADS): A special designed distributed system with an infinitely scalable architecture for processing big data of any 4Vs.

Metric Centre (MC): Synonym of MM.

Metric Variance (MV): Variance of a NR-population (or R-population) with respect to the mean MM.

Multiset: It is a generalized set structure. In mathematics, the notion of multiset is a generalization of the notion of a set in which members are allowed to appear more than once.

Desert Point: A point x of a population P is called a desert point of dimension (r, M) if the cardinality of the multiset B[x,r] is less than M. A desert point for ‘a large value of r and a small value of M’ signifies a lot of information to the statistician.

NR-Population: If a population does not fall into the category of ‘R-Population’ then it is of the category ‘NR-Population’.

Plot Function: A m 1 -mapping is called a plot function of a population if the congestion of data in the actual population P will be same or more than the congestion of data in the created population T e .

Isolated Point: A point x of a population P is called an isolated point of radius r ( = 0) if the core set of the multiset B[x,r] is a singleton set.

Population Space: For a statistical population P (R-population or NR-population), the multiset space (P, d) is called by “Population space”.

Search this Book:

Reset