Article Preview
Top1. Introduction
First, we consider inherent metric and ultrametric properties of data, given either a set of points endowed with at least a dissimilarity function, or a set of points in a coordinate space. Following the outlining of how we can characterize data in regard to how metric it is, we review the characterizing of data in regard to how ultrametric the data is. That can include signal data in meteorology, finance, biomedicine, telecommunications, and so on. It can also include document texts such as literature, technical reports, and social media. Objectives of such characterization include the following:
- •
The embeddability of data, in a metric space if metric characterization is the aim, and in an ultrametric topology or hierarchical clustering or rooted tree topology, if ultrametric characterization is the aim;
- •
In this work, with the research hypothesis that ultrametric topology can express or represent subconscious thought processes, we want to determine “islands” of inherent local ultrametricity.
Our motivation is to determine vestiges, or after effects, of subconscious processes. Examples of such processes are emotion, trauma, dream, infantile development and growth, and so on. We will note how there is integral linkage with cognitive, behavioural, activity-related and conscious reasoning processes. In Murtagh (2014a) we began this work. The computational implications are profound: subconscious and unconscious thought processes are vastly more efficient that conscious thought processes (Murtagh, 2014b). Clearly, conscious and unconscious thought processes are very different, and they are complementary and integrated.
Another different view of our work is that of pattern recognition involving motifs in the form of relationship triangles, as described in chapter 9 of Neuman (2014).
1.1. Metric and Ultrametric
A metric space consists of a set on which is defined a distance function d which assigns to each pair of points of a distance between them, and satisfies the following four axioms, first for any pair of points, referred to as positiveness, reflexivity and symmetry:
if:
For any triplet of points we have the triangular inequality:
If these properties with the exception of the triangular inequality are respected, we speak of dissimilarities. Through subtraction from the maximum value, similarities are transformed into dissimilarities.
When considering an ultrametric space we need to consider the strong triangular inequality or ultrametric inequality defined as:
and this in addition to the positivity, reflexivity and symmetry properties for any triplet of point
.
Measurements of a set of observations, related to a set of attributes, can be converted to a metric principal component, or principal axis, space using Principal Components Analysis. For contingency table data, i.e. frequency of occurrence data, Correspondence Analysis is typically used, since such data is categorical or qualitative rather than quantitative. Depending on input data preprocessing, Principal Components Analysis may be on (i) the sums of squares and cross-products matrix, (ii) variances-covariances, and (iii) correlations.
Principal Coordinates Analysis, also known as Classical Multidimensional Scaling, takes dissimilarities as input and reconstructs a coordinate space (Torgerson, 1958). The principal component, or principal coordinate, space has orthonormal axes. The mapping from the set of dissimilarities to the orthonormal axis space is carried out by singular value decomposition, i.e. eigenvalue, eigenvector decomposition. For an orthonormal decomposition with non-negative eigenvalues, the input data must comprise a positive, semi-definite matrix. With dissimilarity, as opposed to distance, input, we are not guaranteed to have a metric-embeddable output, nor to have all non-negative eigenvalues. From this a measure of how metric a given set of data is, that is endowed with a dissimilarity: we retain the metric embedding associated with the non-negative eigenvalues. Negative eigenvalues are associated with that part of the data that is not metric-embeddable.