Robust Statistical Methods for Rapid Data Labelling

Robust Statistical Methods for Rapid Data Labelling

Jamie Godwin (University of Durham, UK) and Peter Matthews (University of Durham, UK)
Copyright: © 2014 |Pages: 35
DOI: 10.4018/978-1-4666-6086-1.ch007

Abstract

Labelling of data is an expensive, labour-intensive, and time consuming process and, as such, results in vast quantities of data being unexploited when performing analysis through data mining. This chapter presents a new paradigm using robust multivariate statistical methods to encapsulate normal operational behaviour—not failure behaviour—to autonomously derive unsupervised classifier labels for previously collected data in a rapid, cost-effective manner. This enables traditional machine learning to take place on a much richer dataset. Two case studies are presented in the mechanical engineering domain, namely, a wind turbine gearbox and a rolling element bearing. A statistically sound and robust methodology is contributed, allowing for rapid labelling of data to enable traditional data mining techniques. Model development is detailed, along with a comparative evaluation of the metrics. Robust derivatives are presented and their superiority is shown. Example “R” code is given in the appendix, allowing readers to employ the techniques discussed. High levels of agreement between the derived statistical approaches and the underlying condition of the components can be found, showing the practical nature and benefit of this approach.
Chapter Preview
Top

Introduction

Many data-driven algorithms require accurate labels in order to encapsulate the various conditions which correspond to their meaning. However, in many real-world applications, deriving these labels is not possible in practice or is not economically viable due to the amount of resources required. As such, although significant quantities of data exist, exploiting this data effectively is not a trivial problem.

In this chapter, an evaluation of the performance of six multivariate distance metrics on two datasets incorporating censored data is presented. Unlike traditional methods, the techniques evaluated in this chapter provide a means of autonomously deriving classifier labels in an unsupervised manner for previously collected data in a rapid, cost-effective way. This can be employed in cases where previously labelled data is either scarce, or highly imbalanced – allowing a greater amount of data to be incorporated into a traditional data mining analysis. To aid in the practicality and demonstrate the soundness of the approaches detailed, ‘R’ code is provided at all stages of the analysis. This allows the reader to follow the examples, as the techniques are presented on publicly available data for tutorial purposes.

The remainder of this book chapter is organized as follows. Motivation and context for this work is presented in the “background to the problem” section, along with the issues of data scarcity and data imbalance. Next, traditional techniques utilised are presented in the “data intensive techniques” section. The datasets and degradation models used are given in the “dataset description.” In total, 6 multivariate distance metrics are introduced and comparatively evaluated on both datasets for their robustness and merit in performing condition assessment; three Minkowski distances (Manhattan, Euclidean and Chebyshev), the Penrose distance and two forms of the Mahalanobis distance are looked at in depth. Multivariate normality testing is then covered. After this, two case studies are presented in the “case study” sections, showing how the metrics can be employed for rapid labelling of data to enable traditional data mining approaches. Conclusions are then presented, with references and ‘R’ code in the appendices.

Complete Chapter List

Search this Book:
Reset