Supporting Data-Intensive Analysis Processes: A Review of Enabling Technologies and Trends

Supporting Data-Intensive Analysis Processes: A Review of Enabling Technologies and Trends

Lawrence Yao (University of New South Wales, Australia), Fethi A. Rabhi (University of New South Wales, Australia) and Maurice Peat (University of Sydney, Australia)
DOI: 10.4018/978-1-4666-6178-3.ch019


Research scientists in data-intensive science use a variety of scientific software applications to support their analyses and processes. To efficiently support the work of these scientists, software applications should satisfy the essential requirements of interoperability, integration, automation, reproducibility, and efficient data handling. Various enabling technologies including workflow, service, and portal can be used to address these essential requirements. Through an in-depth review, this chapter illustrates that no one technology can address all of the essential requirements of scientific processes and therefore necessitates the use of hybrid technologies to support the requirements of data-intensive research. The chapter also describes current scientific applications that utilize a combination of technologies and discusses some future research directions.
Chapter Preview


Data-intensive science emerged as a fourth paradigm after the three interrelated paradigms of science: empirical, theoretical, and computational (Hey, Tansley, & Tolle, 2009). Empirical science became the dominant form of discovery in the seventeenth century as “natural philosophers” used careful and systematic descriptions of natural phenomena gathered by direct observation to produce knowledge about the world. This was followed by theoretical science; scientists like Newton and Einstein developed general theories that have explanatory power over complex phenomena. When systems become too complex for humans to analyze without mechanical support, computational techniques such as simulation were used to assist scientists in their work.

There has been a shift in the methodology of science driven by the massive growth in data that we are now able to capture through complex new instruments. For example, the Large Hadron Collider of the European Organization for Nuclear Research is able to produce 15 petabytes (15 million gigabytes) of data annually (European Organization for Nuclear Research, 2008). The PubChem archive of National Institute of Health provides information on the biological activities of small molecules. As of August 2011, the archive contains 85 million substance records representing over 30 million chemically unique compounds (National Center for Biotechnology Information, 2011). The Securities Industry Research Center of Australia (Sirca) maintains the Australian Equities Tick History archive, which contains records of activity on the Australian Securities Exchange since 1987, including all order book entries, modifications, cancellations, and trades for supported financial instruments, time stamped with millisecond precision (Sirca, 2011). Jim Gray (2009) has described this data-intensive research as the fourth paradigm of science. This reflects an epistemic shift from computational models of science that use theoretical and mathematical models where there is a level of complexity in a problem that makes empirical observation impossible, to a form of science that has an over abundance of data that requires algorithmic process to generate meaningful results. Due to the massive amounts of available data, scientists working in data-intensive areas are reliant on information technology (IT) infrastructure and tools to extract useful information from their datasets.

This chapter is organized as follows: the background section will examine the activities and processes of data-intensive science and discuss the reasons for using scientific software. The following section describes four essential requirements of efficient scientific applications. This is followed by an in-depth review of current enabling technologies that can be used to develop applications that address the essential requirements for data-intensive science. The final sections will discuss current trends in scientific applications and propose some future research directions.



Activities performed by scientists in data-intensive fields do not normally occur in isolation, they are part of coordinated efforts to extract knowledge or meaning from data. These activities can be thought of as a “pipeline” (Hey et al., 2009) through which knowledge is produced, or more generally a scientific process. Scientific processes cover a wide range of activities including data acquisition, data manipulation, and the publication of analysis results. An important part of the typical scientific process is the “analysis pipeline” (Szalay, 2011) or the analysis process. These processes mostly involve activities like data manipulation and filtration.

Key Terms in this Chapter

Analysis Process: An analysis process in data-intensive science is the subset of a scientific process that is concerned with data analysis activities. A typical analysis process would involve activities to do with data manipulation or data transformation.

Portal Technology: Portal technology refers to the use of centralized portals to expose underlying functionality of a system. A portal acts as a doorway, allowing the underlying functionality (which is often quite complex) to be presented and used in a much more user-friendly manner. A portal also provides a single point of entry so that it is easier for users to find what they are looking for.

Interoperability: Interoperability refers to the ability of different software components to communicate and cooperate with one another, by resolving differences in language, execution platform, interface definition, and in the meaning of what has been communicated.

Scientific Process: A scientific process in data-intensive science consists of the sequence of activities performed by scientists in order to extract knowledge from data. A typical scientific process would cover a wide range of activities including data acquisition, data manipulation, and publication of analysis results.

Service Technology: Service technology refers to the use of services for software development, where a service is an autonomous, platform agnostic software component that operates within an ecosystem of services. The ecosystem is governed by a service-oriented architecture, which relies on the composition of loosely coupled services to achieve complex functionality.

Data-Intensive Science: Data-intensive science is considered to be the fourth paradigm of science after the three interrelated paradigms of empirical, theoretical, and computational science. It is seen as a data-driven, exploration-centered style of science, where IT infrastructures and software tools are heavily used to help scientists manage, analyze, and share data.

Reproducibility: Full replication refers to independently repeating a piece of research from scratch in order to obtain the same results. Where replication is not possible or impractical, research artifacts (e.g., data, process records, and code) can be made available as a means to reproduce the same results when the artifacts are analyzed again. Hence, reproducibility of scientific results offers a stepping stone towards full replication, where the more artifacts are made available, the closer we can get towards full replication.

Software component: Modern software applications and systems are most often developed as assemblies of many smaller parts. The idea of software components formalizes the definition of these “smaller parts”: A software component is basically a software unit with a well-defined interface and explicitly specified dependencies. A software component can be as small as a block of reusable code, or it can be as big as an entire application.

Middleware: Middleware is software that is used to facilitate and link interactions between various software components across distributed computing environments. Middleware typically utilizes adapters to enable interoperability between wide varieties of components. Also, middleware often provides higher-level and more user-friendly access to the components it manages.

Component Technology: Component technology refers to the use of software components for software development. Software components usually conform to a component model, and they are often hosted and managed by a component framework, which provides a controlled environment where components can be composed together to form larger applications or systems.

Integration: Integration describes the idea of applying a consistent user experience and complementary functionality across multiple software components to produce an “integrated” environment. Within an integrated environment, users can save on cognitive resources by learning a single model of behavior and applying the learned model to understand and operate all components within the environment.

Complete Chapter List

Search this Book: