Article Preview
TopIntroduction
The demand for processing streams of data in real-time has increased in recent years due to various factors, including the growing amount of sensor data made available online (Margara et al., 2014) and the number of applications that provide context-aware services using smartphones (e.g. Waze1 and RunKeeper2), among others. Furthermore, there are government organizations and agencies with long tradition on publishing sensor data on the Web (e.g. USGS water data)3, and companies that started doing so recently following open data initiatives (e.g. APIs for public transport data in the UK4 and Madrid5). These are examples of resources that allow developers to build new applications upon dynamic datasets. Yet, extracting meaningful information from streams of data is not trivial and requires data integration procedures and processing systems that scale to varying conditions in data sources, complex queries, and system failures.
The authors of this paper focus on data produced by sensors that are available online. The concept of Sensor Web (Delin & Jackson, 2001) refers to a network of interlinked sensing devices distributed in space, which is able to monitor uncertain environments. The Open Geospatial Consortium's (OGC) Sensor Web Enablement (SWE) provides a set of standards for managing online sensor networks and the data they produce (Botts et al., 2008). SWE's data models and service specifications address syntactic heterogeneities, but lack semantic descriptions. To solve this problem, the Semantic Sensor Web (Sheth et al., 2008) aims at providing a framework for the interoperable exchange and processing of sensor data by enriching observations with spatial, temporal, and thematic metadata.
Sensor data providers and consumers are facing some challenges motivated by the data deluge (Corcho & Garcia-Castro, 2010): support for flexible querying (e.g. including spatio-temporal parameters), the need for on-the-fly aggregations, detection of relevant events and outliers, integration of heterogeneous data sources, and efficient management of system scalability. Making the collected sensor data available for consumers is also a data management task. Commonly, this task is solved by setting up a data access Web portal, via OGC's Sensor Observation Services, or providing an API. According to the Linked Data principles, the proper format to publish data on the Web is the Resource Data Framework (RDF) format.6 The W3C RDF Stream Processing (RSP) Community Group aims at defining a common model for producing, transmitting, and continuously querying data streams encoded in RDF.7 In this paper, the authors show how the conversion of sensor data (coming from the Sensor Cloud infrastructure) to RDF allows the ingestion of sensor data time series into RSP engines. With this purpose, the authors implemented morph-streams++,8 a distributed and parallelized version of morph-streams9 (Calbimonte et al., 2010; Calbimonte et al., 2012) that provides ontology-based data access to execute SPARQL-like queries over a range of data streaming systems. In previous work, the authors have discussed how to extend morph-streams in terms of scalability, adaptive query processing, and RDF stream compression (Llaves et al., 2014; Fernandez et al., 2014). In this paper, the focus is on the use of morph-streams++ for sensor data management in the environmental domain. More concretely, on the pre-processing of environmental sensor data before it is ingested by the RSP engine.