Article Preview
TopIntroduction
Every second, massive amounts of data are being produced by sensors all around the world. From environmental measurement devices to smartphones, the sources of sensor data continue to proliferate, increasing the possibility of blending the diverse sources to collaboratively detect and identify a multitude of observations, from simple phenomena to complex events and situations. As these sensors become more accessible, due to lower costs and simpler configuration and maintenance, they can be deployed not only by companies and government institutions, but also by enthusiasts and citizen scientists. Therefore the volume of data produced is extremely large and highly heterogeneous, making it complex to discover and use.
The heterogeneity of data as well as sensing environments is a key obstacle for realizing a connected sensor world. Different sensor network deployments usually represent the information that they capture in different ways. The data models and schemas are different, the data types and structures are not always compatible, and even the data values often use different representations. For example, consider multiple sensor networks measuring the same type of physical phenomenon. Each sensor deployment may have its own way to represent semantically identical information, e.g., “wind speed” vs. “average wind speed,” or “temperature” vs. “thermometer”. If a user wants to obtain the latest wind speed or temperature data values over the region where all the sensor networks are deployed, the user must employ a mechanism for letting the system understand the semantically equivalent but different representations of data, in order to fully answer the query.
One of the solutions to deal with heterogeneity is through the semantic annotation of sensor data (Sheth, Henson, & Sahoo, 2008), and the provision of ontology-based access to it (Calbimonte, Corcho, & Gray, 2010; Taylor & Leidinger, 2011). However, there is a lack of evidence of how this approach scales, especially with high data rates, and in push-based delivery of streaming data.
In this article we focus on two problems in this context: (i) how to find relevant heterogeneous sensor data sources based on their metadata, and (ii) how to query streaming sensor data from these sources. We summarize our contributions as follows:
- •
Our main contribution to the first problem is the use of the SSN ontology (Compton et al., in press), along with domain-specific vocabularies, for modeling sensor metadata and observations, augmented with mappings to the original sensor schemas. To this end, we use R2RML (Das, Sundara, & Cyganiak, 2012) (RDB-to-RDF mapping language) for mapping relational streams -instead of tables- to ontologies. Thus we use ontologies as a common model for representing sensor data and metadata, to make it possible to search for data sources and to access them through ontological schemas.
- •
For the second problem, we propose a query rewriting and data translation approach that allows querying virtual RDF streams using the SPARQL language with streaming extensions. This approach exploits the R2RML mappings to provide access to the sensor streaming data, not only the metadata. Furthermore, we show that our query rewriting and execution mechanisms are applicable for both pull and push delivery modes, and also for various state-of-the-art stream processing engines, such as SNEE (Galpin, Brenninkmeijer, Jabeen, Fernandes, & Paton, 2009), GSN (Aberer, Hauswirth, & Salehi, 2006), Pachube (http://esper.codehaus.org/). We provide empirical evidence of performance with respect to sampling rates and delivery latency in both pull and push-based modes.