Article Preview
TopIntroduction
With the ever-increasing popularity of Location-based Services, geo-tagging a document - the process of identifying geographic locations (toponyms) in the document - has gained much attention in recent years. In such services, geographic locations act as the glue that bind together disparate document sets (such as textual contents, images and videos) from multiple data sources. Devices that produce multimedia documents such as images and videos are equipped with the capability to have additional sensors (GPS sensors) that can geo-tag the related document with geographic information such as latitude and longitude and the respective information is stored in a metadata along with the corresponding document. Web services that accumulate such documents (e.g. YouTube and Flickr) can retrieve such information automatically. In addition, such services allow any user to manually tag any multimedia document with geographic locations in cases the documents are not geo-tagged by their capturing devices. Unfortunately, the geo-tagging procedure is rather cumbersome for textual documents and generally relies on manual human input. There have been several works to address this limitation and some of them have reported to achieve high level of accuracy as reported in (Ding, 2000), (Amitay, 2004), (Garbin, 2005), (Lieberman, 2007), (Andogah, 2012) and (Ignazio, 2014).
As part of a large-scale project, we have been collecting news stories about a country from the country-specific RSS feed of different online news websites on a daily basis for around a year. The main idea is to aggregate this data set with other modes of public data such as social media posts from Twitter; multimedia data from image sharing websites such as Flickr and data from wearable sensors such as lifeloggers and GPS trackers to create a unique multi-modal (textual as well as multimedia) set of data about a particular geographic location. This will encode experiences from multiple user perspectives and has enormous potential in exploiting for public benefit. One of the core challenges for dealing with such heterogeneous set of data is to define the parameters that can be used to link them together for different use-case scenarios. Among several parameters, the spatio-temporal attribute pair is the simplest of choices due to their omni-presence in all our data sets except in news stories.
News stories, mostly textual, are equipped with a temporal attribute (in the form of a timestamp) to highlight the time and date of publication, however, lack any accompanying metadata to publicise the spatial attribute, even though every news generally has a geographic focus in it (Andogah, 2012). The lack of any spatial attribute makes it a challenging task to geo-tag a news story in an automatic fashion. To geo-tag our collection of news stories, we have been looking for publicly available geo-tagging APIs. CLAVIN (CLAVIN, 2016) and CLIFF (Ignazio, 2014) and (CLIFF, 2015) are two such APIs.
After utilising CLAVIN and CLIFF over a subset of our news data set, we have noticed the following shortcomings: