Article Preview
TopIntroduction
Emergency information processing of social media can contribute effectively to identify regions affected by natural hazards such as earthquakes or tsunami, given that the feeds are real-time and often contain location information (ca. 1.2% with exact coordinates; ca. 50% city or state derived from the user profile). Due to the massive growth of Twitter data and its increasing number of users, it is however, a challenge to access and interpret the stream of data efficiently. Within the last years, there have been major achievements to make use of such “weak” human sensors as a complement to seismic sensors in some early warning systems (see Sakaki, 2010, Guy, 2010), focusing on English and Japanese. At present, there is no similar alerting system for the Mediterranean region. We try to fill this gap within the European TRIDEC project (www.tridec-online.eu) by adapting state of the art algorithms to the common Twitter languages in the endangered zones.
Social media often play a crucial role in disaster management during and after the crisis: citizens generally use Twitter postings or SMS messages to report emergencies. In this case, the information contained in them might be relevant for crisis management (relief and medical care for those affected, repair of broken infrastructure, etc.), so that there is a strong need to classify, cluster and extract such information effectively from large-scale noisy and unstructured data. As the messages are very short (max. 140 characters), NLP analysis is particularly difficult.
A number of text mining tools have been applied to recognize tactical, actionable information in tweets (Verma, 2011), to find messages that contain real-world or real-event information (Becker, 2011; Naaman, 2011), or to extract Named Entities (Neubig, 2011) or other news content (Sankaranarayanan, 2009) for one single language (mostly Japanese or English).
In some cases, though, it is crucial to cross language boundaries. For instance, when the epicenter is near the border of a country (e.g., Western Turkey and Greece), or when a twitter user reports an event in his/her native language (e.g., Romanian) that needs to be translated into a different language (e.g. English, German, or Spanish).
Therefore, our long-term goal within TRIDEC is to support the access to relevant information across languages, focusing on the translation of under-resourced Mediterranean languages like Turkish/Greek/Romanian into English.
The multilingual nature of the blogosphere has been a major hindrance during the Haitian earthquake, where reports ranged from Japanese, to English and Spanish. Caragea (2011)’s work is one of the few that deals with multilinguality, classifying either English or Spanish messages into one of 10 emergency classes.