Domain Adaptation for Crisis Data Using Correlation Alignment and Self-Training

Domain Adaptation for Crisis Data Using Correlation Alignment and Self-Training

Hongmin Li (Kansas State University, Manhattan, USA), Oleksandra Sopova (Kansas State University, Manhattan, USA), Doina Caragea (Kansas State University, Manhattan, USA) and Cornelia Caragea (University of Illinois at Chicago, Chicago, USA)
DOI: 10.4018/IJISCRAM.2018100101

Abstract

Domain adaptation methods have been introduced for auto-filtering disaster tweets to address the issue of lacking labeled data for an emerging disaster. In this article, the authors present and compare two simple, yet effective approaches for the task of classifying disaster-related tweets. The first approach leverages the unlabeled target disaster data to align the source disaster distribution to the target distribution, and, subsequently, learns a supervised classifier from the modified source data. The second approach uses the strategy of self-training to iteratively label the available unlabeled target data, and then builds a classifier as a weighted combination of source and target-specific classifiers. Experimental results using Naïve Bayes as the base classifier show that both approaches generally improve performance as compared to baseline. Overall, the self-training approach gives better results than the alignment-based approach. Furthermore, combining correlation alignment with self-training leads to better result, but the results of self-training are still better.
Article Preview
Top

Introduction

From user groups, online forums, to Facebook, Twitter, Instagram, YouTube, social media platforms have become ubiquitous. The use of social media is particularly prevalent during emergencies. For instance, the Federal Emergency Management Agency (FEMA) wrote in its 2013 National Preparedness report (Maron, 2013) that during and immediately following Hurricane Sandy in 2012 “users sent more than 20 million Sandy-related Twitter posts, or tweets, despite the loss of cell phone service during the peak of the storm.” Such huge amounts of user-generated data contributed by disaster affected communities have become an important source of big crisis data for disaster response (Castillo, 2016; Reuter & Kaufhold, 2018), and at the same time have been used by the public at large to make sense of an event from social media (Stefan, Deborah, Milad, & Christian, 2018). Many research and practical studies have proved the value of social media data on disseminating warning and response information, enhancing situational awareness, facilitating allocation of resources, informing disaster risk reduction strategies and risk assessments (Watson, Finn, & Wadhwa, 2017; Reuter, Hughes, & Kaufhold, 2018; National Research Council, 2013), as well as fostering community resilience (Zhang, Drake, Li, Zobel, & Cowell, 2015). Despite these benefits, the challenges presented by the volume of the data still preclude large emergency organizations from using them routinely (Meier, 2013).

Manually sifting through voluminous streaming data to filter useful information in real time is inherently impossible. Machine learning techniques show promising results in automating the process of identifying useful, relevant and trustworthy information in big crisis data (Qadir et al., 2016), despite many practical challenges (Mendoza, Poblete, & Castillo, 2010). Many works have successfully used supervised learning algorithms to automatically classify tweets (Caragea, Squicciarini, Stehle, Neppalli, & Tapia, 2014; Imran, Elbassuoni, Castillo, Diaz, & Meier, 2013). Supervised algorithms require labeled training data to learn classifiers that can be further used to label new data of the same type (also called test data). The labels generated for the test data are usually accurate when the training and the test data are drawn from the same distribution.

The requirements above result in two main challenges that machine learning algorithms face when used to classify user-generated tweets about emerging disasters such as floods, hurricanes, and terrorist attacks. First, labeled data is not easily available for an emergent “target” disaster for which a classifier is needed to help disaster response teams identify relevant tweets, and ultimately information useful for situational awareness. Labeling data is an expensive and time-consuming process, which does not provide a real-time solution for disaster response. Labeled data from a prior “source” disaster can potentially be used to learn a supervised classifier for the target disaster (Starbird, Palen, Hughes, & Vieweg, 2010). However, another challenge is posed by the fact that data from the “source” disaster and data from the target disaster may not share the same distribution (or characteristics), and the classifier learned from the source may not perform well on the target.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 11: 2 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing