Text Mining and Pre-Processing Methods for Social Media Data Extraction and Processing

Text Mining and Pre-Processing Methods for Social Media Data Extraction and Processing

Santoshi Kumari
DOI: 10.4018/978-1-7998-9594-7.ch002
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

A huge amount of unstructured data is generated from social media platforms like Twitter. Volume of tweets and the velocity with which they are generated on various topics presents extensive challenges in data analytics and processing techniques. Linguistic flexibility for writing tweets presents many challenges in preprocessing and natural language processing tasks. Addressing these challenges, this chapter aims to select, modify, and apply information retrieval and preprocessing steps for retrieving, storing, organizing, and cleaning real-time large-scale unstructured Twitter data. The work focuses on reviewing the previous research and applying suitable preprocessing methods to improve the quality of data by removing unessential data. It is also observed that using tweeter APIs and access tokens provides easy access to real-time tweets. Preprocessing methods are fundamental steps of text analytics and NLP tasks to process unstructured data. Analyzing suitable preprocessing methods like tokenization, removal of stop word, stemming, and lemmatization are applied to normalize the extracted Twitter data.
Chapter Preview
Top

Introduction

Real-time sentiment analysis of continuously generating social media data aims to understand people’s attitude and behavior on various topics discussed at present situations. The analysis on the current data for particular location helps to identify and understand the sentiments and opinion of the people on a critical event such as terrorism, fire alarm, tsunami, critical incidents, elections, political parties and natural hazards.

Main goal of this work is to identify a platform that provides large scale real time social media data. Correspondingly, it identifies suitable techniques and tools to extract real time social media data and build suitable preprocessing methods to improve the quality of the data to represent it easily into feature vector for further analysis. This research work identifies one of the top social networking platform twitter on which people interact with each other known as tweets. It provides huge data source for academic and industrial social media researcher according to (Ahmed & Graduate 2018). The study proposed by Ahmed (2018) also explained about the top resources, platforms, methods and practical tools to retrieve, store and analyze social media data for conducting research on social media. According to Ahmed (2018), “The popularity of using Twitter for social media research, both in academia and in industry, remains high; no other platform has attracted as much attention from academics”. There is no other social media platforms with an infrastructure like twitter to provide 100% of its data through Application Programming Interface known as APIs. It provides technologies to extract tweets at real time which is useful for analyzing data in various applications such as crisis communication and identifying emergency situation, natural disasters, e-governance and elections.

Bruns & Liang (2012) explained the importance of twitter for capturing information at real time in emergency situations. They also identified a method for extracting tweets related to natural disaster in real time, for the development of suitable research infrastructure, for tracking and analyzing large-scale tweets at nearly real time.

To perform real time text analytical research, some techniques such as machine learning and sentiment analysis are used to extract and analyze data from twitter, which is a platform with huge social media posts on various topics. There are various advance data analytical tools (Ahmed, 2018) such as R, Weka, KNIME that can be used to analyze the social media data.

Tweets are expressed in informal and cryptic form containing emoticons, special characters, URLs and short forms. Applying data analytical method to these unstructured tweets is difficult, it requires cleaning and removing unnecessary information to improve quality of the data for further machine learning and data analysis.

To improve the quality of extracted tweets, prepossessing methods are very essential. Preprocessing methods acts as basic fundamental step for all data analytical and NLP tasks to reduce complexity and improve quality of data. It is also foremost part of information retrieval processes to remove unwanted data and improve the quality of text.

Applying preprocessing methods like removal of special characters, URLs, Hashtags, stop words, numbers, punctuations from theses unstructured tweets and filtering unessential data is challenging task. Structured texts are extracted and represented in the form of individual tokens by applying suitable preprocessing methods like Bag of word, Term Document Matrix (TDM), Document Term Matrix(DTM), and Term Frequency-Inverse Document Frequency. Preprocessing step is necessary to extract meaningful and quality data from the corpus of tweets and building the feature vector for further analysis. In previous work, most of the miss classification and confusion caused in machine learning due to 40% of the unwanted data present in dataset (Fayyad et al., 2003) that need to be identified and preprocessed. Therefore, preprocessing is a major step of data analytics as it improves the quality of data for machine learning, reduces data size, enhance computational speed and accuracy.

The rest of this chapter is structured as follows. Section 2 contains a review of the related literature. Section 3 presents methods for data extraction and pre-processing of unstructured social media data. Section 4 describes the experimental setup. Results and analysis are discussed in Section 5. Section 6 presents a conclusions and future work.

Complete Chapter List

Search this Book:
Reset