Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Text Mining and Pre-Processing Methods for Social Media Data Extraction and Processing

Santoshi Kumari

Source Title: Handbook of Research on Opinion Mining and Text Analytics on Literary Works and Social Media

DOI: 10.4018/978-1-7998-9594-7.ch002

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

A huge amount of unstructured data is generated from social media platforms like Twitter. Volume of tweets and the velocity with which they are generated on various topics presents extensive challenges in data analytics and processing techniques. Linguistic flexibility for writing tweets presents many challenges in preprocessing and natural language processing tasks. Addressing these challenges, this chapter aims to select, modify, and apply information retrieval and preprocessing steps for retrieving, storing, organizing, and cleaning real-time large-scale unstructured Twitter data. The work focuses on reviewing the previous research and applying suitable preprocessing methods to improve the quality of data by removing unessential data. It is also observed that using tweeter APIs and access tokens provides easy access to real-time tweets. Preprocessing methods are fundamental steps of text analytics and NLP tasks to process unstructured data. Analyzing suitable preprocessing methods like tokenization, removal of stop word, stemming, and lemmatization are applied to normalize the extracted Twitter data.

Chapter Preview

Top

Introduction

Real-time sentiment analysis of continuously generating social media data aims to understand people’s attitude and behavior on various topics discussed at present situations. The analysis on the current data for particular location helps to identify and understand the sentiments and opinion of the people on a critical event such as terrorism, fire alarm, tsunami, critical incidents, elections, political parties and natural hazards.

Main goal of this work is to identify a platform that provides large scale real time social media data. Correspondingly, it identifies suitable techniques and tools to extract real time social media data and build suitable preprocessing methods to improve the quality of the data to represent it easily into feature vector for further analysis. This research work identifies one of the top social networking platform twitter on which people interact with each other known as tweets. It provides huge data source for academic and industrial social media researcher according to (Ahmed & Graduate 2018). The study proposed by Ahmed (2018) also explained about the top resources, platforms, methods and practical tools to retrieve, store and analyze social media data for conducting research on social media. According to Ahmed (2018), “The popularity of using Twitter for social media research, both in academia and in industry, remains high; no other platform has attracted as much attention from academics”. There is no other social media platforms with an infrastructure like twitter to provide 100% of its data through Application Programming Interface known as APIs. It provides technologies to extract tweets at real time which is useful for analyzing data in various applications such as crisis communication and identifying emergency situation, natural disasters, e-governance and elections.

Bruns & Liang (2012) explained the importance of twitter for capturing information at real time in emergency situations. They also identified a method for extracting tweets related to natural disaster in real time, for the development of suitable research infrastructure, for tracking and analyzing large-scale tweets at nearly real time.

To perform real time text analytical research, some techniques such as machine learning and sentiment analysis are used to extract and analyze data from twitter, which is a platform with huge social media posts on various topics. There are various advance data analytical tools (Ahmed, 2018) such as R, Weka, KNIME that can be used to analyze the social media data.

Tweets are expressed in informal and cryptic form containing emoticons, special characters, URLs and short forms. Applying data analytical method to these unstructured tweets is difficult, it requires cleaning and removing unnecessary information to improve quality of the data for further machine learning and data analysis.

To improve the quality of extracted tweets, prepossessing methods are very essential. Preprocessing methods acts as basic fundamental step for all data analytical and NLP tasks to reduce complexity and improve quality of data. It is also foremost part of information retrieval processes to remove unwanted data and improve the quality of text.

Applying preprocessing methods like removal of special characters, URLs, Hashtags, stop words, numbers, punctuations from theses unstructured tweets and filtering unessential data is challenging task. Structured texts are extracted and represented in the form of individual tokens by applying suitable preprocessing methods like Bag of word, Term Document Matrix (TDM), Document Term Matrix(DTM), and Term Frequency-Inverse Document Frequency. Preprocessing step is necessary to extract meaningful and quality data from the corpus of tweets and building the feature vector for further analysis. In previous work, most of the miss classification and confusion caused in machine learning due to 40% of the unwanted data present in dataset (Fayyad et al., 2003) that need to be identified and preprocessed. Therefore, preprocessing is a major step of data analytics as it improves the quality of data for machine learning, reduces data size, enhance computational speed and accuracy.

The rest of this chapter is structured as follows. Section 2 contains a review of the related literature. Section 3 presents methods for data extraction and pre-processing of unstructured social media data. Section 4 describes the experimental setup. Results and analysis are discussed in Section 5. Section 6 presents a conclusions and future work.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Text Mining and Pre-Processing Methods for Social Media Data Extraction and Processing

Abstract

Introduction

Complete Chapter List