Clustering Methods and Tools to Handle High-Dimensional Social Media Text Data

Clustering Methods and Tools to Handle High-Dimensional Social Media Text Data

DOI: 10.4018/978-1-6684-6909-5.ch003
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Social media data has changed the way big data is used. The amount of data available offers more natural insights that make it possible to find relations and social interactions. Natural language processing (NLP) is an essential tool for such a task. NLP promises to scale traditional methods that allow the automation of tasks for social media datasets. A social media text dataset with a large number of attributes is referred to as a high-dimensional text dataset. One of the challenges of high-dimensional text datasets for NLP text clustering is that not all the measured variables are important for understanding the underlying phenomena of interest, and dimension reduction needs to be performed. Nonetheless, for text clustering, the existing literature is remarkably segmented, and the well-known methods do not address the problems of the high dimensionality of text data. Thus, different methods were found and classified in four areas. Also, it described metrics and technical tools as well as future directions.
Chapter Preview
Top

Introduction

The Big Data era has forced different industries to rethink computational solutions to obtain useful insights in real-life scenarios. The attention has been focused on the design of algorithms for analyzing available information. Although the analysis of this massive volume of data brings challenges for extraction of meaningful information due to those data being in different forms.

Currently, Social Media Data (SMD) has changed the way how is used in big data. The amount and types of data available (text, audio, image, video) offers richer and more natural insights expressing feelings or emotions that make it possible to find relations, language expressions, track trending topics, as well as other social interactions (Kauffmann, et al., 2020). Therefore, to efficiently utilize SMD from various sources for decision-making and innovation, industries must use big data analytics to handle information. For example, (Chatterjeea, et al., 2019) use a dataset of 17.62 million tweet conversational pairs to model the task of understanding emotions as a multi-class classification.

Typical applications of big data analytics for social media include monitoring activities to measure loyalty, keeping track of sentiment towards brands or products, observing the impact of campaigns, the success of marketing messages, identifying and engaging top influencers, and social media analytics. However, most existing approaches to social media analysis rely on Machine Learning (ML) techniques that include text analytics and available unstructured and qualitative text data. Thus, text analytics requires inspecting the target textual document (corpus) to turn it into structured information.

To process a text corpus, a general framework for text analytics proposed by (Hu & Liu, 2012), consists of three consecutive phases: text preprocessing, text representation, and knowledge discovery. Text preprocessing makes the input documents consistent to facilitate text representation for text analytics tasks. Traditional text preprocessing methods include stop words, removal, and stemming. Text representation transforms documents into sparse numeric vectors. Basic text representation models are Bag of Words (BOW) or Vector Space Model (VSM). Knowledge discovery, after text corpus transformation into numeric vectors, it is applied the existing ML methods like classification or clustering.

Hence, Natural language processing (NLP) is an essential and fundamental tool for such a task. NLP promises to scale traditional methodologies of supervised and unsupervised methods that allow the automation of tasks such as document classification, content analysis, sentiment analysis, part-of-speech tagging, machine translation, and information retrieval in new social media data. Standard tasks carried out by NLP are text mining, which encompasses association, categorization, and clustering. However, before implementing, and considering the general framework for text analytics, it is required to perform preprocessing and representation of texts as indexing and encoding.

The process of text indexing consists of segmenting a text into a list of words, and three basic steps allow it. The first one is tokenization, which is the operation of segmenting a text by white spaces or punctuation marks into tokens. The second is stemming or lemmatization, in which each token is converted into its root form using grammar rules. The third is stop-word removal, in which removing grammatical words such as articles, conjunctions, and prepositions performed. Nevertheless, tokenization is the prerequisite for the next two steps, but stemming and stop-word removal clouds are switched with each other, depending on the situation.

Key Terms in this Chapter

Evolutionary Text Clustering: Task to find patterns within a set of mapped text data through temporal networks with no fixed instances, but grows as time passes. To define clusters for the next generation of the network is used information available in the history of that network.

Dirichlet Method: Stochastic process used as a prior for mixture models, which is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables. Is a method of clustering that unknown the number of clusters ahead of time.

Text Streams Clustering: Task to assign a massive amount of text generated from different sources to a new cluster or an existing one in a reasonable time to achieve clustering results through similarity-based or model-based stream methods. A similarity-based use vector space model to represent the documents. Model-based commonly uses Gibbs sampling to estimate the parameters of the mixture model.

Vector Space Model: An algebraic model in NLP that considers the relationship between text (words and documents) and represent similarities as a vector of identifiers. This model assumes two requirements: (1) each term is unique, and (2) the terms have no order. Thus, the relevance of a word or document is equal to the word or document query similarity.

Optimization Text Clustering: Evolutionary and nature-inspired techniques that adapt to the changing environment and can learn to do better and explain their decisions to improve the efficiency of document clustering.

Word Embedding: Document dense vocabulary representation in which similar words have a similar encoding mapping into low d-dimensional distributed real-valued vector. Capture the context of words in a document, semantic and syntactic similarity, relation with other words, etc. The technique often merged into the field of deep learning.

Deep Learning Text Clustering: Deep neural network implementation that combines feature extraction, dimensionality reduction, and clustering to learn representations from text data and to adapt into the clustering module.

Complete Chapter List

Search this Book:
Reset