Predicting Marathi News Class Using Semantic Entity-Driven Clustering Approach

Predicting Marathi News Class Using Semantic Entity-Driven Clustering Approach

Jatinderkumar R. Saini, Prafulla Bharat Bafna
Copyright: © 2021 |Pages: 13
DOI: 10.4018/JCIT.20211001.oa12
Article PDF Download
Open access articles are freely available for download

Abstract

Document management is a need for an era and managing documents in the regional languages is a significant and untouched area. Marathi corpus consisting of news is processed to form Group Entity document matrix Marathi (GEDMM), Vector space model for Marathi (VSMM) and Hysynset Vector space model for Marathi (HSVSMM). GEDMM uses entity group extracted using Condition random field (CRF). The frequent terms are used to construct VSMM using TF-IDF. HSVSMM uses synsets using hypernyms-hyponyms and synonyms. GEDMM and HSVSMM use dimension reduction by selecting significant feature groups. Hierarchical agglomerative clustering (HAC) is used and a dendrogram is produced to visualize the clusters. The performance analysis is carried out using several parameters like entropy, purity, misclassification error and accuracy. The clusters produced using GEDMM shows the minimum entropy and the highest purity. A random forest classifier is applied and the results are evaluated using misclassification error and accuracy.
Article Preview
Top

Introduction

In the contemporary era, the data is present in different languages, processing English text is common and enriched with strong literature but processing data of regional languages like Marathi is a critical task. Abundant data is available in Marathi and classifying Marathi text using dimension reduction by selecting appropriate tokens to involve context in the process of classification is the need of an era.

The textual data is a popular means of information exchange. The available data is divided as structured (tabular form), unstructured (reviews, comments, emails, etc) and semi-structured. (HTML) (Maksimenko et al., 2020) (Gao et al., 2020). Several techniques are available to mine the structured data (Gharehchopogh, & Khalifelu 2011). To process, a huge amount of data text mining techniques are available to generate useful patterns. Clustering, classifications are popularly used techniques (Larsen & Aone, 1999) to identify patterns in the data. There are several steps, which need to be carried out to process the text. Text documents are formed using sentences. The sentences need to be fragmented into tokens by the removal of blanks or other punctuations. Stop words (Meyer et al., 2008) (Aggarwal & Zhai, 2012). do not contribute to the decision-making process and increases noisy features. These are removed to reduce the dimensions. Lemmatization is an effective way to bring words into their meaning root form (Sharma, 2019). (Kubosawa et al., 2019). For e.g. In the sentence ‘राम मंदिरासाठी २८ वर्ष उपास करणारी आधुनिक शबरी’, first tokenization is implemented which results into 8 tokens. मंदिरासाठी will be converted to “मंदिर” as a lemma. ‘२८’ will be ignored in the preprocessing steps. ‘मंदिर’is tagged as a noun by Part of speech (PoS) tagger.

POS tagging is also a significant activity that identifies nouns, adjectives and other parts of speech and helps to perform morphological analysis of the text . (Bustikova et al., 2020).

Unique lemmas can be weighted using TF-IDF measures that is a deterministic statistical procedure but is preferred due to its simple and effective calculations. It is based on a frequency count of the words and is normalized by considering the word occurrences in the entire corpus.

Corpus is set of processed text data and various operations are applied on it e.g. Preprocessing, linguistic analysis and so on. Processing corpus of the English language is an easy task due to the wide availability of resources, but processing a corpus in regional languages (Hanumanthappa, & Swamy, 2014). like Marathi (Bafna and Saini, 2020 (2)), etc is a challenging and untouched domain of text mining (Vijaymeena and Kavitha, 2016.).

Named Entity Recognition (NER) (Nadeau & Sekine, 2007) is assumed as a subset of information mining. It first detects all proper nouns called as Named Entity from the corpus and assigns each entity with a tag as location, person, etc.

Condition random field (CRF) is a classifier model that finds out the dependency between class and the entity. It gives normalized result by avoiding bias. CRF is used for multi-class problems (Kim et al., 2020)

Clustering is an unsupervised technique which groups unlabeled objects into several groups. Classification is used as a data mining technique for labeled dataset. Clustering can be carried out using several ways. Hierarchical agglomerative clustering is one of the ways and follows a bottom-top approach. It is popularly uses technique for generating groups of texts due to its high accuracy. Every object initiates with its cluster and pairs of clusters are combined while moving up the levels in the hierarchy.

Entropy shows the wrongly classified data objects. Minimum entropy represents relevant clusters. VSMM (Vector space model for Marathi) has documents occurring in rows and dimensions occupying columns. HSVSMM represents hysynsets (hypenyms-hyponyms and synonyms in one group) representing features of the matrix. Group Entity document matrix Marathi (GEDMM) represents grouped entities as columns and Marathi news as rows. In all the matrices feature weight is recorded.

Complete Article List

Search this Journal:
Reset
Volume 26: 1 Issue (2024)
Volume 25: 1 Issue (2023)
Volume 24: 5 Issues (2022)
Volume 23: 4 Issues (2021)
Volume 22: 4 Issues (2020)
Volume 21: 4 Issues (2019)
Volume 20: 4 Issues (2018)
Volume 19: 4 Issues (2017)
Volume 18: 4 Issues (2016)
Volume 17: 4 Issues (2015)
Volume 16: 4 Issues (2014)
Volume 15: 4 Issues (2013)
Volume 14: 4 Issues (2012)
Volume 13: 4 Issues (2011)
Volume 12: 4 Issues (2010)
Volume 11: 4 Issues (2009)
Volume 10: 4 Issues (2008)
Volume 9: 4 Issues (2007)
Volume 8: 4 Issues (2006)
Volume 7: 4 Issues (2005)
Volume 6: 1 Issue (2004)
Volume 5: 1 Issue (2003)
Volume 4: 1 Issue (2002)
Volume 3: 1 Issue (2001)
Volume 2: 1 Issue (2000)
Volume 1: 1 Issue (1999)
View Complete Journal Contents Listing