Named Entity System for Tweets in Hindi Language

Named Entity System for Tweets in Hindi Language

Arti Jain, Anuja Arora
Copyright: © 2018 |Pages: 22
DOI: 10.4018/IJIIT.2018100104
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Due to the growing need of smart-health applications in Hindi language, there is a rapid demand for health-related Named Entity Recognition (NER) system for Hindi. For the purpose of the same, this research considers Twitter social network to extract tweets dated 1st October 2016 to 15th October 2017 from Patanjali, Dabur and other Hindi language-oriented Twitter based health sites; while considering four NE types- Person, Disease, Consumable and Organization. To the best of its knowledge, the considered Twitter dataset and NE types for Hindi language is one of the first resources that is being taken care. This article introduces three stage NER system for Tweets in Hindi language (HinTwtNER system)- pre-processing stage; machine Learning stage (Hyperspace Analogue to Language (HAL) and Conditional Random Field (CRF)); and post-processing stage. HinTwtNER looks into binary features and achieves an overall F-score of 49.87% which is comparable to the Twitter based NER systems for English and other languages.
Article Preview
Top

1. Introduction

Online Social-media Networks (OSNs) (Kwak et al., 2010; Lin & Huang, 2013; He et al., 2014; Baldwin et al., 2015; Zhu et al., 2016; Cresci et al., 2017; Egele et al., 2017) such as Facebook, Twitter etc. are becoming an everyday part of many peoples’ lives, and they play a major role in the modern society. As these OSNs act as key elements to transform the cultural, social, technological and other diverse aspects of modern civilization. This in turn impacts various sectors, namely- business, education, health, psychology etc. Statistics (Facebook, 2017) reveal that currently 2.01 billion monthly active users are Facebook users. And, on Twitter (Twitter, 2017) on an average, every second around 6,000 tweets are tweeted, corresponding to over 350,000 tweets per minute, 500 million tweets per day and around 200 billion tweets per year. Tweets are short messages with restriction of maximum length of 140 characters. These tweets are often noisy having spelling and grammatical mistakes (because of informal, mix and gibberish language); short-forms of words (because of slang language); multi-words merged together; special symbols and characters (such as emoticons (._.)) that are embedded within words. Still now-a-days, users prefer to tweet due to the following reasons:

  • Users aren’t getting preferable posts on their newsfeed i.e. system doesn’t analyze and display posts according to users’ interest perfectly;

  • Users don’t prefer to read long posts even on topics of their interests and prefer short posts most of the time;

  • Users prefer posts with images which have greater understanding than only with facts.

So, Twitter is used by the large number of users to share their posts, incorporate follow-ups, re-tweets etc. on variety of trending topics as tweets. Although it generates an idea of what is current, important and popular to twitter users, it becomes tedious to sift through the vast pool of tweets. In order to filter out certain specific tweets from millions of tweets, researchers have applied numerous Natural Language Processing (NLP) utilities such as Named Entity Recognition (NER) (Liu et al., 2011; Li et al., 2012; Cano et al., 2014; Derczynski et al., 2014; Godin et al., 2015; Rizzo et al., 2015; Belainine et al., 2016; Sikdar & Gambäck, 2016; Baksa et al., 2017; Lopez et al., 2017; Tran et al., 2017). Usually in researchers work NER based tweet topic extraction plays a vital role and seems to provide effective results as compared to any other approach. So, while taking advantage of NER, this research work filters out specific theme relevant tweets.

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 1 Issue (2023)
Volume 18: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing