Article Preview
TopIntroduction
Named Entities are names of famous places, persons, artifacts, etc. For example, in the sentence I love staying in Manhattan the word Manhattan is a Named Entity (NE) as it represents a name of a famous location. The task of identifying Named Entities in a sentence is the task done by a Named Entity recognizer. Named Entity Recognition is very important in many applications such as sentiment analysis as it helps in removing words with no sentiment attached. Named Entities also help in narrowing down the results in Document Retrieval. In English, there are a lot of systems designed to tackle the problem, prominent ones being Stanford NER tagger (Finkel, Grenager & Manning, 2005) and Illinois NER tagger (Ratinov & Roth 2009). Named Entity Recognition is widely used in many applications such as Question Answering systems, Coreference Resolution, Query Labeling (Bhargava, Y. Sharma, S. Sharma & Baid, 2015), Sentiment Analysis (Bhargava, Y. Sharma & S. Sharma, 2016) Question Classification (Bhargava, Khandelwal, Bhatia & Sharma, 2016) etc.
With the rapid growth in use of online networking websites such as Facebook, Twitter and Instagram, writing of sentences has tended to become more informal. The informal sentences present in social media have certain characteristic differences from region to region. In countries having English as their major language, an informal sentence generally consists of shortcuts for specific words, acronyms, emoticons and hashtags. In other countries where English is not the major language, there exists another problem along with the usage of the above tokens. It is the usage of words from their native languages into a sentence along with english words. This tendency of using native language words in sentences along with english is called code-mixing. Code mixing helps the user to express his opinions or feelings without the boundaries of a single language. This trend is generally observed when the writer is more comfortable explaining certain phrases in a sentence in his native language. For example, consider the sentence You have that DJ wala look. In this sentence, the word wala is of hindi language while the other words are in english. Generally, the syntax and grammar for code mixed sentences is not the same as that of a native English statement. When language used becomes informal, the linguistic tools present become less reliable as due to the change in the grammar rules. This causes the native NER taggers to perform poorly on social media texts.
Multilingual Code mixing can also be present where a sentence may have more than two languages present. For example, You have that DJ wala look kani, konchem hairstyle change chesi unte thoda acha hoga. In this sentence there are three languages mixed up namely English, Telugu and Hindi. To deal with a bilingual code mixed sentence, one needs to consider the grammar patterns for both the languages along with the hashtags, emoticons and acronyms. With multilingual code-mixed sentences one needs to deal with the patterns observed in all the languages along with the other.
Analysing social media texts made by a user can state many issues such as the state of mind of the person, the opinion of the person on a certain issue or event or product, etc. For example, let us consider a hypothetical ad campaign for a famous brand on social media platforms. For the company to analyse the response about the campaign, it needs tools which can analyse the text written on the social media platform. In the current scenario, there are tools for basic English and other monolingual texts. There are tools for handling monolingual informal sentences for some of the largely spoken languages. But unfortunately, there aren’t a lot of tools present for handling multilingual sentences as well as code mixed sentences.
In this paper, the problem of Named Entity Recognition is tackled for Hindi-English code mixed text. The remaining paper is organized as follows. Firstly, challenges were recognised on specifications of both social media and code mixed texts. Secondly, the related work done in the previous couple of years has been explained clearly as the background work in the paper. Three approaches were proposed for dealing with NER recognition in the proposed methodology section. Later on, the data set which was provided by CMEE-IL 2016 Task Organizers (Rao & Devi, 2016) was analysed with 5 major categorical tags in the paper. Lastly, the experiments conducted, result obtained and the error analysis of the proposed method were mentioned in the paper followed by the conclusion and future scope.