HITS@FIRE task 2015: Twitter based Named Entity Recognizer for Indian Languages (original) (raw)

Natural Language processing (NLP) in its pure sense, is a platform that provides the ability for transforming natural language text to useful information. Named Entity Recognition (NER) is a key task in NLP for classification of named entities in natural languages. Though, there are several algorithms for named entity classification, identifying named entities in twitter data is a demanding task. Loads of information are being shared by people in twitter on a daily basis. This information is unstructured and often contains important information about organizations, politics, disasters, promotional advertisements etc. In this paper, we provide a NER that can effectively classify named entities in twitter data for Indian Languages such as English, Hindi and Tamil. POS, Chunk, Suffix, Prefix information has been used for training in Conditional Random Fields (CRF) based NER Model. CRF is a popular model for labeling and classification in text mining. Performance analysis was done using n-fold validation and F-measure. A maximum precision of 93.82 for English, 92.28 for Hindi and 86.94 for Tamil twitter data was achieved through N fold validation. Results provided by ESM-IL share task in terms of precision for English is 50.48, for Hindi is 81.49 and for Tamil 70.42. The proposed algorithm has a higher classification accuracy and it is achieved through n-fold validation.