Amazigh part-of-speech tagging with machine learning and deep learning (original) (raw)

Evaluating LSTM Networks, HMM and WFST in Malay Part-of-Speech Tagging

Journal of Telecommunication, Electronic and Computer Engineering, 2017

Long short term memory (LSTM) networks have been gaining popularity in modeling sequential data such as phoneme recognition, speech translation, language modeling, speech synthesis, chatbot-like dialog systems and others. This paper investigates the attention-based encoder-decoder LSTM networks in Malay part-of-speech (POS) tagging when it is compared to weighted finite state transducer (WFST) and hidden Markov model (HMM). The attractiveness of LSTM networks is its strength in modeling long distance dependencies. Malay POS tagging is examined from two different conditions: with and without morphological information. The experiment results show that LSTM networks that are trained without any explicit morphological knowledge perform nearly equally with WFST but better than HMM approach that is trained with morphological information.

A Study on the Performance of Recurrent Neural Network based Models in Maithili Part of Speech Tagging

ACM Transactions on Asian and Low-Resource Language Information Processing

This article presents our effort in developing a Maithili Part of Speech (POS) tagger. Substantial effort has been devoted to developing POS taggers in several Indian languages, including Hindi, Bengali, Tamil, Telugu, Kannada, Punjabi, and Marathi; but Maithili did not achieve much attention from the research community. Maithili is one of the official languages of India, with around 50 million native speakers. So, we worked on developing a POS tagger in Maithili. For the development, we use a manually annotated in-house Maithili corpus containing 56,126 tokens. The tagset contains 27 tags. We train a conditional random fields (CRF) classifier to prepare a baseline system that achieves an accuracy of 82.67%. Then, we employ several recurrent neural networks (RNN)-based models, including Long-short Term Memory (LSTM), Gated Recurrent Unit (GRU), LSTM with a CRF layer (LSTM-CRF), and GRU with a CRF layer (GRU-CRF) and perform a comparative study. We also study the effect of both word ...

Part of Speech Tagging in Urdu: Comparison of Machine and Deep Learning Approaches

IEEE Access

In Urdu, part of speech (POS) tagging is a challenging task as it is both inflectionally and derivationally rich morphological language. Verbs are generally conceived a highly inflected object in Urdu comparatively to nouns. POS tagging is used as a preliminary linguistic text analysis in diverse natural language processing domains such as speech processing, information extraction, machine translation and others. It is a task that first identifies appropriate syntactic categories for each word in running text and secondly assigns the predicted syntactic tag to all concerned words. The current work is the extension of our previous work [1]. Previously we presented conditional random field (CRF) based POS tagger with both language dependent and independent feature set. However, in the current study we offer a) the implementation of both machine and deep learning models for Urdu POS tagging task with well-balanced languageindependent feature set and b) to highlight diverse challenges which causes Urdu POS task a challenging one. In this research, we demonstrated the effectiveness of machine learning and deep learning models for Urdu POS task. Empirically, we have evaluated the performance of all models on two benchmark datasets. The core models evaluated in this study are CRF, support vector machine (SVM), two variants of the deep recurrent neural network (DRNN) and a variant of n-gram Markov model the bigram hidden Markov model (HMM). The two variants of DRRN models evaluated include forward Long Short-Term Memory (LSTM)-RNN and LSTM-RNN with CRF output. INDEX TERMS Urdu; Part of speech (POS); Conditional random field (CRF); Support vector machine (SVM); recurrent neural network (RNN); Hidden Markov model (HMM)

Bidirectional LSTMs -CRFs Networks for Bangla POS Tagging

—Part-of-speech (POS) information is one of the fundamental components in the natural language processing pipeline, which helps in extracting higher-level information such as named entities, discourse, and syntactic structure of a sentence. For some languages, such as English, Dutch, and Chinese, it is considered as a solved problem due to the higher accuracy (97%) of the predicted system. Significant efforts have been made for such languages in terms of making the data publicly accessible and also organizing evaluation campaigns. Compared to that there are very fewer efforts for Bangla (ethnonym: Bangla; exonym: Bengali). In this paper, we present a knowledge poor approach for POS tagging, which we evaluated using publicly accessible dataset from LDC. The motivation of our approach is that we did not want to rely on any existing resources such as lexicon or named entity recognizer for designing the system as they are not publicly available and difficult to develop. We have not used any hand-crafted features, rather we employed distributed representations of word and characters. We designed the system using Long Short Term Memory (LSTM) neural networks followed by Conditional Random Fields (CRFs) for designing the model with an inclusion of pre-trained word embedded model. We obtained promising results with an accuracy of 86.0%.

BaNeP: An End-to-End Neural Network Based Model for Bangla Parts-of-Speech Tagging

IEEE Access

In Natural Language Processing, Parts-of-Speech tagging is a vital component that significantly impacts applications like machine translation, spell-checker, information retrieval, and speech processing. In languages such as English and Dutch, POS tagging is considered a solved problem (accuracy: 97%). However, for low-resource languages like Bangla, challenges are still there. In this article, we have proposed a novel RNN-based network named BaNeP to determine parts of speech for Bangla words. The proposed network extracts structural features through a bidirectional LSTM-based sub-network, and intricate contextual relations among words of a sentence are identified through an elaborate weighted context extraction procedure. These features are then combinedly utilized to generate the final Parts-of-Speech prediction. Training the model requires only an annotated dataset vanishing the need for any hand-crafted features. Experimental results on the LDC2010T16 dataset show significant accuracy improvement compared to existing Bangla POS taggers. 12 INDEX TERMS Bangla, POS tagging, RNN, sequence labeling. I. INTRODUCTION 13 Part-of-Speech (POS) Tagging, also known as grammatical 14 tagging or word category disambiguation, is a popular natural 15 language processing (NLP) task that refers to mapping words 16 in a text or corpus to corresponding part-of-speech, depend-17 ing on the structure of the word and its context. Although 18 POS tagging may not be the solution to any particular NLP 19 problem alone, it is a prerequisite for many NLP applications 20 as it provides a linguistic signal on how a word is being 21 used within the scope of a phrase, or sentence and document. 22 Earlier, when there was no language to communicate, humans 23 used sign language to exchange their thoughts, like how we 24 communicate with our pets. Suppose, when we tell our dog, 25 ''Cooper, we love you'', he responds by wagging his tail. 26 This does not mean he actually understands what we say, but 27 he can read our expressions, and understand our emotions 28 and gesture more than words. As the most intellectual being, 29 human has developed an understanding of many nuances of 30 natural languages more than any other animals on this planet. 31 The associate editor coordinating the review of this manuscript and approving it for publication was Long Xu. 'refuse' has been used twice with two different meanings. 54 In the first case, 'refuse' is a verb meaning 'deny' while 55 'refuse' is a noun meaning 'trash' later. These two 'refuse'-56 es are not homophones. So, it is crucial to identify the proper 57 Part-of-Speech of a word to pronounce it correctly. Other pop-58 ular NLP applications, such as information retrieval, emotion 59 analysis, spell checking, word sense disambiguation, etc., 60 also perform POS tagging in preprocessing. 61 Considering the importance of POS tagging in NLP, 62 a tremendous amount of research has been done and still 63 going on to develop an efficient network for languages 64 like English, Dutch and Chinese. Traditional high perfor-65 mance model for those languages are mostly based on 66 Hidden Markov Model (HMM) and Conditional Random 67 Fields (CRF) [1], [2], [3]. However, those models require 68 hand-crafted features and task-specific resources like care-69 fully designed word spelling features, orthographic features 70 and gazetteers. These task-specific resources make the sys-71 tem costly to develop and difficult to adapt to new tasks or 72 domains. Recently, a non-linear neural network with a dis-73 tributed word representation known as a word embedding sys-74 tem has been broadly applied for higher accuracy. In the past 75 few years, researchers have already developed high accuracy 76 systems with the help of Recurrent Neural Network (RNN) or 77 with its variant such as Long Short Term Memory (LSTM), 78 Gated Recurrent Unit (GRU) for high-resource languages 79 like English [4], [5], [6]. But low-resource languages such as 80 Bangla still lack efficient and accurate POS tagging models.

Myanmar Language Part-of-Speech Tagging Using Deep Learning Models

2019

Part-of-speech (POS) tagging is one of the most important processes in Natural Language Processing (NLP). It is useful in many areas of linguist research such as information retrieval, natural language translation, word sense disambiguation and sentiment analysis. The goal of this process is to correctly assign the POS tags for each word in a sentence. Moreover, it is also the essential process for Myanmar Language Translation. Although there are many approaches in Myanmar Language POS tagging, Deep Learning Models are especially proposed in this paper. In this work, Recurrent Neural Network (RNN) with Bi-directional Long Short-Term Memory (Bi-LSTM RNN) is especially applied in Myanmar Word segmentation and POS tagging process. Moreover, GloVe is also used to perform syllable vector representation and word vector representation.

Nepali POS Tagging Using Deep Learning Approaches

NU. International Journal of Science, 2020

Deep Learning approaches are being extensively used in Part of Speech (POS) tagging. POS tagging is one of the important step in Natural Language Processing (NLP) including Machine Translation, Retrieval of Information, developing question answering system, word sense disambiguation, text summarization, Named Entity Recognition, text to speech conversion and classification. The efficiency of POS tagging heavily rely on syntactic, contextual information and morphology of the language. POS tagging in Nepali Language is very difficult as it is morphologically rich. This research paper focuses on implementing and comparing various deep learning approaches for POS tagging in Nepali Language. Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), Long Short Term Memory (LSTM) and Bidirectional LSTM were implemented in tagged Nepali corpus. The result of Bidirectional LSTM (Bi-LSTM) was better than other approaches. Keywords: POS, NLP, RNN, GRU, LSTM, Bi-LSTM, Nepali corpus

Utilizing Morphological Features for Part-of-Speech Tagging of Bahasa Indonesia in Bidirectional LSTM

2020 6th International Conference on Science in Information Technology (ICSITech)

Research in the area of Part of Speech (PoS) Tagging has been widely explored especially for high resource language, such as English. However, there are only a small number of studies that have been conducted for Bahasa Indonesia. In this study, we present our experiment on utilizing morphological features for PoS tagging of Bahasa Indonesia in Bidirectional Long Short Term Memory architecture. Three different features including prefix, suffix, and capitalization have been examined. The results of our study show that combining morphological features with word embedding is effective for improving the tagger performance. Our study also provides more detailed explanation on which morphological features are useful for the PoS tagging task.

Part of speech tagging: a systematic review of deep learning and machine learning approaches

Journal of Big Data

Natural language processing (NLP) tools have sparked a great deal of interest due to rapid improvements in information and communications technologies. As a result, many different NLP tools are being produced. However, there are many challenges for developing efficient and effective NLP tools that accurately process natural languages. One such tool is part of speech (POS) tagging, which tags a particular sentence or words in a paragraph by looking at the context of the sentence/words inside the paragraph. Despite enormous efforts by researchers, POS tagging still faces challenges in improving accuracy while reducing false-positive rates and in tagging unknown words. Furthermore, the presence of ambiguity when tagging terms with different contextual meanings inside a sentence cannot be overlooked. Recently, Deep learning (DL) and Machine learning (ML)-based POS taggers are being implemented as potential solutions to efficiently identify words in a given sentence across a paragraph. T...

Romanian Part of Speech Tagging using LSTM Networks

2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP), 2019

In this paper we present LSTM based neural network architectures for determining the part of speech (POS) tags for Romanian words. LSTM networks combined with fullyconnected output layers are used for predicting the root POS, and sequence-to-sequence models composed of LSTM encoders and decoders are evaluated for predicting the extended MSD and CTAG tags. The highest accuracy achieved for the root POS is 99.18% and for the extended tags is 98.25%. This method proves to be efficient for the proposed task and has the advantage of being language independent, as no expert linguistic knowledge is used in the input features.