A Unified POS Tagging Architecture and its application to Greek (original) (raw)
Related papers
Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy
IEEE/WIC/ACM International Conference on Web Intelligence
This paper proposes a machine learning approach to part-of-speech tagging and named entity recognition for Greek, focusing on the extraction of morphological features and classification of tokens into a small set of classes for named entities. The architecture model that was used is introduced. The greek version of the spaCy platform was added into the source code, a feature that did not exist before our contribution, and was used for building the models. Additionally, a part of speech tagger was trained that can detect the morphology of the tokens and performs higher than the state-of-the-art results when classifying only the part of speech. For named entity recognition using spaCy, a model that extends the standard ENAMEX type (organization, location, person) was built. Certain experiments that were conducted indicate the need for flexibility in out-of-vocabulary words and there is an effort for resolving this issue. Finally, the evaluation results are discussed.
POS Tagger Improvisation with the Addition of Foreign Word Labels on Telkom University News
Building of Informatics, Technology and Science (BITS)
News is a medium of daily information usually obtained by the public. The news consists of a lot of information in it and is composed of sentence structures. Each language is unique with its own sentence structure, like Indonesian and other foreign languages. But nowadays, many media mix Indonesian with foreign languages, making the sentence structure different from Bahasa Indonesia. To classify these words, Part Of Speech Tagging needed to determine the class of words composed of sentences by learning from the Corpus of each language. With the new sentence structure, POS Tagger requires a larger Corpus to learn. The language structure can determine the results of tagging from the POS Tagger. If there are words that are not in the Corpus, it can reduce the accuracy of the POS Tagger. We conducted to enhance the research results by adding data with a different sentence structure from the Indonesian Language Corpus using sentences from online media. Added about 242 sentences with 7,04...
A multistage PoS-tagger at the EVALITA 2009 PoS-tagging Task
Abstract. This paper presents an experimental system architecture for Part-Of-Speech Tagging for the Italian language, able to manage a large tagset to provide both lexical and morphological information. The tagger was built as a cascade of four classifiers where each classifier in the cascade accepts data from an initial input or the guesses of the previous one, executes its annotation, and sends the resulting data to the next stage, or to the output of the cascade.
Some Well-Known Part-Of-Speech (POS) Tagging Systems: A Short Survey
Osmania Papers in Linguistics Vol. 46&47. Pp. 1-21., 2020
In natural language processing, POS tagging is a process of assigning parts-of-speech to words that are used in a text. It tries to capture some linguistic properties and functions of words used in a corpus. It is a complex process embedded with several theoretical and technical issues relating to identifying words and determining their lexicosemantic identity and syntactico-grammatical roles in a text. It also involves a process of defining the basic hierarchical modalities of tag assignment and designing rule-based schemas that are applied for the automatic assignment of tags to words. It uses a strategy with a combination of linguistic and extralinguistic knowledge and computation to achieve success. The output is a POS-tagged corpus which is a useful resource for language processing, language computation, machine learning, cognitive processing, data mining, information extraction, language teaching, dictionary compilation, and language description. Keeping these issues in view, I attempt to briefly describe a few POS tagging systems that are widely used across all major languages in this paper. This paper is for those scholars who come from linguistics and want to explore areas of corpus linguistics and language technology with a mission to serve their mother languages.
Proceedings of PACLIC, 2009
This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve POS tagging performance. Focusing on French tagging, we introduce a maximum entropy conditional sequence tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.7% accuracy on the French Treebank, an error reduction of 23% (28% on unknown words) over the same tagger without lexical information. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data vs. developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.
PoS-tagging Italian texts with CORISTagger
This paper presents an evolution of CORISTagger [1], an high-perfor-mance PoS-tagger for Italian developed at the University of Bologna. The sys-tem is composed of a second-order Hidden Markov Model tagger followed by a Transformation Based tagger. The use of such a stacked structure, paired with a powerful morphological analyser based on a large lexicon composed of 120,000 lemmas, allowed the tagger to obtain good performances in the EVALITA 2009 PoS-tagging task. The performances of the tagger and the most common classifi-cation errors are discussed in detail.
UniBA @ KIPoS: A Hybrid Approach for Part-of-Speech Tagging (short paper)
2020
English. The Part of Speech tagging operation is becoming increasingly important as it represents the starting point for other high-level operations such as Speech Recognition, Machine Translation, Parsing and Information Retrieval. Although the accuracy of state-of-the-art POS-taggers reach a high level of accuracy (around 96-97%) it cannot yet be considered a solved problem because there are many variables to take into account. For example, most of these systems use lexical knowledge to assign a tag to unknown words. The task solution proposed in this work is based on a hybrid tagger, which doesn’t use any prior lexical knowledge, consisting of two different types of POS-taggers used sequentially: HMM tagger and RDRPOSTagger [ (Nguyen et al., 2014), (Nguyen et al., 2016)]. We trained the hybrid model using the Development set and the combination of Development and Silver sets. The results have shown an accuracy of 0,8114 and 0,8100 respectively for the main task. Italiano. L’opera...