Morphological Analysis for the Maltese Language: The challenges of a hybrid system (original) (raw)

English Morphological Analysis With Machine-Learned Rules

dspace.wul.waseda.ac.jp

This paper expounds an algorithm for morphological analysis of English language. The algorithm consists of two closely related components: morphological rule learning and morphological analyzing. The morphological rules are obtained through statistical learning from wordlist, with particular morphological features of English language taken into consideration. The procedure of morphological analysis considers two types of ambiguities: intersectional ambiguity and combinatory ambiguity. The procedure also considers the order of wordform formation in the language. Experiment shows that the algorithm performs distinctively compared to other algorithms.

Automatic morphological alignment and clustering

Technical report TR-2014-07, Department of Computer Science, University of Chicago, 2014

This paper describes an unsupervised algorithm, with no language-specific knowledge, which takes a list of morphological paradigms and explores crossparadigmatic structure in terms of two computational tasks: alignment and clustering. Based on complexity computation in a minimum description length approach, the proposed algorithm learns the relationship across the paradigms based purely on surface strings and formalizes the intuitive idea that, for instance, jumping and loving belong to the same morphological category -this is alignment. Moreover, the algorithm simultaneously learns morphological groupings of the paradigms akin to conjugation and declension classes -this is clustering. The clustering analysis also reveals more fine-grained hierarchical structure among the inflectional classes. The algorithm is applied to verbal paradigms from English and Spanish. The results are useful for further work on the unsupervised learning and prediction-oriented research of paradigmatic structure. We also show the value of computational techniques in linguistics for both explicitly evaluating competing analyses and rigorously implementing analyses.

Language independent morphological analysis

Proceedings of the sixth conference on Applied natural language processing -, 2000

This paper proposes a framework of language independent morphological analysis and mainly concentrate on tokenization, the first process of morphological analysis. Although tokenization is usually not regarded as a difficult task in most segmented languages such as English, there are a number of problems in achieving precise treatment of lexical entries. We first introduce the concept of morpho-fragments, which are intermediate units between characters and lexical entries. We describe our approach to resolve problems arising in tokenization so as to attain a language independent morphological analyzer.

Semi-supervised learning of concatenative morphology

2010

We consider morphology learning in a semi-supervised setting, where a small set of linguistic gold standard analyses is available. We extend Morfessor Baseline, which is a method for unsupervised morphological segmentation, to this task. We show that known linguistic segmentations can be exploited by adding them into the data likelihood function and optimizing separate weights for unlabeled and labeled data. Experiments on English and Finnish are presented with varying amount of labeled data. Results of the linguistic evaluation of Morpho Challenge improve rapidly already with small amounts of labeled data, surpassing the state-ofthe-art unsupervised methods at 1000 labeled words for English and at 100 labeled words for Finnish.

Initial Experiments In Cross-Lingual Morphological Analysis Using Morpheme Segmentation

Proceedings of the Sixth Workshop on

The paper describes initial experiments in data-driven cross-lingual morphological analysis of open-category words using a combination of unsupervised morpheme segmentation, annotation projection and an LSTM encoder-decoder model with attention. Our algorithm provides lemmatisation and morphological analysis generation for previously unseen lowresource language surface forms with only annotated data on the related languages given. Despite the inherently lossy annotation projection, we achieved the best lemmatisation F1-score in the VarDial 2019 Shared Task on Cross-Lingual Morphological Analysis for both Karachay-Balkar (Turkic languages, agglutinative morphology) and Sardinian (Romance languages, fusional morphology).

Clustering Morphological Paradigms Using Syntactic Categories

Lecture Notes in Computer Science, 2010

We propose a new clustering algorithm for the induction of the morphological paradigms. Our method is unsupervised and exploits the syntactic categories of the words acquired by an unsupervised syntactic category induction algorithm [1]. Previous research [2,3] on joint learning of morphology and syntax has shown that both types of knowledge affect each other making it possible to use one type of knowledge to help learn the other one.

Statistical models for unsupervised learning of morphology and POS tagging

2011

This thesis concentrates on two fields in natural language processing. The main contribution of the thesis is in the field of morphology learning. Morphology is the study of how words are formed combining different language constituents (called morphemes) and morphology learning is the process of analysing words, by splitting into these constituents. In the scope of this thesis, morphology is learned mainly by paradigmatic approaches, in which words are analysed in groups, called paradigms. Paradigms are morphological structures having the capability of generating various word forms. We propose approaches for capturing paradigms to perform morphological segmentation. One of the approaches proposed captures paradigms within a hierarchical tree structure. Using a hierarchical structure covers a wide range of paradigms by spotting morphological similarities. The second scope of the thesis is part-of-speech (POS) tagging. Parts-of-speech are linguistic categories, which group words havi...

An Agglomerative Hierarchical Clustering Algorithm for Morpheme Labelling

In this paper, we present an agglomerative hierarchical clustering algorithm for labelling morphs. The algorithm aims to capture allomorphs and homophonous morphemes for a deeper analysis of segmentation results of a morphological segmentation system. Most morphological segmentation systems focus only on segmentation rather than labelling morphs according to their roles in words, i.e. inflectional (cases, tenses etc.) vs. derivational. Nevertheless, it is helpful to have a better understanding of the roles of morphs in a word to be able to judge the grammatical function of that word in a sentence; i.e. the syntactic category. We believe that a good morph labelling system can also help part-of-speech tagging. The proposed cluster- ing algorithm can capture allomorphs in Turkish successfully. We obtain a recall of 86.34% for Turkish and 84.79% for English.

Learning multilingual morphology with CLOG

1998

Automatic lemmatisation is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma (base form) to each word in a running text is not trivial, as, for instance, nouns inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, as word-forms cannot be matched against a morphological lexicon. The paper discusses a machine learning approach to the automatic lemmatisation of unknown words in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems: learning to perform morphosyntactic tagging of words in a text, and learning to perform morphological analysis, which produces the lemma from the word-form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. We train the analyser on open-class inflecting Slovene words, namely nouns, adjectives, and main verbs, together being characterised by more than 400 different morphosyntactic tags. Our training sets consist of a morphological lexicon containing 15,000 lemmas and a manually annotated corpus consisting of 100,000 running words. We evaluate the learned model on word lists extracted from a corpus of Slovene texts containing 500,000 words, and show that our morphological analysis module achieves 98.6% accuracy, while the combination of the tagger and analyser is 92.0% accurate on unknown inflecting Slovene words.

Unsupervised Learning of Morphology and the Languages of the World

2009

This thesis presents work in two areas; Language Technology and Linguistic Typology. In the field of Language Technology, a specific problem is addressed: Can a computer extract a description of word conjugation in a natural language using only written text in the language? The problem is often referred to as Unsupervised Learning of Morphology and has a variety of applications, including Machine Translation, Document Categorization and Information Retrieval. The problem is also relevant for linguistic theory.